NIPS 2023 papers


topic-1

Topic words :  data,  model,  time,  based,  causal,  models,  methods,  inference

Learning Linear Causal Representations from Interventions under General Nonlinear Mixing
Simon Buchholz Goutham Rajendran Elan Rosenfeld Bryon Aragam Bernhard Schölkopf Pradeep Kumar Ravikumar



研究问题:学习未知潜在干预下的因果关系表示。
动机:在潜在分布为高斯但混合函数完全一般的情况下,从未知的单节点干预中证明强可识别性结果。
方法:通过分析潜在分布的精度矩阵的二次形式,挖掘非线性密度变换后数据分布中的高维几何结构。
效果:提出一种对比算法来识别潜在变量,并在各种任务上评估其性能。

We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of identifiability from non-paired interventions for deep neural network embeddings and general causal structures. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks.

How to Turn Your Knowledge Graph Embeddings into Generative Models
Lorenzo Loconte Nicola Di Mauro Robert Peharz Antonio Vergari



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE)。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文提出通过结合知识图谱来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Some of the most successful knowledge graph embedding (KGE) models for link prediction – CP, RESCAL, TuckER, ComplEx – can be interpreted as energy-based models. Under this perspective they are not amenable for exact maximum-likelihood estimation (MLE), sampling and struggle to integrate logical constraints. This work re-interprets the score functions of these KGEs as circuits – constrained computational graphs allowing efficient marginalisation. Then, we design two recipes to obtain efficient generative circuit models by either restricting their activations to be non-negative or squaring their outputs. Our interpretation comes with little or no loss of performance for link prediction, while the circuits framework unlocks exact learning by MLE, efficient sampling of new triples, and guarantee that logical constraints are satisfied by design. Furthermore, our models scale more gracefully than the original KGEs on graphs with millions of entities.

Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation
Sebastien Lachapelle Divyat Mahajan Ioannis Mitliagkas Simon Lacoste-Julien



研究问题:解决表示学习中的潜在变量识别和“超出支持范围”的图像生成问题。
动机:对于一类被称为“附加式”解码器的模型,我们证明了其可以解决潜在变量识别和“超出支持范围”的图像生成问题,这对于处理可以被分解为对象特定图像之和的图像非常适用。
方法:通过使用附加式解码器进行重构问题求解,如果满足一定条件,可以保证潜在变量块的识别,这些条件只依赖于对潜在因子分布的弱假设。
效果:我们的结果表明,附加式解码器可以在新的环境下实现非线性独立成分分析(ICA),并增强了我们对对象中心表示学习方法的理论理解。此外,我们还从理论上证明,附加式解码器可以通过以新的方式重新组合观察到的变化因素来生成新的图像,这种能力我们称之为笛卡尔积外推。在模拟数据上的实验表明,附加性对于识别和外推都是至关重要的。

We tackle the problems of latent variables identification and "out-of-support'' image generation in representation learning. We show that both are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.

Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent
Jihao Andreas Lin Javier Antoran Shreyas Padhy David Janz José Miguel Hernández-Lobato Alexander Terenin



研究问题:高斯过程是一种强大的不确定性量化和序列决策框架,但其需要解决线性系统的问题,这在数据集大小上具有立方成本并容易受条件影响。
动机:我们探索随机梯度算法作为一种计算效率高的方法来近似解决这些线性系统,以克服高斯过程的限制。
方法:我们开发了从后验采样的低方差优化目标,并将其扩展到诱导点。我们还通过非收敛性的隐含偏差的光谱特性来解释随机梯度下降通常能产生准确的预测。
效果:实验结果表明,随机梯度下降在具有足够数据覆盖的区域和足够远离数据的区域都能生成接近真实后验的预测分布。在大规模或病态回归任务上,随机梯度下降实现了最先进的性能。其不确定性估计在大规模贝叶斯优化任务上与明显更昂贵的基线相匹配。

Gaussian processes are a powerful framework for quantifying uncertainty and for sequential decision-making but are limited by the requirement of solving linear systems. In general, this has a cubic cost in dataset size and is sensitive to conditioning. We explore stochastic gradient algorithms as a computationally efficient method of approximately solving these linear systems: we develop low-variance optimization objectives for sampling from the posterior and extend these to inducing points. Counterintuitively, stochastic gradient descent often produces accurate predictions, even in cases where it does not converge quickly to the optimum. We explain this through a spectral characterization of the implicit bias from non-convergence. We show that stochastic gradient descent produces predictive distributions close to the true posterior both in regions with sufficient data coverage, and in regions sufficiently far away from the data. Experimentally, stochastic gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks. Its uncertainty estimates match the performance of significantly more expensive baselines on a large-scale Bayesian~optimization~task.

Causal normalizing flows: from theory to practice
Adrián Javaloy Pablo Sanchez Martin Isabel Valera



研究问题:本文旨在深化对正规化流在因果推理中的应用。
动机:利用最新的非线性ICA结果,显示了给定因果顺序,可以从观察数据中识别出因果模型,并使用自回归正规化流(NFs)进行恢复。
方法:分析不同的设计和学习选择以捕捉潜在的因果数据生成过程,描述如何在因果NFs中实现do操作符,从而回答干预和反事实问题。
效果:通过全面的消融研究验证了设计和训练选择;将因果NFs与其他近似因果模型的方法进行比较;实证地证明因果NFs可以用于解决现实世界的问题,其中混合离散连续数据和部分知识在因果图中是常态。

In this work, we deepen on the use of normalizing flows for causal reasoning. Specifically, we first leverage recent results on non-linear ICA to show that causal models are identifiable from observational data given a causal ordering, and thus can be recovered using autoregressive normalizing flows (NFs). Second, we analyze different design and learning choices for *causal normalizing flows* to capture the underlying causal data-generating process. Third, we describe how to implement the *do-operator* in causal NFs, and thus, how to answer interventional and counterfactual questions. Finally, in our experiments, we validate our design and training choices through a comprehensive ablation study; compare causal NFs to other approaches for approximating causal models; and empirically demonstrate that causal NFs can be used to address real-world problems—where the presence of mixed discrete-continuous data and partial knowledge on the causal graph is the norm. The code for this work can be found at https://github.com/psanch21/causal-flows.

Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach
Fabian Zaiser Andrzej S Murawski Luke Ong



研究问题:本文提出了一种精确的贝叶斯推理方法,用于解决离散统计模型的大类离散推理问题。
动机:现有的精确推理工具无法处理具有无限支持和连续先验的离散模型,因此需要一种新的方法来解决这个问题。
方法:本文引入了一种概率编程语言,支持离散和连续采样、离散观测、仿射函数、(随机)分支和基于离散事件的条件。我们的主要工具是概率生成函数,它提供了一种紧凑的封闭形式分布表示,可以通过程序进行定义,从而实现后验概率、期望、方差和高阶矩的精确计算。
效果:我们的实验表明,Genfer通常比现有的精确推理工具PSI、Dice和Prodigy更快。在一系列现有精确工具无法解决的真实世界推理问题上,Genfer的性能与近似蒙特卡洛方法相当,同时避免了近似误差。

We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to a large class of discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on discrete events. Our key tool is *probability generating functions*: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct and fully automated in a tool called *Genfer*, which uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that Genfer is often faster than the existing exact inference tools PSI, Dice, and Prodigy. On a range of real-world inference problems that none of these exact tools can solve, Genfer's performance is competitive with approximate Monte Carlo methods, while avoiding approximation errors.

A Rigorous Link between Deep Ensembles and (Variational) Bayesian Methods
Veit David Wild Sahra Ghalebikesabi Dino Sejdinovic Jeremias Knoblauch



研究问题:建立贝叶斯、变分贝叶斯和集成方法之间的第一个数学严谨的联系。
动机:将深度学习中遇到的非凸优化问题重新表述为概率度量空间中的凸优化问题,以实现这一关键步骤。
方法:通过观察Wasserstein梯度流来研究广义变分推理,从而得出一个统一的理论来解释深度强化学习中常用的不确定性量化的各种看似不相关的方法。
效果:提出了一种新的集成方案,并证明了这些算法在概率度量空间上收敛到定义良好的全局最小值。

We establish the first mathematically rigorous link between Bayesian, variational Bayesian, and ensemble methods. A key step towards this it to reformulate the non-convex optimisation problem typically encountered in deep learning as a convex optimisation in the space of probability measures. On a technical level, our contribution amounts to studying generalised variational inference through the lense of Wasserstein gradient flows. The result is a unified theory of various seemingly disconnected approaches that are commonly used for uncertainty quantification in deep learning---including deep ensembles and (variational) Bayesian methods. This offers a fresh perspective on the reasons behind the success of deep ensembles over procedures based on parameterised variational inference, and allows the derivation of new ensembling schemes with convergence guarantees. We showcase this by proposing a family of interacting deep ensembles with direct parallels to the interactions of particle systems in thermodynamics, and use our theory to prove the convergence of these algorithms to a well-defined global minimiser on the space of probability measures.

Entropic Neural Optimal Transport via Diffusion Processes
Nikita Gushchin Alexander Kolesov Alexander Korotin Dmitry P. Vetrov Evgeny Burnaev



研究问题:提出一种新的神经网络算法,用于计算可由样本访问的概率分布之间的熵最优传输(EOT)计划的基本问题。
动机:现有的大规模EOT方法与本算法不同,本算法是端到端的,只包含一个学习步骤,具有快速的推理过程,并允许处理小的熵正则化系数值,这在某些应用问题中尤为重要。
方法:该算法基于动态版本的EOT(被称为薛定谔桥问题)的鞍点 reformulation,是一个单步学习过程,具有快速推理过程。
效果:在几个大规模的EOT任务上,该方法表现出良好的性能。ENOT解算器的代码可以在 https://github.com/ngushchin/EntropicNeuralOptimalTransport 上找到。

We propose a novel neural algorithm for the fundamental problem of computing the entropic optimal transport (EOT) plan between probability distributions which are accessible by samples. Our algorithm is based on the saddle point reformulation of the dynamic version of EOT which is known as the Schrödinger Bridge problem. In contrast to the prior methods for large-scale EOT, our algorithm is end-to-end and consists of a single learning step, has fast inference procedure, and allows handling small values of the entropy regularization coefficient which is of particular importance in some applied problems. Empirically, we show the performance of the method on several large-scale EOT tasks. The code for the ENOT solver can be found at https://github.com/ngushchin/EntropicNeuralOptimalTransport

A Measure-Theoretic Axiomatisation of Causality
Junhyung Park Simon Buchholz Bernhard Schölkopf Krikamol Muandet



研究问题:本文旨在寻找一种普遍接受的因果关系公理化方法。
动机:尽管因果关系在许多研究领域中都是核心概念,但目前还没有达成统一的公理化。
方法:将因果关系视为概率论的延伸,并研究当对系统进行干预时会发生什么,主张以科尔莫戈洛夫的概率测量理论公理化作为因果关系公理化的起点。为此,提出了一个由概率空间和一组称为因果核的转移概率核组成的因果空间的概念。
效果:提出的框架不仅严格基于测度理论,而且阐明了现有框架的长期局限性,例如循环、潜在变量和随机过程。

Causality is a central concept in a wide range of research areas, yet there is still no universally agreed axiomatisation of causality. We view causality both as an extension of probability theory and as a study of what happens when one intervenes on a system, and argue in favour of taking Kolmogorov's measure-theoretic axiomatisation of probability as the starting point towards an axiomatisation of causality. To that end, we propose the notion of a causal space, consisting of a probability space along with a collection of transition probability kernels, called causal kernels, that encode the causal information of the space. Our proposed framework is not only rigorously grounded in measure theory, but it also sheds light on long-standing limitations of existing frameworks including, for example, cycles, latent variables and stochastic processes.

Characteristic Circuits
Zhongjie Yu Martin Trapp Kristian Kersting



研究问题:如何在现实世界的场景中可靠且高效地在不确定性下进行推理,同时捕捉数据中的复杂关系。
动机:概率电路(PCs)是一种可处理的高维概率分布模型,但学习异构数据上的概率电路具有挑战性,某些参数分布的密度无法用封闭形式表示,限制了其潜在应用。
方法:引入特征电路(CCs),这是一种在频域中对异构数据分布提供统一形式化表示的可处理概率模型。特征函数与概率测度之间的一对一关系使我们能够在异构数据域上学习高维分布,即使没有闭型密度函数也能进行高效的概率推理。
效果:实验表明,特征电路的结构和参数可以从数据中有效学习,并在常见的基准数据集上优于最先进的异构数据密度估计器。

In many real-world scenarios it is crucial to be able to reliably and efficiently reason under uncertainty while capturing complex relationships in data. Probabilistic circuits (PCs), a prominent family of tractable probabilistic models, offer a remedy to this challenge by composing simple, tractable distributions into a high-dimensional probability distribution. However, learning PCs on heterogeneous data is challenging and densities of some parametric distributions are not available in closed form, limiting their potential use. We introduce characteristic circuits (CCs), a family of tractable probabilistic models providing a unified formalization of distributions over heterogeneous data in the spectral domain. The one-to-one relationship between characteristic functions and probability measures enables us to learn high-dimensional distributions on heterogeneous data domains and facilitates efficient probabilistic inference even when no closed-form density function is available. We show that the structure and parameters of CCs can be learned efficiently from the data and find that CCs outperform state-of-the-art density estimators for heterogeneous data domains on common benchmark data sets.

Generalizing Nonlinear ICA Beyond Structural Sparsity
Yujia Zheng Kun Zhang



研究问题:非线性独立成分分析(ICA)的可识别性问题。
动机:现有的ICA方法需要额外的假设才能实现可识别性,且在实际应用中可能无法满足所有情况。
方法:提出了一种新的非线性ICA方法,该方法考虑了观测变量多于源、部分稀疏和源依赖性以及灵活的分组结构等一般情况,并证明了在这些情况下的可识别性。
效果:理论分析和实验结果均表明,该方法在合成数据和真实数据集上都具有良好的性能。

Nonlinear independent component analysis (ICA) aims to uncover the true latent sources from their observable nonlinear mixtures. Despite its significance, the identifiability of nonlinear ICA is known to be impossible without additional assumptions. Recent advances have proposed conditions on the connective structure from sources to observed variables, known as Structural Sparsity, to achieve identifiability in an unsupervised manner. However, the sparsity constraint may not hold universally for all sources in practice. Furthermore, the assumptions of bijectivity of the mixing process and independence among all sources, which arise from the setting of ICA, may also be violated in many real-world scenarios. To address these limitations and generalize nonlinear ICA, we propose a set of new identifiability results in the general settings of undercompleteness, partial sparsity and source dependence, and flexible grouping structures. Specifically, we prove identifiability when there are more observed variables than sources (undercomplete), and when certain sparsity and/or source independence assumptions are not met for some changing sources. Moreover, we show that even in cases with flexible grouping structures (e.g., part of the sources can be divided into irreducible independent groups with various sizes), appropriate identifiability results can also be established. Theoretical claims are supported empirically on both synthetic and real-world datasets.

Implicit Variational Inference for High-Dimensional Posteriors
Anshuk Uppal Kristoffer Stensbo-Smidt Wouter Boomsma Jes Frellsen



研究问题:本文旨在提出一种新的方法,利用神经采样器来近似复杂高维空间中的多模态和相关后验分布。
动机:现有的贝叶斯模型在准确捕捉真实后验分布方面依赖于准确的采样,而现有的方法通常依赖于额外的判别网络和不稳定的对抗目标。
方法:我们提出了一种新方法,该方法通过局部线性化神经采样器来引入新的近似推理边界,以使用隐式分布。此外,我们还提出了一种新的采样器架构,首次实现了对数亿个潜在变量的隐式分布,通过使用可微数值近似解决了计算问题。
效果:实验结果表明,我们的方法能够恢复大型贝叶斯神经网络中各层之间的相关性,这是网络性能的关键因素,但也是最难实现的。在下游任务的实验中,我们的表达后验优于最先进的不确定性量化方法,验证了我们的训练算法的有效性和学习的隐式近似的质量。

In variational inference, the benefits of Bayesian models rely on accurately capturing the true posterior distribution. We propose using neural samplers that specify implicit distributions, which are well-suited for approximating complex multimodal and correlated posteriors in high-dimensional spaces. Our approach introduces novel bounds for approximate inference using implicit distributions by locally linearising the neural sampler. This is distinct from existing methods that rely on additional discriminator networks and unstable adversarial objectives. Furthermore, we present a new sampler architecture that, for the first time, enables implicit distributions over tens of millions of latent variables, addressing computational concerns by using differentiable numerical approximations. We empirically show that our method is capable of recovering correlations across layers in large Bayesian neural networks, a property that is crucial for a network's performance but notoriously challenging to achieve. To the best of our knowledge, no other method has been shown to accomplish this task for such large models. Through experiments in downstream tasks, we demonstrate that our expressive posteriors outperform state-of-the-art uncertainty quantification methods, validating the effectiveness of our training algorithm and the quality of the learned implicit approximation.

Wasserstein Quantum Monte Carlo: A Novel Approach for Solving the Quantum Many-Body Schrödinger Equation
Kirill Neklyudov Jannes Nys Luca Thiede Juan Felipe Carrasquilla Alvarez qiang liu Max Welling Alireza Makhzani



研究问题:解决量子多体薛定谔方程是量子物理、量子化学和材料科学中的基本挑战。
动机:传统的量子变分蒙特卡洛方法优化目标难以最小化,需要使用自然梯度等二阶优化方法。深度学习方法通过神经网络表示丰富的波函数族,部分解决了这个问题。
方法:本文将能量泛函最小化从波函数空间转化为对应于粒子排列(反)对称波函数的玻恩分布空间,并将量子变分蒙特卡洛解释为该分布空间中的费雪-拉奥梯度流,然后进行到变分流形的投影步骤。
效果:我们提出了“Wasserstein Quantum Monte Carlo”(WQMC),该方法使用由Wasserstein度量诱导的梯度流,而不是费雪-拉奥度量,并对应于*传输*概率质量,而不是*传送*它。实验证明,WQMC的动态会导致更快地收敛到分子系统的基态。

Solving the quantum many-body Schrödinger equation is a fundamental and challenging problem in the fields of quantum physics, quantum chemistry, and material sciences. One of the common computational approaches to this problem is Quantum Variational Monte Carlo (QVMC), in which ground-state solutions are obtained by minimizing the energy of the system within a restricted family of parameterized wave functions. Deep learning methods partially address the limitations of traditional QVMC by representing a rich family of wave functions in terms of neural networks. However, the optimization objective in QVMC remains notoriously hard to minimize and requires second-order optimization methods such as natural gradient. In this paper, we first reformulate energy functional minimization in the space of Born distributions corresponding to particle-permutation (anti-)symmetric wave functions, rather than the space of wave functions. We then interpret QVMC as the Fisher--Rao gradient flow in this distributional space, followed by a projection step onto the variational manifold. This perspective provides us with a principled framework to derive new QMC algorithms, by endowing the distributional space with better metrics, and following the projected gradient flow induced by those metrics. More specifically, we propose ``Wasserstein Quantum Monte Carlo'' (WQMC), which uses the gradient flow induced by the Wasserstein metrics, rather than Fisher--Rao metric, and corresponds to *transporting* the probability mass, rather than *teleporting* it. We demonstrate empirically that the dynamics of WQMC results in faster convergence to the ground state of molecular systems.

The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance
Jon Donnelly Srikar Katta Cynthia Rudin Edward P Browne



研究问题:如何量化变量的重要性,特别是在多个模型对同一数据集的解释能力相等的情况下。
动机:现有的方法通常只针对给定的数据集和模型计算变量的重要性,这可能导致不同的研究者对同一数据得出冲突但同样有效的结论。此外,即使考虑到所有可能的解释,这些洞察也可能不会因为合理的数据扰动而泛化。
方法:提出了一种新的变量重要性框架,该框架可以量化变量在所有好模型中的重要性,并且在数据分布上是稳定的。这个框架非常灵活,可以与大多数现有的模型类别和全局变量重要性度量集成。
效果:通过实验证明,该框架可以在复杂的模拟设置中恢复变量重要性排名,其中其他方法失败。此外,还展示了该框架可以准确估计变量对于底层数据分布的真实重要性。在探索预测HIV感染者HIV载量的基因重要性的实际案例研究中,强调了一个以前未与HIV相关联的重要基因。

Quantifying variable importance is essential for answering high-stakes questions in fields like genetics, public policy, and medicine. Current methods generally calculate variable importance for a given model trained on a given dataset. However, for a given dataset, there may be many models that explain the target outcome equally well; without accounting for all possible explanations, different researchers may arrive at many conflicting yet equally valid conclusions given the same data. Additionally, even when accounting for all possible explanations for a given dataset, these insights may not generalize because not all good explanations are stable across reasonable data perturbations. We propose a new variable importance framework that quantifies the importance of a variable across the set of all good models and is stable across the data distribution. Our framework is extremely flexible and can be integrated with most existing model classes and global variable importance metrics. We demonstrate through experiments that our framework recovers variable importance rankings for complex simulation setups where other methods fail. Further, we show that our framework accurately estimates the _true importance_ of a variable for the underlying data distribution. We provide theoretical guarantees on the consistency and finite sample error rates for our estimator. Finally, we demonstrate its utility with a real-world case study exploring which genes are important for predicting HIV load in persons with HIV, highlighting an important gene that has not previously been studied in connection with HIV.

Common Ground in Cooperative Communication
Xiaoran Hao Yash Jhaveri Patrick Shafto



研究问题:本文旨在解决合作交流中的核心挑战——共同基础的问题,即如何拥有足够的共享知识和理解以成功进行交流。
动机:现有的合作交流模型都假设最强烈的共同基础形式,即完美和完全的知识共享,因此无法捕捉到合作交流的核心挑战。
方法:我们提出了一个合作交流的一般理论,该理论在数学上是有原则的,并在允许任意数据和假设表示的空间中明确定义了共同基础可能性的谱系,远超过完美和完全的知识共享。
效果:通过考虑参数化形式的共同基础,并将通信的数据选择和假设推理过程视为编码和解码,我们将此框架与现代机器学习中的强大模型变分自动编码器建立了联系。最后,我们进行了一系列实证模拟,以支持并详细阐述我们的理论结果。

Cooperative communication plays a fundamental role in theories of human-human interaction--cognition, culture, development, language, etc.--as well as human-robot interaction. The core challenge in cooperative communication is the problem of common ground: having enough shared knowledge and understanding to successfully communicate. Prior models of cooperative communication, however, uniformly assume the strongest form of common ground, perfect and complete knowledge sharing, and, therefore, fail to capture the core challenge of cooperative communication. We propose a general theory of cooperative communication that is mathematically principled and explicitly defines a spectrum of common ground possibilities, going well beyond that of perfect and complete knowledge sharing, on spaces that permit arbitrary representations of data and hypotheses. Our framework is a strict generalization of prior models of cooperative communication. After considering a parametric form of common ground and viewing the data selection and hypothesis inference processes of communication as encoding and decoding, we establish a connection to variational autoencoding, a powerful model in modern machine learning. Finally, we carry out a series of empirical simulations to support and elaborate on our theoretical results.

Hierarchical clustering with dot products recovers hidden tree structure
Annie Gray Alexander Modell Patrick Rubin-Delanchy Nick Whiteley



研究问题:本文旨在提供一种新的视角来看待已经建立的凝聚式聚类算法,重点关注恢复层次结构。
动机:现有的凝聚式聚类算法在恢复数据生成的层次结构时存在不足,因此需要提出一种改进的方法。
方法:我们推荐了一种标准算法的简单变体,其中通过最大平均点积而不是最小距离或簇内方差来合并簇。我们还理解了在这种模型中,层次信息如何转化为可以从数据中恢复的树形几何结构。
效果:我们在真实数据上的表现优于现有方法(如UPGMA、Ward方法和HDBSCAN),证明了我们的新方法可以更好地恢复数据的层次结构。

In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.

Normalizing flow neural networks by JKO scheme
Chen Xu Xiuyuan Cheng Yao Xie



研究问题:本文旨在开发一种高效的生成模型,用于采样和似然估计,特别是在高维空间中。
动机:现有的工作采用特殊的网络架构和流轨迹的正则化来实现流模型,但这种方法在计算和内存消耗上较高。
方法:受Jordan-Kinderleherer-Otto (JKO)方案的启发,本文提出了一种名为JKO-iFlow的神经ODE流网络,该网络通过连续堆叠残差块,避免了SDE轨迹的采样和得分匹配或变分学习,从而降低了内存负载和端到端训练的难度。
效果:实验结果显示,与现有的流和扩散模型相比,JKO-iFlow网络在显著降低计算和内存成本的同时,取得了相当的性能。

Normalizing flow is a class of deep generative models for efficient sampling and likelihood estimation, which achieves attractive performance, particularly in high dimensions. The flow is often implemented using a sequence of invertible residual blocks. Existing works adopt special network architectures and regularization of flow trajectories. In this paper, we develop a neural ODE flow network called JKO-iFlow, inspired by the Jordan-Kinderleherer-Otto (JKO) scheme, which unfolds the discrete-time dynamic of the Wasserstein gradient flow. The proposed method stacks residual blocks one after another, allowing efficient block-wise training of the residual blocks, avoiding sampling SDE trajectories and score matching or variational learning, thus reducing the memory load and difficulty in end-to-end training. We also develop adaptive time reparameterization of the flow network with a progressive refinement of the induced trajectory in probability space to improve the model accuracy further. Experiments with synthetic and real data show that the proposed JKO-iFlow network achieves competitive performance compared with existing flow and diffusion models at a significantly reduced computational and memory cost.

Inferring the Future by Imagining the Past
Kartik Chandra Tony Chen Tzu-Mao Li Jonathan Ragan-Kelley Joshua B. Tenenbaum



研究问题:如何模拟人类从静态快照中快速、灵活地推断出复杂序列的过去和未来事件。
动机:人类能够从静态场景中推断出复杂的动态事件,这种能力在许多领域都有应用价值。
方法:基于认知科学的研究,提出了一种与人类直觉高度相关的蒙特卡洛算法,该算法只需要少量样本即可进行推断。
效果:该算法在各种领域中的推断结果与人类直觉高度相关,并且只使用了少量样本。同时,该算法还发现了推断问题与蒙特卡洛路径追踪之间的意外联系,将计算机图形学领域的数十年思想应用于这一看似无关的心理任务。

A single panel of a comic book can say a lot: it can depict not only where the characters currently are, but also their motions, their motivations, their emotions, and what they might do next. More generally, humans routinely infer complex sequences of past and future events from a *static snapshot* of a *dynamic scene*, even in situations they have never seen before. In this paper, we model how humans make such rapid and flexible inferences. Building on a long line of work in cognitive science, we offer a Monte Carlo algorithm whose inferences correlate well with human intuitions in a wide variety of domains, while only using a small, cognitively-plausible number of samples. Our key technical insight is a surprising connection between our inference problem and Monte Carlo path tracing, which allows us to apply decades of ideas from the computer graphics community to this seemingly-unrelated theory of mind task.

AMDP: An Adaptive Detection Procedure for False Discovery Rate Control in High-Dimensional Mediation Analysis
Jiarong Ding Xuehu Zhu



研究问题:高维中介分析中的多重检验问题,以及如何准确评估检测过程的不确定性。
动机:现有的方法在处理高维中介分析时,要么没有进行校准就构造p值,要么忽视了跨测试的联合信息,导致FDR控制不足或多重假设的排名规则非最优。
方法:本文提出了一种适应性中介检测程序(AMDP),通过优化排名规则和提出数据驱动的策略来确定中介选择的阈值,以在高维中介分析中渐进地控制FDR。
效果:数值研究表明,AMDP在合成和真实数据集上的表现优于现有方法。

High-dimensional mediation analysis is often associated with a multiple testing problem for detecting significant mediators. Assessing the uncertainty of this detecting process via false discovery rate (FDR) has garnered great interest. To control the FDR in multiple testing, two essential steps are involved: ranking and selection. Existing approaches either construct p-values without calibration or disregard the joint information across tests, leading to conservation in FDR control or non-optimal ranking rules for multiple hypotheses. In this paper, we develop an adaptive mediation detection procedure (referred to as "AMDP") to identify relevant mediators while asymptotically controlling the FDR in high-dimensional mediation analysis. AMDP produces the optimal rule for ranking hypotheses and proposes a data-driven strategy to determine the threshold for mediator selection. This novel method captures information from the proportions of composite null hypotheses and the distribution of p-values, which turns the high dimensionality into an advantage instead of a limitation. The numerical studies on synthetic and real data sets illustrate the performances of AMDP compared with existing approaches.

Encoding Time-Series Explanations through Self-Supervised Model Behavior Consistency
Owen Queen Thomas Hartvigsen Teddy Koker Huan He Theodoros Tsiligkaridis Marinka Zitnik



研究问题:时间序列模型的解释具有独特挑战性,需要确定驱动模型预测的时间序列信号的位置以及它们与可解释的临时模式的匹配。
动机:虽然其他模态的解释器可以应用于时间序列,但其归纳偏置不能很好地转移到本质上具有挑战性的时间序列解释。
方法:我们提出了TimeX,一种用于训练解释器的时间序列一致性模型。TimeX训练一个可解释的替代模型来模仿预训练时间序列模型的行为。它通过引入模型行为一致性来解决模型忠实性问题,这是一种保留由预训练模型在潜在空间中产生的关系的新颖表述方式,同时保留由TimeX在潜在空间中产生的关系。
效果:我们在八个合成和现实世界的数据集中评估了TimeX,并将其性能与最先进的解释性方法进行了比较。我们还使用生理时间序列进行了案例研究。定量评估表明,在所有数据集中,TimeX在每个指标上都实现了最高或第二高的性能。通过案例研究,我们展示了TimeX的新组件有潜力训练出忠实、可解释的模型,以捕捉预训练时间序列模型的行为。

Interpreting time series models is uniquely challenging because it requires identifying both the location of time series signals that drive model predictions and their matching to an interpretable temporal pattern. While explainers from other modalities can be applied to time series, their inductive biases do not transfer well to the inherently challenging interpretation of time series. We present TimeX, a time series consistency model for training explainers. TimeX trains an interpretable surrogate to mimic the behavior of a pretrained time series model. It addresses the issue of model faithfulness by introducing model behavior consistency, a novel formulation that preserves relations in the latent space induced by the pretrained model with relations in the latent space induced by TimeX. TimeX provides discrete attribution maps and, unlike existing interpretability methods, it learns a latent space of explanations that can be used in various ways, such as to provide landmarks to visually aggregate similar explanations and easily recognize temporal patterns. We evaluate TimeX on eight synthetic and real-world datasets and compare its performance against state-of-the-art interpretability methods. We also conduct case studies using physiological time series. Quantitative evaluations demonstrate that TimeX achieves the highest or second-highest performance in every metric compared to baselines across all datasets. Through case studies, we show that the novel components of TimeX show potential for training faithful, interpretable models that capture the behavior of pretrained time series models.

Streaming PCA for Markovian Data
Syamantak Kumar Purnamrita Sarkar



研究问题:本文研究了从不可约、非周期性和可逆马尔科夫链开始的平稳分布中,如何估计协方差矩阵未知特征向量的问题。
动机:在数据只能通过马尔科夫链蒙特卡洛(MCMC)类型算法采样,并且目标是对平稳分布的参数进行推理的场景中,Oja的算法可以提供解决方案。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Since its inception in 1982, Oja's algorithm has become an established method for streaming principle component analysis (PCA). We study the problem of streaming PCA, where the data-points are sampled from an irreducible, aperiodic, and reversible Markov chain starting in stationarity. Our goal is to estimate the top eigenvector of the unknown covariance matrix of the stationary distribution. This setting has implications in scenarios where data can solely be sampled from a Markov Chain Monte Carlo (MCMC) type algorithm, and the objective is to perform inference on parameters of the stationary distribution. Most convergence guarantees for Oja's algorithm in the literature assume that the data-points are sampled IID. For data streams with Markovian dependence, one typically downsamples the data to get a "nearly" independent data stream. In this paper, we obtain the first near-optimal rate for Oja's algorithm on the entire data, where we remove the logarithmic dependence on the sample size, $n$, resulting from throwing data away in downsampling strategies.

MMD-Fuse: Learning and Combining Kernels for Two-Sample Testing Without Data Splitting
Felix Biggs Antonin Schrab Arthur Gretton



研究问题:如何最大化基于最大均值差异(MMD)的两样本检验的效力。
动机:目前的MMD检验方法在处理有限集数据时存在不足,需要改进。
方法:提出一种新的统计方法,通过调整定义MMD所用的核集合来最大化检验效力。对于有限集数据,该方法将各个核下的归一化MMD值通过加权软最大值进行组合。同时,证明了该统计量在假设检验中的指数集中边界。此外,还提出了一种与数据相关但排列独立的核选择方法,避免了数据分割。
效果:该方法适用于合成低维和现实世界的高维数据,并在检验效力方面优于当前最先进的核检验方法。

We propose novel statistics which maximise the power of a two-sample test based on the Maximum Mean Discrepancy (MMD), by adapting over the set of kernels used in defining it. For finite sets, this reduces to combining (normalised) MMD values under each of these kernels via a weighted soft maximum. Exponential concentration bounds are proved for our proposed statistics under the null and alternative. We further show how these kernels can be chosen in a data-dependent but permutation-independent way, in a well-calibrated test, avoiding data splitting. This technique applies more broadly to general permutation-based MMD testing, and includes the use of deep kernels with features learnt using unsupervised models such as auto-encoders. We highlight the applicability of our MMD-Fuse tests on both synthetic low-dimensional and real-world high-dimensional data, and compare its performance in terms of power against current state-of-the-art kernel tests.

Fast Approximation of Similarity Graphs with Kernel Density Estimation
Peter Macgregor He Sun



研究问题:如何有效地从一组数据点中构建一个稀疏的、保持原有聚类结构的相似性图。
动机:传统的相似性图构建方法时间复杂度高,空间复杂度与数据量呈二次关系,需要改进。
方法:提出一种基于核密度估计的新算法框架,可以用于任意核函数,以构建稀疏的、保持原有聚类结构的相似性图。
效果:在多种数据集上,新方法比scikit-learn和FAISS库中的实现表现出更优的性能。

Constructing a similarity graph from a set $X$ of data points in $ \mathbb{R}^d$ is the first step of many modern clustering algorithms. However, typical constructions of a similarity graph have high time complexity, and a quadratic space dependency with respect to $|X|$. We address this limitation and present a new algorithmic framework that constructs a sparse approximation of the fully connected similarity graph while preserving its cluster structure. Our presented algorithm is based on the kernel density estimation problem, and is applicable for arbitrary kernel functions. We compare our designed algorithm with the well-known implementations from the scikit-learn library and the FAISS library, and find that our method significantly outperforms the implementation from both libraries on a variety of datasets.

Provable benefits of annealing for estimating normalizing constants: Importance Sampling, Noise-Contrastive Estimation, and beyond
Omar Chehab Aapo Hyvarinen Andrej Risteski



研究问题:如何有效地估计归一化常数(配分函数)?
动机:目前的蒙特卡洛方法在设计选择上存在不确定性,需要一种理论来指导。
方法:通过研究不同设计选择的渐近估计误差,包括使用哪种估计器、采用哪种分布路径以及是否使用路径等。
效果:研究发现,噪声对比估计比重要性采样估计更有效;几何路径可以将估计误差从指数函数降低到多项式函数;算术路径在某些极限下可以提供优于常用的几何路径的最佳性质。基于这些理论,提出了一种两步估计器来有效逼近最优路径。

Recent research has developed several Monte Carlo methods for estimating the normalization constant (partition function) based on the idea of annealing. This means sampling successively from a path of distributions which interpolate between a tractable "proposal" distribution and the unnormalized "target" distribution. Prominent estimators in this family include annealed importance sampling and annealed noise-contrastive estimation (NCE). Such methods hinge on a number of design choices: which estimator to use, which path of distributions to use and whether to use a path at all; so far, there is no definitive theory on which choices are efficient. Here, we evaluate each design choice by the asymptotic estimation error it produces. First, we show that using NCE is more efficient than the importance sampling estimator, but in the limit of infinitesimal path steps, the difference vanishes. Second, we find that using the geometric path brings down the estimation error from an exponential to a polynomial function of the parameter distance between the target and proposal distributions. Third, we find that the arithmetic path, while rarely used, can offer optimality properties over the universally-used geometric path. In fact, in a particular limit, the optimal path is arithmetic. Based on this theory, we finally propose a two-step estimator to approximate the optimal path in an efficient way.

Theoretical and Practical Perspectives on what Influence Functions Do
Andrea Schioppa Katja Filippova Ivan Titov Polina Zablotskaia



研究问题:影响函数(IF)被用作通过训练数据解释模型预测的技术,但其在现代深度神经网络中的预测能力存在局限。
动机:现有的IF方法在估计上存在问题,导致其预测效果不佳。
方法:分析IF方法中存在的五个问题,包括凸性、数值稳定性、训练轨迹和参数发散等,并使用BERT和ResNet模型进行验证。
效果:虽然大部分假设可以得到解决,但参数发散对IF的预测能力构成了明显的限制。同时,即使某些假设不成立,IF仍可用于模型调试和纠正错误。

Influence functions (IF) have been seen as a technique for explaining model predictions through the lens of the training data. Their utility is assumed to be in identifying training examples "responsible" for a prediction so that, for example, correcting a prediction is possible by intervening on those examples (removing or editing them) and retraining the model. However, recent empirical studies have shown that the existing methods of estimating IF predict the leave-one-out-and-retrain effect poorly. In order to understand the mismatch between the theoretical promise and the practical results, we analyse five assumptions made by IF methods which are problematic for modern-scale deep neural networks and which concern convexity, numeric stability, training trajectory and parameter divergence. This allows us to clarify what can be expected theoretically from IF. We show that while most assumptions can be addressed successfully, the parameter divergence poses a clear limitation on the predictive power of IF: influence fades over training time even with deterministic training. We illustrate this theoretical result with BERT and ResNet models. Another conclusion from the theoretical analysis is that IF are still useful for model debugging and correcting even though some of the assumptions made in prior work do not hold: using natural language processing and computer vision tasks, we verify that mis-predictions can be successfully corrected by taking only a few fine-tuning steps on influential examples.

Multi Time Scale World Models
Vaisakh Shaj Saleh GHOLAM ZADEH Ozan Demir Luiz Ricardo Douat Gerhard Neumann



研究问题:如何让机器通过学习多层次的时间抽象世界模型,处理复杂的不确定性预测。
动机:现有的学习方法在处理复杂不确定性预测和多层次时间抽象的世界模型时存在困难。
方法:提出一种名为多时间尺度状态空间(MTS3)的统计学习框架,用于学习多层次的时间抽象世界模型。
效果:实验证明,MTS3在多个系统识别基准测试中,包括复杂的模拟和现实世界动态系统,都优于最近的方法。

Intelligent agents use internal world models to reason and make predictions about different courses of their actions at many scales. Devising learning paradigms and architectures that allow machines to learn world models that operate at multiple levels of temporal abstractions while dealing with complex uncertainty predictions is a major technical hurdle. In this work, we propose a probabilistic formalism to learn multi-time scale world models which we call the Multi Time Scale State Space (MTS3) model. Our model uses a computationally efficient inference scheme on multiple time scales for highly accurate long-horizon predictions and uncertainty estimates over several seconds into the future. Our experiments, which focus on action conditional long horizon future predictions, show that MTS3 outperforms recent methods on several system identification benchmarks including complex simulated and real-world dynamical systems.

Gaussian Partial Information Decomposition: Bias Correction and Application to High-dimensional Data
Praveen Venkatesh Corbett Bennett Sam Gale Tamina K. Ramirez Greggory Heller Severine Durand Shawn R Olsen Stefan Mihalas



研究问题:如何有效地计算和估计多变量高斯分布上的部分信息分解(PID)。
动机:随着神经科学实验技术的进步,我们能够同时记录多个大脑区域的数千个神经元的活动,这需要强大的计算工具来分析任务相关信息如何在几个大脑区域之间表示和传递。部分信息分解(PID)作为一种工具,可以量化两个或更多大脑区域关于任务相关信息的独特、冗余和协同信息量,但其在实际应用中的计算挑战以及估计的偏差和方差等统计问题尚未得到充分解决。
方法:本文提出了一种新方法,可以在多变量高斯分布上有效计算和估计PID。
效果:实证研究表明,该方法满足直观的可加性属性,即使在高维情况下也能恢复基本事实。此外,我们还首次提出了一种方法来纠正有限样本大小下的PID估计偏差。最后,我们证明,我们的高斯PID能有效描述老鼠大脑的区域间交互,当刺激在行为上相关时,视觉区域之间的冗余更高。

Recent advances in neuroscientific experimental techniques have enabled us to simultaneously record the activity of thousands of neurons across multiple brain regions. This has led to a growing need for computational tools capable of analyzing how task-relevant information is represented and communicated between several brain regions. Partial information decompositions (PIDs) have emerged as one such tool, quantifying how much unique, redundant and synergistic information two or more brain regions carry about a task-relevant message. However, computing PIDs is computationally challenging in practice, and statistical issues such as the bias and variance of estimates remain largely unexplored. In this paper, we propose a new method for efficiently computing and estimating a PID definition on multivariate Gaussian distributions. We show empirically that our method satisfies an intuitive additivity property, and recovers the ground truth in a battery of canonical examples, even at high dimensionality. We also propose and evaluate, for the first time, a method to correct the bias in PID estimates at finite sample sizes. Finally, we demonstrate that our Gaussian PID effectively characterizes inter-areal interactions in the mouse brain, revealing higher redundancy between visual areas when a stimulus is behaviorally relevant.

Bifurcations and loss jumps in RNN training
Lukas Eisenmann Zahra Monfared Niclas Alexander Göring Daniel Durstewitz



研究问题:本文旨在探讨循环神经网络(RNN)在处理序列数据和推断动态系统时的训练过程及其复杂任务的解决方式。
动机:RNN中的分岔现象是动态系统中的一种重要现象,包括RNN在内,指的是当系统的某个或多个参数变化时,系统动态行为发生拓扑(定性)变化。了解RNN的分岔结构将有助于推导出其许多计算和动态属性,如对参数变化的敏感性或训练过程中的行为。
方法:本文首先为一类基于ReLU的RNNs数学证明了某些分岔确实与损失梯度趋向于无穷大或零有关。然后引入了一种新颖的启发式算法,用于检测基于ReLU的RNNs中的所有固定点和k-周期以及它们的存在和稳定区域,即参数空间中的分岔流形。
效果:与先前寻找固定点和常用连续方法的数字算法相比,我们的算法提供了精确的结果,并以令人惊讶的良好缩放行为返回高阶的固定点和周期。我们在RNNs的训练过程分析中展示了该算法的应用,并发现最近引入的广义教师强制技术完全避免了训练中的某种类型的分岔。因此,除了促进对训练后的RNNs进行DST分析外,我们的算法还为分析训练过程本身提供了强大的工具。

Recurrent neural networks (RNNs) are popular machine learning tools for modeling and forecasting sequential data and for inferring dynamical systems (DS) from observed time series. Concepts from DS theory (DST) have variously been used to further our understanding of both, how trained RNNs solve complex tasks, and the training process itself. Bifurcations are particularly important phenomena in DS, including RNNs, that refer to topological (qualitative) changes in a system's dynamical behavior as one or more of its parameters are varied. Knowing the bifurcation structure of an RNN will thus allow to deduce many of its computational and dynamical properties, like its sensitivity to parameter variations or its behavior during training. In particular, bifurcations may account for sudden loss jumps observed in RNN training that could severely impede the training process. Here we first mathematically prove for a particular class of ReLU-based RNNs that certain bifurcations are indeed associated with loss gradients tending toward infinity or zero. We then introduce a novel heuristic algorithm for detecting all fixed points and $k$-cycles in ReLU-based RNNs and their existence and stability regions, hence bifurcation manifolds in parameter space. In contrast to previous numerical algorithms for finding fixed points and common continuation methods, our algorithm provides $\textit{exact}$ results and returns fixed points and cycles up to high orders with surprisingly good scaling behavior. We exemplify the algorithm on the analysis of the training process of RNNs, and find that the recently introduced technique of generalized teacher forcing completely avoids certain types of bifurcations in training. Thus, besides facilitating the DST analysis of trained RNNs, our algorithm provides a powerful instrument for analyzing the training process itself.

Hypernetwork-based Meta-Learning for Low-Rank Physics-Informed Neural Networks
Woojin Cho Kookjin Lee Donsub Rim Noseong Park



研究问题:如何在许多查询场景中,对不断变化的PDE输入参数进行高效的数值模拟?
动机:当前,在各种工程和应用科学应用中,需要对部分微分方程(PDEs)进行重复的数值模拟,但现有的物理信息神经网络(PINNs)需要花费大量时间进行训练,不适合处理大量的查询。
方法:我们提出了一种轻量级的低秩PINNs,该模型仅包含数百个模型参数,并结合了一种基于超网络的元学习算法,可以有效地近似不同PDE输入参数下的解。
效果:实验证明,该方法能有效克服PINNs的"失败模式"问题,并在处理大量查询时表现出高效性。

In various engineering and applied science applications, repetitive numerical simulations of partial differential equations (PDEs) for varying input parameters are often required (e.g., aircraft shape optimization over many design parameters) and solvers are required to perform rapid execution. In this study, we suggest a path that potentially opens up a possibility for physics-informed neural networks (PINNs), emerging deep-learning-based solvers, to be considered as one such solver. Although PINNs have pioneered a proper integration of deep-learning and scientific computing, they require repetitive time-consuming training of neural networks, which is not suitable for many-query scenarios. To address this issue, we propose a lightweight low-rank PINNs containing only hundreds of model parameters and an associated hypernetwork-based meta-learning algorithm, which allows efficient approximation of solutions of PDEs for varying ranges of PDE input parameters. Moreover, we show that the proposed method is effective in overcoming a challenging issue, known as "failure modes" of PINNs.

Physics-Driven ML-Based Modelling for Correcting Inverse Estimation
Ruiyuan Kang Tingting Mu Panos Liatsis Dimitrios Kyritsis



研究问题:在科学和工程领域部署机器学习估计器时,如何避免可能导致灾难性后果的失败估计。
动机:解决科学和工程领域中因失败状态估计导致的严重后果,如航空发动机设计。
方法:利用物理定律指导的模拟和性能指标,通过优化来检测并修正失败的状态估计。提出了一种名为GEESE的新方法,包括混合代理误差模型和两种生成模型。
效果:在三个真实世界的科学和工程逆问题上进行测试,结果显示GEESE在找到可行的状态修正方面失败次数最少,且总体上需要物理评估的频率较低。

When deploying machine learning estimators in science and engineering (SAE) domains, it is critical to avoid failed estimations that can have disastrous consequences, e.g., in aero engine design. This work focuses on detecting and correcting failed state estimations before adopting them in SAE inverse problems, by utilizing simulations and performance metrics guided by physical laws. We suggest to flag a machine learning estimation when its physical model error exceeds a feasible threshold, and propose a novel approach, GEESE, to correct it through optimization, aiming at delivering both low error and high efficiency. The key designs of GEESE include (1) a hybrid surrogate error model to provide fast error estimations to reduce simulation cost and to enable gradient based backpropagation of error feedback, and (2) two generative models to approximate the probability distributions of the candidate states for simulating the exploitation and exploration behaviours. All three models are constructed as neural networks. GEESE is tested on three real-world SAE inverse problems and compared to a number of state-of-the-art optimization/search approaches. Results show that it fails the least number of times in terms of finding a feasible state correction, and requires physical evaluations less frequently in general.

Debias Coarsely, Sample Conditionally: Statistical Downscaling through Optimal Transport and Probabilistic Diffusion Models
Zhong Yi Wan Ricardo Baptista Anudhyan Boral Yi-Fan Chen John Anderson Fei Sha Leonardo Zepeda-Nunez



研究问题:如何利用未配对的数据进行统计降尺度?
动机:现有的统计降尺度方法需要配对数据,且在低频率内容不匹配时无法正确匹配物理量的统计数据。
方法:提出了一种两阶段的概率框架,通过最优传输映射进行去偏步骤,通过后验条件采样的概率扩散模型进行上采样步骤。
效果:该方法在一维和二维流体流问题上进行了演示,可以从低分辨率输入生成高分辨率输出,并正确匹配物理量的统计数据。

We introduce a two-stage probabilistic framework for statistical downscaling using unpaired data. Statistical downscaling seeks a probabilistic map to transform low-resolution data from a biased coarse-grained numerical scheme to high-resolution data that is consistent with a high-fidelity scheme. Our framework tackles the problem by composing two transformations: (i) a debiasing step via an optimal transport map, and (ii) an upsampling step achieved by a probabilistic diffusion model with a posteriori conditional sampling. This approach characterizes a conditional distribution without needing paired data, and faithfully recovers relevant physical statistics from biased samples. We demonstrate the utility of the proposed approach on one- and two-dimensional fluid flow problems, which are representative of the core difficulties present in numerical simulations of weather and climate. Our method produces realistic high-resolution outputs from low-resolution inputs, by upsampling resolutions of $8\times$ and $16\times$. Moreover, our procedure correctly matches the statistics of physical quantities, even when the low-frequency content of the inputs and outputs do not match, a crucial but difficult-to-satisfy assumption needed by current state-of-the-art alternatives. Code for this work is available at: https://github.com/google-research/swirl-dynamics/tree/main/swirl_dynamics/projects/probabilistic_diffusion.

Provable benefits of score matching
Chirag Pabbaraju Dhruv Rohatgi Anish Sevekari Holden Lee Ankur Moitra Andrej Risteski



研究问题:寻找替代最大似然估计概率分布的方法,解决计算常数比例的难题。
动机:最大似然估计在处理某些具有固定阶数和参数半径的指数族分布时,优化最大似然损失是NP难的,而其统计效率又与参数的半径和环境维度呈多项式关系。
方法:提出使用得分匹配法来估计这些分布的概率,该方法避免了计算常数比例,且其优化过程具有计算和统计效率。
效果:实验结果表明,得分匹配法在优化过程中既高效又具有统计效率,可以作为处理这类指数族分布的有效方法。

Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of the benefits and tradeoffs with maximum likelihood---both computational and statistical---are not well understood. In this work, we give the first example of a natural exponential family of distributions such that the score matching loss is computationally efficient to optimize, and has a comparable statistical efficiency to ML, while the ML loss is intractable to optimize using a gradient-based method. The family consists of exponentials of polynomials of fixed degree, and our result can be viewed as a continuous analogue of recent developments in the discrete setting. Precisely, we show: (1) Designing a zeroth-order or first-order oracle for optimizing the maximum likelihood loss is NP-hard. (2) Maximum likelihood has a statistical efficiency polynomial in the ambient dimension and the radius of the parameters of the family. (3) Minimizing the score matching loss is both computationally and statistically efficient, with complexity polynomial in the ambient dimension.

Unifying Predictions of Deterministic and Stochastic Physics in Mesh-reduced Space with Sequential Flow Generative Model
Luning Sun Xu Han Han Gao Jian-Xun Wang Liping Liu



研究问题:如何在非结构化网格中准确预测动力系统?
动机:许多动力系统由于各种因素(如混沌性)引入了不可忽视的随机性,因此需要一个统一的框架来捕捉这些系统在滚动中的确定性和随机性成分。
方法:受再生学习启发,提出一种新的模型,该模型结合生成网络和序列网络来模拟动力系统。具体来说,我们使用自动编码器在低维空间中学习全空间物理变量的紧凑表示。然后,我们将变压器与条件正态流模型相结合,以模拟潜在表示的时间序列。
效果:新模型在确定性和随机系统中进行评估。该模型优于几个竞争性基线模型,并对确定性系统的预测更准确。其自身的预测误差也反映在其不确定性估计中。当预测随机系统时,提出的模型生成高质量的滚动样本。这些样本的均值和方差很好地匹配了从昂贵的数值模拟计算得出的样本统计数据。

Accurate prediction of dynamical systems in unstructured meshes has recently shown successes in scientific simulations. Many dynamical systems have a nonnegligible level of stochasticity introduced by various factors (e.g. chaoticity), so there is a need for a unified framework that captures both deterministic and stochastic components in the rollouts of these systems. Inspired by regeneration learning, we propose a new model that combines generative and sequential networks to model dynamical systems. Specifically, we use an autoencoder to learn compact representations of full-space physical variables in a low-dimensional space. We then integrate a transformer with a conditional normalizing flow model to model the temporal sequence of latent representations. We evaluate the new model in both deterministic and stochastic systems. The model outperforms several competitive baseline models and makes more accurate predictions of deterministic systems. Its own prediction error is also reflected in its uncertainty estimations. When predicting stochastic systems, the proposed model generates high-quality rollout samples. The mean and variance of these samples well match the statistics of samples computed from expensive numerical simulations.

Distributionally Robust Skeleton Learning of Discrete Bayesian Networks
Yeshu Li Brian D Ziebart



研究问题:从可能被破坏的数据中学习一般离散贝叶斯网络的精确骨架。
动机:考虑到异常值的影响,提出最坏情况风险优化方法来处理数据损坏问题。
方法:基于分布鲁棒优化和回归方法,在有限Wasserstein距离或KL散度内优化最不利风险。
效果:对于有界度数图,我们的方法在成功结构学习上具有对数样本复杂性的非渐近保证。数值研究验证了我们的方法在合成和真实数据集上的有效性。

We consider the problem of learning the exact skeleton of general discrete Bayesian networks from potentially corrupted data. Building on distributionally robust optimization and a regression approach, we propose to optimize the most adverse risk over a family of distributions within bounded Wasserstein distance or KL divergence to the empirical distribution. The worst-case risk accounts for the effect of outliers. The proposed approach applies for general categorical random variables without assuming faithfulness, an ordinal relationship or a specific form of conditional distribution. We present efficient algorithms and show the proposed methods are closely related to the standard regularized regression approach. Under mild assumptions, we derive non-asymptotic guarantees for successful structure learning with logarithmic sample complexities for bounded-degree graphs. Numerical study on synthetic and real datasets validates the effectiveness of our method.

Kernel Quadrature with Randomly Pivoted Cholesky
Ethan Nicholas Epperly Elvira Moreno Ferreira



研究问题:本文旨在提出一种新的积分规则,用于在再生核希尔伯特空间中的函数。
动机:现有的核积分方法或者精度较低,或者需要解决计算上具有挑战性的采样问题。
方法:采用随机轴转置Cholesky采样算法生成节点,形成新的数值计算过程。
效果:理论和数值结果表明,随机轴转置Cholesky快速且积分误差率与基于连续体积采样、细化和重组的计算密集型积分方案相当。该方法易于适应具有任意内核的复杂几何形状,为核积分开辟了新的可能性。

This paper presents new quadrature rules for functions in a reproducing kernel Hilbert space using nodes drawn by a sampling algorithm known as randomly pivoted Cholesky. The resulting computational procedure compares favorably to previous kernel quadrature methods, which either achieve low accuracy or require solving a computationally challenging sampling problem. Theoretical and numerical results show that randomly pivoted Cholesky is fast and achieves comparable quadrature error rates to more computationally expensive quadrature schemes based on continuous volume sampling, thinning, and recombination. Randomly pivoted Cholesky is easily adapted to complicated geometries with arbitrary kernels, unlocking new potential for kernel quadrature.

Bayesian target optimisation for high-precision holographic optogenetics
Marcus Triplett Marta Agnieszka Gajowa Hillel Adesnik Liam Paninski



研究问题:如何克服光遗传学中非目标神经元的无意激活问题,实现对神经群体活动的精确光遗传控制。
动机:目前的光遗传学技术由于光线不能完美地限制在目标神经元上,导致非目标神经元的无意激活,影响了其精度。
方法:提出一种名为贝叶斯目标优化的新型计算方法,通过非参数贝叶斯推理来模拟光遗传刺激下的神经反应,并优化激光功率和光学目标位置以最小化OTS。
效果:模拟和体外实验数据的验证表明,贝叶斯目标优化显著降低了所有测试条件下的OTS,实现了光遗传刺激的精度显著提高。

Two-photon optogenetics has transformed our ability to probe the structure and function of neural circuits. However, achieving precise optogenetic control of neural ensemble activity has remained fundamentally constrained by the problem of off-target stimulation (OTS): the inadvertent activation of nearby non-target neurons due to imperfect confinement of light onto target neurons. Here we propose a novel computational approach to this problem called Bayesian target optimisation. Our approach uses nonparametric Bayesian inference to model neural responses to optogenetic stimulation, and then optimises the laser powers and optical target locations needed to achieve a desired activity pattern with minimal OTS. We validate our approach in simulations and using data from in vitro experiments, showing that Bayesian target optimisation considerably reduces OTS across all conditions we test. Together, these results establish our ability to overcome OTS, enabling optogenetic stimulation with substantially improved precision.

On Learning Necessary and Sufficient Causal Graphs
Hengrui Cai Yixin Wang Michael Jordan Rui Song



研究问题:现有的方法试图在复杂的大规模图中发现所有变量之间的因果关系,但在实践中,只有图中的一小部分变量与感兴趣的结果相关。
动机:为了解决这一问题,本文提出了一种新的方法,通过学习一类仅包含与感兴趣结果相关的因果关系的必要和充分图(NSCG),即“因果特征”。
方法:该方法的核心思想是使用“因果关系的概率”来系统地评估图中特征的重要性,从而识别出与感兴趣结果相关的子图。为此,我们开发了一种必要的和充分的因果结构学习(NSCSL)算法,通过建立因果关系的概率和特征的自然因果效应之间的关系。
效果:通过对模拟数据和真实数据的实证研究,我们发现NSCSL优于现有的算法,并能够揭示出与目标遗传性状相关的关键的酵母基因。

The causal revolution has stimulated interest in understanding complex relationships in various fields. Most of the existing methods aim to discover causal relationships among all variables within a complex large-scale graph. However, in practice, only a small subset of variables in the graph are relevant to the outcomes of interest. Consequently, causal estimation with the full causal graph---particularly given limited data---could lead to numerous *falsely discovered, spurious* variables that exhibit high correlation with, but exert no causal impact on, the target outcome. In this paper, we propose learning a class of *necessary and sufficient causal graphs (NSCG)* that exclusively comprises causally relevant variables for an outcome of interest, which we term *causal features*. The key idea is to employ *probabilities of causation* to systematically evaluate the importance of features in the causal graph, allowing us to identify a subgraph relevant to the outcome of interest. To learn NSCG from data, we develop a *necessary and sufficient causal structural learning (NSCSL)* algorithm, by establishing theoretical properties and relationships between probabilities of causation and natural causal effects of features. Across empirical studies of simulated and real data, we demonstrate that NSCSL outperforms existing algorithms and can reveal crucial yeast genes for target heritable traits of interest.

Optimal Exploration for Model-Based RL in Nonlinear Systems
Andrew Wagenmaker Guanya Shi Kevin Jamieson



研究问题:如何有效地学习和控制未知的非线性动力系统。
动机:在实际应用中,学习一个良好控制器的成本可能会受到某些系统参数的重大影响,因此需要关注这些参数的学习。
方法:通过最小化控制器损失来估计系统参数,并开发一种算法来有效探索系统以减少这种度量中的不确定性。
效果:实验证明该方法在现实的非线性机器人系统中是有效的。

Learning to control unknown nonlinear dynamical systems is a fundamental problem in reinforcement learning and control theory. A commonly applied approach is to first explore the environment (exploration), learn an accurate model of it (system identification), and then compute an optimal controller with the minimum cost on this estimated system (policy optimization). While existing work has shown that it is possible to learn a uniformly good model of the system (Mania et al., 2020), in practice, if we aim to learn a good controller with a low cost on the actual system, certain system parameters may be significantly more critical than others, and we therefore ought to focus our exploration on learning such parameters. In this work, we consider the setting of nonlinear dynamical systems and seek to formally quantify, in such settings, (a) which parameters are most relevant to learning a good controller, and (b) how we can best explore so as to minimize uncertainty in such parameters. Inspired by recent work in linear systems (Wagenmaker et al., 2021), we show that minimizing the controller loss in nonlinear systems translates to estimating the system parameters in a particular, task-dependent metric. Motivated by this, we develop an algorithm able to efficiently explore the system to reduce uncertainty in this metric, and prove a lower bound showing that our approach learns a controller at a near-instance-optimal rate. Our algorithm relies on a general reduction from policy optimization to optimal experiment design in arbitrary systems, and may be of independent interest. We conclude with experiments demonstrating the effectiveness of our method in realistic nonlinear robotic systems.

SE(3) Equivariant Augmented Coupling Flows
Laurence Illing Midgley Vincent Stimper Javier Antoran Emile Mathieu Bernhard Schölkopf José Miguel Hernández-Lobato



研究问题:如何使耦合正则化流在保留快速采样和密度评估的同时,具有物理系统的SE(3)和置换不变性?
动机:标准的耦合架构阻止了对原子的笛卡尔坐标进行操作的流具有物理系统的SE(3)和置换不变性。
方法:通过在附加的增强维度上进行坐标分割,提出一种保持SE(3)和置换等变的耦合流。在每一层中,该流将原子的位置映射到学习的SE(3)不变基中,然后应用标准的流变换(如单调有理二次样条),最后返回到原始基。
效果:当在DW4、LJ13和QM9-positional数据集上训练时,我们的流与等变连续正则化流竞争,同时允许快一个数量级的采样。此外,我们是第一个仅通过建模其原子的笛卡尔位置来学习丙氨酸二肽的完整玻尔兹曼分布的。最后,我们证明我们的流可以仅使用它们的能函数来近似地从DW4和LJ13粒子系统的玻尔兹曼分布中进行采样。

Coupling normalizing flows allow for fast sampling and density evaluation, making them the tool of choice for probabilistic modeling of physical systems. However, the standard coupling architecture precludes endowing flows that operate on the Cartesian coordinates of atoms with the SE(3) and permutation invariances of physical systems. This work proposes a coupling flow that preserves SE(3) and permutation equivariance by performing coordinate splits along additional augmented dimensions. At each layer, the flow maps atoms' positions into learned SE(3) invariant bases, where we apply standard flow transformations, such as monotonic rational-quadratic splines, before returning to the original basis. Crucially, our flow preserves fast sampling and density evaluation, and may be used to produce unbiased estimates of expectations with respect to the target distribution via importance sampling. When trained on the DW4, LJ13, and QM9-positional datasets, our flow is competitive with equivariant continuous normalizing flows, while allowing sampling more than an order of magnitude faster. Moreover, to the best of our knowledge, we are the first to learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms. Lastly, we demonstrate that our flow can be trained to approximately sample from the Boltzmann distribution of the DW4 and LJ13 particle systems using only their energy functions.

GloptiNets: Scalable Non-Convex Optimization with Certificates
Gaspard Beugnot Julien Mairal Alessandro Rudi



研究问题:本文旨在提出一种新的非凸优化方法,用于处理超立方体或环面上的平滑函数。
动机:传统的优化方法依赖于代数性质,而我们的方法则利用目标函数在傅里叶频谱衰减中的固有规律性。
方法:通过定义一个易于处理的模型族,我们同时获得了精确的证书和利用了优化神经网络的强大计算技术。
效果:将该方法应用于中等维度但有数千个系数的多项式时,其性能超过了基于Lasserre层次结构的最新优化方法,解决了竞争对手无法解决的问题。

We present a novel approach to non-convex optimization with certificates, which handles smooth functions on the hypercube or on the torus. Unlike traditional methods that rely on algebraic properties, our algorithm exploits the regularity of the target function intrinsic in the decay of its Fourier spectrum. By defining a tractable family of models, we allow {\em at the same time} to obtain precise certificates and to leverage the advanced and powerful computational techniques developed to optimize neural networks. In this way the scalability of our approach is naturally enhanced by parallel computing with GPUs. Our approach, when applied to the case of polynomials of moderate dimensions but with thousands of coefficients, outperforms the state-of-the-art optimization methods with certificates, as the ones based on Lasserre's hierarchy, addressing problems intractable for the competitors.

Posterior Contraction Rates for Matérn Gaussian Processes on Riemannian Manifolds
Paul Rosa Viacheslav Borovitskiy Alexander Terenin Judith Rousseau



研究问题:本文探讨了在几何设置中使用高斯过程进行不确定性量化的问题,特别是在输入位于黎曼流形上时。
动机:近年来,已经开发出了在几何设置中处理这些模型的计算工具,这引发了一个问题:能否从理论上证明这些内在模型会导致比简单地将所有相关量嵌入到欧几里得空间并使用普通欧几里得高斯过程的限制更好的性能?
方法:本文证明了定义在紧致黎曼流形上的内在Matérn高斯过程的最佳收缩率,并证明了使用迹和延拓定理在流形和环境索伯列夫空间之间的类似速率的外在过程。
效果:通过一系列的例子,本文展示了内在过程在实践中可以实现更好的性能。因此,本文的工作表明,需要更精细的分析来区分不同层次的几何高斯过程的数据效率,特别是在涉及小数据集规模和非渐进行为的情况下。

Gaussian processes are used in many machine learning applications that rely on uncertainty quantification. Recently, computational tools for working with these models in geometric settings, such as when inputs lie on a Riemannian manifold, have been developed. This raises the question: can these intrinsic models be shown theoretically to lead to better performance, compared to simply embedding all relevant quantities into $\mathbb{R}^d$ and using the restriction of an ordinary Euclidean Gaussian process? To study this, we prove optimal contraction rates for intrinsic Matérn Gaussian processes defined on compact Riemannian manifolds. We also prove analogous rates for extrinsic processes using trace and extension theorems between manifold and ambient Sobolev spaces: somewhat surprisingly, the rates obtained turn out to coincide with those of the intrinsic processes, provided that their smoothness parameters are matched appropriately. We illustrate these rates empirically on a number of examples, which, mirroring prior work, show that intrinsic processes can achieve better performance in practice. Therefore, our work shows that finer-grained analyses are needed to distinguish between different levels of data-efficiency of geometric Gaussian processes, particularly in settings which involve small data set sizes and non-asymptotic behavior.

CS4ML: A general framework for active learning with arbitrary data based on Christoffel functions
Juan M Cardenas Ben Adcock Nick Dexter



研究问题:本文提出了一种适用于回归问题的主动学习通用框架。
动机:目前的主动学习框架仅支持目标函数的点样本,而我们的框架允许更一般类型的数据,如傅立叶数据、向量值数据、连续曲线数据和多模态数据。
方法:我们考虑了根据有限数量的采样度量和任意非线性近似空间(模型类)进行随机采样,并引入了“广义克里斯托费尔函数”的概念,以优化采样度量。
效果:在科学计算中,主动学习通常很有用,因为生成数据通常是昂贵的。我们在多项式梯度增强学习、使用生成对抗网络的核磁共振成像(MRI)以及使用物理信息神经网络(PINNs)解决偏微分方程(PDEs)的自适应采样方面展示了该框架的有效性。

We introduce a general framework for active learning in regression problems. Our framework extends the standard setup by allowing for general types of data, rather than merely pointwise samples of the target function. This generalization covers many cases of practical interest, such as data acquired in transform domains (e.g., Fourier data), vector-valued data (e.g., gradient-augmented data), data acquired along continuous curves, and, multimodal data (i.e., combinations of different types of measurements). Our framework considers random sampling according to a finite number of sampling measures and arbitrary nonlinear approximation spaces (model classes). We introduce the concept of \textit{generalized Christoffel functions} and show how these can be used to optimize the sampling measures. We prove that this leads to near-optimal sample complexity in various important cases. This paper focuses on applications in scientific computing, where active learning is often desirable, since it is usually expensive to generate data. We demonstrate the efficacy of our framework for gradient-augmented learning with polynomials, Magnetic Resonance Imaging (MRI) using generative models and adaptive sampling for solving PDEs using Physics-Informed Neural Networks (PINNs).

Tree Variational Autoencoders
Laura Manduchi Moritz Vandenhirtz Alain Ryser Julia E Vogt



研究问题:提出一种新的生成性层次聚类模型——树变分自动编码器(TreeVAE)。
动机:现有的模型在处理隐藏数据结构时存在局限性,需要一种能够自适应发现最优树以编码潜在变量之间依赖关系的模型。
方法:通过将样本根据其内在特性进行分层划分,TreeVAE能够揭示数据中隐藏的结构。同时,利用专门的叶子解码器,TreeVAE的树基生成架构实现了轻量级的条件推理并提升了生成性能。
效果:实验证明,TreeVAE能够在各种数据集上发现潜在的簇并找到不同组之间的有意义的层次关系,包括真实世界的成像数据。此外,TreeVAE提供的对数似然下限比序列对应模型更具竞争力。最后,由于其生成性质,TreeVAE能够通过条件采样从发现的簇中生成新的样本。

We propose Tree Variational Autoencoder (TreeVAE), a new generative hierarchical clustering model that learns a flexible tree-based posterior distribution over latent variables. TreeVAE hierarchically divides samples according to their intrinsic characteristics, shedding light on hidden structures in the data. It adapts its architecture to discover the optimal tree for encoding dependencies between latent variables. The proposed tree-based generative architecture enables lightweight conditional inference and improves generative performance by utilizing specialized leaf decoders. We show that TreeVAE uncovers underlying clusters in the data and finds meaningful hierarchical relations between the different groups on a variety of datasets, including real-world imaging data. We present empirically that TreeVAE provides a more competitive log-likelihood lower bound than the sequential counterparts. Finally, due to its generative nature, TreeVAE is able to generate new samples from the discovered clusters via conditional sampling.

Auditing Fairness by Betting
Ben Chugg Santiago Cortes-Gomez Bryan Wilder Aaditya Ramdas



研究问题:我们提供了一种实用、高效和非参数化的方法,用于审计已部署的分类和回归模型的公平性。
动机:与以往依赖固定样本量的工作不同,我们的方法具有连续性,可以对不断流入的数据进行持续监控,非常适合追踪现实世界系统的公平性。
方法:我们允许数据由概率策略收集,而不是从总体中均匀抽样。这使得审计可以在为其他目的收集的数据上进行。此外,此策略可能会随时间改变,不同的子群体可能会使用不同的策略。最后,我们的方法可以处理由于模型或底层人口变化引起的分布偏移。
效果:我们的方法基于最新的随时有效推理和博弈论统计——特别是“通过下注测试”框架——取得了进展。这些联系确保了我们的方法具有可解释性、快速性和易于实施的特点。我们在三个基准公平性数据集上展示了该方法的有效性。

We provide practical, efficient, and nonparametric methods for auditing the fairness of deployed classification and regression models. Whereas previous work relies on a fixed-sample size, our methods are sequential and allow for the continuous monitoring of incoming data, making them highly amenable to tracking the fairness of real-world systems. We also allow the data to be collected by a probabilistic policy as opposed to sampled uniformly from the population. This enables auditing to be conducted on data gathered for another purpose. Moreover, this policy may change over time and different policies may be used on different subpopulations. Finally, our methods can handle distribution shift resulting from either changes to the model or changes in the underlying population. Our approach is based on recent progress in anytime-valid inference and game-theoretic statistics---the ``testing by betting'' framework in particular. These connections ensure that our methods are interpretable, fast, and easy to implement. We demonstrate the efficacy of our approach on three benchmark fairness datasets.

Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision
Ayush Tewari Tianwei Yin George Cazenavette Semon Rezchikov Joshua B. Tenenbaum Fredo Durand William T. Freeman Vincent Sitzmann



研究问题:现有的去噪扩散模型在训练样本易获取的场景下表现良好,但在真实场景中,训练样本可能无法直接获得,如何让模型在这种情况下也能进行有效学习?
动机:例如在逆向图形学中,我们的目标是根据给定的2D图像生成3D场景分布的样本,但无法获取真实的3D场景,只能得到部分观察信息。
方法:提出一种新的去噪扩散概率模型,通过将可微分的前向模型直接整合到去噪过程中,使模型能从未被直接观察的信号分布中进行采样。
效果:在三个具有挑战性的计算机视觉任务上验证了该方法的有效性。例如,在逆向图形学中,模型可以直接从与单个2D输入图像对齐的3D场景分布中进行采样。

Denoising diffusion models are a powerful type of generative models used to capture complex distributions of real-world signals. However, their applicability is limited to scenarios where training samples are readily available, which is not always the case in real-world applications. For example, in inverse graphics, the goal is to generate samples from a distribution of 3D scenes that align with a given image, but ground-truth 3D scenes are unavailable and only 2D images are accessible. To address this limitation, we propose a novel class of denoising diffusion probabilistic models that learn to sample from distributions of signals that are never directly observed. Instead, these signals are measured indirectly through a known differentiable forward model, which produces partial observations of the unknown signal. Our approach involves integrating the forward model directly into the denoising process. A key contribution of our work is the integration of a differentiable forward model into the denoising process. This integration effectively connects the generative modeling of observations with the generative modeling of the underlying signals, allowing for end-to-end training of a conditional generative model over signals. During inference, our approach enables sampling from the distribution of underlying signals that are consistent with a given partial observation. We demonstrate the effectiveness of our method on three challenging computer vision tasks. For instance, in the context of inverse graphics, our model enables direct sampling from the distribution of 3D scenes that align with a single 2D input image.

Conditional score-based diffusion models for Bayesian inference in infinite dimensions
Lorenzo Baldassari Ali Siahkoohi Josselin Garnier Knut Solna Maarten V. de Hoop



研究问题:如何有效地解决无限维函数空间中的线性逆问题。
动机:虽然基于得分的扩散模型(SDMs)在有限维向量空间中解决了各种线性逆问题,但在无限维函数空间中的应用却鲜有涉及。
方法:提出了一种理论上有根据的方法,通过使用“条件化SDMs”来从无限维贝叶斯线性逆问题的后验分布中进行采样。
效果:证明了在无限维空间中,条件去噪估计器这种在有限维中成功的方法同样适用。并通过大量的数值例子验证了这种方法的有效性和可行性。

Since their initial introduction, score-based diffusion models (SDMs) have been successfully applied to solve a variety of linear inverse problems in finite-dimensional vector spaces due to their ability to efficiently approximate the posterior distribution. However, using SDMs for inverse problems in infinite-dimensional function spaces has only been addressed recently, primarily through methods that learn the unconditional score. While this approach is advantageous for some inverse problems, it is mostly heuristic and involves numerous computationally costly forward operator evaluations during posterior sampling. To address these limitations, we propose a theoretically grounded method for sampling from the posterior of infinite-dimensional Bayesian linear inverse problems based on amortized conditional SDMs. In particular, we prove that one of the most successful approaches for estimating the conditional score in finite dimensions—the conditional denoising estimator—can also be applied in infinite dimensions. A significant part of our analysis is dedicated to demonstrating that extending infinite-dimensional SDMs to the conditional setting requires careful consideration, as the conditional score typically blows up for small times, contrarily to the unconditional score. We conclude by presenting stylized and large-scale numerical examples that validate our approach, offer additional insights, and demonstrate that our method enables large-scale, discretization-invariant Bayesian inference.

Stein $\Pi$-Importance Sampling
Congye Wang Wilson Ye Chen Heishiro Kanagawa Chris J. Oates



研究问题:如何设计适用于Stein重要采样的马尔科夫链。
动机:Stein差异已成为改进马尔科夫链蒙特卡罗输出的强大工具,但如何设计适合此类后处理的马尔科夫链的问题尚未解决。
方法:本论文研究了Stein重要性采样,其中为$Pi$不变马尔科夫链访问的状态分配权重以获得目标$P$的一致近似值。
效果:令人惊讶的是,最优选择的$Pi$与目标$P$并不相同;因此,我们提出了一种基于新颖变分论证的$\Pi$的显式构造。对于PosteriorDB基准测试中的约70%的任务,报告了比$P$-不变马尔科夫链的类似后处理显著的改进。

Stein discrepancies have emerged as a powerful tool for retrospective improvement of Markov chain Monte Carlo output. However, the question of how to design Markov chains that are well-suited to such post-processing has yet to be addressed. This paper studies Stein importance sampling, in which weights are assigned to the states visited by a $\Pi$-invariant Markov chain to obtain a consistent approximation of $P$, the intended target. Surprisingly, the optimal choice of $\Pi$ is not identical to the target $P$; we therefore propose an explicit construction for $\Pi$ based on a novel variational argument. Explicit conditions for convergence of Stein $\Pi$-Importance Sampling are established. For $\approx 70$% of tasks in the PosteriorDB benchmark, a significant improvement over the analogous post-processing of $P$-invariant Markov chains is reported.

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers
Phillip Lippe Bastiaan S. Veeling Paris Perdikaris Richard E Turner Johannes Brandstetter



研究问题:如何提高深度神经网络在解决时间依赖偏微分方程(PDEs)上的长期预测准确性和稳定性。
动机:传统的PDE求解方法计算成本高,因此基于深度神经网络的替代方案受到了关注。然而,这些神经PDE求解器需要能够提供长期准确的预测,这是一个难题。
方法:本研究分析了常见的时间推进策略,发现在高频解决方案中忽视了非主导的空间频率信息是限制稳定准确预测的主要障碍。受最近扩散模型进展的启发,提出了PDE-Refiner模型,通过多步细化过程更准确地模拟所有频率成分。
效果:在复杂的流体动力学基准测试中验证了PDE-Refiner,其稳定的准确预测性能超过了最先进的模型,包括神经、数值和混合神经数值架构。此外,PDE-Refiner大大提高了数据效率,因为去噪目标隐含地引入了一种新的频谱数据增强形式。最后,PDE-Refiner与扩散模型的联系使其能够准确有效地评估模型的预测不确定性。

Time-dependent partial differential equations (PDEs) are ubiquitous in science and engineering. Recently, mostly due to the high computational cost of traditional solution techniques, deep neural network based surrogates have gained increased interest. The practical utility of such neural PDE solvers relies on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem. In this work, we present a large-scale analysis of common temporal rollout strategies, identifying the neglect of non-dominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Based on these insights, we draw inspiration from recent advances in diffusion models to introduce PDE-Refiner; a novel model class that enables more accurate modeling of all frequency components via a multistep refinement process. We validate PDE-Refiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. We further demonstrate that PDE-Refiner greatly enhances data efficiency, since the denoising objective implicitly induces a novel form of spectral data augmentation. Finally, PDE-Refiner's connection to diffusion models enables an accurate and efficient assessment of the model's predictive uncertainty, allowing us to estimate when the surrogate becomes inaccurate.

Partial Counterfactual Identification of Continuous Outcomes with a Curvature Sensitivity Model
Valentyn Melnychuk Dennis Frauen Stefan Feuerriegel



研究问题:本文旨在解决现有连续结果反事实推理方法对潜在结构因果模型的强假设问题,并尝试进行部分反事实识别。
动机:现有的连续结果反事实推理方法主要关注点识别,并对潜在的结构因果模型做出了强烈的非自然假设。
方法:本文提出了一种新的敏感性模型——曲率敏感性模型,通过限制函数水平集的曲率来获取信息边界。同时,将曲率敏感性模型实现为一种新的深度生成模型——增强型伪可逆解码器。
效果:实验证明,增强型伪可逆解码器是有效的。据我们所知,这是我们提出的第一种针对具有连续结果的马尔科夫结构因果模型的部分识别模型。

Counterfactual inference aims to answer retrospective "what if" questions and thus belongs to the most fine-grained type of inference in Pearl's causality ladder. Existing methods for counterfactual inference with continuous outcomes aim at point identification and thus make strong and unnatural assumptions about the underlying structural causal model. In this paper, we relax these assumptions and aim at partial counterfactual identification of continuous outcomes, i.e., when the counterfactual query resides in an ignorance interval with informative bounds. We prove that, in general, the ignorance interval of the counterfactual queries has non-informative bounds, already when functions of structural causal models are continuously differentiable. As a remedy, we propose a novel sensitivity model called Curvature Sensitivity Model. This allows us to obtain informative bounds by bounding the curvature of level sets of the functions. We further show that existing point counterfactual identification methods are special cases of our Curvature Sensitivity Model when the bound of the curvature is set to zero. We then propose an implementation of our Curvature Sensitivity Model in the form of a novel deep generative model, which we call Augmented Pseudo-Invertible Decoder. Our implementation employs (i) residual normalizing flows with (ii) variational augmentations. We empirically demonstrate the effectiveness of our Augmented Pseudo-Invertible Decoder. To the best of our knowledge, ours is the first partial identification model for Markovian structural causal models with continuous outcomes.

Adversarial Robustness in Graph Neural Networks: A Hamiltonian Approach
Kai Zhao Qiyu Kang Yang Song Rui She Sijie Wang Wee Peng Tay



研究问题:本文旨在研究图神经网络(GNNs)对对抗性扰动的脆弱性,包括那些同时影响节点特征和图拓扑的扰动。
动机:目前的GNNs在面对对抗性攻击时存在脆弱性,需要寻找新的方法和理论来提高其鲁棒性。
方法:通过借鉴物理学原理,使用保守哈密顿流构建GNNs,并对比不同神经流GNNs在各种基准数据集上的对抗鲁棒性。
效果:实验结果表明,利用保守哈密顿流和李亚普诺夫稳定性构建的GNNs在对抗性扰动下的鲁棒性有显著提高。

Graph neural networks (GNNs) are vulnerable to adversarial perturbations, including those that affect both node features and graph topology. This paper investigates GNNs derived from diverse neural flows, concentrating on their connection to various stability notions such as BIBO stability, Lyapunov stability, structural stability, and conservative stability. We argue that Lyapunov stability, despite its common use, does not necessarily ensure adversarial robustness. Inspired by physics principles, we advocate for the use of conservative Hamiltonian neural flows to construct GNNs that are robust to adversarial attacks. The adversarial robustness of different neural flow GNNs is empirically compared on several benchmark datasets under a variety of adversarial attacks. Extensive numerical experiments demonstrate that GNNs leveraging conservative Hamiltonian flows with Lyapunov stability substantially improve robustness against adversarial perturbations. The implementation code of experiments is available at \url{https://github.com/zknus/NeurIPS-2023-HANG-Robustness}.

A Cross-Moment Approach for Causal Effect Estimation
Yaroslav Kivva Saber Salehkaleybar Negar Kiyavash



研究问题:在存在潜在混淆因素的线性结构因果模型中,当只有一个代理变量可用时,如何估计治疗对结果的因果效应。
动机:现有的方法需要对数据生成模型做出限制性假设或至少有两个代理变量,我们提出了一种新的方法来解决这个问题。
方法:我们提出使用治疗、结果和代理变量之间的交叉矩来估计因果效应。特别是,如果线性SCM中的潜在混淆因素是非高斯的,那么通过简单的算术运算就可以从交叉矩中识别出因果效应。
效果:我们的实验表明,该方法在估计因果效应方面是有效的。

We consider the problem of estimating the causal effect of a treatment on an outcome in linear structural causal models (SCM) with latent confounders when we have access to a single proxy variable. Several methods (such as difference-in-difference (DiD) estimator or negative outcome control) have been proposed in this setting in the literature. However, these approaches require either restrictive assumptions on the data generating model or having access to at least two proxy variables. We propose a method to estimate the causal effect using cross moments between the treatment, the outcome, and the proxy variable. In particular, we show that the causal effect can be identified with simple arithmetic operations on the cross moments if the latent confounder in linear SCM is non-Gaussian. In this setting, DiD estimator provides an unbiased estimate only in the special case where the latent confounder has exactly the same direct causal effects on the outcomes in the pre-treatment and post-treatment phases. This translates to the common trend assumption in DiD, which we effectively relax. Additionally, we provide an impossibility result that shows the causal effect cannot be identified if the observational distribution over the treatment, the outcome, and the proxy is jointly Gaussian. Our experiments on both synthetic and real-world datasets showcase the effectiveness of the proposed approach in estimating the causal effect.

Outlier-Robust Gromov-Wasserstein for Graph Data
Lemin Kong Jiajin Li Jianheng Tang Anthony Man-Cho So



研究问题:如何有效地比较和对齐不同度量空间上的概率分布,特别是在处理异常值时。
动机:当前广泛使用的Gromov-Wasserstein距离在处理异常值时存在较大误差,需要改进。
方法:提出一种称为RGW的新的、稳健的Gromov-Wasserstein距离版本,通过乐观地扰动边际约束并在基于Kullback-Leibler散度的模糊集合内进行操作。同时,开发了一种计算效率高且理论证明可行的Bregman近端交替线性化最小化算法。
效果:实验验证了理论结果,并展示了RGW在真实世界的图学习任务(如子图匹配和部分形状对应)上的有效性。

Gromov-Wasserstein (GW) distance is a powerful tool for comparing and aligning probability distributions supported on different metric spaces. Recently, GW has become the main modeling technique for aligning heterogeneous data for a wide range of graph learning tasks. However, the GW distance is known to be highly sensitive to outliers, which can result in large inaccuracies if the outliers are given the same weight as other samples in the objective function. To mitigate this issue, we introduce a new and robust version of the GW distance called RGW. RGW features optimistically perturbed marginal constraints within a Kullback-Leibler divergence-based ambiguity set. To make the benefits of RGW more accessible in practice, we develop a computationally efficient and theoretically provable procedure using Bregman proximal alternating linearized minimization algorithm. Through extensive experimentation, we validate our theoretical results and demonstrate the effectiveness of RGW on real-world graph learning tasks, such as subgraph matching and partial shape correspondence.

Sharp Spectral Rates for Koopman Operator Learning
Vladimir R Kostic Karim Lounici Pietro Novelli massimiliano pontil



研究问题:如何从数据中学习Koopman算子及其谱分解。
动机:非线性动力系统可以通过相关的Koopman算子进行描述,其作用在时间上向前演化系统的每个可观察量。
方法:我们首次提出了非渐近的Koopman特征值和特征函数的学习界限,并分析了两种流行的估计器:扩展动态模式分解(EDMD)和降维回归(RRR)。
效果:我们的结果主要依赖于对操作数误差的新的最小最大估计界限,这可能具有独立的兴趣。我们的谱学习界限是由同时控制操作数误差和估计的特征函数的新的距离度量功能性驱动的。这些界限表明,EDMD和RRR的方差相似,但EDMD受到较大的偏差影响,可能会对其学习速度产生不利影响。我们的研究结果为涌现伪特征值的问题提供了新的见解,这是一个众所周知的经验问题。数值实验说明了在实践中界限的影响。

Non-linear dynamical systems can be handily described by the associated Koopman operator, whose action evolves every observable of the system forward in time. Learning the Koopman operator and its spectral decomposition from data is enabled by a number of algorithms. In this work we present for the first time non-asymptotic learning bounds for the Koopman eigenvalues and eigenfunctions. We focus on time-reversal-invariant stochastic dynamical systems, including the important example of Langevin dynamics. We analyze two popular estimators: Extended Dynamic Mode Decomposition (EDMD) and Reduced Rank Regression (RRR). Our results critically hinge on novel {minimax} estimation bounds for the operator norm error, that may be of independent interest. Our spectral learning bounds are driven by the simultaneous control of the operator norm error and a novel metric distortion functional of the estimated eigenfunctions. The bounds indicates that both EDMD and RRR have similar variance, but EDMD suffers from a larger bias which might be detrimental to its learning rate. Our results shed new light on the emergence of spurious eigenvalues, an issue which is well known empirically. Numerical experiments illustrate the implications of the bounds in practice.

A Dynamical System View of Langevin-Based Non-Convex Sampling
Mohammad Reza Karimi Jaghargh Ya-Ping Hsieh Andreas Krause



研究问题:非凸采样是机器学习中的关键挑战,对于深度学习中的非凸优化和近似概率推理至关重要。
动机:尽管其重要性,但理论上仍存在一些重要挑战:现有保证的缺点在于缺乏对最后迭代的保证,除了随机梯度兰格文动力学的基本方案外,目前对此了解甚少。
方法:我们开发了一个新颖的框架,通过利用动力系统理论的几种工具来解决上述问题。我们的主要结果是,对于一类最先进的采样方案,它们在Wasserstein距离上的最后迭代收敛可以归结为对其连续时间对应物的研究,这被更好地理解。
效果:结合MCMC采样的标准假设,我们的理论立即产生了许多先进的采样方案(如镜像兰格文、邻近、随机化中点和龙格-库塔方法)的最后迭代Wasserstein收敛。

Non-convex sampling is a key challenge in machine learning, central to non-convex optimization in deep learning as well as to approximate probabilistic inference. Despite its significance, theoretically there remain some important challenges: Existing guarantees suffer from the drawback of lacking guarantees for the last-iterates, and little is known beyond the elementary schemes of stochastic gradient Langevin dynamics. To address these issues, we develop a novel framework that lifts the above issues by harnessing several tools from the theory of dynamical systems. Our key result is that, for a large class of state-of-the-art sampling schemes, their last-iterate convergence in Wasserstein distances can be reduced to the study of their continuous-time counterparts, which is much better understood. Coupled with standard assumptions of MCMC sampling, our theory immediately yields the last-iterate Wasserstein convergence of many advanced sampling schemes such as mirror Langevin, proximal, randomized mid-point, and Runge-Kutta methods.

Tree-Based Diffusion Schrödinger Bridge with Applications to Wasserstein Barycenters
Maxence Noble Valentin De Bortoli Arnaud Doucet Alain Durmus



研究问题:本文旨在解决多边际最优传输(mOT)的问题,特别是在树状二次成本函数下的熵版本。
动机:现有的最优传输方法在处理具有预设边际的分布时存在困难,因此需要一种能够最小化积分成本函数的方法。
方法:本文提出了一种基于树的扩散薛定谔桥(TreeDSB)算法,这是一种动态和连续的状态空间方法,可以看作是多边际Sinkhorn算法的动态连续版本。
效果:该方法在高维情况下具有良好的应用性,如图像插值和贝叶斯融合,实验结果表明,该方法能有效计算Wasserstein重心,即树状mOT问题的解决方案。

Multi-marginal Optimal Transport (mOT), a generalization of OT, aims at minimizing the integral of a cost function with respect to a distribution with some prescribed marginals. In this paper, we consider an entropic version of mOT with a tree-structured quadratic cost, i.e., a function that can be written as a sum of pairwise cost functions between the nodes of a tree. To address this problem, we develop Tree-based Diffusion Schr\"odinger Bridge (TreeDSB), an extension of the Diffusion Schr\"odinger Bridge (DSB) algorithm. TreeDSB corresponds to a dynamic and continuous state-space counterpart of the multimarginal Sinkhorn algorithm. A notable use case of our methodology is to compute Wasserstein barycenters which can be recast as the solution of a mOT problem on a star-shaped tree. We demonstrate that our methodology can be applied in high-dimensional settings such as image interpolation and Bayesian fusion.

ARTree: A Deep Autoregressive Model for Phylogenetic Inference
Tianyu Xie Cheng Zhang



研究问题:设计灵活的概率模型以处理树状结构,这对于开发高效的系统发育推断方法很重要。
动机:先前的研究通常通过手动设计的启发式特征来利用树状结构的相似性,这需要专业知识并且可能受到近似能力的限制。
方法:本文提出了一种基于图神经网络(GNNs)的深度自回归模型,称为ARTree,用于进行系统发育推断。通过将树状结构分解为一系列叶节点添加操作,并基于可学习的拓扑特征通过GNN对涉及的条件分布进行建模,ARTree可以在不使用启发式特征的情况下提供丰富的树状结构分布族,且具有简单的采样算法。
效果:我们在一个具有挑战性的实测数据树状结构密度估计和变分贝叶斯系统发育推断问题上展示了我们的方法的有效性和效率。

Designing flexible probabilistic models over tree topologies is important for developing efficient phylogenetic inference methods. To do that, previous works often leverage the similarity of tree topologies via hand-engineered heuristic features which would require domain expertise and may suffer from limited approximation capability. In this paper, we propose a deep autoregressive model for phylogenetic inference based on graph neural networks (GNNs), called ARTree. By decomposing a tree topology into a sequence of leaf node addition operations and modeling the involved conditional distributions based on learnable topological features via GNNs, ARTree can provide a rich family of distributions over tree topologies that have simple sampling algorithms, without using heuristic features. We demonstrate the effectiveness and efficiency of our method on a benchmark of challenging real data tree topology density estimation and variational Bayesian phylogenetic inference problems.

Newton–Cotes Graph Neural Networks: On the Time Evolution of Dynamic Systems
Lingbing Guo Weiqing Wang Zhuo Chen Ningyu Zhang Zequn Sun Yixuan Lai Qiang Zhang Huajun Chen



研究问题:如何提高对系统动态的预测精度。
动机:现有的基于图神经网络的方法在预测系统未来状态时,其速度积分函数是时间恒定的,这限制了预测的准确性。
方法:提出一种新的预测方法,该方法使用牛顿-科特斯公式进行多次速度估计以预测积分,并从理论上证明了这种方法的有效性。
效果:通过在多个基准测试上的大量实验,证明这种方法比现有技术具有更高的准确性和稳定性。

Reasoning system dynamics is one of the most important analytical approaches for many scientific studies. With the initial state of a system as input, the recent graph neural networks (GNNs)-based methods are capable of predicting the future state distant in time with high accuracy. Although these methods have diverse designs in modeling the coordinates and interacting forces of the system, we show that they actually share a common paradigm that learns the integration of the velocity over the interval between the initial and terminal coordinates. However, their integrand is constant w.r.t. time. Inspired by this observation, we propose a new approach to predict the integration based on several velocity estimations with Newton–Cotes formulas and prove its effectiveness theoretically. Extensive experiments on several benchmarks empirically demonstrate consistent and significant improvement compared with the state-of-the-art methods.

Score-based Generative Modeling through Stochastic Evolution Equations in Hilbert Spaces
Sungbin Lim Eunbi Yoon Taehyun Byun Taewon Kang Seungwoo Kim Kyungjae Lee Sungjoon Choi



研究问题:本文旨在探讨使用随机微分方程在希尔伯特空间中进行概率扩散模型和随机演化方程之间的桥梁构建。
动机:通过将随机微分方程应用到希尔伯特空间,可以扩展其在样本空间和演化算子方面的适用性,从而包含最近的扩散模型变化,如生成功能性数据或用图像变换替换漂移系数。
方法:推导出一种广义的时间反转公式,构建了概率扩散模型和随机演化方程之间的桥梁,并提出了称为希尔伯特扩散模型(HDM)的基于分数的生成模型。结合傅里叶神经算子,验证了HDM在从功能性数据集采样函数方面的优越性。
效果:实验结果表明,HDM在运动合成任务中表现出强大的力量,利用希尔伯特空间中的维纳过程。最后,对图像数据集的实证结果也验证了HDM与使用热传导的扩散模型之间的联系,揭示了探索演化算子和样本空间的潜力。

Continuous-time score-based generative models consist of a pair of stochastic differential equations (SDEs)—a forward SDE that smoothly transitions data into a noise space and a reverse SDE that incrementally eliminates noise from a Gaussian prior distribution to generate data distribution samples—are intrinsically connected by the time-reversal theory on diffusion processes. In this paper, we investigate the use of stochastic evolution equations in Hilbert spaces, which expand the applicability of SDEs in two aspects: sample space and evolution operator, so they enable encompassing recent variations of diffusion models, such as generating functional data or replacing drift coefficients with image transformation. To this end, we derive a generalized time-reversal formula to build a bridge between probabilistic diffusion models and stochastic evolution equations and propose a score-based generative model called Hilbert Diffusion Model (HDM). Combining with Fourier neural operator, we verify the superiority of HDM for sampling functions from functional datasets with a power of kernel two-sample test of 4.2 on Quadratic, 0.2 on Melbourne, and 3.6 on Gridwatch, which outperforms existing diffusion models formulated in function spaces. Furthermore, the proposed method shows its strength in motion synthesis tasks by utilizing the Wiener process with values in Hilbert space. Finally, our empirical results on image datasets also validate a connection between HDM and diffusion models using heat dissipation, revealing the potential for exploring evolution operators and sample spaces.

Unpaired Multi-Domain Causal Representation Learning
Nils Sturma Chandler Squires Mathias Drton Caroline Uhler



研究问题:寻找由因果相关的潜在变量组成的数据表示。
动机:在多个可能共享因果表示的领域中,我们可以获得数据,但不同领域的观察值是未配对的,即我们只观察到每个领域的边际分布,而不是它们的联合分布。
方法:在论文中,我们为线性设置中的联合分布和共享因果图的可识别性提供了充分条件。如果可以从每个领域的边际分布中唯一恢复联合分布和共享的因果表示,那么识别就成立。我们将结果转化为一种恢复共享潜在因果图的实用方法。
效果:实验结果表明,该方法可以在各种知识驱动任务上取得显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美。

The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our results into a practical method to recover the shared latent causal graph.

Compression with Bayesian Implicit Neural Representations
Zongyu Guo Gergely Flamich Jiajun He Zhibo Chen José Miguel Hernández-Lobato



研究问题:如何有效地压缩数据,同时保持高质量的重建。
动机:目前的压缩方法在量化到低比特精度时会显著降低重建质量。
方法:提出通过过拟合变分贝叶斯神经网络对数据进行压缩,并使用相对熵编码压缩近似后验权重样本,而不是对其进行量化和熵编码。
效果:该方法在图像和音频压缩方面表现出色,同时保持简单性。

Many common types of data can be represented as functions that map coordinates to signal values, such as pixel locations to RGB values in the case of an image. Based on this view, data can be compressed by overfitting a compact neural network to its functional representation and then encoding the network weights. However, most current solutions for this are inefficient, as quantization to low-bit precision substantially degrades the reconstruction quality. To address this issue, we propose overfitting variational Bayesian neural networks to the data and compressing an approximate posterior weight sample using relative entropy coding instead of quantizing and entropy coding it. This strategy enables direct optimization of the rate-distortion performance by minimizing the $\beta$-ELBO, and target different rate-distortion trade-offs for a given network architecture by adjusting $\beta$. Moreover, we introduce an iterative algorithm for learning prior weight distributions and employ a progressive refinement process for the variational posterior that significantly enhances performance. Experiments show that our method achieves strong performance on image and audio compression while retaining simplicity.

Fast Optimal Transport through Sliced Generalized Wasserstein Geodesics
Guillaume Mahey Laetitia Chapel Gilles Gasso Clément Bonet Nicolas Courty



研究问题:提出一种新的平方Wasserstein距离的代理,称为min-SWGG,基于两个输入分布的最佳一维投影所诱导的传输映射。
动机:在许多涉及概率度量的应用中,Wasserstein距离和相关的最优传输计划已被证明是有用的。
方法:通过最佳一维投影对两个输入分布进行传输映射,得到min-SWGG。同时,我们还提供了一种快速计算方案,适用于梯度下降优化。
效果:实验证据表明,min-SWGG在各种情况下都有其优势,如梯度流、形状匹配和图像着色等。

Wasserstein distance (WD) and the associated optimal transport plan have been proven useful in many applications where probability measures are at stake. In this paper, we propose a new proxy of the squared WD, coined $\textnormal{min-SWGG}$, that is based on the transport map induced by an optimal one-dimensional projection of the two input distributions. We draw connections between $\textnormal{min-SWGG}$, and Wasserstein generalized geodesics in which the pivot measure is supported on a line. We notably provide a new closed form for the exact Wasserstein distance in the particular case of one of the distributions supported on a line allowing us to derive a fast computational scheme that is amenable to gradient descent optimization. We show that $\textnormal{min-SWGG}$, is an upper bound of WD and that it has a complexity similar to as Sliced-Wasserstein, with the additional feature of providing an associated transport plan. We also investigate some theoretical properties such as metricity, weak convergence, computational and topological properties. Empirical evidences support the benefits of $\textnormal{min-SWGG}$, in various contexts, from gradient flows, shape matching and image colorization, among others.

Safety Verification of Decision-Tree Policies in Continuous Time
Christian Schilling Anna Lukina Emir Demirović Kim Guldstrand Larsen



研究问题:如何为由决策树控制的安全系统提供保证。
动机:尽管决策树作为学习基础控制策略的可解释替代模型越来越受欢迎,但为其提供安全保证仍是一个开放的挑战。
方法:本文提出了一种直接验证连续时间决策树控制系统的方法,其核心是利用决策树结构通过决策节点传播基于集合的近似值。
效果:通过对几个模仿非线性系统中神经网络策略的决策树进行提炼,我们证明了该方法的有效性。

Decision trees have gained popularity as interpretable surrogate models for learning-based control policies. However, providing safety guarantees for systems controlled by decision trees is an open challenge. We show that the problem is undecidable even for systems with the simplest dynamics, and PSPACE-complete for finite-horizon properties. The latter can be verified for discrete-time systems via bounded model checking. However, for continuous-time systems, such an approach requires discretization, thereby weakening the guarantees for the original system. This paper presents the first algorithm to directly verify decision-tree controlled system in continuous time. The key aspect of our method is exploiting the decision-tree structure to propagate a set-based approximation through the decision nodes. We demonstrate the effectiveness of our approach by verifying safety of several decision trees distilled to imitate neural-network policies for nonlinear systems.

Squared Neural Families: A New Class of Tractable Density Models
Russell Tsuchida Cheng Soon Ong Dino Sejdinovic



研究问题:本文旨在开发和研究一种新的概率分布模型,称为平方神经网络族(SNEFY),并通过对神经网络的2-范数进行平方和归一化来形成。
动机:由于许多机器学习任务中都需要使用到灵活的概率分布模型,因此作者们开发了这种新的、基于神经网络的概率分布模型。
方法:通过将神经网络的2-范数进行平方并归一化,形成了SNEFY模型。在许多感兴趣的案例中,SNEFY模型都可以得到封闭形式的正规化常数,从而产生灵活且完全可追踪的密度模型。
效果:SNEFY模型严格地推广了经典的指数族,并且在条件密度估计和缺失数据密度估计等任务上都有实际应用。

Flexible models for probability distributions are an essential ingredient in many machine learning tasks. We develop and investigate a new class of probability distributions, which we call a Squared Neural Family (SNEFY), formed by squaring the 2-norm of a neural network and normalising it with respect to a base measure. Following the reasoning similar to the well established connections between infinitely wide neural networks and Gaussian processes, we show that SNEFYs admit closed form normalising constants in many cases of interest, thereby resulting in flexible yet fully tractable density models. SNEFYs strictly generalise classical exponential families, are closed under conditioning, and have tractable marginal distributions. Their utility is illustrated on a variety of density estimation, conditional density estimation, and density estimation with missing data tasks.

VaRT: Variational Regression Trees
Sebastian Salazar



研究问题:本文旨在介绍一种新颖的非参数贝叶斯模型,该模型使用变分推理来近似随机决策树空间上的后验分布。
动机:决策树是机器学习中用于分类和回归任务的成熟工具。本文提出了一种新的非参数贝叶斯模型,用于处理这些问题。
方法:我们使用变分推理来近似随机决策树空间上的后验分布,并在18个数据集上评估了该模型的性能。我们还探索了其在因果关系推断问题上的应用。
效果:实验结果表明,该模型在回归任务上与其他最先进的方法具有竞争力。我们在PyTorch中实现了该算法的全向量化版本。

Decision trees are a well-established tool in machine learning for classification and regression tasks. In this paper, we introduce a novel non-parametric Bayesian model that uses variational inference to approximate a posterior distribution over the space of stochastic decision trees. We evaluate the model's performance on 18 datasets and demonstrate its competitiveness with other state-of-the-art methods in regression tasks. We also explore its application to causal inference problems. We provide a fully vectorized implementation of our algorithm in PyTorch.

Conditional independence testing under misspecified inductive biases
Felipe Maia Polo Yuekai Sun Moulinath Banerjee



研究问题:条件独立测试是现代统计学和机器学习中的基本且具有挑战性的任务。
动机:许多现代的条件独立测试方法依赖于强大的监督学习方法来学习回归函数或贝叶斯预测器作为中间步骤,我们称之为基于回归的测试。当这些方法由于错误的归纳偏差而失败时,其行为尚不清楚。
方法:我们研究了基于回归的条件独立测试在错误归纳偏差下的性能。即,我们为三个基于回归的测试提出了新的近似值或错误上界,这些测试取决于错误归纳偏差。此外,我们还引入了拉奥-布莱克威尔预测器测试(RBPT),这是一种对错误归纳偏差具有鲁棒性的基于回归的条件独立测试。
效果:通过人工和真实数据的实验,展示了我们的理论和方法的有效性。

Conditional independence (CI) testing is a fundamental and challenging task in modern statistics and machine learning. Many modern methods for CI testing rely on powerful supervised learning methods to learn regression functions or Bayes predictors as an intermediate step; we refer to this class of tests as regression-based tests. Although these methods are guaranteed to control Type-I error when the supervised learning methods accurately estimate the regression functions or Bayes predictors of interest, their behavior is less understood when they fail due to misspecified inductive biases; in other words, when the employed models are not flexible enough or when the training algorithm does not induce the desired predictors. Then, we study the performance of regression-based CI tests under misspecified inductive biases. Namely, we propose new approximations or upper bounds for the testing errors of three regression-based tests that depend on misspecification errors. Moreover, we introduce the Rao-Blackwellized Predictor Test (RBPT), a regression-based CI test robust against misspecified inductive biases. Finally, we conduct experiments with artificial and real data, showcasing the usefulness of our theory and methods.

Learning Functional Transduction
Mathieu Chalvidal Thomas Serre Rufin VanRullen



研究问题:如何有效地进行回归分析?
动机:现有的直接基于范例数据的转导方法和潜在的复杂函数拟合的归纳方法都存在问题。
方法:利用向量值再生核巴拿赫空间理论,提出了一种混合方法:元学习转导回归系统,形成高效的上下文神经近似器。
效果:训练后的转导器可以快速捕捉新的功能关系,并生成原始图像估计,适用于物理系统和气候变化模型等应用,且训练成本较低。

Research in statistical learning has polarized into two general approaches to perform regression analysis: Transductive methods construct estimates directly based on exemplar data using generic relational principles which might suffer from the curse of dimensionality. Conversely, inductive methods can potentially fit highly complex functions at the cost of compute-intensive solution searches. In this work, we leverage the theory of vector-valued Reproducing Kernel Banach Spaces (RKBS) to propose a hybrid approach: We show that transductive regression systems can be meta-learned with gradient descent to form efficient _in-context_ neural approximators of function defined over both finite and infinite-dimensional spaces (operator regression). Once trained, our _Transducer_ can almost instantaneously capture new functional relationships and produce original image estimates, given a few pairs of input and output examples. We demonstrate the benefit of our meta-learned transductive approach to model physical systems influenced by varying external factors with little data at a fraction of the usual deep learning training costs for partial differential equations and climate modeling applications.

High-dimensional Asymptotics of Denoising Autoencoders
Hugo Cui Lenka Zdeborova



研究问题:使用具有绑定权重和跳过连接的两层非线性自动编码器对高维数据进行去噪。
动机:在训练样本数量和输入维度同时趋向无穷大,而隐藏单元数量有限的情况下,解决高维数据的去噪问题。
方法:构建一个具有绑定权重和跳过连接的两层非线性自动编码器,通过闭型表达式计算去噪均方误差。
效果:实验结果表明,该架构优于与主成分分析密切相关的不带跳过连接的自动编码器,且能准确捕捉真实数据集的学习曲线。

We address the problem of denoising data from a Gaussian mixture using a two-layer non-linear autoencoder with tied weights and a skip connection. We consider the high-dimensional limit where the number of training samples and the input dimension jointly tend to infinity while the number of hidden units remains bounded. We provide closed-form expressions for the denoising mean-squared test error. Building on this result, we quantitatively characterize the advantage of the considered architecture over the autoencoder without the skip connection that relates closely to principal component analysis. We further show that our results capture accurately the learning curves on a range of real datasets.

Kernelized Cumulants: Beyond Kernel Mean Embeddings
Patric Bonnier Harald Oberhauser Zoltán Szabó



研究问题:如何将累积量扩展到再生核希尔伯特空间(RKHS)并证明其计算的可追踪性。
动机:累积量可以作为矩的替代方案,具有更低的方差估计器等优点。
方法:使用张量代数的工具将累积量扩展到RKHS,并通过内核技巧证明其计算的可追踪性。
效果:通过理论和实验(包括合成数据、环境数据和交通数据分析),证明了超越一阶的优势,且在实验中实现了相同的计算复杂度和最小的开销。

In $\mathbb{R}^d$, it is well-known that cumulants provide an alternative to moments that can achieve the same goals with numerous benefits such as lower variance estimators. In this paper we extend cumulants to reproducing kernel Hilbert spaces (RKHS) using tools from tensor algebras and show that they are computationally tractable by a kernel trick. These kernelized cumulants provide a new set of all-purpose statistics; the classical maximum mean discrepancy and Hilbert-Schmidt independence criterion arise as the degree one objects in our general construction. We argue both theoretically and empirically (on synthetic, environmental, and traffic data analysis) that going beyond degree one has several advantages and can be achieved with the same computational complexity and minimal overhead in our experiments.

Let the Flows Tell: Solving Graph Combinatorial Problems with GFlowNets
Dinghuai Zhang Hanjun Dai Nikolay Malkin Aaron Courville Yoshua Bengio Ling Pan



研究问题:组合优化(CO)问题通常属于NP-hard,难以通过精确算法求解,因此适合应用机器学习方法。
动机:CO问题中的结构化约束可能阻碍直接在解空间中进行优化或采样。
方法:设计马尔可夫决策过程(MDPs)以解决不同的组合优化问题,并训练条件GFlowNets从解空间中采样。
效果:通过在各种CO任务上进行大量实验,使用合成和真实数据,证明GFlowNet策略能够有效地找到高质量的解决方案。

Combinatorial optimization (CO) problems are often NP-hard and thus out of reach for exact algorithms, making them a tempting domain to apply machine learning methods. The highly structured constraints in these problems can hinder either optimization or sampling directly in the solution space. On the other hand, GFlowNets have recently emerged as a powerful machinery to efficiently sample from composite unnormalized densities sequentially and have the potential to amortize such solution-searching processes in CO, as well as generate diverse solution candidates. In this paper, we design Markov decision processes (MDPs) for different combinatorial problems and propose to train conditional GFlowNets to sample from the solution space. Efficient training techniques are also developed to benefit long-range credit assignment. Through extensive experiments on a variety of different CO tasks with synthetic and realistic data, we demonstrate that GFlowNet policies can efficiently find high-quality solutions. Our implementation is open-sourced at https://github.com/zdhNarsil/GFlowNet-CombOpt.

Provably Fast Finite Particle Variants of SVGD via Virtual Particle Stochastic Approximation
Aniket Das Dheeraj Mysore Nagaraj



研究问题:本文旨在解决粒子基变分推断算法SVGD在有限粒子情况下的行为理解不足的问题。
动机:尽管SVGD的无限粒子极限动态特性已被充分描述,但其在有限粒子状态下的行为却鲜为人知。为此,我们引入了“虚拟粒子”的概念,以开发新的基于概率测度的群体极限SVGD动态的随机近似方法,这些方法可以用有限的粒子精确实现。
方法:我们设计了两种计算效率高的SVGD变体,即VP-SVGD和GB-SVGD,这两种算法具有被证明的快速有限粒子收敛速度。我们的算法可以看作是特定随机批量近似的SVGD,其计算效率高于普通SVGD。
效果:实验结果表明,运行T步、批量大小为K的VP-SVGD和GB-SVGD产生的n个粒子至少与目标分布的核Stein Discrepancy至多为O(d^{1/3}/(KT)^{1/6})的i.i.d样本一样好。此外,我们的结果还适用于对势函数的温和增长条件,这比先前工作中通常考虑的等周(例如Poincare不等式)或信息传输条件(例如Talagrand的不等式T_1)要弱得多。因此,我们分析了由VP-SVGD和GB-SVGD产生的粒子的经验测度向目标分布的收敛情况,并展示了比已知最好的有限粒子分析的SVGD双指数改善的效果。

Stein Variational Gradient Descent (SVGD) is a popular particle-based variational inference algorithm with impressive empirical performance across various domains. Although the population (i.e, infinite-particle) limit dynamics of SVGD is well characterized, its behavior in the finite-particle regime is far less understood. To this end, our work introduces the notion of *virtual particles* to develop novel stochastic approximations of population-limit SVGD dynamics in the space of probability measures, that are exactly realizable using finite particles. As a result, we design two computationally efficient variants of SVGD, namely VP-SVGD and GB-SVGD, with provably fast finite-particle convergence rates. Our algorithms can be viewed as specific random-batch approximations of SVGD, which are computationally more efficient than ordinary SVGD. We show that the $n$ particles output by VP-SVGD and GB-SVGD, run for $T$ steps with batch-size $K$, are at-least as good as i.i.d samples from a distribution whose Kernel Stein Discrepancy to the target is at most $O(\tfrac{d^{1/3}}{(KT)^{1/6}})$ under standard assumptions. Our results also hold under a mild growth condition on the potential function, which is much weaker than the isoperimetric (e.g. Poincare Inequality) or information-transport conditions (e.g. Talagrand's Inequality $\mathsf{T}_1$) generally considered in prior works. As a corollary, we analyze the convergence of the empirical measure (of the particles output by VP-SVGD and GB-SVGD) to the target distribution and demonstrate a **double exponential improvement** over the best known finite-particle analysis of SVGD. Beyond this, our results present the **first known oracle complexities for this setting with polynomial dimension dependence**, thereby completely eliminating the curse of dimensionality exhibited by previously known finite-particle rates.

Trans-Dimensional Generative Modeling via Jump Diffusion Models
Andrew Campbell William Harvey Christian Dietrich Weilbach Valentin De Bortoli Tom Rainforth Arnaud Doucet



研究问题:本文旨在提出一种新的生成模型,能够自然地处理不同维度的数据。
动机:目前的生成模型在处理不同维度数据时存在困难,需要分别生成状态值和维度。
方法:通过联合建模每个数据点的状态和维度,将生成过程定义为跳跃扩散过程,在不同的维度空间之间跳跃。首先定义一个破坏维度的前向噪声过程,然后推导出创建维度的时间反向生成过程以及一种新的证据下界训练目标,用于学习近似它。
效果:模拟我们学到的时间反向生成过程的近似值,为生成不同维度的数据提供了一种有效的方法,通过联合生成状态值和维度。在分子和视频数据集上进行实验,报告了与测试时间扩散引导插值任务更好的兼容性,以及与分别生成状态值和维度的固定维度模型相比,改进了插值能力。

We propose a new class of generative model that naturally handles data of varying dimensionality by jointly modeling the state and dimension of each datapoint. The generative process is formulated as a jump diffusion process that makes jumps between different dimensional spaces. We first define a dimension destroying forward noising process, before deriving the dimension creating time-reversed generative process along with a novel evidence lower bound training objective for learning to approximate it. Simulating our learned approximation to the time-reversed generative process then provides an effective way of sampling data of varying dimensionality by jointly generating state values and dimensions. We demonstrate our approach on molecular and video datasets of varying dimensionality, reporting better compatibility with test-time diffusion guidance imputation tasks and improved interpolation capabilities versus fixed dimensional models that generate state values and dimensions separately.

Explaining the Uncertain: Stochastic Shapley Values for Gaussian Process Models
Siu Lun Chau Krikamol Muandet Dino Sejdinovic



研究问题:本文旨在提出一种新的方法来解释高斯过程(GPs),该方法可以利用GPs中存在的完整的分析协方差结构。
动机:现有的高斯过程解释方法无法充分利用其完整的分析协方差结构,因此需要一种新方法来提高解释的有效性和准确性。
方法:本文提出了一种基于扩展了随机合作博弈的沙普利值的解决方案概念的新方法,该方法生成的解释是随机变量。使用这种方法生成的高斯过程解释满足类似于标准沙普利值的有利公理,并具有可追踪的特征和数据观察之间的协方差函数。这种协方差允许量化解释的不确定性并研究解释之间的统计依赖性。
效果:通过大量的示例说明,本文提出的新方法在预测和解释高斯过程方面表现出了很高的有效性。

We present a novel approach for explaining Gaussian processes (GPs) that can utilize the full analytical covariance structure present in GPs. Our method is based on the popular solution concept of Shapley values extended to stochastic cooperative games, resulting in explanations that are random variables. The GP explanations generated using our approach satisfy similar favorable axioms to standard Shapley values and possess a tractable covariance function across features and data observations. This covariance allows for quantifying explanation uncertainties and studying the statistical dependencies between explanations. We further extend our framework to the problem of predictive explanation, and propose a Shapley prior over the explanation function to predict Shapley values for new data based on previously computed ones. Our extensive illustrations demonstrate the effectiveness of the proposed approach.

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics
Leon Klein Andrew Y. K. Foong Tor Erlend Fjelde Bruno Kacper Mlodozeniec Marc Brockschmidt Sebastian Nowozin Frank Noe Ryota Tomioka



研究问题:如何有效地模拟分子系统的长时间过程,如结合和折叠等?
动机:传统的分子动力学模拟方法无法有效模拟长时间的分子过程,并且需要对每个研究的分子系统进行新的模拟。
方法:提出一种名为Timewarp的增强采样方法,该方法使用正则化流作为马尔可夫链蒙特卡罗方法中的目标波尔兹曼分布的提议分布。该流在离线状态下对分子动力学轨迹进行训练,学习在时间上进行大步长模拟。
效果:Timewarp方法具有转移性,一旦训练完成,就可以推广到未见过的小肽(2-4个氨基酸),并能够比标准的分子动力学更快地探索其亚稳态。这为开发通用、可转移的加速分子动力学算法迈出了重要一步。

*Molecular dynamics* (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed for each molecular system studied. We present *Timewarp*, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6} \textrm{fs}$. Crucially, Timewarp is *transferable* between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids) at all-atom resolution, exploring their metastable states and providing wall-clock acceleration of sampling compared to standard MD. Our method constitutes an important step towards general, transferable algorithms for accelerating MD.

Statistical Guarantees for Variational Autoencoders using PAC-Bayesian Theory
Sokhna Diarra Mbacke Florence Clerc Pascal Germain



研究问题:本文旨在利用PAC-贝叶斯理论为变分自编码器(VAEs)提供统计保证。
动机:尽管变分自编码器在机器学习中广泛应用,但关于其理论性质的许多问题仍未解决。
方法:使用PAC-贝叶斯理论,我们首先推导出数据生成分布中单个样本条件后验分布的第一个PAC-贝叶斯界。然后,利用这一结果为VAE的重构损失提供泛化保证,并为输入和再生分布之间的距离提供上界。更重要的是,我们提供了输入分布和由VAE生成模型定义的分布之间的Wasserstein距离的上界。
效果:实验结果表明,我们的方法能够有效地为VAEs提供理论保证。

Since their inception, Variational Autoencoders (VAEs) have become central in machine learning. Despite their widespread use, numerous questions regarding their theoretical properties remain open. Using PAC-Bayesian theory, this work develops statistical guarantees for VAEs. First, we derive the first PAC-Bayesian bound for posterior distributions conditioned on individual samples from the data-generating distribution. Then, we utilize this result to develop generalization guarantees for the VAE's reconstruction loss, as well as upper bounds on the distance between the input and the regenerated distributions. More importantly, we provide upper bounds on the Wasserstein distance between the input distribution and the distribution defined by the VAE's generative model.

The Geometry of Neural Nets' Parameter Spaces Under Reparametrization
Agustinus Kristiadi Felix Dangel Philipp Hennig



研究问题:重新参数化在改善神经网络训练中是一种流行的方法,但它可能会引发一些问题,如Hessian基平坦度量、优化轨迹和概率密度模式的不一致性。
动机:本文从黎曼几何的角度研究了神经网络在重新参数化下的不变性。
方法:如果明确表示出度量并使用正确的相关转换规则,那么任何神经网络都具有不变性。
效果:讨论了不变性对测量最小值的平坦度、优化和概率密度最大化的影响,并探索了一些有用的不变性方向。

Model reparametrization, which follows the change-of-variable rule of calculus, is a popular way to improve the training of neural nets. But it can also be problematic since it can induce inconsistencies in, e.g., Hessian-based flatness measures, optimization trajectories, and modes of probability densities. This complicates downstream analyses: e.g. one cannot definitively relate flatness with generalization since arbitrary reparametrization changes their relationship. In this work, we study the invariance of neural nets under reparametrization from the perspective of Riemannian geometry. From this point of view, invariance is an inherent property of any neural net _if_ one explicitly represents the metric and uses the correct associated transformation rules. This is important since although the metric is always present, it is often implicitly assumed as identity, and thus dropped from the notation, then lost under reparametrization. We discuss implications for measuring the flatness of minima, optimization, and for probability-density maximization. Finally, we explore some interesting directions where invariance is useful.

Online PCA in Converging Self-consistent Field Equations
Xihan Li Xiang Chen Rasul Tutunov Haitham Bou Ammar Lei Wang Jun Wang



研究问题:解决非线性特征值问题中的自洽场方程(SCF)的非收敛性问题。
动机:传统的固定点迭代方法在解决这类问题上存在非收敛的问题,而SCF方程在计算科学中具有重要的意义。
方法:将SCF方程视为非平稳时间序列的主成分分析(PCA),并在线更新分布和其自身的主成分,使模型逐渐趋向于平衡状态。
效果:通过在线PCA技术,新算法能够提高模型向平衡状态的收敛能力,并在合成和真实的电子结构场景上进行了实验验证,表现出了高收敛能力。

Self-consistent Field (SCF) equation is a type of nonlinear eigenvalue problem in which the matrix to be eigen-decomposed is a function of its own eigenvectors. It is of great significance in computational science for its connection to the Schrödinger equation. Traditional fixed-point iteration methods for solving such equations suffer from non-convergence issues. In this work, we present a novel perspective on such SCF equations as a principal component analysis (PCA) for non-stationary time series, in which a distribution and its own top principal components are mutually updated over time, and the equilibrium state of the model corresponds to the solution of the SCF equations. By the new perspective, online PCA techniques are able to engage in so as to enhance the convergence of the model towards the equilibrium state, acting as a new set of tools for converging the SCF equations. With several numerical adaptations, we then develop a new algorithm for converging the SCF equation, and demonstrated its high convergence capacity with experiments on both synthesized and real electronic structure scenarios.

On Slicing Optimality for Mutual Information
Ammar Fayad Majd Ibrahim



研究问题:在复杂的高维环境中,测量两个随机变量之间的依赖性具有重大意义,但计算困难。
动机:当前的切片方法虽然可以用于测量高维变量之间的互信息(MI),但由于其使用均匀的切片方向分布,通常会丢弃变量之间的有信息特征,导致依赖性的量化不准确。
方法:本文提出了一种寻找互信息最优切片分布的原则性框架,包括理论分析和实践算法的开发,并将其与现代机器学习框架相连接。
效果:通过在基准领域的全面实验,证明了我们的信息测量方法比最先进的基线方法有显著的改进。

Measuring dependence between two random variables is of great importance in various domains but is difficult to compute in today's complex environments with high-dimensional data. Recently, slicing methods have shown to be a scalable approach to measuring mutual information (MI) between high-dimensional variables by projecting these variables into one-dimensional spaces. Unfortunately, these methods use uniform distributions of slicing directions, which generally discard informative features between variables and thus lead to inaccurate quantification of dependence. In this paper, we propose a principled framework that searches for an \textit{optimal} distribution of slices for MI. Importantly, we answer theoretical questions about finding the optimal slicing distribution in the context of MI and develop corresponding theoretical analyses. We also develop a practical algorithm, connecting our theoretical results with modern machine learning frameworks. Through comprehensive experiments in benchmark domains, we demonstrate significant gains in our information measure than state-of-the-art baselines.

Conditional Matrix Flows for Gaussian Graphical Models
Marcello Massimo Negri Fabricio Arend Torres Volker Roth



研究问题:如何利用少量观察值来研究多个变量之间的条件独立性。
动机:高斯图模型(GGMs)通过$l_q$正则化鼓励精度矩阵的稀疏性,但大多数GMMs依赖于$l_1$范数,因为目标对于小于$l_1$的伪范数来说是非常非凸的。
方法:我们提出了一个通用框架,用于在GGMs中进行矩阵变量正态流变分推断,该框架统一了频率派和贝叶斯框架的优点。作为对以前工作的关键点改进,我们用一个流为所有正则化参数$\lambda$和所有$l_q$范数训练了一个连续的稀疏回归模型。
效果:在一个模型中,我们可以访问(i)任意$\lambda$和任意$l_q$(伪)范数的后验演变,(ii)用于模型选择的边缘对数似然,以及(iii)模拟退火在MAP极限中的频繁解决方案路径。

Studying conditional independence among many variables with few observations is a challenging task. Gaussian Graphical Models (GGMs) tackle this problem by encouraging sparsity in the precision matrix through $l_q$ regularization with $q\leq1$. However, most GMMs rely on the $l_1$ norm because the objective is highly non-convex for sub-$l_1$ pseudo-norms. In the frequentist formulation, the $l_1$ norm relaxation provides the solution path as a function of the shrinkage parameter $\lambda$. In the Bayesian formulation, sparsity is instead encouraged through a Laplace prior, but posterior inference for different $\lambda$ requires repeated runs of expensive Gibbs samplers. Here we propose a general framework for variational inference with matrix-variate Normalizing Flow in GGMs, which unifies the benefits of frequentist and Bayesian frameworks. As a key improvement on previous work, we train with one flow a continuum of sparse regression models jointly for all regularization parameters $\lambda$ and all $l_q$ norms, including non-convex sub-$l_1$ pseudo-norms. Within one model we thus have access to (i) the evolution of the posterior for any $\lambda$ and any $l_q$ (pseudo-) norm, (ii) the marginal log-likelihood for model selection, and (iii) the frequentist solution paths through simulated annealing in the MAP limit.

Multiply Robust Federated Estimation of Targeted Average Treatment Effects
Larry Han Zhu Shen Jose R Zubizarreta



研究问题:如何在保护个体数据隐私的同时,利用多站点数据对目标人群进行有效的因果推断。
动机:联邦或多站点研究相比单站点研究具有更强的一般性,能够研究未被充分代表的人群,以及研究罕见的暴露和结果。然而,这些研究需要解决保护每个个体数据的隐私、协变量分布的异质性和站点之间的不同数据结构等问题。
方法:我们提出了一种新的联邦方法,通过开发一种多重稳健且保护隐私的干扰函数估计方法来调整协变量偏移并适应站点间的协变量不匹配,以从多站点数据中得出对目标人群的有效因果推断。
效果:我们的方法在效率和鲁棒性方面优于现有的最先进技术,展示了有限样本的优势。我们将该方法应用于研究经皮冠状动脉介入(PCI)对急性心肌梗死(AMI)患者住院时间的治疗效应,数据来源于美国医疗保险和医疗补助服务中心(CMS)。

Federated or multi-site studies have distinct advantages over single-site studies, including increased generalizability, the ability to study underrepresented populations, and the opportunity to study rare exposures and outcomes. However, these studies are complicated by the need to preserve the privacy of each individual's data, heterogeneity in their covariate distributions, and different data structures between sites. We propose a novel federated approach to derive valid causal inferences for a target population using multi-site data. We adjust for covariate shift and accommodate covariate mismatch between sites by developing a multiply-robust and privacy-preserving nuisance function estimation approach. Our methodology incorporates transfer learning to estimate ensemble weights to combine information from source sites. We show that these learned weights are efficient and optimal under different scenarios. We showcase the finite sample advantages of our approach in terms of efficiency and robustness compared to existing state-of-the-art approaches. We apply our approach to study the treatment effect of percutaneous coronary intervention (PCI) on the duration of hospitalization for patients experiencing acute myocardial infarction (AMI) with data from the Centers for Medicare \& Medicaid Services (CMS).

Improving *day-ahead* Solar Irradiance Time Series Forecasting by Leveraging Spatio-Temporal Context
Oussama Boussif Ghait Boukachab Dan Assouline Stefano Massaroli Tianle Yuan Loubna Benabbou Yoshua Bengio



研究问题:如何通过利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Solar power harbors immense potential in mitigating climate change by substantially reducing CO$_{2}$ emissions. Nonetheless, the inherent variability of solar irradiance poses a significant challenge for seamlessly integrating solar power into the electrical grid. While the majority of prior research has centered on employing purely time series-based methodologies for solar forecasting, only a limited number of studies have taken into account factors such as cloud cover or the surrounding physical context. In this paper, we put forth a deep learning architecture designed to harness spatio-temporal context using satellite data, to attain highly accurate day-ahead time-series forecasting for any given station, with a particular emphasis on forecasting Global Horizontal Irradiance (GHI). We also suggest a methodology to extract a distribution for each time step prediction, which can serve as a very valuable measure of uncertainty attached to the forecast. When evaluating models, we propose a testing scheme in which we separate particularly difficult examples from easy ones, in order to capture the model performances in crucial situations, which in the case of this study are the days suffering from varying cloudy conditions. Furthermore, we present a new multi-modal dataset gathering satellite imagery over a large zone and time series for solar irradiance and other related physical variables from multiple geographically diverse solar stations. Our approach exhibits robust performance in solar irradiance forecasting, including zero-shot generalization tests at unobserved solar stations, and holds great promise in promoting the effective integration of solar power into the grid.

Hierarchical VAEs provide a normative account of motion processing in the primate brain
Hadi Vafaii Jacob L. Yates Daniel Butts



研究问题:本文旨在评估分层推理在运动知觉中的作用及其与大脑功能的一致性。
动机:作者借鉴了19世纪提出的感知和推理之间的关系,并将其与现代机器学习中的生成模型如变分自编码器(VAEs)及其分层变体进行类比。
方法:作者首先引入了一个名为“视网膜光学流学习”(ROFL)的新型合成数据框架,该框架可以控制运动统计量及其原因。然后,他们提出了一种新的分层VAE,并在两个下游任务上对其进行了测试:(i)预测视网膜光学流的真因(例如自我运动);(ii)预测灵长类动物运动处理通路中的神经元反应。作者操纵了模型架构(分层与非分层)、损失函数以及运动刺激的因果结构。
效果:研究发现,模型中的分层潜在结构带来了几项改进。首先,它提高了真因变量的线性可解码性,并且是以稀疏和解耦的方式进行的。其次,我们的分层VAE在预测神经元反应方面优于先前最先进的模型,并表现出稀疏的潜在-神经元关系。这些结果取决于世界的因果结构,表明大脑和人工神经网络之间的对齐不仅依赖于架构,还依赖于匹配生态相关的刺激统计量。总的来说,我们的研究结果表明,分层贝叶斯推理是大脑理解世界的基础,而分层VAEs可以有效地模拟这种理解。

The relationship between perception and inference, as postulated by Helmholtz in the 19th century, is paralleled in modern machine learning by generative models like Variational Autoencoders (VAEs) and their hierarchical variants. Here, we evaluate the role of hierarchical inference and its alignment with brain function in the domain of motion perception. We first introduce a novel synthetic data framework, Retinal Optic Flow Learning (ROFL), which enables control over motion statistics and their causes. We then present a new hierarchical VAE and test it against alternative models on two downstream tasks: (i) predicting ground truth causes of retinal optic flow (e.g., self-motion); and (ii) predicting the responses of neurons in the motion processing pathway of primates. We manipulate the model architectures (hierarchical versus non-hierarchical), loss functions, and the causal structure of the motion stimuli. We find that hierarchical latent structure in the model leads to several improvements. First, it improves the linear decodability of ground truth variables and does so in a sparse and disentangled manner. Second, our hierarchical VAE outperforms previous state-of-the-art models in predicting neuronal responses and exhibits sparse latent-to-neuron relationships. These results depend on the causal structure of the world, indicating that alignment between brains and artificial neural networks depends not only on architecture but also on matching ecologically relevant stimulus statistics. Taken together, our results suggest that hierarchical Bayesian inference underlines the brain's understanding of the world, and hierarchical VAEs can effectively model this understanding.

Optimal testing using combined test statistics across independent studies
Lasse Vuursteen Botond Szabo Aad van der Vaart Harry van Zanten



研究问题:本文旨在研究元分析中测试统计量的组合方法的理论理解,特别是在高维模型和复合假设检验中。
动机:尽管组合独立试验或实验的测试统计量是元分析的常用方法,但其理论理解有限,尤其是在考虑复合假设检验的高维模型中。
方法:在许多正态均值模型的背景下,我们引入了一个自然且温和的限制,对局部试验的元级组合函数进行数学量化。然后,我们为标准的结合方法(如p值和e值)推导出最小最大下界和匹配上界,以量化相对于使用完整、汇总数据的损耗。
效果:我们发现一种“肘部效应”,即在某些情况下,将每个试验中的局部最优测试结合起来会导致次优的元分析方法。我们还探索了允许试验设计之间有限协调的可能性。我们的研究结果将元分析与带宽约束分布式推理联系起来,并建立在后者领域的最新信息理论上的发展。

Combining test statistics from independent trials or experiments is a popular method of meta-analysis. However, there is very limited theoretical understanding of the power of the combined test, especially in high-dimensional models considering composite hypotheses tests. We derive a mathematical framework to study standard {meta-analysis} testing approaches in the context of the many normal means model, which serves as the platform to investigate more complex models. We introduce a natural and mild restriction on the meta-level combination functions of the local trials. This allows us to mathematically quantify the cost of compressing $m$ trials into real-valued test statistics and combining these. We then derive minimax lower and matching upper bounds for the separation rates of standard combination methods for e.g. p-values and e-values, quantifying the loss relative to using the full, pooled data. We observe an elbow effect, revealing that in certain cases combining the locally optimal tests in each trial results in a sub-optimal {meta-analysis} method and develop approaches to achieve the global optima. We also explore the possible gains of allowing limited coordination between the trial designs. Our results connect meta-analysis with bandwidth constraint distributed inference and build on recent information theoretic developments in the latter field.

Fair Adaptive Experiments
Waverly Wei Xinwei Ma Jingshen Wang



研究问题:如何通过随机化实验评估治疗、政策或干预的有效性,同时解决公平性和效率性的问题。
动机:传统的完全随机化方法可能导致数据使用效率低下,而适应性实验通过在实验过程中学习并更新处理分配概率,可以提高数据使用效率和估计效率,但可能引发公平性和平等性问题。
方法:提出一种公平的适应性实验策略,该策略可以同时提高数据使用效率,实现“无嫉妒”的处理分配保证,并提高参与者的整体福利。这种策略不需要对结果变量进行参数建模假设,使其更具通用性和适用性。
效果:理论研究表明,所提出的适应性处理分配算法尽管没有闭型表达式,但会逐渐接近最优分配规则。模拟证据和两个合成数据研究进一步证明了公平适应性实验策略的性能。

Randomized experiments have been the gold standard for assessing the effectiveness of a treatment, policy, or intervention, spanning various fields, including social sciences, biomedical studies, and e-commerce. The classical complete randomization approach assigns treatments based on a pre-specified probability and may lead to inefficient use of data. Adaptive experiments improve upon complete randomization by sequentially learning and updating treatment assignment probabilities using accrued evidence during the experiment. Hence, they can help achieve efficient data use and higher estimation efficiency. However, their application can also raise fairness and equity concerns, as assignment probabilities may vary drastically across groups of participants. Furthermore, when treatment is expected to be extremely beneficial to certain groups of participants, it is more appropriate to expose many of these participants to favorable treatment. In response to these challenges, we propose a fair adaptive experiment strategy that simultaneously enhances data use efficiency, achieves an ``envy-free'' treatment assignment guarantee, and improves the overall welfare of participants. An important feature of our proposed strategy is that we do not impose parametric modeling assumptions on the outcome variables, making it more versatile and applicable to a wider array of applications. Through our theoretical investigation, we characterize the convergence rate of the estimated treatment effects and the associated standard deviations at the group level and further prove that our adaptive treatment assignment algorithm, despite not having a closed-form expression, approaches the optimal allocation rule asymptotically. Our proof strategy takes into account the fact that the allocation decisions in our design depend on sequentially accumulated data, which poses a significant challenge in characterizing the properties and conducting statistical inference of our method. We further provide simulation evidence and two synthetic data studies to showcase the performance of our fair adaptive experiment strategy.

Versatile Energy-Based Probabilistic Models for High Energy Physics
Taoli Cheng Aaron Courville



研究问题:本文旨在构建一种多功能的能量基础概率模型,用于模拟大型强子对撞机中的高能物理事件。
动机:能量基础模型作为一种经典的生成建模方法,具有能量函数形式的灵活性,近年来在计算机视觉和自然语言处理的高维数据建模中取得了巨大成功。
方法:基于强大的生成模型,描述了粒子间的高阶交互作用,适应不同的编码架构,并建立在隐式生成的基础上。
效果:该框架可以作为强大的参数化事件生成器用于物理模拟,一个无虚假相关性的通用异常信号检测器,以及增强的事件分类器进行粒子识别。

As a classical generative modeling approach, energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In line with these advancements, we build a multi-purpose energy-based probabilistic model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator for physics simulation, a generic anomalous signal detector free from spurious correlations, and an augmented event classifier for particle identification.

NAS-X: Neural Adaptive Smoothing via Twisting
Dieterich Lawson Michael Y. Li Scott Linderman



研究问题:本文旨在解决统计和机器学习中序列潜在变量模型的灵活性增加,导致分析推理和模型学习变得具有挑战性的问题。
动机:为了解决这个问题,作者提出了一种新的方法——神经自适应平滑通过扭曲(NAS-X),该方法通过使用平滑序列蒙特卡洛(SMC)来估计难以处理的后验期望,将重新加权唤醒睡眠(RWS)扩展到序列设置中。
方法:NAS-X结合了RWS和平滑SMC,能够提供低偏和低方差的梯度估计,并适应离散和连续的潜在变量模型。
效果:实验表明,NAS-X在推理和模型学习方面显著优于先前基于VI和RWS的方法,实现了更低的参数误差和更紧的似然界限。

Sequential latent variable models (SLVMs) are essential tools in statistics and machine learning, with applications ranging from healthcare to neuroscience. As their flexibility increases, analytic inference and model learning can become challenging, necessitating approximate methods. Here we introduce neural adaptive smoothing via twisting (NAS-X), a method that extends reweighted wake-sleep (RWS) to the sequential setting by using smoothing sequential Monte Carlo (SMC) to estimate intractable posterior expectations. Combining RWS and smoothing SMC allows NAS-X to provide low-bias and low-variance gradient estimates, and fit both discrete and continuous latent variable models. We illustrate the theoretical advantages of NAS-X over previous methods and explore these advantages empirically in a variety of tasks, including a challenging application to mechanistic models of neuronal dynamics. These experiments show that NAS-X substantially outperforms previous VI- and RWS-based methods in inference and model learning, achieving lower parameter error and tighter likelihood bounds.

Modelling Cellular Perturbations with the Sparse Additive Mechanism Shift Variational Autoencoder
Michael Bereket Theofanis Karaletsos



研究问题:如何有效地对观察结果进行干预建模,特别是在药物发现中对细胞的多样性干预效应进行模型化。
动机:在药物发现等领域,需要对各种干预措施对细胞的影响进行建模,以揭示未知的生物作用机制。
方法:提出了稀疏附加机制转移变分自编码器(SAMS-VAE),将组合性、解耦性和可解释性相结合,用于干预模型。SAMS-VAE将受干扰样本的潜在状态建模为局部潜在变量(捕捉样本特定变化)和稀疏全局潜在变量(潜在干预效应)的总和。
效果:通过两个流行的单细胞测序数据集,对SAMS-VAE进行了定量和定性评估。实验结果表明,SAMS-VAE在分布内和分布外任务的泛化性能上优于同类模型,并能产生与已知生物机制强烈相关的可解释潜在结构。

Generative models of observations under interventions have been a vibrant topic of interest across machine learning and the sciences in recent years. For example, in drug discovery, there is a need to model the effects of diverse interventions on cells in order to characterize unknown biological mechanisms of action. We propose the Sparse Additive Mechanism Shift Variational Autoencoder, SAMS-VAE, to combine compositionality, disentanglement, and interpretability for perturbation models. SAMS-VAE models the latent state of a perturbed sample as the sum of a local latent variable capturing sample-specific variation and sparse global variables of latent intervention effects. Crucially, SAMS-VAE sparsifies these global latent variables for individual perturbations to identify disentangled, perturbation-specific latent subspaces that are flexibly composable. We evaluate SAMS-VAE both quantitatively and qualitatively on a range of tasks using two popular single cell sequencing datasets. In order to measure perturbation-specific model-properties, we also introduce a framework for evaluation of perturbation models based on average treatment effects with links to posterior predictive checks. SAMS-VAE outperforms comparable models in terms of generalization across in-distribution and out-of-distribution tasks, including a combinatorial reasoning task under resource paucity, and yields interpretable latent structures which correlate strongly to known biological mechanisms. Our results suggest SAMS-VAE is an interesting addition to the modeling toolkit for machine learning-driven scientific discovery.

Lie Point Symmetry and Physics-Informed Networks
Tara Akhound-Sadegh Laurence Perreault-Levasseur Johannes Brandstetter Max Welling Siamak Ravanbakhsh



研究问题:本文旨在探索将PDE对称性(Lie点对称性)整合到神经网络求解器中,特别是在物理学信息神经网络(PINNs)中的应用。
动机:尽管对称性在改善神经网络的泛化能力方面具有潜力,但它们在神经网络求解偏微分方程(PDEs)中的应用仍然未被充分探索。
方法:我们提出了一种新的损失函数,该函数可以像PINN模型试图通过损失函数强制实施底层PDE一样,向网络提供关于Lie点对称性的信息。
效果:实证评估表明,由PDE的Lie点对称性引入的归纳偏差极大地提高了PINN的样本效率。

Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equivariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions.. Effectively, this means that once the network learns a solution, it also learns the neighbouring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.

PAC-Bayes Generalization Certificates for Learned Inductive Conformal Prediction
Apoorva Sharma Sushant Veer Asher Hancock Heng Yang Marco Pavone Anirudha Majumdar



研究问题:如何提高深度学习模型的不确定性估计效率?
动机:目前的深度学习模型虽然能提供覆盖保证,但其预测集的大小和内容并不直接可控,且依赖于底层模型和评分函数的选择。
方法:通过使用数据直接优化ICP预测集的效率来学习模型和评分函数参数。
效果:利用PAC-Bayes理论为直接优化以最大化效率同时满足所需测试覆盖率的集值预测器获得覆盖和效率的泛化界限。在回归和分类任务上评估该方法,并在低数据量的情况下超越基于Hoeffding边界的PAC保证的ICP校准基线。

Inductive Conformal Prediction (ICP) provides a practical and effective approach for equipping deep learning models with uncertainty estimates in the form of set-valued predictions which are guaranteed to contain the ground truth with high probability. Despite the appeal of this coverage guarantee, these sets may not be efficient: the size and contents of the prediction sets are not directly controlled, and instead depend on the underlying model and choice of score function. To remedy this, recent work has proposed learning model and score function parameters using data to directly optimize the efficiency of the ICP prediction sets. While appealing, the generalization theory for such an approach is lacking: direct optimization of empirical efficiency may yield prediction sets that are either no longer efficient on test data, or no longer obtain the required coverage on test data. In this work, we use PAC-Bayes theory to obtain generalization bounds on both the coverage and the efficiency of set-valued predictors which can be directly optimized to maximize efficiency while satisfying a desired test coverage. In contrast to prior work, our framework allows us to utilize the entire calibration dataset to learn the parameters of the model and score function, instead of requiring a separate hold-out set for obtaining test-time coverage guarantees. We leverage these theoretical results to provide a practical algorithm for using calibration data to simultaneously fine-tune the parameters of a model and score function while guaranteeing test-time coverage and efficiency of the resulting prediction sets. We evaluate the approach on regression and classification tasks, and outperform baselines calibrated using a Hoeffding bound-based PAC guarantee on ICP, especially in the low-data regime.

Derandomized novelty detection with FDR control via conformal e-values
Meshi Bashari Amir Epstein Yaniv Romano Matteo Sesia



研究问题:如何通过使用合适的一致性e值,而不是p值来量化统计显著性,使一致性推理更稳定,从而减少对同一数据进行多次分析时可能出现的随机性。
动机:当前的一致性推理方法虽然强大,但其随机性限制了其结果的稳定性和可解释性。
方法:提出一种新颖的方法,通过利用从相同数据中仔细提取的额外边信息,以创新的方式对一致性e值进行加权,从而有效地聚合来自对同一数据的多次分析的证据,同时保证假发现率的控制。
效果:模拟实验和真实数据表明,这种方法可以有效地消除最先进的替代技术所得到的推理中的随机噪声,有时还能提高检验力。

Conformal inference provides a general distribution-free method to rigorously calibrate the output of any machine learning algorithm for novelty detection. While this approach has many strengths, it has the limitation of being randomized, in the sense that it may lead to different results when analyzing twice the same data and this can hinder the interpretation of any findings. We propose to make conformal inferences more stable by leveraging suitable conformal e-values instead of p-values to quantify statistical significance. This solution allows the evidence gathered from multiple analyses of the same data to be aggregated effectively while provably controlling the false discovery rate. Further, we show that the proposed method can reduce randomness without much loss of power compared to standard conformal inference, partly thanks to an innovative way of weighting conformal e-values based on additional side information carefully extracted from the same data. Simulations with synthetic and real data confirm this solution can be effective at eliminating random noise in the inferences obtained with state-of-the-art alternative techniques, sometimes also leading to higher power.

Discriminative Calibration: Check Bayesian Computation from Simulations and Flexible Classifier
Yuling Yao Justin Domke



研究问题:如何准确检验贝叶斯计算的准确性?
动机:当前常用的基于排名的模拟校准(SBC)存在一些缺点,如测试统计量不够灵活、交互作用难以考察、多重检验困难以及结果p值不是发散度量。
方法:提出用灵活的分类方法代替边际排名测试,从数据中学习测试统计量。这种方法通常具有比SBC测试更高的统计能力,并返回可解释的失配度量,由分类准确性计算得出。该方法可以用于不同的数据生成过程,以解决基于模拟的推理或传统的推理方法,如马尔科夫链蒙特卡罗或变分推理。
效果:通过数值和真实数据实验验证了该方法的有效性。

To check the accuracy of Bayesian computations, it is common to use rank-based simulation-based calibration (SBC). However, SBC has drawbacks: The test statistic is somewhat ad-hoc, interactions are difficult to examine, multiple testing is a challenge, and the resulting p-value is not a divergence metric. We propose to replace the marginal rank test with a flexible classification approach that learns test statistics from data. This measure typically has a higher statistical power than the SBC test and returns an interpretable divergence measure of miscalibration, computed from classification accuracy. This approach can be used with different data generating processes to address simulation-based inference or traditional inference methods like Markov chain Monte Carlo or variational inference. We illustrate an automated implementation using neural networks and statistically-inspired features, and validate the method with numerical and real data experiments.

Prediction and Control in Continual Reinforcement Learning
Nishanth Anand Doina Precup



研究问题:本文旨在解决连续强化学习中的价值函数估计问题。
动机:现有的强化学习算法在面对持续变化的环境时,往往难以快速适应新的情况。
方法:提出将价值函数分解为持久值函数和瞬时值函数两部分,分别在不同的时间尺度上进行更新。
效果:实验结果表明,该方法在预测和控制问题上都能显著提高性能。

Temporal difference (TD) learning is often used to update the estimate of the value function which is used by RL agents to extract useful policies. In this paper, we focus on value function estimation in continual reinforcement learning. We propose to decompose the value function into two components which update at different timescales: a _permanent_ value function, which holds general knowledge that persists over time, and a _transient_ value function, which allows quick adaptation to new situations. We establish theoretical results showing that our approach is well suited for continual learning and draw connections to the complementary learning systems (CLS) theory from neuroscience. Empirically, this approach improves performance significantly on both prediction and control problems.

Intensity Profile Projection: A Framework for Continuous-Time Representation Learning for Dynamic Networks
Alexander Modell Ian Gallagher Emma Ceccherini Nick Whiteley Patrick Rubin-Delanchy



研究问题:本文提出了一种新的表示学习框架,即强度轮廓投影,用于连续时间动态网络数据。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过估计成对的强度函数(例如通过核平滑),学习一个最小化强度重建误差的投影,并通过学习的投影构造演化的节点表示。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We present a new representation learning framework, Intensity Profile Projection, for continuous-time dynamic network data. Given triples $(i,j,t)$, each representing a time-stamped ($t$) interaction between two entities ($i,j$), our procedure returns a continuous-time trajectory for each node, representing its behaviour over time. The framework consists of three stages: estimating pairwise intensity functions, e.g. via kernel smoothing; learning a projection which minimises a notion of intensity reconstruction error; and constructing evolving node representations via the learned projection. The trajectories satisfy two properties, known as structural and temporal coherence, which we see as fundamental for reliable inference. Moreoever, we develop estimation theory providing tight control on the error of any estimated trajectory, indicating that the representations could even be used in quite noise-sensitive follow-on analyses. The theory also elucidates the role of smoothing as a bias-variance trade-off, and shows how we can reduce the level of smoothing as the signal-to-noise ratio increases on account of the algorithm `borrowing strength' across the network.

High Precision Causal Model Evaluation with Conditional Randomization
Chao Ma Cheng Zhang



研究问题:如何评估因果模型,特别是在现实世界的条件随机化设置中。
动机:虽然随机对照试验是金标准,但并不总是可行或道德的。条件随机化实验基于逆概率加权(IPW)提供了更现实的方法,但可能会受到高估计方差的影响。
方法:提出了一种新的低方差因果误差估计器,称为对估计器。通过将相同的IPW估计器应用于模型和真实的实验效应,我们的估计器有效地消除了由于IPW引起的方差,并实现了较小的渐近方差。
效果:实证研究表明,我们的估计器有所改进,显示出其在实现接近随机对照试验性能方面的巨大潜力。我们的方法为在条件随机化设置中评估因果推理模型提供了一种简单而强大的解决方案,无需对IPW估计器本身进行复杂的修改,为更强大和可靠的模型评估铺平了道路。

The gold standard for causal model evaluation involves comparing model predictions with true effects estimated from randomized controlled trials (RCT). However, RCTs are not always feasible or ethical to perform. In contrast, conditionally randomized experiments based on inverse probability weighting (IPW) offer a more realistic approach but may suffer from high estimation variance. To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings, we introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator. By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance. Empirical studies demonstrate the improved of our estimator, highlighting its potential on achieving near-RCT performance. Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself, paving the way for more robust and reliable model assessments.

SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from Diffusion Models
Martin Gonzalez Nelson Fernandez Thuy Vinh Dinh Tran Elies Gherbi Hatem Hajri Nader Masmoudi



研究问题:如何提高预训练扩散概率模型(DPMs)的采样速度和质量。
动机:现有的解决微分方程(DE)的方法虽然速度快,但质量一般,而慢速的稳定性求解器(SDE solvers)虽然质量好,但速度慢。
方法:提出随机显式指数无导数求解器(SEEDS),通过分析扩散SDE的精确解的公式,对线性部分进行解析计算,并使用新的随机成分处理方法,实现其方差的解析计算,从而在保持高质量采样的同时,将采样速度提高3-5倍。
效果:在多个图像生成基准测试中验证了该方法,结果显示SEEDS在采样速度和质量上优于或与之前的SDE求解器相当,且SEEDS无需依赖导数和训练,并有强大的收敛保证。

A potent class of generative models known as Diffusion Probabilistic Models (DPMs) has become prominent. A forward diffusion process adds gradually noise to data, while a model learns to gradually denoise. Sampling from pre-trained DPMs is obtained by solving differential equations (DE) defined by the learnt model, a process which has shown to be prohibitively slow. Numerous efforts on speeding-up this process have consisted on crafting powerful ODE solvers. Despite being quick, such solvers do not usually reach the optimal quality achieved by available slow SDE solvers. Our goal is to propose SDE solvers that reach optimal quality without requiring several hundreds or thousands of NFEs to achieve that goal. We propose Stochastic Explicit Exponential Derivative-free Solvers (SEEDS), improving and generalizing Exponential Integrator approaches to the stochastic case on several frameworks. After carefully analyzing the formulation of exact solutions of diffusion SDEs, we craft SEEDS to analytically compute the linear part of such solutions. Inspired by the Exponential Time-Differencing method, SEEDS use a novel treatment of the stochastic components of solutions, enabling the analytical computation of their variance, and contains high-order terms allowing to reach optimal quality sampling $\sim3$-$5\times$ faster than previous SDE methods. We validate our approach on several image generation benchmarks, showing that SEEDS outperform or are competitive with previous SDE solvers. Contrary to the latter, SEEDS are derivative and training free, and we fully prove strong convergence guarantees for them.

Gacs-Korner Common Information Variational Autoencoder
Michael Kleinman Alessandro Achille Stefano Soatto Jonathan Kao



研究问题:本文旨在提出一种公共信息的概念,以量化和区分两个随机变量之间共享的信息和各自独有的信息。
动机:现有的公共信息概念无法很好地处理高维数据,如图像和视频,因此需要一种新的方法来更好地理解和利用公共信息。
方法:通过优化一族函数定义公共信息的概念,并使用传统变分自动编码器的一种简单修改来划分和量化公共和独有的信息。
效果:实证研究表明,该方法能够学习到语义上有意义的公共和独有的变化因素,即使在高维数据上也能准确量化随机变量之间的公共信息。

We propose a notion of common information that allows one to quantify and separate the information that is shared between two random variables from the information that is unique to each. Our notion of common information is defined by an optimization problem over a family of functions and recovers the G\'acs-K\"orner common information as a special case. Importantly, our notion can be approximated empirically using samples from the underlying data distribution. We then provide a method to partition and quantify the common and unique information using a simple modification of a traditional variational auto-encoder. Empirically, we demonstrate that our formulation allows us to learn semantically meaningful common and unique factors of variation even on high-dimensional data such as images and videos. Moreover, on datasets where ground-truth latent factors are known, we show that we can accurately quantify the common information between the random variables.

Estimating Causal Effects Identifiable from a Combination of Observations and Experiments
Yonghan Jung Ivan Diaz Jin Tian Elias Bareinboim



研究问题:确定一组观察和干预分布是否可以组合以学习目标因果关系,即广义识别(g-identification)问题。
动机:尽管g-identification在理论上已被充分理解和解决,但在实际应用中,特别是在从有限样本中估计目标分布时,这些结果的应用具有挑战性。
方法:本文开发了一种新的、通用的估计器,对g-可识别的因果函数表现出多重稳健性。具体来说,我们证明了任何g-可识别的因果效应都可以表示为易于估计的广义多结果顺序后门调整的函数。然后,我们构建了一个相应的估计器,该估计器对偏差具有鲁棒性。我们分析了估计器的渐近收敛性质。最后,我们在实验研究中说明了所提出的估计器的使用。模拟结果证实了理论。
效果:实证研究表明,该方法能够学习到语义上有意义的公共和独有的变化因素,即使在高维数据上也能准确量化随机变量之间的公共信息。

Learning cause and effect relations is arguably one of the central challenges found throughout the data sciences. Formally, determining whether a collection of observational and interventional distributions can be combined to learn a target causal relation is known as the problem of generalized identification (or g-identification) [Lee et al., 2019]. Although g-identification has been well understood and solved in theory, it turns out to be challenging to apply these results in practice, in particular when considering the estimation of the target distribution from finite samples. In this paper, we develop a new, general estimator that exhibits multiply robustness properties for g-identifiable causal functionals. Specifically, we show that any g-identifiable causal effect can be expressed as a function of generalized multi-outcome sequential back-door adjustments that are amenable to estimation. We then construct a corresponding estimator for the g-identification expression that exhibits robustness properties to bias. We analyze the asymptotic convergence properties of the estimator. Finally, we illustrate the use of the proposed estimator in experimental studies. Simulation results corroborate the theory.

Differentiable sorting for censored time-to-event data.
Andre Vauvelle Benjamin Wild Roland Eils Spiros Denaxas



研究问题:本文旨在解决生存分析这一机器学习中的重要半监督任务,特别是在医疗保健领域的应用。
动机:目前的生存分析方法(如Cox的部分似然法)存在一些问题,如对数据中的依赖关系假设过于严格,无法处理删失数据等。
方法:本文提出了一种新的方法Diffsurv,该方法通过扩展可微分排序方法以处理删失任务。Diffsurv预测可能的排列矩阵,以适应由删失样本引入的标签不确定性。
效果:实验结果表明,Diffsurv在各种模拟和现实世界的风险预测场景中优于现有的基准方法。此外,作者还展示了Diffsurv在top-k风险预测方面的算法优势,超越了当前的方法。

Survival analysis is a crucial semi-supervised task in machine learning with significant real-world applications, especially in healthcare. The most common approach to survival analysis, Cox’s partial likelihood, can be interpreted as a ranking model optimized on a lower bound of the concordance index. We follow these connections further, with listwise ranking losses that allow for a relaxation of the pairwise independence assumption. Given the inherent transitivity of ranking, we explore differentiable sorting networks as a means to introduce a stronger transitive inductive bias during optimization. Despite their potential, current differentiable sorting methods cannot account for censoring, a crucial aspect of many real-world datasets. We propose a novel method, Diffsurv, to overcome this limitation by extending differentiable sorting methods to handle censored tasks. Diffsurv predicts matrices of possible permutations that accommodate the label uncertainty introduced by censored samples. Our experiments reveal that Diffsurv outperforms established baselines in various simulated and real-world risk prediction scenarios. Furthermore, we demonstrate the algorithmic advantages of Diffsurv by presenting a novel method for top-k risk prediction that surpasses current methods.

Causal Discovery in Semi-Stationary Time Series
Shanyun Gao Raghavendra Addanki Tong Yu Ryan A. Rossi Murat Kocaoglu



研究问题:如何在不做出平稳假设的情况下从观察性时间序列中发现因果关系。
动机:在许多领域,如零售销售、交通系统和医学科学中,这是一个常见的挑战。
方法:提出了一种基于约束的非参数算法来发现这种半稳定时间序列中的因果关系。
效果:通过大量的实验验证了该算法在连续和离散模拟数据上识别因果关系的能力,并将其应用于实际的气候数据集。

Discovering causal relations from observational time series without making the stationary assumption is a significant challenge. In practice, this challenge is common in many areas, such as retail sales, transportation systems, and medical science. Here, we consider this problem for a class of non-stationary time series problems. The structural causal model (SCM) of this type of time series, called the semi-stationary time series, exhibits that a finite number of different causal mechanisms occur sequentially and periodically across time. This model holds considerable practical utility because it can represent periodicity, including common occurrences such as seasonality and diurnal variation. We propose a constraint-based, non-parametric algorithm for discovering causal relations in this setting. The resulting algorithm, PCMCI$_{\Omega}$, can capture the alternating and recurring changes in the causal mechanisms and then identify the underlying causal graph with conditional independence (CI) tests. We show that this algorithm is sound in identifying causal relations on discrete time series. We validate the algorithm with extensive experiments on continuous and discrete simulated data. We also apply our algorithm to a real-world climate dataset.

Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis
Victor Letzelter Mathieu Fontaine Mickael Chen Patrick Perez Slim Essid Gaël Richard



研究问题:本文旨在扩展多选题学习(MCL)方法,以解决回归设置中多个目标可能针对每个训练输入进行采样的条件分布估计问题。
动机:现有的MCL变体在回归设置中关注于合并假设,从而最终牺牲了预测的多样性。相比之下,我们的方法依赖于基于输出空间的Voronoi划分的数学框架所支撑的新的学习评分方案,从而得出概率解释。
方法:我们引入了弹性多选题学习(rMCL),这是一种扩展的MCL方法,用于回归设置中多个目标可能针对每个训练输入进行采样的条件分布估计。
效果:通过在合成数据上进行实验验证rMCL后,我们在声音源定位问题上进一步评估了其优点,展示了其实用性和解释的相关性。

We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.

Fast Scalable and Accurate Discovery of DAGs Using the Best Order Score Search and Grow Shrink Trees
Bryan Andrews Joseph Ramsey Ruben Sanchez Romero Jazmin Camchong Erich Kummerfeld



研究问题:如何有效地学习图形条件独立结构,以解决大规模高连接变量的问题。
动机:现有的学习算法在准确性和执行时间上难以扩展到涉及数百个高度连接变量的问题,例如从fMRI数据中恢复大脑网络。
方法:引入了BOSS和GSTs来学习有向无环图(DAGs)。BOSS贪婪地搜索变量的排列,使用GSTs根据排列构建和评分DAGs。GSTs高效地缓存分数以消除冗余计算。
效果:BOSS在准确性和执行时间方面实现了最先进的性能,并在各种条件下与各种组合和基于梯度的学习算法进行了比较。通过将BOSS应用于两组静息状态fMRI数据,证明了其实用性。BOSS可在TETRAD项目中使用,包括Python和R包装器。

Learning graphical conditional independence structures is an important machine learning problem and a cornerstone of causal discovery. However, the accuracy and execution time of learning algorithms generally struggle to scale to problems with hundreds of highly connected variables---for instance, recovering brain networks from fMRI data. We introduce the best order score search (BOSS) and grow-shrink trees (GSTs) for learning directed acyclic graphs (DAGs) in this paradigm. BOSS greedily searches over permutations of variables, using GSTs to construct and score DAGs from permutations. GSTs efficiently cache scores to eliminate redundant calculations. BOSS achieves state-of-the-art performance in accuracy and execution time, comparing favorably to a variety of combinatorial and gradient-based learning algorithms under a broad range of conditions. To demonstrate its practicality, we apply BOSS to two sets of resting-state fMRI data: simulated data with pseudo-empirical noise distributions derived from randomized empirical fMRI cortical signals and clinical data from 3T fMRI scans processed into cortical parcels. BOSS is available for use within the TETRAD project which includes Python and R wrappers.

PROTES: Probabilistic Optimization with Tensor Sampling
Anastasia Batsheva Andrei Chertkov Gleb Ryzhakov Ivan Oseledets



研究问题:开发一种新的黑箱优化方法PROTES,用于处理复杂的多维数组和离散多元函数。
动机:现有的离散优化方法在处理大规模复杂问题时表现不佳,需要一种更有效的方法。
方法:基于低参数张量训练格式的概率密度函数进行概率采样,开发出新的优化方法PROTES。
效果:通过数值实验,无论在解析模型函数还是复杂问题上,PROTES都优于流行的离散优化方法(粒子群优化、协方差矩阵适应、微分进化等)。

We developed a new method PROTES for black-box optimization, which is based on the probabilistic sampling from a probability density function given in the low-parametric tensor train format. We tested it on complex multidimensional arrays and discretized multivariable functions taken, among others, from real-world applications, including unconstrained binary optimization and optimal control problems, for which the possible number of elements is up to $2^{1000}$. In numerical experiments, both on analytic model functions and on complex problems, PROTES outperforms popular discrete optimization methods (Particle Swarm Optimization, Covariance Matrix Adaptation, Differential Evolution, and others).

Variational Gaussian processes for linear inverse problems
Thibault Christophe RANDRIANARISOA Botond Szabo



研究问题:本文旨在探讨利用贝叶斯方法解决逆问题,特别是在复杂模型中的标准采样基础的贝叶斯方法的计算成本过高的问题。
动机:在逆问题中,感兴趣的参数或信号只能间接观察到,并且观测通常受到噪声的进一步干扰。贝叶斯提供了一种通过先验分布自然地规范这些问题的方法,并提供了概率解决方案,量化了问题中的剩余不确定性。然而,标准采样基础的贝叶斯方法在复杂模型中的计算成本可能过高。因此,在实践中,变分贝叶斯越来越受欢迎。
方法:在我们的分析中,我们研究了用于高斯过程先验的变分贝叶斯方法来解决线性逆问题。我们考虑了轻度和严重的不适定逆问题,并与Titsias [Titsias, 2009]提出的流行的诱导变量变分贝叶斯方法进行了合作。我们在一般设置中推导了变分后验的后验收缩率,并表明可以通过正确调整的过程实现最小最大估计率。作为具体示例,我们考虑了一系列逆问题,包括热方程、Volterra算子和拉东变换,以及基于总体和经验光谱特征的诱导变量方法。
效果:实验结果表明,在各种知识驱动任务上,ERNIE取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,我们的研究也证明了变分贝叶斯方法在解决逆问题上的有效性和优越性。

By now Bayesian methods are routinely used in practice for solving inverse problems. In inverse problems the parameter or signal of interest is observed only indirectly, as an image of a given map, and the observations are typically further corrupted with noise. Bayes offers a natural way to regularize these problems via the prior distribution and provides a probabilistic solution, quantifying the remaining uncertainty in the problem. However, the computational costs of standard, sampling based Bayesian approaches can be overly large in such complex models. Therefore, in practice variational Bayes is becoming increasingly popular. Nevertheless, the theoretical understanding of these methods is still relatively limited, especially in context of inverse problems. In our analysis we investigate variational Bayesian methods for Gaussian process priors to solve linear inverse problems. We consider both mildly and severely ill-posed inverse problems and work with the popular inducing variable variational Bayes approach proposed by Titsias [Titsias, 2009]. We derive posterior contraction rates for the variational posterior in general settings and show that the minimax estimation rate can be attained by correctly tunned procedures. As specific examples we consider a collection of inverse problems including the heat equation, Volterra operator and Radon transform and inducing variable methods based on population and empirical spectral features.

Robustifying Generalizable Implicit Shape Networks with a Tunable Non-Parametric Model
Amine Ouasfi Adnane Boukhayma



研究问题:本文旨在解决预训练语言模型在知识驱动任务上的性能不足,以及现有前向可泛化模型在未定向点云隐式形状重建中存在的泛化问题。
动机:目前的预训练语言模型和前向可泛化模型在处理知识驱动任务和未定向点云隐式形状重建时存在性能和泛化性的问题。
方法:本文提出了一种利用大规模文本语料库和知识图谱训练增强的语言表示模型ERNIE的方法,并结合了网络的 inter-shape 数据先验和 intra-shape 正则化先验的 Nyström Kernel Ridge Regression 方法进行形状自适应表达性-鲁棒性权衡。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,所提出的形状自适应表达性-鲁棒性权衡方法在合成数据和真实数据上都取得了优于基线方法和目前最先进的结果。

Feedforward generalizable models for implicit shape reconstruction from unoriented point cloud present multiple advantages, including high performance and inference speed. However, they still suffer from generalization issues, ranging from underfitting the input point cloud, to misrepresenting samples outside of the training data distribution, or with toplogies unseen at training. We propose here an efficient mechanism to remedy some of these limitations at test time. We combine the inter-shape data prior of the network with an intra-shape regularization prior of a Nyström Kernel Ridge Regression, that we further adapt by fitting its hyperprameters to the current shape. The resulting shape function defined in a shape specific Reproducing Kernel Hilbert Space benefits from desirable stability and efficiency properties and grants a shape adaptive expressiveness-robustness trade-off. We demonstrate the improvement obtained through our method with respect to baselines and the state-of-the-art using synthetic and real data.

Topology-Aware Uncertainty for Image Segmentation
Saumya Gupta Yikai Zhang Xiaoling Hu Prateek Prasanna Chao Chen



研究问题:如何准确估计曲线结构(如血管和道路网络)的不确定性,以便于人类注释员进行验证。
动机:由于信号相对较弱且几何/拓扑复杂,对曲线结构的分割具有挑战性。为了促进和加速大规模注释,需要采用半自动方法,如专家校对。
方法:我们利用拓扑数据分析工具,特别是离散Morse理论(DMT),首先捕获结构,然后推理其不确定性。我们还提出了一种联合预测模型来估计结构之间的不确定性(互结构不确定性),并通过扰动和行走方案采样其表示来提出一种新的概率DMT模型来模拟每个结构内在的不确定性(内结构不确定性)。
效果:在各种2D和3D数据集上,我们的方法比现有工作产生了更好的结构级不确定性图。

Segmentation of curvilinear structures such as vasculature and road networks is challenging due to relatively weak signals and complex geometry/topology. To facilitate and accelerate large scale annotation, one has to adopt semi-automatic approaches such as proofreading by experts. In this work, we focus on uncertainty estimation for such tasks, so that highly uncertain, and thus error-prone structures can be identified for human annotators to verify. Unlike most existing works, which provide pixel-wise uncertainty maps, we stipulate it is crucial to estimate uncertainty in the units of topological structures, e.g., small pieces of connections and branches. To achieve this, we leverage tools from topological data analysis, specifically discrete Morse theory (DMT), to first capture the structures, and then reason about their uncertainties. To model the uncertainty, we (1) propose a joint prediction model that estimates the uncertainty of a structure while taking the neighboring structures into consideration (inter-structural uncertainty); (2) propose a novel Probabilistic DMT to model the inherent uncertainty within each structure (intra-structural uncertainty) by sampling its representations via a perturb-and-walk scheme. On various 2D and 3D datasets, our method produces better structure-wise uncertainty maps compared to existing works. Code available at: https://github.com/Saumya-Gupta-26/struct-uncertainty

Smooth, exact rotational symmetrization for deep learning on point clouds
Sergey Pozdnyakov Michele Ceriotti



研究问题:如何将旋转对称性添加到现有的点云模型中,同时保持其他所有要求。
动机:在化学和材料建模领域,严格遵守物理约束是至关重要的,而现有的模型往往忽视了这一点。
方法:提出了一种通用的对称化方法,可以在不影响其他要求的情况下,给任何给定的模型添加旋转等变性质。
效果:通过这种方法,我们引入了一种新的点边转换器(PET)架构,它在应用我们的通用协议后,不仅保持了原有的精度,而且实现了旋转等变,从而在分子和固体的几个基准数据集上取得了最先进的性能。

Point clouds are versatile representations of 3D objects and have found widespread application in science and engineering. Many successful deep-learning models have been proposed that use them as input. The domain of chemical and materials modeling is especially challenging because exact compliance with physical constraints is highly desirable for a model to be usable in practice. These constraints include smoothness and invariance with respect to translations, rotations, and permutations of identical atoms. If these requirements are not rigorously fulfilled, atomistic simulations might lead to absurd outcomes even if the model has excellent accuracy. Consequently, dedicated architectures, which achieve invariance by restricting their design space, have been developed. General-purpose point-cloud models are more varied but often disregard rotational symmetry. We propose a general symmetrization method that adds rotational equivariance to any given model while preserving all the other requirements. Our approach simplifies the development of better atomic-scale ML schemes by relaxing the constraints on the design space and making it possible to incorporate ideas that proved effective in other domains. We demonstrate this idea by introducing the Point Edge Transformer (PET) architecture, which is not intrinsically equivariant but achieves state-of-the-art performance on several benchmark datasets of molecules and solids. A-posteriori application of our general protocol makes PET exactly equivariant, with minimal changes to its accuracy.

Double and Single Descent in Causal Inference with an Application to High-Dimensional Synthetic Control
Jann Spiess Guido Imbens Amar Venugopal



研究问题:本文探讨了在因果关系推断中高度过参数化的模型,包括具有许多控制单位的合成控制。
动机:受最近机器学习中双重下降现象的文献启发,我们考虑了高维线性回归在估算平均工资数据和平均处理效应中的应用,发现比样本量拥有更多协变量的模型可以优于简单的模型。
方法:我们首先调查了高维线性回归在估算工资数据和平均处理效应上的表现,然后记录了具有许多控制单位的高维合成控制估计器的性能。我们发现添加控制单位甚至可以帮助改善预治疗拟合情况。
效果:我们为这些高维模型的性能提供了一个统一的理论视角。具体来说,我们展示了更复杂的模型可以被解释为简单模型的平均估计器,这有助于提高平均性能。这种观点为我们提供了关于当控制单位相对于预治疗期间的数量较多时如何使用合成控制的切实见解。

Motivated by a recent literature on the double-descent phenomenon in machine learning, we consider highly over-parameterized models in causal inference, including synthetic control with many control units. In such models, there may be so many free parameters that the model fits the training data perfectly. We first investigate high-dimensional linear regression for imputing wage data and estimating average treatment effects, where we find that models with many more covariates than sample size can outperform simple ones. We then document the performance of high-dimensional synthetic control estimators with many control units. We find that adding control units can help improve imputation performance even beyond the point where the pre-treatment fit is perfect. We provide a unified theoretical perspective on the performance of these high-dimensional models. Specifically, we show that more complex models can be interpreted as model-averaging estimators over simpler ones, which we link to an improvement in average performance. This perspective yields concrete insights into the use of synthetic control when control units are many relative to the number of pre-treatment periods.

Latent SDEs on Homogeneous Spaces
Sebastian Zeng Florian Graf Roland Kwitt



研究问题:本文探讨了在潜在变量模型中进行变分贝叶斯推理的问题,其中观察到的随机过程由未被观察到的潜在随机微分方程(SDE)的解所控制。
动机:当试图从大规模数据中学习$\mathbb{R}^n$中的潜在SDE时,会出现诸如高效梯度计算等挑战。因此,我们退一步研究一个特定的子类。在我们的案例中,SDE在均匀的潜在空间内演化,并由相应的(矩阵)李群的随机动力学引发。
方法:对于变分推断,球体不仅便于对SDE的初始状态使用均匀先验,而且在证据下界中,我们还获得了近似后验和先验过程之间的KL散度的特别简单和直观的表达式。
效果:实证研究表明,通过现有的一步几何欧拉-马尔可夫方案,可以有效地学习到提出类型的潜在SDE。尽管我们只限制自己研究一类较不多样化的SDE,但我们在一系列时间序列插值和分类基准测试上实现了竞争甚至最先进的性能。

We consider the problem of variational Bayesian inference in a latent variable model where a (possibly complex) observed stochastic process is governed by the unobserved solution of a latent stochastic differential equation (SDE). Motivated by the challenges that arise when trying to learn a latent SDE in $\mathbb{R}^n$ from large-scale data, such as efficient gradient computation, we take a step back and study a specific subclass instead. In our case, the SDE evolves inside a homogeneous latent space and is induced by stochastic dynamics of the corresponding (matrix) Lie group. In the context of learning problems, SDEs on the $n$-dimensional unit sphere are arguably the most relevant incarnation of this setup. For variational inference, the sphere not only facilitates using a uniform prior on the initial state of the SDE, but we also obtain a particularly simple and intuitive expression for the KL divergence between the approximate posterior and prior process in the evidence lower bound. We provide empirical evidence that a latent SDE of the proposed type can be learned efficiently by means of an existing one-step geometric Euler-Maruyama scheme. Despite restricting ourselves to a less diverse class of SDEs, we achieve competitive or even state-of-the-art performance on a collection of time series interpolation and classification benchmarks.

Neural Sampling in Hierarchical Exponential-family Energy-based Models
Xingsi Dong Si Wu



研究问题:本文旨在提出一种模拟大脑运作的分层指数族能量模型(HEE模型),以理解大脑如何通过生成模型来理解外部世界。
动机:贝叶斯脑理论认为大脑使用生成模型来理解外部世界,而采样视角则认为大脑通过随机神经反应的样本推断后验分布。此外,大脑会不断更新其生成模型以接近外部世界的真实分布。
方法:在HEE模型中,我们将配分函数分解为各个层次,并利用具有较短时间常数的神经元群来采样分解后的归一化项的梯度。这使得我们的模型能够同时估计配分函数和执行推理,避免了传统能量基础模型(EBMs)中的负相位问题。因此,学习过程在时间和空间上都得到了局部化,模型易于收敛。为了匹配大脑的快速计算,我们证明神经适应可以作为动量项,显著加速推理过程。
效果:在自然图像数据集上,我们的模型表现出与生物视觉系统观察到的类似的表示。此外,对于机器学习社区来说,我们的模型可以通过联合或边际生成来生成观察结果。我们表明,边际生成优于联合生成,并达到与其他EBMs相当的性能。

Bayesian brain theory suggests that the brain employs generative models to understand the external world. The sampling-based perspective posits that the brain infers the posterior distribution through samples of stochastic neuronal responses. Additionally, the brain continually updates its generative model to approach the true distribution of the external world. In this study, we introduce the Hierarchical Exponential-family Energy-based (HEE) model, which captures the dynamics of inference and learning. In the HEE model, we decompose the partition function into individual layers and leverage a group of neurons with shorter time constants to sample the gradient of the decomposed normalization term. This allows our model to estimate the partition function and perform inference simultaneously, circumventing the negative phase encountered in conventional energy-based models (EBMs). As a result, the learning process is localized both in time and space, and the model is easy to converge. To match the brain's rapid computation, we demonstrate that neural adaptation can serve as a momentum term, significantly accelerating the inference process. On natural image datasets, our model exhibits representations akin to those observed in the biological visual system. Furthermore, for the machine learning community, our model can generate observations through joint or marginal generation. We show that marginal generation outperforms joint generation and achieves performance on par with other EBMs.

Switching Autoregressive Low-rank Tensor Models
Hyun Dong Lee Andrew Warrington Joshua I Glaser Scott Linderman



研究问题:时间序列分析中的一个重要问题是对具有时变动态的系统进行建模。
动机:常见的模型如自回归隐马尔可夫模型(ARHMMs)和切换线性动力系统(SLDSs)各有优缺点,需要一种既能保留两者优点又能改善其缺点的新模型。
方法:本文提出了切换自回归低秩张量SALT模型,通过低秩分解参数化ARHMM的张量,控制参数数量,允许捕捉长期依赖关系而不过度拟合。
效果:实验证明,SALT模型在一系列模拟和真实预测任务上具有数量优势,包括行为和神经数据集。此外,学习到的低秩张量提供了关于每个离散状态内部的时间依赖性的新见解。

An important problem in time-series analysis is modeling systems with time-varying dynamics. Probabilistic models with joint continuous and discrete latent states offer interpretable, efficient, and experimentally useful descriptions of such data. Commonly used models include autoregressive hidden Markov models (ARHMMs) and switching linear dynamical systems (SLDSs), each with its own advantages and disadvantages. ARHMMs permit exact inference and easy parameter estimation, but are parameter intensive when modeling long dependencies, and hence are prone to overfitting. In contrast, SLDSs can capture long-range dependencies in a parameter efficient way through Markovian latent dynamics, but present an intractable likelihood and a challenging parameter estimation task. In this paper, we propose _switching autoregressive low-rank tensor_ SALT models, which retain the advantages of both approaches while ameliorating the weaknesses. SALT parameterizes the tensor of an ARHMM with a low-rank factorization to control the number of parameters and allow longer range dependencies without overfitting. We prove theoretical and discuss practical connections between SALT, linear dynamical systems, and SLDSs. We empirically demonstrate quantitative advantages of SALT models on a range of simulated and real prediction tasks, including behavioral and neural datasets. Furthermore, the learned low-rank tensor provides novel insights into temporal dependencies within each discrete state.

Human spatiotemporal pattern learning as probabilistic program synthesis
Tracey Mills Joshua B. Tenenbaum Samuel J Cheyette



研究问题:如何通过增强的语言表示模型(ERNIE)和知识图谱,充分利用词汇、句法和知识信息进行语言理解。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱联合训练ERNIE模型,以更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

People are adept at learning a wide variety of structured patterns from small amounts of data, presenting a conundrum from the standpoint of the bias-variance tradeoff: what kinds of representations and algorithms support the joint flexibility and data-paucity of human learning? One possibility is that people "learn by programming": inducing probabilistic models to fit observed data. Here, we experimentally test human learning in the domain of structured 2-dimensional patterns, using a task in which participants repeatedly predicted where a dot would move based on its previous trajectory. We evaluate human performance against standard parametric and non-parametric time-series models, as well as two Bayesian program synthesis models whose hypotheses vary in their degree of structure: a compositional Gaussian Process model and a structured "Language of Thought" (LoT) model. We find that signatures of human pattern learning are best explained by the LoT model, supporting the idea that the flexibility and data-efficiency of human structure learning can be understood as probabilistic inference over an expressive space of programs.

Quantification of Uncertainty with Adversarial Models
Kajetan Schweighofer Lukas Aichberger Mykyta Ielanskyi Günter Klambauer Sepp Hochreiter



研究问题:如何准确估计预测不确定性,特别是在真实世界应用中的行动预测?
动机:当前的预测不确定性量化方法,如深度集成或MC Dropout,主要考虑后验分布,因此在估计认识不确定性方面表现不佳。
方法:提出一种利用对抗模型进行不确定性量化(QUAM)的方法,该方法不仅考虑后验分布,还识别了积分下乘积大的整个区域。
效果:实验表明,QUAM在深度学习模型的认识不确定性捕获上表现出色,并在视觉领域的挑战性任务上超越了先前的方法。

Quantifying uncertainty is important for actionable predictions in real-world applications. A crucial part of predictive uncertainty quantification is the estimation of epistemic uncertainty, which is defined as an integral of the product between a divergence function and the posterior. Current methods such as Deep Ensembles or MC dropout underperform at estimating the epistemic uncertainty, since they primarily consider the posterior when sampling models. We suggest Quantification of Uncertainty with Adversarial Models (QUAM) to better estimate the epistemic uncertainty. QUAM identifies regions where the whole product under the integral is large, not just the posterior. Consequently, QUAM has lower approximation error of the epistemic uncertainty compared to previous methods. Models for which the product is large correspond to adversarial models (not adversarial examples!). Adversarial models have both a high posterior as well as a high divergence between their predictions and that of a reference model. Our experiments show that QUAM excels in capturing epistemic uncertainty for deep learning models and outperforms previous methods on challenging tasks in the vision domain.

Neural Latent Geometry Search: Product Manifold Inference via Gromov-Hausdorff-Informed Bayesian Optimization
Haitz Sáez de Ocáriz Borde Alvaro Arroyo Ismael Morales López Ingmar Posner Xiaowen Dong



研究问题:如何自动确定最优的潜在几何结构以提升机器学习模型的性能。
动机:目前的机器学习模型主要依赖于欧几里得空间,但研究发现使用具有恒定曲率的双曲和球面空间,或者它们的组合,可以更好地对潜在空间进行建模并提高模型性能。然而,对于如何自动确定最优的潜在几何结构的问题,目前的研究还很少。
方法:我们提出了一种新的方法,称为神经潜在几何搜索(NLGS)。具体来说,我们在一些简化假设下,首次尝试通过少量的查询评估来搜索由常曲率模型空间组成的潜在几何结构。为了实现这一目标,我们提出了一种基于度量几何中Gromov-Hausdorff距离的新的潜在几何之间距离的概念。我们还设计了一个基于潜在几何之间平滑性的图搜索空间,并将计算出的距离作为额外的归纳偏置。最后,我们使用贝叶斯优化在查询有效的方式下搜索最优的潜在几何结构。
效果:我们在合成和真实世界的数据集上进行了实验,确定了多种机器学习问题的最优潜在几何结构。实验结果表明,我们的方法可以有效地找到最优的潜在几何结构,从而提高了机器学习模型的性能。

Recent research indicates that the performance of machine learning models can be improved by aligning the geometry of the latent space with the underlying data structure. Rather than relying solely on Euclidean space, researchers have proposed using hyperbolic and spherical spaces with constant curvature, or combinations thereof, to better model the latent space and enhance model performance. However, little attention has been given to the problem of automatically identifying the optimal latent geometry for the downstream task. We mathematically define this novel formulation and coin it as neural latent geometry search (NLGS). More specifically, we introduce an initial attempt to search for a latent geometry composed of a product of constant curvature model spaces with a small number of query evaluations, under some simplifying assumptions. To accomplish this, we propose a novel notion of distance between candidate latent geometries based on the Gromov-Hausdorff distance from metric geometry. In order to compute the Gromov-Hausdorff distance, we introduce a mapping function that enables the comparison of different manifolds by embedding them in a common high-dimensional ambient space. We then design a graph search space based on the notion of smoothness between latent geometries and employ the calculated distances as an additional inductive bias. Finally, we use Bayesian optimization to search for the optimal latent geometry in a query-efficient manner. This is a general method which can be applied to search for the optimal latent geometry for a variety of models and downstream tasks. We perform experiments on synthetic and real-world datasets to identify the optimal latent geometry for multiple machine learning problems.

SHAP-IQ: Unified Approximation of any-order Shapley Interactions
Fabian Fumagalli Maximilian Muschalik Patrick Kolpaczki Eyke Hüllermeier Barbara Eva Hammer



研究问题:如何有效地计算任意基数交互指数(CII)的沙普利交互值?
动机:现有的沙普利交互值计算方法需要特定的近似技术,且没有理论保证其近似质量。
方法:提出SHAPley Interaction Quantification (SHAP-IQ),一种基于新表示的高效采样近似器来计算任意CII的沙普利交互值。
效果:在语言、图像分类和高维合成模型上的应用表明,SHAP-IQ具有高效的计算能力和良好的解释效果。

Predominately in explainable artificial intelligence (XAI) research, the Shapley value (SV) is applied to determine feature attributions for any black box model. Shapley interaction indices extend the SV to define any-order feature interactions. Defining a unique Shapley interaction index is an open research question and, so far, three definitions have been proposed, which differ by their choice of axioms. Moreover, each definition requires a specific approximation technique. Here, we propose SHAPley Interaction Quantification (SHAP-IQ), an efficient sampling-based approximator to compute Shapley interactions for arbitrary cardinal interaction indices (CII), i.e. interaction indices that satisfy the linearity, symmetry and dummy axiom. SHAP-IQ is based on a novel representation and, in contrast to existing methods, we provide theoretical guarantees for its approximation quality, as well as estimates for the variance of the point estimates. For the special case of SV, our approach reveals a novel representation of the SV and corresponds to Unbiased KernelSHAP with a greatly simplified calculation. We illustrate the computational efficiency and effectiveness by explaining language, image classification and high-dimensional synthetic models.

Statistically Valid Variable Importance Assessment through Conditional Permutations
Ahmad Chamma Denis Engemann Bertrand Thirion



研究问题:如何准确评估变量在复杂学习器,如深度神经网络中的重要性。
动机:当前常用的移除式重要性评估方法存在误判相关性协变量的问题,需要一种更准确的方法。
方法:开发了一种模型无关、计算简洁的系统化条件性排列重要性(CPI)方法,并创建了可重复使用的最先进的变量重要性估计器基准测试。
效果:理论和实证表明,CPI通过提供准确的类型I错误控制克服了标准排列重要性的限制。在大型医疗数据集的实际数据分析实验中,CPI提供了更简洁的显著变量选择。

Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that \textit{CPI} overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, \textit{CPI} consistently showed top accuracy across benchmarks. An experiment on real-world data analysis in a large-scale medical dataset showed that \textit{CPI} provides a more parsimonious selection of statistically significant variables. Our results suggest that \textit{CPI} can be readily used as drop-in replacement for permutation-based methods.

Topological Obstructions and How to Avoid Them
Babak Esmaeili Robin Walters Heiko Zimmermann Jan-Willem van de Meent



研究问题:将几何归纳偏差引入模型可以提高解释性和泛化能力,但编码到特定的几何结构可能由于施加的拓扑约束而具有挑战性。
动机:训练带有几何潜在空间的编码器可能会遇到障碍,包括奇异点(如自交)或不正确的度或绕数导致的局部最优解。
方法:通过定义多模态变分分布,正则化流有可能规避这些障碍。受此观察启发,我们提出了一种新的基于流的模型,该模型将数据点映射到几何空间上的多模态分布。
效果:我们在两个领域进行实证评估,观察到训练过程中的稳定性提高,有更高的概率收敛到一个同胚编码器。

Incorporating geometric inductive biases into models can aid interpretability and generalization, but encoding to a specific geometric structure can be challenging due to the imposed topological constraints. In this paper, we theoretically and empirically characterize obstructions to training encoders with geometric latent spaces. We show that local optima can arise due to singularities (e.g. self-intersection) or due to an incorrect degree or winding number. We then discuss how normalizing flows can potentially circumvent these obstructions by defining multimodal variational distributions. Inspired by this observation, we propose a new flow-based model that maps data points to multimodal distributions over geometric spaces and empirically evaluate our model on 2 domains. We observe improved stability during training and a higher chance of converging to a homeomorphic encoder.

Perceptual adjustment queries and an inverted measurement paradigm for low-rank metric learning
Austin Xu Andrew McRae Jingyan Wang Mark A. Davenport Ashwin Pananjady



研究问题:本文旨在提出一种新型的人类反馈查询机制,称为知觉调整查询(PAQ),并展示其在度量学习问题中的应用。
动机:为了解决标准矩阵估计器无法应用于高维低秩矩阵估计问题,我们提出了一种结合基数和序数查询优势的新型查询机制。
方法:我们采用倒置测量方案设计了PAQ,并通过收集PAQ测量来学习未知的马氏距离。我们还开发了一种两阶段估计器来进行基于PAQ的度量学习,并为其提供了样本复杂度保证。
效果:数值模拟结果显示,我们的估计器具有良好的性能和显著的特性。

We introduce a new type of query mechanism for collecting human feedback, called the perceptual adjustment query (PAQ). Being both informative and cognitively lightweight, the PAQ adopts an inverted measurement scheme, and combines advantages from both cardinal and ordinal queries. We showcase the PAQ in the metric learning problem, where we collect PAQ measurements to learn an unknown Mahalanobis distance. This gives rise to a high-dimensional, low-rank matrix estimation problem to which standard matrix estimators cannot be applied. Consequently, we develop a two-stage estimator for metric learning from PAQs, and provide sample complexity guarantees for this estimator. We present numerical simulations demonstrating the performance of the estimator and its notable properties.

Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning
Changyu Chen Ramesha Karunasena Thanh Hong Nguyen Arunesh Sinha Pradeep Varakantham



研究问题:在强化学习中,如何优化大规模离散且无序的动作空间,特别是在随机资源分配等问题上。
动机:现有的强化学习方法在处理大规模的离散、无序动作空间时表现不佳,同时,这些问题需要实现的动作具有有效性,这在数学形式上难以简洁表达。
方法:本研究采用条件正态流网络来紧凑地表示随机策略,并通过演员-评论家方法使用采样的动作和相应的动作概率。同时,通过有效的动作拒绝方法(通过有效动作的预言机)更新基本策略。
效果:实验表明,该方法比现有方法更具可扩展性,并能在任何状态下强制实施任意的状态条件约束在动作分布的支持上。

Many problems in Reinforcement Learning (RL) seek an optimal policy with large discrete multidimensional yet unordered action spaces; these include problems in randomized allocation of resources such as placements of multiple security resources and emergency response units, etc. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large, for which existing RL methods do not perform well. Moreover, these problems require validity of the realized action (allocation); this validity constraint is often difficult to express compactly in a closed mathematical form. The allocation nature of the problem also prefers stochastic optimal policies, if one exists. In this work, we address these challenges by (1) applying a (state) conditional normalizing flow to compactly represent the stochastic policy — the compactness arises due to the network only producing one sampled action and the corresponding log probability of the action, which is then used by an actor-critic method; and (2) employing an invalid action rejection method (via a valid action oracle) to update the base policy. The action rejection is enabled by a modified policy gradient that we derive. Finally, we conduct extensive experiments to show the scalability of our approach compared to prior methods and the ability to enforce arbitrary state-conditional constraints on the support of the distribution of actions in any state.

D-CIPHER: Discovery of Closed-form Partial Differential Equations
Krzysztof Kacprzyk Zhaozhi Qian Mihaela van der Schaar



研究问题:如何直接从数据中找出闭型微分方程,包括偏微分方程和高阶常微分方程。
动机:现有的方法需要对方程的形式做出强假设,无法发现许多已知的现象,并且对于噪声和不频繁的观察结果,它们通过估计导数来解决方程-数据不匹配的问题,这使它们在处理这些问题时显得力不从心。
方法:我们提出了D-CIPHER,它能够抵抗测量误差的影响,并可以揭示一类新的、非常通用的微分方程。我们还设计了一种新的优化过程CoLLie,以帮助D-CIPHER有效地搜索这类方程。
效果:实验证明,D-CIPHER能够发现许多超出当前方法能力范围的已知方程。

Closed-form differential equations, including partial differential equations and higher-order ordinary differential equations, are one of the most important tools used by scientists to model and better understand natural phenomena. Discovering these equations directly from data is challenging because it requires modeling relationships between various derivatives that are not observed in the data (equation-data mismatch) and it involves searching across a huge space of possible equations. Current approaches make strong assumptions about the form of the equation and thus fail to discover many well-known phenomena. Moreover, many of them resolve the equation-data mismatch by estimating the derivatives, which makes them inadequate for noisy and infrequent observations. To this end, we propose D-CIPHER, which is robust to measurement artifacts and can uncover a new and very general class of differential equations. We further design a novel optimization procedure, CoLLie, to help D-CIPHER search through this class efficiently. Finally, we demonstrate empirically that it can discover many well-known equations that are beyond the capabilities of current methods.

Labeling Neural Representations with Inverse Recognition
Kirill Bykov Laura Kopf Shinichi Nakajima Marius Kloft Marina MC Höhne



研究问题:现有的深度学习模型虽然在复杂数据表示学习上表现出强大的能力,但这些表示的性质仍然基本未知。
动机:现有的全局可解释性方法如网络剖析存在依赖分割掩码、缺乏统计显著性检验和计算需求高等问题。
方法:我们提出了反向识别(INVERT)方法,这是一种基于区分概念能力的可扩展方法,用于将学习到的表示与人类可理解的概念相链接。
效果:我们在各种场景中展示了INVERT的应用,包括识别受虚假相关性影响的代表,以及解释模型内决策制定的层次结构。

Deep Neural Networks (DNNs) demonstrated remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for linking the learned representations to human-interpretable concepts based on the ability to differentiate between concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance, emphasizing its utility and credibility. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.

Towards Combinatorial Generalization for Catalysts: A Kohn-Sham Charge-Density Approach
Phil Pope David Jacobs



研究问题:如何提高机器学习在催化剂模型预测中的性能,特别是在未见过的结构上。
动机:现有的机器学习方法主要关注能量预测,但在新的结构上并未显示出显著的泛化能力。
方法:通过点对点学习Kohn-Sham电荷密度,使用一种新的具有电荷密度的块状催化剂数据集进行训练。
效果:实验结果表明,该方法能够推广到训练时未见过的元素的新结构,实现了组合泛化。超过80%的二元和三元测试案例比标准基线更快地收敛,平均减少了13%的达到收敛所需的迭代次数,这可能具有独立的意义。

The Kohn-Sham equations underlie many important applications such as the discovery of new catalysts. Recent machine learning work on catalyst modeling has focused on prediction of the energy, but has so far not yet demonstrated significant out-of-distribution generalization. Here we investigate another approach based on the pointwise learning of the Kohn-Sham charge-density. On a new dataset of bulk catalysts with charge densities, we show density models can generalize to new structures with combinations of elements not seen at train time, a form of combinatorial generalization. We show that over 80% of binary and ternary test cases achieve faster convergence than standard baselines in Density Functional Theory, amounting to an average reduction of 13% in the number of iterations required to reach convergence, which may be of independent interest. Our results suggest that density learning is a viable alternative, trading greater inference costs for a step towards combinatorial generalization, a key property for applications.

Physics-Informed Bayesian Optimization of Variational Quantum Circuits
Kim Andrea Nicoli Christopher J. Anders Lena Funcke Tobias Hartung Karl Jansen Stefan Kuhn Klaus Robert Muller Paolo Stornati Pan Kessel Shinichi Nakajima



研究问题:如何利用贝叶斯优化来改进变分量子特征求解器(VQEs)的性能。
动机:VQEs是一种混合的量子-经典协议,用于近似量子哈密顿量的基态,但需要大量的计算资源。
方法:提出了一种新的方法,通过结合关于量子电路的重要先验信息,推导出一种VQE-kernel,并设计了一种新的贝叶斯优化采集函数EMICoRe,可以有效地利用VQE-kernel的归纳偏置。
效果:实验结果表明,该方法优于最先进的基线方法,能够显著提高VQEs的性能。

In this paper, we propose a novel and powerful method to harness Bayesian optimization for variational quantum eigensolvers (VQEs) - a hybrid quantum-classical protocol used to approximate the ground state of a quantum Hamiltonian. Specifically, we derive a *VQE-kernel* which incorporates important prior information about quantum circuits: the kernel feature map of the VQE-kernel exactly matches the known functional form of the VQE's objective function and thereby significantly reduces the posterior uncertainty. Moreover, we propose a novel acquisition function for Bayesian optimization called \emph{Expected Maximum Improvement over Confident Regions} (EMICoRe) which can actively exploit the inductive bias of the VQE-kernel by treating regions with low predictive uncertainty as indirectly "observed". As a result, observations at as few as three points in the search domain are sufficient to determine the complete objective function along an entire one-dimensional subspace of the optimization landscape. Our numerical experiments demonstrate that our approach improves over state-of-the-art baselines.

Causal Effect Identification in Uncertain Causal Networks
Sina Akbari Fateme Jamshidi Ehsan Mokhtarian Matthew James Vowels Jalal Etesami Negar Kiyavash



研究问题:在存在不确定性的因果结构中,如何确定具有最高可信度且可识别特定因果关系的子图?
动机:当因果图中的边缘存在不确定性时,例如代表领域专家的信念程度或反映特定统计测试的置信度,如何进行有效的因果推断?
方法:提出一种称为“边缘ID问题”的NP-hard组合优化问题解决方法,并设计了高效的近似算法。
效果:通过在真实世界网络和随机生成的图形上进行评估,验证了所提算法的有效性。

Causal identification is at the core of the causal inference literature, where complete algorithms have been proposed to identify causal queries of interest. The validity of these algorithms hinges on the restrictive assumption of having access to a correctly specified causal structure. In this work, we study the setting where a probabilistic model of the causal structure is available. Specifically, the edges in a causal graph exist with uncertainties which may, for example, represent degree of belief from domain experts. Alternatively, the uncertainty about an edge may reflect the confidence of a particular statistical test. The question that naturally arises in this setting is: Given such a probabilistic graph and a specific causal effect of interest, what is the subgraph which has the highest plausibility and for which the causal effect is identifiable? We show that answering this question reduces to solving an NP-hard combinatorial optimization problem which we call the edge ID problem. We propose efficient algorithms to approximate this problem and evaluate them against both real-world networks and randomly generated graphs.

Efficient Training of Energy-Based Models Using Jarzynski Equality
Davide Carbone Mengjian Hua Simon Coste Eric Vanden-Eijnden



研究问题:如何有效地计算基于统计物理学的能量模型(EBMs)在无监督学习中的性能,特别是其与数据分布之间的交叉熵(CE)。
动机:使用交叉熵作为训练目标具有挑战性,因为需要通过采样模型分布来计算其相对于模型参数的梯度。
方法:利用非平衡热力学中的贾茨尼斯基等式和顺序蒙特卡洛采样工具,进行高效计算并避免使用标准对比散度算法产生的不受控制的近似值。具体来说,引入了对未调整的Langevin算法(ULA)的修改,其中每个步行者获得一个权重,可以在任何一步进行梯度估计,从而绕过由ULA慢混合引起的采样偏差。
效果:通过高斯混合分布以及MNIST和CIFAR-10数据集的数值实验,证明了该方法在所有考虑的情况下都优于基于对比散度算法的方法。

Energy-based models (EBMs) are generative models inspired by statistical physics with a wide range of applications in unsupervised learning. Their performance is well measured by the cross-entropy (CE) of the model distribution relative to the data distribution. Using the CE as the objective for training is however challenging because the computation of its gradient with respect to the model parameters requires sampling the model distribution. Here we show how results for nonequilibrium thermodynamics based on Jarzynski equality together with tools from sequential Monte-Carlo sampling can be used to perform this computation efficiently and avoid the uncontrolled approximations made using the standard contrastive divergence algorithm. Specifically, we introduce a modification of the unadjusted Langevin algorithm (ULA) in which each walker acquires a weight that enables the estimation of the gradient of the cross-entropy at any step during GD, thereby bypassing sampling biases induced by slow mixing of ULA. We illustrate these results with numerical experiments on Gaussian mixture distributions as well as the MNIST and CIFAR-10 datasets. We show that the proposed approach outperforms methods based on the contrastive divergence algorithm in all the considered situations.

Interaction Measures, Partition Lattices and Kernel Tests for High-Order Interactions
Zhaolu Liu Robert Peach Pedro A. M. Mediano Mauricio Barahona



研究问题:现有的模型主要依赖成对关系,往往无法捕捉到复杂多变量数据在社会经济、生态或生物医学等领域的完整统计结构。
动机:高阶变量之间的非平凡依赖关系在这些系统的分析和建模中起着重要作用,然而从数据中提取这种高阶交互仍然具有挑战性。
方法:本文提出了一种$d$-阶($d\geq 2$)交互测量的层次结构,逐渐包含可能的联合概率分布的分解,并定义了非参数、基于内核的测试,以系统地确定$d$-阶交互的统计显著性。我们还建立了与格理论的数学联系,阐明了交互测量及其复合排列测试的推导过程;明确了单纯复形与核矩阵定心的联系;并提供了一种增强计算效率的方法。
效果:通过在合成数据上进行验证以及在神经影像数据分析中的应用,我们展示了数值结果。

Models that rely solely on pairwise relationships often fail to capture the complete statistical structure of the complex multivariate data found in diverse domains, such as socio-economic, ecological, or biomedical systems. Non-trivial dependencies between groups of more than two variables can play a significant role in the analysis and modelling of such systems, yet extracting such high-order interactions from data remains challenging. Here, we introduce a hierarchy of $d$-order ($d \geq 2$) interaction measures, increasingly inclusive of possible factorisations of the joint probability distribution, and define non-parametric, kernel-based tests to establish systematically the statistical significance of $d$-order interactions. We also establish mathematical links with lattice theory, which elucidate the derivation of the interaction measures and their composite permutation tests; clarify the connection of simplicial complexes with kernel matrix centring; and provide a means to enhance computational efficiency. We illustrate our results numerically with validations on synthetic data, and through an application to neuroimaging data.

Learning Efficient Coding of Natural Images with Maximum Manifold Capacity Representations
Thomas Edward Yerxa Yilun Kuang Eero P Simoncelli SueYeon Chung



研究问题:如何有效测量和优化感觉系统的反应特性以获取最大环境信息?
动机:现有的信息理论属性难以在实际环境中进行测量或作为优化的目标函数。
方法:提出了一种新的生态相关效率度量标准——"流形容量",并通过简化的假设将其直接优化,生成了最大流形容量表示(MMCR)。
效果:MMCRs在自我监督学习的标准基准测试中表现优秀,并在一系列神经预测性基准测试中与最先进的BERT模型相媲美。

The efficient coding hypothesis proposes that the response properties of sensory systems are adapted to the statistics of their inputs such that they capture maximal information about the environment, subject to biological constraints. While elegant, information theoretic properties are notoriously difficult to measure in practical settings or to employ as objective functions in optimization. This difficulty has necessitated that computational models designed to test the hypothesis employ several different information metrics ranging from approximations and lower bounds to proxy measures like reconstruction error. Recent theoretical advances have characterized a novel and ecologically relevant efficiency metric, the ``manifold capacity,” which is the number of object categories that may be represented in a linearly separable fashion. However, calculating manifold capacity is a computationally intensive iterative procedure that until now has precluded its use as an objective. Here we outline the simplifying assumptions that allow manifold capacity to be optimized directly, yielding Maximum Manifold Capacity Representations (MMCR). The resulting method is closely related to and inspired by advances in the field of self supervised learning (SSL), and we demonstrate that MMCRs are competitive with state of the art results on standard SSL benchmarks. Empirical analyses reveal differences between MMCRs and representations learned by other SSL frameworks, and suggest a mechanism by which manifold compression gives rise to class separability. Finally we evaluate a set of SSL methods on a suite of neural predicitivity benchmarks, and find MMCRs are higly competitive as models of the ventral stream.

Efficient Bayesian Learning Curve Extrapolation using Prior-Data Fitted Networks
Steven Adriaensen Herilalaina Rakotoarison Samuel Müller Frank Hutter



研究问题:学习曲线外推旨在预测模型在训练后期的性能,基于早期阶段的性能。
动机:尽管学习曲线外推的内在不确定性需要贝叶斯方法,但现有方法(i)过于严格,和/或(ii)计算成本高。
方法:我们首次将先验数据适应神经网络(PFNs)应用于此情境。PFN是一种经过先验数据预训练的变压器,用于执行单次前向传递的近似贝叶斯推理。我们提出了LC-PFN,这是一种通过MCMC从先前艺术中提出的参数先验生成的1000万个人工右删截学习曲线进行外推的PFN。
效果:实验证明,LC-PFN可以比MCMC更准确地近似后验预测分布,同时速度提高1万倍。此外,同样的LC-PFN在对四种学习曲线基准(LCBench、NAS-Bench-201、Taskset和PD1)进行外推时也取得了有竞争力的性能,这些基准源自于在53个不同数据集上训练各种模型架构(MLPs、CNNs、RNNs和Transformers),输入模态多样(表格、图像、文本和蛋白质数据)。最后,我们在模型选择的背景下探讨了其潜力,发现基于简单LC-PFN的预测早期停止标准在这些数据集中的45个上实现了2-6倍的速度提升,几乎没有额外开销。

Learning curve extrapolation aims to predict model performance in later epochs of training, based on the performance in earlier epochs. In this work, we argue that, while the inherent uncertainty in the extrapolation of learning curves warrants a Bayesian approach, existing methods are (i) overly restrictive, and/or (ii) computationally expensive. We describe the first application of prior-data fitted neural networks (PFNs) in this context. A PFN is a transformer, pre-trained on data generated from a prior, to perform approximate Bayesian inference in a single forward pass. We propose LC-PFN, a PFN trained to extrapolate 10 million artificial right-censored learning curves generated from a parametric prior proposed in prior art using MCMC. We demonstrate that LC-PFN can approximate the posterior predictive distribution more accurately than MCMC, while being over 10 000 times faster. We also show that the same LC-PFN achieves competitive performance extrapolating a total of 20 000 real learning curves from four learning curve benchmarks (LCBench, NAS-Bench-201, Taskset, and PD1) that stem from training a wide range of model architectures (MLPs, CNNs, RNNs, and Transformers) on 53 different datasets with varying input modalities (tabular, image, text, and protein data). Finally, we investigate its potential in the context of model selection and find that a simple LC-PFN based predictive early stopping criterion obtains 2 - 6x speed-ups on 45 of these datasets, at virtually no overhead.

Stabilizing the Optimization of Neural Signed Distance Functions and Finer Shape Representation
Huizong Yang Yuxin Sun Ganesh Sundaramoorthi Anthony Yezzi



研究问题:如何通过学习隐式神经表示(INR)来更准确地捕捉形状的几何和拓扑结构。
动机:目前的网络优化方法在处理复杂形状时,会出现不稳定性和收敛到局部最优解的问题,导致无法准确捕获形状的细节。
方法:通过对现有损失函数的分析,提出了一种新的稳定化正则项,并设计了基于二次层的网络结构。
效果:实验证明,新的方法能够更准确地捕捉形状细节和拓扑结构,超越了现有的最先进技术。

We present new insights and a novel paradigm for learning implicit neural representations (INR) of shapes. In particular, we shed light on the popular eikonal loss used for imposing a signed distance function constraint in INR. We show analytically that as the representation power of the network increases, the optimization approaches a partial differential equation (PDE) in the continuum limit that is unstable. We show that this instability can manifest in existing network optimization, leading to irregularities in the reconstructed surface and/or convergence to sub-optimal local minima, and thus fails to capture fine geometric and topological structure. We show analytically how other terms added to the loss, currently used in the literature for other purposes, can actually eliminate these instabilities. However, such terms can over-regularize the surface, preventing the representation of fine shape detail. Based on a similar PDE theory for the continuum limit, we introduce a new regularization term that still counteracts the eikonal instability but without over-regularizing. Furthermore, since stability is now guaranteed in the continuum limit, this stabilization also allows for considering new network structures that are able to represent finer shape detail. We introduce such a structure based on quadratic layers. Experiments on multiple benchmark data sets show that our new regularization and network are able to capture more precise shape details and more accurate topology than existing state-of-the-art.

Structured Neural Networks for Density Estimation and Causal Inference
Asic Q Chen Ruian Shi Xiang Gao Ricardo Baptista Rahul G Krishnan



研究问题:如何将结构化信息注入神经网络,以学习满足输入子集不变性的功能。
动机:在生成模型中使用神经网络时,编码观察变量的条件独立性结构是有利的,通常以贝叶斯网络的形式。
方法:提出结构化神经网络(StrNN),通过在神经网络中注入掩蔽路径来注入结构。这些掩码通过我们在神经网络架构和二进制矩阵分解之间探索的新关系进行设计,以确保所需的独立性得到尊重。
效果:我们展示了StrNN在三个应用中的效用:(1)使用StrNN进行二进制和高斯密度估计;(2)使用结构化自回归流(StrAFs)和结构化连续归一化流(StrCNF)进行实值密度估计;(3)使用StrAFs进行干预和反事实分析以进行因果推理。我们的工作为数据高效的生成建模和用于因果效应估计的归一化流的使用开辟了新途径。

Injecting structure into neural networks enables learning functions that satisfy invariances with respect to subsets of inputs. For instance, when learning generative models using neural networks, it is advantageous to encode the conditional independence structure of observed variables, often in the form of Bayesian networks. We propose the Structured Neural Network (StrNN), which injects structure through masking pathways in a neural network. The masks are designed via a novel relationship we explore between neural network architectures and binary matrix factorization, to ensure that the desired independencies are respected. We devise and study practical algorithms for this otherwise NP-hard design problem based on novel objectives that control the model architecture. We demonstrate the utility of StrNN in three applications: (1) binary and Gaussian density estimation with StrNN, (2) real-valued density estimation with Structured Autoregressive Flows (StrAFs) and Structured Continuous Normalizing Flows (StrCNF), and (3) interventional and counterfactual analysis with StrAFs for causal inference. Our work opens up new avenues for learning neural networks that enable data-efficient generative modeling and the use of normalizing flows for causal effect estimation.

Scalable Transformer for PDE Surrogate Modeling
Zijie Li Dule Shu Amir Barati Farimani



研究问题:如何利用Transformer模型进行大规模的网格点问题建模,并解决其数值不稳定和计算昂贵的问题。
动机:尽管Transformer在各种应用中表现出了最先进的性能,但在处理大量网格点的问题时,其线性复杂度的注意力机制可能会导致数值不稳定和计算成本高昂。
方法:提出了一种基于轴向因子化核积分的分解Transformer(FactFormer)模型。具体来说,引入了一个可学习的投影算子,将输入函数分解为多个具有一维域的子函数。然后评估这些子函数,并使用轴向因子化方案来计算实例基核。
效果:所提出的模型能够有效地模拟$256\times 256$网格上的二维科尔莫戈洛夫流动和$64times64\times64$网格上的三维烟雾浮力,具有良好的准确性和效率。这种因子化方案可以作为处理多维问题的全注意力方案的高效低秩替代方案。

Transformer has shown state-of-the-art performance on various applications and has recently emerged as a promising tool for surrogate modeling of partial differential equations (PDEs). Despite the introduction of linear-complexity attention, applying Transformer to problems with a large number of grid points can be numerically unstable and computationally expensive. In this work, we propose Factorized Transformer (FactFormer), which is based on an axial factorized kernel integral. Concretely, we introduce a learnable projection operator that decomposes the input function into multiple sub-functions with one-dimensional domain. These sub-functions are then evaluated and used to compute the instance-based kernel with an axial factorized scheme. We showcase that the proposed model is able to simulate 2D Kolmogorov flow on a $256\times 256$ grid and 3D smoke buoyancy on a $64\times64\times64$ grid with good accuracy and efficiency. The proposed factorized scheme can serve as a computationally efficient low-rank surrogate for the full attention scheme when dealing with multi-dimensional problems.

Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
Muhammad Faaiz Taufiq Arnaud Doucet Rob Cornish Jean-Francois Ton



研究问题:本文旨在解决现有策略评估方法在上下文环境bandits中存在的高方差问题。
动机:当前的策略评估方法,如逆概率加权(IPW)和双重鲁棒(DR)估计器,在目标策略和行为策略重叠低或动作空间和上下文空间大的情况下,存在高方差的问题。
方法:本文提出了一种新的策略评估器——边际比率(MR)估计器,它专注于结果Y的边际分布的变化,而不是策略本身。通过严格的理论分析,证明了MR估计器相对于传统方法如IPW和DR在降低方差方面的优势。
效果:实验结果表明,MR估计器在合成和真实世界的数据集上都表现出了优越的性能,证实了其在上下文环境bandits的策略评估中的实用性。

Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation. However, current OPE methods, such as Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators, suffer from high variance, particularly in cases of low overlap between target and behaviour policies or large action and context spaces. In this paper, we introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves. Through rigorous theoretical analysis, we demonstrate the benefits of the MR estimator compared to conventional methods like IPW and DR in terms of variance reduction. Additionally, we establish a connection between the MR estimator and the state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator, proving that MR achieves lower variance among a generalized family of MIPS estimators. We further illustrate the utility of the MR estimator in causal inference settings, where it exhibits enhanced performance in estimating Average Treatment Effects (ATE). Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits.

Strategic Distribution Shift of Interacting Agents via Coupled Gradient Flows
Lauren E Conger Franca Hoffman Eric Mazumdar Lillian J Ratliff



研究问题:本文旨在提出一种新的框架,用于分析现实世界系统中分布偏移的动态变化,该框架捕捉了学习算法和部署分布之间的反馈循环。
动机:现有的工作大多将反馈引发的分布偏移建模为对抗性的,或者通过过于简单的分布偏移结构进行建模。相比之下,我们提出了一个耦合的偏微分方程模型,通过考虑由于对算法决策的战略性响应、非局部内生种群交互和其他外源引起的分布偏移而产生的复杂动态,来捕捉分布随时间变化的细粒度变化。
方法:我们在机器学习的两个常见设置中考虑问题:信息不对称的合作设置和学习者面临策略性用户的竞争设置。对于这两种设置,当算法通过梯度下降进行再训练时,我们证明了再训练过程在有限维和无限维中都会收敛到稳定状态,并获得了关于模型参数的显式速率。为此,我们推导了关于耦合偏微分方程收敛的新结果,扩展了多物种系统的知识。
效果:实证上,我们表明,我们的方法很好地捕捉到了诸如极化和差异影响等简单模型无法捕捉到的已知形式的分布偏移。

We propose a novel framework for analyzing the dynamics of distribution shift in real-world systems that captures the feedback loop between learning algorithms and the distributions on which they are deployed. Prior work largely models feedback-induced distribution shift as adversarial or via an overly simplistic distribution-shift structure. In contrast, we propose a coupled partial differential equation model that captures fine-grained changes in the distribution over time by accounting for complex dynamics that arise due to strategic responses to algorithmic decision-making, non-local endogenous population interactions, and other exogenous sources of distribution shift. We consider two common settings in machine learning: cooperative settings with information asymmetries, and competitive settings where a learner faces strategic users. For both of these settings, when the algorithm retrains via gradient descent, we prove asymptotic convergence of the retraining procedure to a steady-state, both in finite and in infinite dimensions, obtaining explicit rates in terms of the model parameters. To do so we derive new results on the convergence of coupled PDEs that extends what is known on multi-species systems. Empirically, we show that our approach captures well-documented forms of distribution shifts like polarization and disparate impacts that simpler models cannot capture.

Estimating Koopman operators with sketching to provably learn large scale dynamical systems
Giacomo Meanti Antoine Chatalic Vladimir R Kostic Pietro Novelli massimiliano pontil Lorenzo Rosasco



研究问题:如何有效地预测和分析复杂的动态系统?
动机:现有的非参数机器学习算法在处理复杂动态系统时,计算效率低下。
方法:利用随机投影(sketching)技术提升核空间中主成分回归(PCR)或降维回归(RRR)等Koopman算子估计器的效率。
效果:实验结果表明,新提出的“草图”估计器在保持与PCR或RRR相同的准确性的同时,计算速度大大提高。

The theory of Koopman operators allows to deploy non-parametric machine learning algorithms to predict and analyze complex dynamical systems. Estimators such as principal component regression (PCR) or reduced rank regression (RRR) in kernel spaces can be shown to provably learn Koopman operators from finite empirical observations of the system's time evolution. Scaling these approaches to very long trajectories is a challenge and requires introducing suitable approximations to make computations feasible. In this paper, we boost the efficiency of different kernel-based Koopman operator estimators using random projections (sketching). We derive, implement and test the new ``sketched'' estimators with extensive experiments on synthetic and large-scale molecular dynamics datasets. Further, we establish non asymptotic error bounds giving a sharp characterization of the trade-offs between statistical learning rates and computational efficiency. Our empirical and theoretical analysis shows that the proposed estimators provide a sound and efficient way to learn large scale dynamical systems. In particular our experiments indicate that the proposed estimators retain the same accuracy of PCR or RRR, while being much faster.

Nonparametric Boundary Geometry in Physics Informed Deep Learning
Scott Alexander Cameron Arnu Pretorius Stephen J. Roberts



研究问题:如何有效地解决设计师在三角形网格上指定的具有边界条件的偏微分方程系统。
动机:当前使用机器学习加速解决方案的过程严重依赖于固定的几何参数化,这限制了训练后的模型在不同设计问题上的重用可能性。
方法:提出一种新的神经操作器架构,接受三角形网格形式的边界几何作为输入,并产生给定PDE的近似解作为输出。
效果:一旦训练完成,该模型就可以快速估计新几何上的PDE解决方案,无需重新训练或将几何表示为预指定的参数化。

Engineering design problems frequently require solving systems of partial differential equations with boundary conditions specified on object geometries in the form of a triangular mesh. These boundary geometries are provided by a designer and are problem dependent. The efficiency of the design process greatly benefits from fast turnaround times when repeatedly solving PDEs on various geometries. However, most current work that uses machine learning to speed up the solution process relies heavily on a fixed parameterization of the geometry, which cannot be changed after training. This severely limits the possibility of reusing a trained model across a variety of design problems. In this work, we propose a novel neural operator architecture which accepts boundary geometry, in the form of triangular meshes, as input and produces an approximate solution to a given PDE as output. Once trained, the model can be used to rapidly estimate the PDE solution over a new geometry, without the need for retraining or representation of the geometry to a pre-specified parameterization.

NCDL: A Framework for Deep Learning on non-Cartesian Lattices
Joshua John Horacsek Usman Alim



研究问题:如何利用非笛卡尔网格进行机器学习。
动机:尽管非笛卡尔网格在数值科学领域如模拟和科学可视化中很重要,但在机器学习中的应用却几乎未被探索,主要原因是数据在非笛卡尔域上的表示困难以及缺乏对非笛卡尔数据的标准机器学习操作支持。
方法:本文提出了一种新的数据结构——格张量,它将传统的张量时空运算推广到格张量上,使得标准机器学习算法能够应用于非笛卡尔数据。同时,我们使用非二元降采样方案将笛卡尔数据转换为非笛卡尔空间以进行进一步处理。
效果:我们引入了一个实现了格张量容器(带有一些常见的机器学习操作)的软件库,并展示了其有效性。我们的方法为非笛卡尔域上的机器学习提供了一个通用框架,解决了上述挑战,填补了当前文献中的空白。

The use of non-Cartesian grids is a niche but important topic in sub-fields of the numerical sciences such as simulation and scientific visualization. However, non-Cartesian approaches are virtually unexplored in machine learning. This is likely due to the difficulties in the representation of data on non-Cartesian domains and the lack of support for standard machine learning operations on non-Cartesian data. This paper proposes a new data structure called the lattice tensor which generalizes traditional tensor spatio-temporal operations to lattice tensors, enabling the use of standard machine learning algorithms on non-Cartesian data. However, data need not reside on a non-Cartesian structure, we use non-Dyadic downsampling schemes to bring Cartesian data into a non-Cartesian space for further processing. We introduce a software library that implements the lattice tensor container (with some common machine learning operations), and demonstrate its effectiveness. Our method provides a general framework for machine learning on non-Cartesian domains, addressing the challenges mentioned above and filling a gap in the current literature.

Statistical Limits of Adaptive Linear Models: Low-Dimensional Estimation and Inference
Licong Lin Mufang Ying Suvrojit Ghosh Koulik Khamaru Cun-Hui Zhang



研究问题:在数据收集具有适应性的情况下,统计估计和推断面临重大挑战。
动机:即使对于线性模型,当数据允许被任意适应时,普通最小二乘(OLS)估计器在进行单坐标估计时可能无法表现出渐近正态性,并且误差会增大。
方法:我们探索了利用独立同分布(i.i.d.)数据和使用适应性数据进行估计的性能之间的显著差异。我们研究了数据收集的适应性如何影响高维线性模型中低维参数组件的估计性能。
效果:我们发现,在数据收集机制满足一定条件时,低维参数组件的估计误差可以匹配其在i.i.d.设置中的对应误差,这个因子取决于数据的适应性程度。我们还提出了一种新的单坐标估计器,通过解决两阶段自适应线性估计方程(TALE)。在数据收集适应性较弱的情况下,我们证明了所提出的估计器的渐近正态性属性。

Estimation and inference in statistics pose significant challenges when data are collected adaptively. Even in linear models, the Ordinary Least Squares (OLS) estimator may fail to exhibit asymptotic normality for single coordinate estimation and have inflated error. This issue is highlighted by a recent minimax lower bound, which shows that the error of estimating a single coordinate can be enlarged by a multiple of $\sqrt{d}$ when data are allowed to be arbitrarily adaptive, compared with the case when they are i.i.d. Our work explores this striking difference in estimation performance between utilizing i.i.d. and adaptive data. We investigate how the degree of adaptivity in data collection impacts the performance of estimating a low-dimensional parameter component in high-dimensional linear models. We identify conditions on the data collection mechanism under which the estimation error for a low-dimensional parameter component matches its counterpart in the i.i.d. setting, up to a factor that depends on the degree of adaptivity. We show that OLS or OLS on centered data can achieve this matching error. In addition, we propose a novel estimator for single coordinate inference via solving a Two-stage Adaptive Linear Estimating equation (TALE). Under a weaker form of adaptivity in data collection, we establish an asymptotic normality property of the proposed estimator.

A Framework for Fast and Stable Representations of Multiparameter Persistent Homology Decompositions
David Loiseaux Mathieu Carrière Andrew Blumberg



研究问题:本文旨在解决多参数持久同调的表示问题,以便于整合到标准的机器学习算法中。
动机:现有的方法要么忽略大部分多参数信息以简化为单参数情况,要么在面对噪声时具有潜在的不稳定性。
方法:引入了一个新的通用表示框架,该框架利用了多参数持久同调分解的最新结果。这个框架信息丰富,计算速度快,包含了之前的方法。
效果:通过数值实验验证了稳定性结果和算法,在几个真实数据集上展示了统计收敛性、预测准确性和快速运行时间。

Topological data analysis (TDA) is an area of data science that focuses on using invariants from algebraic topology to provide multiscale shape descriptors for geometric data sets such as point clouds. One of the most important such descriptors is persistent homology, which encodes the change in shape as a filtration parameter changes; a typical parameter is the feature scale. For many data sets, it is useful to simultaneously vary multiple filtration parameters, for example feature scale and density. While the theoretical properties of single parameter persistent homology are well understood, less is known about the multiparameter case. A central question is the problem of representing multiparameter persistent homology by elements of a vector space for integration with standard machine learning algorithms. Existing approaches to this problem either ignore most of the multiparameter information to reduce to the one-parameter case or are heuristic and potentially unstable in the face of noise. In this article, we introduce a new general representation framework that leverages recent results on decompositions of multiparameter persistent homology. This framework is rich in information, fast to compute, and encompasses previous approaches. Moreover, we establish theoretical stability guarantees under this framework as well as efficient algorithms for practical computation, making this framework an applicable and versatile tool for analyzing geometric and point cloud data. We validate our stability results and algorithms with numerical experiments that demonstrate statistical convergence, prediction accuracy, and fast running times on several real data sets.

Cognitive Model Discovery via Disentangled RNNs
Kevin J Miller Maria K Eckstein Matthew Botvinick Zeb Kurth-Nelson



研究问题:本文旨在通过数据直接学习简洁的认知模型。
动机:传统的构建认知模型的过程既困难又需要大量的创新和灵感,因此本文采用一种新方法直接从数据中学习简洁的认知模型。
方法:使用循环神经网络拟合行为数据,并对在时间步长之间携带过多信息的模型进行惩罚,从而得到稀疏且可解释的表示和动态。
效果:当拟合已知认知模型的合成行为数据时,该方法能够恢复出这些模型的基本形式。当用于拟合老鼠执行赌博任务的选择数据时,该方法能够恢复出简单且可解释的模型,并对神经机制做出可测试的预测。

Computational cognitive models are a fundamental tool in behavioral neuroscience. They embody in software precise hypotheses about the cognitive mechanisms underlying a particular behavior. Constructing these models is typically a difficult iterative process that requires both inspiration from the literature and the creativity of an individual researcher. Here, we adopt an alternative approach to learn parsimonious cognitive models directly from data. We fit behavior data using a recurrent neural network that is penalized for carrying excess information between timesteps, leading to sparse, interpretable representations and dynamics. When fitting synthetic behavioral data from known cognitive models, our method recovers the underlying form of those models. When fit to choice data from rats performing a bandit task, our method recovers simple and interpretable models that make testable predictions about neural mechanisms.

Flow Matching for Scalable Simulation-Based Inference
Jonas Bernhard Wildberger Maximilian Dax Simon Buchholz Stephen R Green Jakob H. Macke Bernhard Schölkopf



研究问题:如何将离散正则化流的神经后验估计方法扩展到高维问题。
动机:基于生成模型的最新进展,提出一种使用连续正则化流进行模拟推理(SBI)的方法。
方法:提出流动匹配后验估计(FMPE)技术,利用流动匹配实现无约束架构,提供对复杂数据模态的增强灵活性。
效果:实验表明,FMPE在已建立的SBI基准上取得了有竞争力的性能,并在一个具有挑战性的科学问题上展示了其改进的可扩展性:对于引力波推理,FMPE优于基于可比离散流的方法,训练时间减少了30%,准确性也大大提高。

Neural posterior estimation methods based on discrete normalizing flows have become established tools for simulation-based inference (SBI), but scaling them to high-dimensional problems can be challenging. Building on recent advances in generative modeling, we here present flow matching posterior estimation (FMPE), a technique for SBI using continuous normalizing flows. Like diffusion models, and in contrast to discrete flows, flow matching allows for unconstrained architectures, providing enhanced flexibility for complex data modalities. Flow matching, therefore, enables exact density evaluation, fast training, and seamless scalability to large architectures---making it ideal for SBI. We show that FMPE achieves competitive performance on an established SBI benchmark, and then demonstrate its improved scalability on a challenging scientific problem: for gravitational-wave inference, FMPE outperforms methods based on comparable discrete flows, reducing training time by 30\% with substantially improved accuracy. Our work underscores the potential of FMPE to enhance performance in challenging inference scenarios, thereby paving the way for more advanced applications to scientific problems.

Max-Sliced Mutual Information
Dor Tsur Ziv Goldfeld Kristjan Greenewald



研究问题:如何量化高维随机变量之间的依赖关系,特别是在统计学习和推理中。
动机:传统的相关性分析(CCA)和互信息(MI)方法在处理高维数据时存在局限性,如CCA只能捕捉线性相关,而互信息在高维情况下难以计算/估计。
方法:提出了一种可扩展的信息理论的CCA泛化方法,称为最大切片互信息(mSMI)。mSMI等于高维变量的低维投影之间的最大互信息,当变量符合高斯分布时,它退化为CCA。
效果:实验表明,mSMI在独立性测试、多视角表示学习、算法公平性和生成模型等任务上的表现优于竞争方法,且计算开销很小。

Quantifying dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon's mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA only accounts for linear dependence, which may be insufficient for certain applications, while mutual information is often infeasible to compute/estimate in high dimensions. This work proposes a middle ground in the form of a scalable information-theoretic generalization of CCA, termed max-sliced mutual information (mSMI). mSMI equals the maximal mutual information between low-dimensional projections of the high-dimensional variables, which reduces back to CCA in the Gaussian case. It enjoys the best of both worlds: capturing intricate dependencies in the data while being amenable to fast computation and scalable estimation from samples. We show that mSMI retains favorable structural properties of Shannon's mutual information, like variational forms and identification of independence. We then study statistical estimation of mSMI, propose an efficiently computable neural estimator, and couple it with formal non-asymptotic error bounds. We present experiments that demonstrate the utility of mSMI for several tasks, encompassing independence testing, multi-view representation learning, algorithmic fairness, and generative modeling. We observe that mSMI consistently outperforms competing methods with little-to-no computational overhead.

Probabilistic Inference in Reinforcement Learning Done Right
Jean Tarbouriech Tor Lattimore Brendan O'Donoghue



研究问题:如何有效地进行强化学习中的状态-动作对的最优性后验概率的贝叶斯处理。
动机:现有的近似方法可能导致算法无法实现真正的统计推断,从而在复杂问题上表现不佳。
方法:采用一种新的变分贝叶斯近似方法,将最优性后验概率转化为一个易于处理的凸优化问题。
效果:所提出的方法被称为VAPOR,其性能优于现有方法,并在深度强化学习版本上进行了实验验证。

A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.

Transformer-based Planning for Symbolic Regression
Parshin Shojaee Kazem Meidani Amir Barati Farimani Chandan K. Reddy



研究问题:本文旨在解决机器学习中符号回归(SR)的挑战,即如何基于函数值找到数学表达式。
动机:尽管预训练的转换器模型在生成方程序列方面表现出色,但这些模型主要依赖于从文本生成中借用的监督预训练目标,忽视了准确性和复杂性等方程发现目标。
方法:我们提出了TPSR,一种将蒙特卡洛树搜索集成到转换器解码过程中的符号回归转换器规划策略。
效果:广泛的实验表明,我们的方法优于最先进的方法,提高了模型的拟合-复杂度权衡、外推能力和对噪声的鲁棒性。

Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and offering notable advantages in terms of inference time over classical Genetic Programming (GP) methods. However, these models primarily rely on supervised pre-training goals borrowed from text generation and overlook equation discovery objectives like accuracy and complexity. To address this, we propose TPSR, a Transformer-based Planning strategy for Symbolic Regression that incorporates Monte Carlo Tree Search into the transformer decoding process. Unlike conventional decoding strategies, TPSR enables the integration of non-differentiable feedback, such as fitting accuracy and complexity, as external sources of knowledge into the transformer-based equation generation process. Extensive experiments on various datasets show that our approach outperforms state-of-the-art methods, enhancing the model's fitting-complexity trade-off, extrapolation abilities, and robustness to noise

Modulated Neural ODEs
Ilze Amanda Auzina Cagatay Yildiz Sara Magliacane Matthias Bethge Efstratios Gavves



研究问题:现有的神经常微分方程(NODEs)方法在捕捉不同轨迹的非线性动态方面存在局限,仅能通过初始状态值或自回归编码器更新来捕获变化。
动机:为了改善现有NODE方法的局限性,提出了一种新颖的调制神经ODEs(MoNODEs)框架,该框架将动态状态与潜在的静态变化因素区分开来。
方法:引入了从数据中学习的时不变调制变量,并将其应用于四种现有的NODE变体。
效果:在振荡系统、视频和人类行走轨迹等具有特定调制的轨迹上测试MoNODE,该框架显著提高了现有模型对新的动态参数化进行泛化以及进行远端预测的能力。此外,验证了提出的调制变量能够有效地表示真实的未知变化因素。

Neural ordinary differential equations (NODEs) have been proven useful for learning non-linear dynamics of arbitrary trajectories. However, current NODE methods capture variations across trajectories only via the initial state value or by auto-regressive encoder updates. In this work, we introduce Modulated Neural ODEs (MoNODEs), a novel framework that sets apart dynamics states from underlying static factors of variation and improves the existing NODE methods. In particular, we introduce *time-invariant modulator variables* that are learned from the data. We incorporate our proposed framework into four existing NODE variants. We test MoNODE on oscillating systems, videos and human walking trajectories, where each trajectory has trajectory-specific modulation. Our framework consistently improves the existing model ability to generalize to new dynamic parameterizations and to perform far-horizon forecasting. In addition, we verify that the proposed modulator variables are informative of the true unknown factors of variation as measured by $R^2$ scores.

Pseudo-Likelihood Inference
Theo Gruner Boris Belousov Fabio Muratore Daniel Palenicek Jan Peters



研究问题:如何有效地进行模拟推理,特别是在高维任务中。
动机:现有的模拟推理方法在处理高维任务时效果不佳。
方法:提出了伪似然推理(PLI)方法,将神经网络近似引入贝叶斯计算,使其在挑战性的贝叶斯系统识别任务上具有竞争力。
效果:在四个经典的模拟推理基准任务和一个高度动态的物理系统上评估了PLI的有效性,结果显示其在随机模拟和多模态后验景观上具有特别的优势。

Simulation-Based Inference (SBI) is a common name for an emerging family of approaches that infer the model parameters when the likelihood is intractable. Existing SBI methods either approximate the likelihood, such as Approximate Bayesian Computation (ABC) or directly model the posterior, such as Sequential Neural Posterior Estimation (SNPE). While ABC is efficient on low-dimensional problems, on higher-dimensional tasks, it is generally outperformed by SNPE, which leverages function approximation. In this paper, we propose Pseudo-Likelihood Inference (PLI), a new method that brings neural approximation into ABC, making it competitive on challenging Bayesian system identification tasks. By utilizing integral probability metrics, we introduce a smooth likelihood kernel with an adaptive bandwidth that is updated based on information-theoretic trust regions. Thanks to this formulation, our method (i) allows for optimizing neural posteriors via gradient descent, (ii) does not rely on summary statistics, and (iii) enables multiple observations as input. In comparison to SNPE, it leads to improved performance when more data is available. The effectiveness of PLI is evaluated on four classical SBI benchmark tasks and on a highly dynamic physical system, showing particular advantages on stochastic simulations and multi-modal posterior landscapes.

Star-Shaped Denoising Diffusion Probabilistic Models
Andrey Okhotin Dmitry Molchanov Arkhipkin Sergeevich Vladimir Grigory Bartosh Viktor Ohanesian Aibek Alanov Dmitry P. Vetrov



研究问题:如何定义非高斯或离散分布的Denoising Diffusion Probabilistic Models (DDPMs)?
动机:现有的DDPMs由于其马尔可夫结构,难以定义非高斯或离散的分布。
方法:本文提出了Star-Shaped DDPM (SS-DDPM),其星形扩散过程允许我们绕过定义转移概率或计算后验分布的需要。
效果:在高斯分布的情况下,SS-DDPM等同于DDPM。然而,SS-DDPM为设计如Beta、von Mises–Fisher、Dirichlet、Wishart等分布的扩散模型提供了简单的方法,这在数据位于约束流形上时特别有用。我们在不同设置中评估了该模型,发现即使在图像数据上,Beta SS-DDPM也能取得与高斯DDPM相当的结果。

Denoising Diffusion Probabilistic Models (DDPMs) provide the foundation for the recent breakthroughs in generative modeling. Their Markovian structure makes it difficult to define DDPMs with distributions other than Gaussian or discrete. In this paper, we introduce Star-Shaped DDPM (SS-DDPM). Its *star-shaped diffusion process* allows us to bypass the need to define the transition probabilities or compute posteriors. We establish duality between star-shaped and specific Markovian diffusions for the exponential family of distributions and derive efficient algorithms for training and sampling from SS-DDPMs. In the case of Gaussian distributions, SS-DDPM is equivalent to DDPM. However, SS-DDPMs provide a simple recipe for designing diffusion models with distributions such as Beta, von Mises–Fisher, Dirichlet, Wishart and others, which can be especially useful when data lies on a constrained manifold. We evaluate the model in different settings and find it competitive even on image data, where Beta SS-DDPM achieves results comparable to a Gaussian DDPM. Our implementation is available at https://github.com/andrey-okhotin/star-shaped

Learning DAGs from Data with Few Root Causes
Panagiotis Misiakos Chris Wendler Markus Püschel



研究问题:提出一种新的学习有向无环图(DAG)的方法,从线性结构方程模型(SEM)生成的数据中学习。
动机:现有的方法主要处理由大量随机根因生成的数据,而我们考虑的是只有少量根因且存在测量噪声的情况。
方法:我们将线性SEM视为一种线性变换,其从与节点关联的随机值根因的密集输入向量计算数据。当根因数量较少且存在测量噪声时,我们证明了在这种新设置下的可识别性,并表明真实的DAG是根因向量的$L^0$-范数的全局最小值。
效果:对于满足少量根因假设的数据,我们的方法在性能上超过了现有的DAG学习方法。

We present a novel perspective and algorithm for learning directed acyclic graphs (DAGs) from data generated by a linear structural equation model (SEM). First, we show that a linear SEM can be viewed as a linear transform that, in prior work, computes the data from a dense input vector of random valued root causes (as we will call them) associated with the nodes. Instead, we consider the case of (approximately) few root causes and also introduce noise in the measurement of the data. Intuitively, this means that the DAG data is produced by few data generating events whose effect percolates through the DAG. We prove identifiability in this new setting and show that the true DAG is the global minimizer of the $L^0$-norm of the vector of root causes. For data satisfying the few root causes assumption, we show superior performance compared to prior DAG learning methods.

Causal Component Analysis
Wendong Liang Armin Kekić Julius von Kügelgen Simon Buchholz Michel Besserve Luigi Gresele Bernhard Schölkopf



研究问题:本文旨在介绍一种新的中间问题——因果成分分析(CauCA),它是独立成分分析(ICA)和因果表示学习(CRL)的一般化,可以同时捕捉到潜在变量之间的因果关系和统计依赖性。
动机:传统的ICA和CRL方法在处理潜在变量的独立性和因果关系时存在局限性。CauCA通过引入因果图结构,将两者结合起来,以更好地理解和恢复潜在变量之间的关系。
方法:本文提出了一种基于正则化流的似然估计方法,用于同时学习去混合函数和因果机制。通过在多个数据集上进行干预实验,证明了该方法在CauCA和非线性ICA设置中的有效性。
效果:实验结果表明,所提出的方法在合成数据集上取得了良好的性能,为进一步扩展到CRL提供了可能性。此外,通过对不同类型干预的多个数据集进行分析,还得到了关于CauCA可识别性的新结果。

Independent Component Analysis (ICA) aims to recover independent latent variables from observed mixtures thereof. Causal Representation Learning (CRL) aims instead to infer causally related (thus often statistically _dependent_) latent variables, together with the unknown graph encoding their causal relationships. We introduce an intermediate problem termed _Causal Component Analysis (CauCA)_. CauCA can be viewed as a generalization of ICA, modelling the causal dependence among the latent components, and as a special case of CRL. In contrast to CRL, it presupposes knowledge of the causal graph, focusing solely on learning the unmixing function and the causal mechanisms. Any impossibility results regarding the recovery of the ground truth in CauCA also apply for CRL, while possibility results may serve as a stepping stone for extensions to CRL. We characterize CauCA identifiability from multiple datasets generated through different types of interventions on the latent causal variables. As a corollary, this interventional perspective also leads to new identifiability results for nonlinear ICA—a special case of CauCA with an empty graph—requiring strictly fewer datasets than previous results. We introduce a likelihood-based approach using normalizing flows to estimate both the unmixing function and the causal mechanisms, and demonstrate its effectiveness through extensive synthetic experiments in the CauCA and ICA setting.

A Fast and Accurate Estimator for Large Scale Linear Model via Data Averaging
Rui Wang Yanyan Ouyang Panpan Yu Wangli Xu



研究问题:本文关注的问题是在样本尺寸极大且数据维度随样本大小变化的情况下,线性模型的估计问题。
动机:在现有的许多方法中,基于草图技术的方法使用草图数据进行最小二乘估计,但这种方法在维度较大时,其收敛速度往往无法达到最优。
方法:我们提出了一种新的基于数据平均的草图方法,该方法将原始数据减少到几个平均观测值,这些平均观测值仍然满足线性模型并用于估计回归系数。
效果:理论和数值结果表明,所提出的方法在统计性能上表现良好,同时计算成本也较低。

This work is concerned with the estimation problem of linear model when the sample size is extremely large and the data dimension can vary with the sample size. In this setting, the least square estimator based on the full data is not feasible with limited computational resources. Many existing methods for this problem are based on the sketching technique which uses the sketched data to perform least square estimation. We derive fine-grained lower bounds of the conditional mean squared error for sketching methods. For sampling methods, our lower bound provides an attainable optimal convergence rate. Our result implies that when the dimension is large, there is hardly a sampling method can have a faster convergence rate than the uniform sampling method. To achieve a better statistical performance, we propose a new sketching method based on data averaging. The proposed method reduces the original data to a few averaged observations. These averaged observations still satisfy the linear model and are used to estimate the regression coefficients. The asymptotic behavior of the proposed estimation procedure is studied. Our theoretical results show that the proposed method can achieve a faster convergence rate than the optimal convergence rate for sampling methods. Theoretical and numerical results show that the proposed estimator has good statistical performance as well as low computational cost.

Grassmann Manifold Flows for Stable Shape Generation
Ryoma Yataka Kazuki Hirashima Masashi Shiraishi



研究问题:如何利用特定流形的对称性作为归纳偏置进行机器学习。
动机:格拉斯曼流形能够处理表示为形状空间的基本形状,实现稳定的形状分析。
方法:通过连续归一化流在格拉斯曼流形上学习分布,明确目标是生成稳定的形状。
效果:该方法能有效消除旋转和反转等无关变换的影响,实验结果表明,该方法能捕获数据结构,生成高质量的样本,并在对数似然或证据下界方面显著优于现有方法。

Recently, studies on machine learning have focused on methods that use symmetry implicit in a specific manifold as an inductive bias. Grassmann manifolds provide the ability to handle fundamental shapes represented as shape spaces, enabling stable shape analysis. In this paper, we present a novel approach in which we establish the theoretical foundations for learning distributions on the Grassmann manifold via continuous normalization flows, with the explicit goal of generating stable shapes. Our approach facilitates more robust generation by effectively eliminating the influence of extraneous transformations, such as rotations and inversions, through learning and generating within a Grassmann manifold designed to accommodate the essential shape information of the object. The experimental results indicated that the proposed method could generate high-quality samples by capturing the data structure. Furthermore, the proposed method significantly outperformed state-of-the-art methods in terms of the log-likelihood or evidence lower bound. The results obtained are expected to stimulate further research in this field, leading to advances for stable shape generation and analysis.

Bayesian Optimisation of Functions on Graphs
Xingchen Wan Pierre Osselin Henry Kenlay Binxin Ru Michael A Osborne Xiaowen Dong



研究问题:如何优化在图结构数据上定义的函数。
动机:随着图结构数据的日益丰富,需要优化在图节点集上定义的函数。传统的图搜索算法可能效率低下且无法利用函数值的信息,而贝叶斯优化是一种高效的黑盒求解器,但尚未应用于此类新设置。
方法:提出了一种新的贝叶斯优化框架,用于优化在通用、大规模和潜在未知的图上定义的函数。通过学习图上的合适核函数,该框架能够适应目标函数的行为。局部建模方法进一步保证了该方法的效率。
效果:在合成和真实世界的图上的大量实验证明了所提出的优化框架的有效性。

The increasing availability of graph-structured data motivates the task of optimising over functions defined on the node set of graphs. Traditional graph search algorithms can be applied in this case, but they may be sample-inefficient and do not make use of information about the function values; on the other hand, Bayesian optimisation is a class of promising black-box solvers with superior sample efficiency, but it has scarcely been applied to such novel setups. To fill this gap, we propose a novel Bayesian optimisation framework that optimises over functions defined on generic, large-scale and potentially unknown graphs. Through the learning of suitable kernels on graphs, our framework has the advantage of adapting to the behaviour of the target function. The local modelling approach further guarantees the efficiency of our method. Extensive experiments on both synthetic and real-world graphs demonstrate the effectiveness of the proposed optimisation framework.

Sample based Explanations via Generalized Representers
Che-Ping Tsai Chih-Kuan Yeh Pradeep Kumar Ravikumar



研究问题:提出一种基于样本的机器学习模型解释方法,即广义表示器。
动机:现有的样本解释方法无法满足一些公理性质,需要一种新的解释方法。
方法:使用全局样本重要性和局部样本重要性两个组件来测量训练样本对模型测试预测的影响,其中全局样本重要性是模型对训练点的量化重要性,与测试样本无关,而局部样本重要性则通过核函数测量训练样本和测试点之间的相似性。
效果:实证比较了不同的广义表示器在两个图像分类数据集上的效果,证明了广义表示器是唯一一类满足公理性质的基于样本的解释方法。

We propose a general class of sample based explanations of machine learning models, which we term generalized representers. To measure the effect of a training sample on a model's test prediction, generalized representers use two components: a global sample importance that quantifies the importance of the training point to the model and is invariant to test samples, and a local sample importance that measures similarity between the training sample and the test point with a kernel. A key contribution of the paper is to show that generalized representers are the only class of sample based explanations satisfying a natural set of axiomatic properties. We discuss approaches to extract global importances given a kernel, and also natural choices of kernels given modern non-linear models. As we show, many popular existing sample based explanations could be cast as generalized representers with particular choices of kernels and approaches to extract global importances. Additionally, we conduct empirical comparisons of different generalized representers on two image classification datasets.

Bounded rationality in structured density estimation
Tianyuan Teng Li Kevin Wenliang Hang Zhang



研究问题:人类大脑如何在有限的资源下,从无限的概率分布空间中构建内部模型,以准确表示环境不确定性。
动机:理解人类如何学习并处理不确定性对于各种认知任务中的自适应和最优行为至关重要。
方法:通过一个新颖的结构化密度估计任务,让参与者对连续呈现的独立观察进行潜在概率分布函数的形成和报告。
效果:随着观察数量的增加,报告的预测密度更接近真实值。然而,观察到在结构估计中存在明显的不一致性,即报告的聚类数量误差大。这种不一致性与分布的规模无关,且在不同的刺激模态中持续存在。

Learning to accurately represent environmental uncertainty is crucial for adaptive and optimal behaviors in various cognitive tasks. However, it remains unclear how the human brain, constrained by finite cognitive resources, constructs an internal model from an infinite space of probability distributions. In this study, we explore how these learned distributions deviate from the ground truth, resulting in observable inconsistency in a novel structured density estimation task. During each trial, human participants were asked to form and report the latent probability distribution functions underlying sequentially presented independent observations. As the number of observations increased, the reported predictive density became closer to the ground truth. Nevertheless, we observed an intriguing inconsistency in human structure estimation, specifically a large error in the number of reported clusters. Such inconsistency is invariant to the scale of the distribution and persists across stimulus modalities. We modeled uncertainty learning as approximate Bayesian inference in a nonparametric mixture prior of distributions. Human reports were best explained under resource rationality embodied in a decaying tendency towards model expansion. Our study offers insights into human cognitive processes under uncertainty and lays the groundwork for further exploration of resource-rational representations in the brain under more complex tasks.

Pairwise Causality Guided Transformers for Event Sequences
Xiao Shou Debarun Bhattacharjya Tian Gao Dharmashankar Subramanian Oktie Hassanzadeh Kristin Bennett



研究问题:尽管在许多学科中,配对因果关系已经在观察性纵向分析中得到广泛研究,但研究问题:尽管在许多学科中,配对因果关系已经在观察性纵向分析中得到广泛研究,但将配对因果关系的知识纳入深度学习模型以处理时间序列事件仍然在很大程度上未被探索。
动机:本文的动机是提出一种新的方法,通过注入“事件Z放大未来事件Y的发生”等配对定性因果关系的知识,来提高基于变压器的模型在多变量事件序列中的性能。
方法:我们建立了一个新的框架,使用变压器架构进行时间序列事件的因果推断,为该方法提供了理论依据,并展示了如何获得所提出的无偏估计量。
效果:实验结果表明,我们的方法通过有效地利用关于因果关系对的知识,在预测准确性方面优于几种最先进的模型。我们还考虑了一个独特的应用,即通过大型语言模型生成社会事件序列,并展示因果知识图如何帮助预测此类序列中的事件。总的来说,我们的框架为提高基于变压器的模型在多变量事件序列中的性能提供了一种实用的方法,通过显式地利用配对因果关系信息。

Although pairwise causal relations have been extensively studied in observational longitudinal analyses across many disciplines, incorporating knowledge of causal pairs into deep learning models for temporal event sequences remains largely unexplored. In this paper, we propose a novel approach for enhancing the performance of transformer-based models in multivariate event sequences by injecting pairwise qualitative causal knowledge such as `event Z amplifies future occurrences of event Y'. We establish a new framework for causal inference in temporal event sequences using a transformer architecture, providing a theoretical justification for our approach, and show how to obtain unbiased estimates of the proposed measure. Experimental results demonstrate that our approach outperforms several state-of-the-art models in terms of prediction accuracy by effectively leveraging knowledge about causal pairs. We also consider a unique application where we extract knowledge around sequences of societal events by generating them from a large language model, and demonstrate how a causal knowledge graph can help with event prediction in such sequences. Overall, our framework offers a practical means of improving the performance of transformer-based models in multivariate event sequences by explicitly exploiting pairwise causal information.

Riemannian Laplace approximations for Bayesian neural networks
Federico Bergamin Pablo Moreno-Muñoz Søren Hauberg Georgios Arvanitidis



研究问题:贝叶斯神经网络通常使用高斯分布来近似权重后验,但实际的后验分布通常是高度非高斯的,导致性能下降。
动机:提出一种简单的参数近似后验方法,通过黎曼度量适应真实后验的形状,该度量由对数后验梯度确定。
方法:开发了一种黎曼拉普拉斯近似方法,其中样本自然落入具有低负对数后验的权重区域。通过利用黎曼度量的结构以及自动微分,可以有效地解决求解常微分方程组的问题。
效果:实验结果表明,该方法在各种任务上始终优于传统的拉普拉斯近似方法。与常规的拉普拉斯近似方法不同,该方法对先验的选择不敏感,缓解了当前方法的实际缺陷。

Bayesian neural networks often approximate the weight-posterior with a Gaussian distribution. However, practical posteriors are often, even locally, highly non-Gaussian, and empirical performance deteriorates. We propose a simple parametric approximate posterior that adapts to the shape of the true posterior through a Riemannian metric that is determined by the log-posterior gradient. We develop a Riemannian Laplace approximation where samples naturally fall into weight-regions with low negative log-posterior. We show that these samples can be drawn by solving a system of ordinary differential equations, which can be done efficiently by leveraging the structure of the Riemannian metric and automatic differentiation. Empirically, we demonstrate that our approach consistently improves over the conventional Laplace approximation across tasks. We further show that, unlike the conventional Laplace approximation, our method is not overly sensitive to the choice of prior, which alleviates a practical pitfall of current approaches.

Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation
Richard Gao Michael Deistler Jakob H. Macke



研究问题:如何对科学模拟器进行稳健且模拟负担小的推理?
动机:目前的贝叶斯推理方法在模拟器模型不准确时,可能会过于限制。
方法:提出一种用于广义贝叶斯推理(GBI)的模拟负担估计(ACE)方法,通过训练神经网络来近似成本函数,然后使用蒙特卡洛马尔科夫链(MCMC)进行GBI后验推断。
效果:ACE能更准确地预测成本,并为观察提供更接近合成观测的预测模拟,特别是在模拟器不准确的情况下。将ACE应用于Hodgkin-Huxley模型参数推断,结果比标准SBI方法更有效。

Simulation-based inference (SBI) enables amortized Bayesian inference for simulators with implicit likelihoods. But when we are primarily interested in the quality of predictive simulations, or when the model cannot exactly reproduce the observed data (i.e., is misspecified), targeting the Bayesian posterior may be overly restrictive. Generalized Bayesian Inference (GBI) aims to robustify inference for (misspecified) simulator models, replacing the likelihood-function with a cost function that evaluates the goodness of parameters relative to data. However, GBI methods generally require running multiple simulations to estimate the cost function at each parameter value during inference, making the approach computationally infeasible for even moderately complex simulators. Here, we propose amortized cost estimation (ACE) for GBI to address this challenge: We train a neural network to approximate the cost function, which we define as the expected distance between simulations produced by a parameter and observed data. The trained network can then be used with MCMC to infer GBI posteriors for any observation without running additional simulations. We show that, on several benchmark tasks, ACE accurately predicts cost and provides predictive simulations that are closer to synthetic observations than other SBI methods, especially for misspecified simulators. Finally, we apply ACE to infer parameters of the Hodgkin-Huxley model given real intracellular recordings from the Allen Cell Types Database. ACE identifies better data-matching parameters while being an order of magnitude more simulation-efficient than a standard SBI method. In summary, ACE combines the strengths of SBI methods and GBI to perform robust and simulation-amortized inference for scientific simulators.

Variational Annealing on Graphs for Combinatorial Optimization
Sebastian Sanokowski Wilhelm Franz Berghammer Sepp Hochreiter Sebastian Lehner



研究问题:本文旨在解决现有无监督学习方法在解决组合优化问题上的性能限制。
动机:目前的无监督学习方法基于独立解变量的假设,这在某些困难的问题实例上会限制性能。
方法:引入子图标记化技术,将一组解变量的配置表示为单个标记,以缓解固有于自回归方法的长序列采样过程的缺点,同时不牺牲表达能力。此外,还提出了一种退火熵正则化方法。
效果:实验结果表明,这种方法在许多流行的组合优化问题上表现出优越的性能,且学习效率高且稳定。

Several recent unsupervised learning methods use probabilistic approaches to solve combinatorial optimization (CO) problems based on the assumption of statistically independent solution variables. We demonstrate that this assumption imposes performance limitations in particular on difficult problem instances. Our results corroborate that an autoregressive approach which captures statistical dependencies among solution variables yields superior performance on many popular CO problems. We introduce Subgraph Tokenization in which the configuration of a set of solution variables is represented by a single token. This tokenization technique alleviates the drawback of the long sequential sampling procedure which is inherent to autoregressive methods without sacrificing expressivity. Importantly, we theoretically motivate an annealed entropy regularization and show empirically that it is essential for efficient and stable learning.

The Graph Pencil Method: Mapping Subgraph Densities to Stochastic Block Models
Lee M. Gunderson Gecia Bravo-Hermsdorff Peter Orbanz



研究问题:如何将子图密度精确映射到随机块模型(SBM)的参数上。
动机:为了解决在有限子图密度下,确定对应随机块模型的问题。
方法:通过一种方法,将子图密度从一组有限的子图中确定出来,并转化为随机块模型的参数。
效果:该方法可以将子图密度直接用于推理,并且计算开销可以忽略不计。

In this work, we describe a method that determines an exact map from a finite set of subgraph densities to the parameters of a stochastic block model (SBM) matching these densities. Given a number K of blocks, the subgraph densities of a finite number of stars and bistars uniquely determines a single element of the class of all degree-separated stochastic block models with K blocks. Our method makes it possible to translate estimates of these subgraph densities into model parameters, and hence to use subgraph densities directly for inference. The computational overhead is negligible; computing the translation map is polynomial in K, but independent of the graph size once the subgraph densities are given.

Canonical normalizing flows for manifold learning
Kyriakos Flouris Ender Konukoglu



研究问题:如何通过低维流形描述数据,实现对数据的高效表示?
动机:目前的流形学习方法往往学习到的是一个纠缠的、各维度信息退化的内在基,而非有效的数据表示。
方法:提出一种正交和/或稀疏基的流形学习方法,即规范流形学习方法,通过最小化非对角流形度量元素的L1范数,使变换矩阵具有少量突出且非退化的基函数。
效果:在大多数实验中,规范流形学习方法比其他流形学习方法更能有效地利用潜在空间,自动生成更少突出和不同的维度来表示数据,从而更好地逼近目标分布,得到更低的FID分数。

Manifold learning flows are a class of generative modelling techniques that assume a low-dimensional manifold description of the data. The embedding of such a manifold into the high-dimensional space of the data is achieved via learnable invertible transformations. Therefore, once the manifold is properly aligned via a reconstruction loss, the probability density is tractable on the manifold and maximum likelihood can be used to optimize the network parameters. Naturally, the lower-dimensional representation of the data requires an injective-mapping. Recent approaches were able to enforce that the density aligns with the modelled manifold, while efficiently calculating the density volume-change term when embedding to the higher-dimensional space. However, unless the injective-mapping is analytically predefined, the learned manifold is not necessarily an \emph{efficient representation} of the data. Namely, the latent dimensions of such models frequently learn an entangled intrinsic basis, with degenerate information being stored in each dimension. Alternatively, if a locally orthogonal and/or sparse basis is to be learned, here coined canonical intrinsic basis, it can serve in learning a more compact latent space representation. Toward this end, we propose a canonical manifold learning flow method, where a novel optimization objective enforces the transformation matrix to have few prominent and non-degenerate basis functions. We demonstrate that by minimizing the off-diagonal manifold metric elements $\ell_1$-norm, we can achieve such a basis, which is simultaneously sparse and/or orthogonal. Canonical manifold flow yields a more efficient use of the latent space, automatically generating fewer prominent and distinct dimensions to represent data, and consequently a better approximation of target distributions than other manifold flow methods in most experiments we conducted, resulting in lower FID scores.

SmoothHess: ReLU Network Feature Interactions via Stein's Lemma
Max Torop Aria Masoomi Davin Hill Kivanc Kose Stratis Ioannidis Jennifer Dy



研究问题:如何解释性地模型化神经网络特征交互作用,特别是对于ReLU网络的挑战。
动机:现有的方法通过查看神经网络的Hessian来模型化特征交互作用,但对于几乎处处为零Hessian的ReLU网络来说,这构成了挑战。
方法:我们提出了SmoothHess方法,通过斯坦因引理估计二阶交互作用。具体来说,我们通过有效的采样算法对网络进行高斯卷积并估计其Hessian,只需要网络梯度调用即可。
效果:我们在基准数据集和一个真实世界的医疗肺活量测试数据集上验证了SmoothHess捕捉交互作用的优越能力。

Several recent methods for interpretability model feature interactions by looking at the Hessian of a neural network. This poses a challenge for ReLU networks, which are piecewise-linear and thus have a zero Hessian almost everywhere. We propose SmoothHess, a method of estimating second-order interactions through Stein's Lemma. In particular, we estimate the Hessian of the network convolved with a Gaussian through an efficient sampling algorithm, requiring only network gradient calls. SmoothHess is applied post-hoc, requires no modifications to the ReLU network architecture, and the extent of smoothing can be controlled explicitly. We provide a non-asymptotic bound on the sample complexity of our estimation procedure. We validate the superior ability of SmoothHess to capture interactions on benchmark datasets and a real-world medical spirometry dataset.

Effective Bayesian Heteroscedastic Regression with Deep Neural Networks
Alexander Immer Emanuele Palumbo Alexander Marx Julia E Vogt



研究问题:如何灵活地量化复杂回归问题中的不可约随机性和模型依赖性知识不确定性。
动机:尽管深度神经网络原则上可以通过非线性函数提供这种灵活性并学习异方差随机性,但最近的研究表明,由于预测方差会缩放梯度,以均值和方差为参数的最大对数似然目标可能会导致妥协的平均适应。
方法:我们提出使用高斯的自然参数化,这已被证明对于基于非线性特征映射和高斯过程的异方差回归更稳定。此外,我们强调网络参数和预测的原则性正则化的重要性。因此,我们提出了一种高效的异方差神经网络的拉普拉斯近似方法,该方法通过经验贝叶斯实现自动正则化并提供知识不确定性,从而提高了泛化能力。
效果:我们在一系列回归问题上展示了我们的方法——包括一个新的异方差图像回归基准——我们的方法是可扩展的,改进了以前的异方差回归方法,并且在不需要超参数调整的情况下提供了知识不确定性。

Flexibly quantifying both irreducible aleatoric and model-dependent epistemic uncertainties plays an important role for complex regression problems. While deep neural networks in principle can provide this flexibility and learn heteroscedastic aleatoric uncertainties through non-linear functions, recent works highlight that maximizing the log likelihood objective parameterized by mean and variance can lead to compromised mean fits since the gradient are scaled by the predictive variance, and propose adjustments in line with this premise. We instead propose to use the natural parametrization of the Gaussian, which has been shown to be more stable for heteroscedastic regression based on non-linear feature maps and Gaussian processes. Further, we emphasize the significance of principled regularization of the network parameters and prediction. We therefore propose an efficient Laplace approximation for heteroscedastic neural networks that allows automatic regularization through empirical Bayes and provides epistemic uncertainties, both of which improve generalization. We showcase on a range of regression problems—including a new heteroscedastic image regression benchmark—that our methods are scalable, improve over previous approaches for heteroscedastic regression, and provide epistemic uncertainty without requiring hyperparameter tuning.

Individualized Dosing Dynamics via Neural Eigen Decomposition
Stav Belogolovsky Ido Greenberg Danny Eytan Shie Mannor



研究问题:如何利用神经网络微分方程解决医疗剂量模型中的噪声敏感性和个体化问题。
动机:传统的剂量模型对噪声敏感,且难以适应不断变化的治疗政策。
方法:提出神经本征随机微分方程算法(NESDE),通过超网络进行个体化建模,使用解耦控制实现对新治疗政策的泛化,根据噪声级别调整模型的表达能力,并使用频谱表示进行快速、连续、闭型预测。
效果:在合成和真实医疗问题上验证了NESDE的鲁棒性,并利用学习到的动力学发布模拟医疗健身房环境。

Dosing models often use differential equations to model biological dynamics. Neural differential equations in particular can learn to predict the derivative of a process, which permits predictions at irregular points of time. However, this temporal flexibility often comes with a high sensitivity to noise, whereas medical problems often present high noise and limited data. Moreover, medical dosing models must generalize reliably over individual patients and changing treatment policies. To address these challenges, we introduce the Neural Eigen Stochastic Differential Equation algorithm (NESDE). NESDE provides individualized modeling (using a hypernetwork over patient-level parameters); generalization to new treatment policies (using decoupled control); tunable expressiveness according to the noise level (using piecewise linearity); and fast, continuous, closed-form prediction (using spectral representation). We demonstrate the robustness of NESDE in both synthetic and real medical problems, and use the learned dynamics to publish simulated medical gym environments.

Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling
Zhenyu Zhu Francesco Locatello Volkan Cevher



研究问题:本文旨在为得分匹配及其在因果发现中的应用提供统计样本复杂度界限。
动机:准确的得分函数估计是可以实现的,通过使用随机梯度下降训练标准的深度ReLU神经网络。
方法:我们建立了关于使用Rolland等人[2022]的基于得分匹配的因果发现方法恢复因果关系的错误率的界限,假设对得分函数的估计足够好。
效果:最后,我们在得分基础生成模型内分析了得分匹配估计的上界,这已被应用于因果发现,但在生成模型领域内也具有独立的兴趣。

This paper provides statistical sample complexity bounds for score-matching and its applications in causal discovery. We demonstrate that accurate estimation of the score function is achievable by training a standard deep ReLU neural network using stochastic gradient descent. We establish bounds on the error rate of recovering causal relationships using the score-matching-based causal discovery method of Rolland et al. [2022], assuming a sufficiently good estimation of the score function. Finally, we analyze the upper bound of score-matching estimation within the score-based generative modeling, which has been applied for causal discovery but is also of independent interest within the domain of generative models.

Nonparametric Identifiability of Causal Representations from Unknown Interventions
Julius von Kügelgen Michel Besserve Wendong Liang Luigi Gresele Armin Kekić Elias Bareinboim David Blei Bernhard Schölkopf



研究问题:本文旨在从高维函数(“混合”)中推断潜在的因果变量及其因果关系,无需对生成过程的部分知识了解。
动机:现有的工作依赖于弱监督,如反事实的预先和事后干预视图或时间结构;对混合函数或潜在因果模型施加限制性假设,如线性;或者需要部分了解生成过程,如因果图或干预目标。
方法:我们考虑了因果模型和混合函数都是非参数化的一般设置。学习信号的形式是来自未知干预的潜在因果模型产生的多个数据集或环境。
效果:我们证明了观察分布和一个完美的节点干预足以识别,满足一个泛化条件。对于任意数量的变量,我们展示了每个节点至少一对不同的完美干预域可以保证可识别性。此外,我们发现潜在变量之间的因果关系强度在所有等效解决方案中都得到了保留,使得推断的表示适合从新数据中得出因果结论。

We study causal representation learning, the task of inferring latent causal variables and their causal relations from high-dimensional functions (“mixtures”) of the variables. Prior work relies on weak supervision, in the form of counterfactual pre- and post-intervention views or temporal structure; places restrictive assumptions, such as linearity, on the mixing function or latent causal model; or requires partial knowledge of the generative process, such as the causal graph or intervention targets. We instead consider the general setting in which both the causal model and the mixing function are nonparametric. The learning signal takes the form of multiple datasets, or environments, arising from unknown interventions in the underlying causal model. Our goal is to identify both the ground truth latents and their causal graph up to a set of ambiguities which we show to be irresolvable from interventional data. We study the fundamental setting of two causal variables and prove that the observational distribution and one perfect intervention per node suffice for identifiability, subject to a genericity condition. This condition rules out spurious solutions that involve fine-tuning of the intervened and observational distributions, mirroring similar conditions for nonlinear cause-effect inference. For an arbitrary number of variables, we show that at least one pair of distinct perfect interventional domains per node guarantees identifiability. Further, we demonstrate that the strengths of causal influences among the latent variables are preserved by all equivalent solutions, rendering the inferred representation appropriate for drawing causal conclusions from new data. Our study provides the first identifiability results for the general nonparametric setting with unknown interventions, and elucidates what is possible and impossible for causal representation learning without more direct supervision.

FAST: a Fused and Accurate Shrinkage Tree for Heterogeneous Treatment Effects Estimation
Jia Gu Caizhi Tang Han Yan Qing Cui Longfei Li JUN ZHOU



研究问题:本文提出了一种新的异质性处理效应估计策略,称为融合和精确收缩树(FAST)。
动机:受到统计中收缩估计的启发,我们开发了一种最优加权方案和相应的平衡基于试验数据的无偏估计器与基于观察数据的有偏估计器的估计器。
方法:结合基于树的技术,我们引入了一个新的分裂标准,利用试验数据和观察数据更准确地估计处理效应。
效果:通过模拟和真实数据分析,证明了FAST及其集成版本在有限样本性能上优于现有方法。

This paper proposes a novel strategy for estimating the heterogeneous treatment effect called the Fused and Accurate Shrinkage Tree ($\mathrm{FAST}$). Our approach utilizes both trial and observational data to improve the accuracy and robustness of the estimator. Inspired by the concept of shrinkage estimation in statistics, we develop an optimal weighting scheme and a corresponding estimator that balances the unbiased estimator based on the trial data with the potentially biased estimator based on the observational data. Specifically, combined with tree-based techniques, we introduce a new split criterion that utilizes both trial data and observational data to more accurately estimate the treatment effect. Furthermore, we confirm the consistency of our proposed tree-based estimator and demonstrate the effectiveness of our criterion in reducing prediction error through theoretical analysis. The advantageous finite sample performance of the $\mathrm{FAST}$ and its ensemble version over existing methods is demonstrated via simulations and real data analysis.

Advancing Bayesian Optimization via Learning Correlated Latent Space
Seunghun Lee Jaewon Chu Sihyeon Kim Juyeon Ko Hyunwoo J. Kim



研究问题:优化黑盒函数的有效方法。
动机:现有的优化方法在离散数据上存在潜在的次优解。
方法:提出关联潜在空间贝叶斯优化(CoBO),通过学习距离在潜在空间和目标函数内强相关的关联潜在空间来减小固有差距。
效果:在分子设计和算术表达式拟合等优化任务中表现出色,且在小预算下实现高性能。

Bayesian optimization is a powerful method for optimizing black-box functions with limited function evaluations. Recent works have shown that optimization in a latent space through deep generative models such as variational autoencoders leads to effective and efficient Bayesian optimization for structured or discrete data. However, as the optimization does not take place in the input space, it leads to an inherent gap that results in potentially suboptimal solutions. To alleviate the discrepancy, we propose Correlated latent space Bayesian Optimization (CoBO), which focuses on learning correlated latent spaces characterized by a strong correlation between the distances in the latent space and the distances within the objective function. Specifically, our method introduces Lipschitz regularization, loss weighting, and trust region recoordination to minimize the inherent gap around the promising areas. We demonstrate the effectiveness of our approach on several optimization tasks in discrete data, such as molecule design and arithmetic expression fitting, and achieve high performance within a small budget.

Operator Learning with Neural Fields: Tackling PDEs on General Geometries
Louis Serrano Lise Le Boudec Armand Kassaï Koupaï Thomas X Wang Yuan Yin Jean-Noël Vittaut patrick gallinari



研究问题:本文旨在解决使用机器学习方法解决偏微分方程时需要学习函数空间之间的映射的问题。
动机:尽管卷积神经网络或图神经网络在离散函数上有所限制,但神经算子为直接映射函数提供了有希望的里程碑。然而,它们在领域几何方面仍然面临挑战,并且通常依赖于某种形式的离散化。
方法:为了减轻这些限制,我们提出了一种新的方法CORAL,该方法利用基于坐标的网络来解决一般几何上的偏微分方程。CORAL的设计消除了对输入网格的限制,使其适用于任何空间采样和几何形状。
效果:CORAL在多个分辨率上都表现出强大的性能,并在凸和非凸域中都表现出色,超越了或与最先进的模型相媲美。

Machine learning approaches for solving partial differential equations require learning mappings between function spaces. While convolutional or graph neural networks are constrained to discretized functions, neural operators present a promising milestone toward mapping functions directly. Despite impressive results they still face challenges with respect to the domain geometry and typically rely on some form of discretization. In order to alleviate such limitations, we present CORAL, a new method that leverages coordinate-based networks for solving PDEs on general geometries. CORAL is designed to remove constraints on the input mesh, making it applicable to any spatial sampling and geometry. Its ability extends to diverse problem domains, including PDE solving, spatio-temporal forecasting, and inverse problems like geometric design. CORAL demonstrates robust performance across multiple resolutions and performs well in both convex and non-convex domains, surpassing or performing on par with state-of-the-art models.

Fast Bellman Updates for Wasserstein Distributionally Robust MDPs
Zhuodong Yu Ling Dai Shaohang Xu Siyang Gao Chin Pang Ho



研究问题:Markov决策过程在模型模糊性下常面临敏感性问题,如何有效解决?
动机:近年来,鲁棒MDPs作为一种有效的框架出现以克服这一挑战。分布鲁棒MDPs通过引入不确定模型参数的分布信息来缓解鲁棒MDPs的保守性。
方法:本文提出了一种计算效率高的解决方案框架,用于解决具有Wasserstein不确定性集的分布鲁棒MDPs问题。该框架利用特定问题结构将与分布鲁棒贝尔曼更新相关的优化问题分解为更小的子问题,这些子问题可以有效解决。
效果:数值实验表明,所提出的算法优于其他最先进的解决方案方法。

Markov decision processes (MDPs) often suffer from the sensitivity issue under model ambiguity. In recent years, robust MDPs have emerged as an effective framework to overcome this challenge. Distributionally robust MDPs extend the robust MDP framework by incorporating distributional information of the uncertain model parameters to alleviate the conservative nature of robust MDPs. This paper proposes a computationally efficient solution framework for solving distributionally robust MDPs with Wasserstein ambiguity sets. By exploiting the specific problem structure, the proposed framework decomposes the optimization problems associated with distributionally robust Bellman updates into smaller subproblems, which can be solved efficiently. The overall complexity of the proposed algorithm is quasi-linear in both the numbers of states and actions when the distance metric of the Wasserstein distance is chosen to be $L_1$, $L_2$, or $L_{\infty}$ norm, and so the computational cost of distributional robustness is substantially reduced. Our numerical experiments demonstrate that the proposed algorithms outperform other state-of-the-art solution methods.

ContinuAR: Continuous Autoregression For Infinite-Fidelity Fusion
WEI W. XING Yuxin Wang Zheng Xing



研究问题:多保真度融合是一种重要的替代技术,可以提供对昂贵计算机模拟的见解,并有效改进决策,但其缺乏一个系统框架来利用保真度指标,处理高维和任意数据结构,以及很好地扩展到无限保真度问题。
动机:尽管多保真度融合技术发展迅速,但它们在处理高维和任意数据结构、利用保真度指标以及扩展到无限保真度问题上仍存在挑战。
方法:本研究首先将流行的自回归(AR)推广为一种新的线性保真微分方程(FiDE),为可追踪的无限保真度融合铺平了道路。然后,我们将FiDE推广到高维系统,这也提供了一个统一的框架,似乎弥合了多种多保真度和单保真度基于GP的模型之间的差距。最后,我们提出了ContinuAR,这是FiDE的一种秩-1近似解决方案,易于训练,与任意多保真度数据结构兼容,可线性扩展到输出维度,最重要的是,其性能始终优于基线方法。
效果:与传统的SOTA无限保真度融合IFC相比,ContinuAR在准确性上提高了4倍,在训练时间上加快了62,500倍。

Multi-fidelity fusion has become an important surrogate technique, which provides insights into expensive computer simulations and effectively improves decision-making, e.g., optimization, with less computational cost. Multi-fidelity fusion is much more computationally efficient compared to traditional single-fidelity surrogates. Despite the fast advancement of multi-fidelity fusion techniques, they lack a systematic framework to make use of the fidelity indicator, deal with high-dimensional and arbitrary data structure, and scale well to infinite-fidelity problems. In this work, we first generalize the popular autoregression (AR) to derive a novel linear fidelity differential equation (FiDE), paving the way to tractable infinite-fidelity fusion. We generalize FiDE to a high-dimensional system, which also provides a unifying framework to seemly bridge the gap between many multi- and single-fidelity GP-based models. We then propose ContinuAR, a rank-1 approximation solution to FiDEs, which is tractable to train, compatible with arbitrary multi-fidelity data structure, linearly scalable to the output dimension, and most importantly, delivers consistent SOTA performance with a significant margin over the baseline methods. Compared to the SOTA infinite-fidelity fusion, IFC, ContinuAR achieves up to 4x improvement in accuracy and 62,500x speedup in training time.

Equivariant flow matching
Leon Klein Andreas Krämer Frank Noe



研究问题:如何有效地构建适用于统计物理学中多体系统的生成模型,并解决现有连续归一化流(CNFs)训练和采样计算昂贵的问题。
动机:为了解决统计物理学中的长期采样问题,需要训练流以产生诸如小分子和蛋白质等多体系统的平衡样本。为此,将目标能量的对称性纳入模型至关重要,而等变连续归一化流(CNFs)可以实现这一点。然而,CNFs的训练和采样计算可能非常昂贵,限制了它们的可扩展性和实际应用。
方法:本文提出了等变流匹配,这是一种基于最近提出的最优传输流匹配的新等变CNF训练目标。等变流匹配利用目标能量的物理对称性进行高效、无模拟的等变CNF训练。
效果:我们在旋转和排列不变的多粒子系统和小分子丙氨酸二肽上展示了流匹配的有效性。我们首次获得了不依赖于特定内部坐标特征化的具有显著采样效率的玻尔兹曼生成器。结果显示,与现有方法相比,等变流匹配目标产生的流具有更短的积分路径、更高的采样效率和更高的可扩展性。

Normalizing flows are a class of deep generative models that are especially interesting for modeling probability distributions in physics, where the exact likelihood of flows allows reweighting to known target energy functions and computing unbiased observables. For instance, Boltzmann generators tackle the long-standing sampling problem in statistical physics by training flows to produce equilibrium samples of many-body systems such as small molecules and proteins. To build effective models for such systems, it is crucial to incorporate the symmetries of the target energy into the model, which can be achieved by equivariant continuous normalizing flows (CNFs). However, CNFs can be computationally expensive to train and generate samples from, which has hampered their scalability and practical application. In this paper, we introduce equivariant flow matching, a new training objective for equivariant CNFs that is based on the recently proposed optimal transport flow matching. Equivariant flow matching exploits the physical symmetries of the target energy for efficient, simulation-free training of equivariant CNFs. We demonstrate the effectiveness of flow matching on rotation and permutation invariant many-particle systems and a small molecule, alanine dipeptide, where for the first time we obtain a Boltzmann generator with significant sampling efficiency without relying on tailored internal coordinate featurization. Our results show that the equivariant flow matching objective yields flows with shorter integration paths, improved sampling efficiency, and higher scalability compared to existing methods.

Implicit Manifold Gaussian Process Regression
Bernardo Fichera Viacheslav Borovitskiy Andreas Krause Aude Billard



研究问题:如何将高维数据的隐式低维流形结构直接从数据(有标签和无标签)中推断出来,以改善高维情况下的预测性能和校准效果。
动机:传统的高维数据处理方法在处理小或稀疏数据集时表现良好,但在处理高维数据时会遇到困难。通过利用数据实际所在的隐式低维流形,可以扩展高维数据处理技术。
方法:本文提出了一种能够直接从数据(有标签和无标签)中推断出隐式结构的高斯过程回归技术,并以完全可微的方式实现。对于所得模型,讨论了其在假设流形上的Matern高斯过程的收敛性。
效果:该技术可扩展到数十万个数据点,并可能提高标准高斯过程回归在高维设置中的预测性能和校准效果。

Gaussian process regression is widely used because of its ability to provide well-calibrated uncertainty estimates and handle small or sparse datasets. However, it struggles with high-dimensional data. One possible way to scale this technique to higher dimensions is to leverage the implicit low-dimensional manifold upon which the data actually lies, as postulated by the manifold hypothesis. Prior work ordinarily requires the manifold structure to be explicitly provided though, i.e. given by a mesh or be known to be one of the well-known manifolds like the sphere. In contrast, in this paper we propose a Gaussian process regression technique capable of inferring implicit structure directly from data (labeled and unlabeled) in a fully differentiable way. For the resulting model, we discuss its convergence to the Matérn Gaussian process on the assumed manifold. Our technique scales up to hundreds of thousands of data points, and may improve the predictive performance and calibration of the standard Gaussian process regression in high-dimensional settings.

Causal Interpretation of Self-Attention in Pre-Trained Transformers
Raanan Yehezkel Rohekar Yaniv Gurwicz Shami Nisimov



研究问题:本文旨在提出Transformer神经网络架构中自我注意力的因果解释,并探索其在学习序列中的因果关系方面的应用。
动机:现有的预训练Transformer模型可以捕捉丰富的语义模式,但缺乏对输入序列中因果关系的理解。
方法:将自我注意力解释为给定输入符号序列的结构方程模型,该模型可以在特定上下文下解读为输入符号的因果结构。通过计算最深注意力层中对应表示的偏相关性,估计输入符号之间的条件独立性关系,从而学习输入序列的因果结构。
效果:在情感分类和推荐两个任务中,该方法能够提供Transformers结果的因果解释,验证了其在零样本因果发现方面的潜力。

We propose a causal interpretation of self-attention in the Transformer neural network architecture. We interpret self-attention as a mechanism that estimates a structural equation model for a given input sequence of symbols (tokens). The structural equation model can be interpreted, in turn, as a causal structure over the input symbols under the specific context of the input sequence. Importantly, this interpretation remains valid in the presence of latent confounders. Following this interpretation, we estimate conditional independence relations between input symbols by calculating partial correlations between their corresponding representations in the deepest attention layer. This enables learning the causal structure over an input sequence using existing constraint-based algorithms. In this sense, existing pre-trained Transformers can be utilized for zero-shot causal-discovery. We demonstrate this method by providing causal explanations for the outcomes of Transformers in two tasks: sentiment classification (NLP) and recommendation.

Credal Marginal MAP
Radu Marinescu Debarun Bhattacharjya Junkyu Lee Fabio Cozman Alexander G. Gray



研究问题:本文旨在解决在信度网络中执行边际最大后验概率(MAP)推断的困难,特别是在评估每个完整的MAP分配时需要进行精确的概率计算。
动机:由于在信度网络中进行边际MAP推断需要对所有可能的MAP变量的边际分布进行精确的概率计算,因此这一任务非常困难。
方法:本文提出了基于变量消除和深度优先搜索的新精确方法,以及基于迷你桶分区和随机局部搜索的几种近似方案。
效果:通过大量的实验评估,证明了这些新方法在随机和真实世界基准问题上的有效性。

Credal networks extend Bayesian networks to allow for imprecision in probability values. Marginal MAP is a widely applicable mixed inference task that identifies the most likely assignment for a subset of variables (called MAP variables). However, the task is extremely difficult to solve in credal networks particularly because the evaluation of each complete MAP assignment involves exact likelihood computations (combinatorial sums) over the vertices of a complex joint credal set representing the space of all possible marginal distributions of the MAP variables. In this paper, we explore Credal Marginal MAP inference and develop new exact methods based on variable elimination and depth-first search as well as several approximation schemes based on the mini-bucket partitioning and stochastic local search. An extensive empirical evaluation demonstrates the effectiveness of our new methods on random as well as real-world benchmark problems.

Deep Recurrent Optimal Stopping
NIRANJAN DAMERA VENKATA Chiranjib Bhattacharyya



研究问题:如何有效地将深度神经网络(DNNs)应用于非马尔科夫最优停止问题。
动机:现有的基于DNN的方法在扩展到非马尔科夫设置时,需要显著扩展状态和参数空间,表现出维度诅咒的问题。
方法:首次引入了一种最优停止策略梯度算法(OSPG),该算法通过隐式优化值函数而不进行递归,有效地利用RNNs解决非马尔科夫问题,减轻了非马尔科夫性诅咒的影响。
效果:OSPG算法源自于一种新颖的离散时间非马尔科夫最优停止轨迹的贝叶斯网络表示的推理过程,因此它产生了一种离线策略梯度算法,消除了昂贵的蒙特卡洛策略模拟。

Deep neural networks (DNNs) have recently emerged as a powerful paradigm for solving Markovian optimal stopping problems. However, a ready extension of DNN-based methods to non-Markovian settings requires significant state and parameter space expansion, manifesting the curse of dimensionality. Further, efficient state-space transformations permitting Markovian approximations, such as those afforded by recurrent neural networks (RNNs), are either structurally infeasible or are confounded by the curse of non-Markovianity. Considering these issues, we introduce, for the first time, an optimal stopping policy gradient algorithm (OSPG) that can leverage RNNs effectively in non-Markovian settings by implicitly optimizing value functions without recursion, mitigating the curse of non-Markovianity. The OSPG algorithm is derived from an inference procedure on a novel Bayesian network representation of discrete-time non-Markovian optimal stopping trajectories and, as a consequence, yields an offline policy gradient algorithm that eliminates expensive Monte Carlo policy rollouts.

Stochastic Approximation Algorithms for Systems of Interacting Particles
Mohammad Reza Karimi Jaghargh Ya-Ping Hsieh Andreas Krause



研究问题:交互式粒子系统在各种机器学习任务中表现出色,但其分析通常依赖于平均场极限的简化假设。然而,实际应用中使用的是离散时间步、有限粒子数和复杂的积分方案,这在连续时间和离散时间过程之间造成了理论差距。
动机:本文提出了一种新的框架,建立了离散时间方案与其对应的平均场极限之间的精确联系,包括收敛性质和渐进行为。
方法:通过采用动态系统的视角,该框架无缝集成了各种通常独立分析的数值方案。例如,该框架为优化无限宽度的两层神经网络和通过斯坦因变分梯度下降进行采样提供了统一的处理方式,这两者以前是分别研究的。
效果:实验结果表明,该框架能够清晰地理解和比较不同的数值方案,有助于提高机器学习任务的性能。

Interacting particle systems have proven highly successful in various machine learning tasks, including approximate Bayesian inference and neural network optimization. However, the analysis of these systems often relies on the simplifying assumption of the \emph{mean-field} limit, where particle numbers approach infinity and infinitesimal step sizes are used. In practice, discrete time steps, finite particle numbers, and complex integration schemes are employed, creating a theoretical gap between continuous-time and discrete-time processes. In this paper, we present a novel framework that establishes a precise connection between these discrete-time schemes and their corresponding mean-field limits in terms of convergence properties and asymptotic behavior. By adopting a dynamical system perspective, our framework seamlessly integrates various numerical schemes that are typically analyzed independently. For example, our framework provides a unified treatment of optimizing an infinite-width two-layer neural network and sampling via Stein Variational Gradient descent, which were previously studied in isolation.

Structured Voronoi Sampling
Afra Amini Li Du Ryan Cotterell



研究问题:本文旨在为基于梯度的文本生成任务构建一个理论可靠且有原则的方法。
动机:尽管梯度采样算法在文本生成中表现出了其有效性,但目前缺乏对此任务的理论支持和基本原则方法。
方法:利用语言模型给出的离散分布定义密度,并基于汉密尔顿蒙特卡洛算法进行采样,将这种基于梯度的技术命名为结构化Voronoi采样(SVS)。
效果:实验结果显示,与替代采样方案相比,SVS样本的经验分布更接近参考分布。此外,在控制生成任务中,SVS能够生成流畅且多样化的样本,同时显著优于其他方法。

Gradient-based sampling algorithms have demonstrated their effectiveness in text generation, especially in the context of controlled text generation. However, there exists a lack of theoretically grounded and principled approaches for this task. In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods. We use discrete distributions given by language models to define densities and develop an algorithm based on Hamiltonian Monte Carlo to sample from them. We name our gradient-based technique Structured Voronoi Sampling (SVS). In an experimental setup where the reference distribution is known, we show that the empirical distribution of SVS samples is closer to the reference distribution compared to alternative sampling schemes. Furthermore, in a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods.

DYffusion: A Dynamics-informed Diffusion Model for Spatiotemporal Forecasting
Salva Rühling Cachay Bo Zhao Hailey James Rose Yu



研究问题:本文旨在解决扩散模型主要用于静态图像生成和预测,对于动态预测能力不足的问题。
动机:目前的扩散模型主要针对静态图像进行设计,对于动态预测的能力有限。因此,本文提出了一种利用数据中编码的时序动态来训练扩散模型的新方法。
方法:本文提出的方法通过直接耦合网络中的扩散步骤与数据中的时序动态,训练了一个随机的、时间条件化的插值器和一个模拟传统扩散模型正向和反向过程的主预测器网络。这种设计选择自然地编码了多步和长范围预测能力,使得采样轨迹具有高度的灵活性和连续性,同时在推理时可以权衡性能和加速采样。此外,动态感知的扩散过程引入了强烈的归纳偏置,与传统基于高斯噪声的扩散模型相比,提高了计算效率。
效果:实验结果表明,本文提出的方法在复杂的动态预测任务上表现出色,包括海表面温度、纳维-斯托克斯流和弹簧网格系统等。该方法在概率技能评分指标上具有竞争力。

While diffusion models can successfully generate data and make predictions, they are predominantly designed for static images. We propose an approach for training diffusion models for dynamics forecasting that leverages the temporal dynamics encoded in the data, directly coupling it with the diffusion steps in the network. We train a stochastic, time-conditioned interpolator and a backbone forecaster network that mimic the forward and reverse processes of conventional diffusion models, respectively. This design choice naturally encodes multi-step and long-range forecasting capabilities, allowing for highly flexible, continuous-time sampling trajectories and the ability to trade-off performance with accelerated sampling at inference time. In addition, the dynamics-informed diffusion process imposes a strong inductive bias, allowing for improved computational efficiency compared to traditional Gaussian noise-based diffusion models. Our approach performs competitively on probabilistic skill score metrics in complex dynamics forecasting of sea surface temperatures, Navier-Stokes flows, and spring mesh systems.

Policy Gradient for Rectangular Robust Markov Decision Processes
Navdeep Kumar Esther Derman Matthieu Geist Kfir Yehuda Levy Shie Mannor



研究问题:本文旨在解决强化学习中的策略梯度方法无法处理转换不确定性的问题,以及学习稳健策略的计算成本高昂的问题。
动机:目前的扩散模型主要针对静态图像进行设计,对于动态预测的能力有限。因此,本文提出了一种利用数据中编码的时序动态来训练扩散模型的新方法。
方法:本文提出的方法通过直接耦合网络中的扩散步骤与数据中的时序动态,训练了一个随机的、时间条件化的插值器和一个模拟传统扩散模型正向和反向过程的主预测器网络。这种设计选择自然地编码了多步和长范围预测能力,使得采样轨迹具有高度的灵活性和连续性,同时在推理时可以权衡性能和加速采样。此外,动态感知的扩散过程引入了强烈的归纳偏置,与传统基于高斯噪声的扩散模型相比,提高了计算效率。
效果:实验结果表明,本文提出的方法在复杂的动态预测任务上表现出色,包括海表面温度、纳维-斯托克斯流和弹簧网格系统等。该方法在概率技能评分指标上具有竞争力。

Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.

Automatic Integration for Spatiotemporal Neural Point Processes
Zihao Zhou Rose Yu



研究问题:如何有效地对连续时间点过程进行学习,特别是在空间和时间上具有复杂性的时空点过程(STPPs)。
动机:现有的方法在处理STPP的积分问题上存在挑战,如假设强度函数的参数形式缺乏灵活性,或使用蒙特卡洛采样近似强度函数引入数值误差。
方法:本文提出了一种新的范式“Auto-STPP”,将双网络方法扩展到3D STPP,并引入了一种可分解的积分网络参数化方法,利用ProdNet将复杂的多变量计算图简化为单变量图的乘积,从而避免了多变量计算图中固有的计算复杂性。
效果:实验证明“Auto-STPP”的一致性,并在合成数据和基准真实世界数据集上进行了验证。“Auto-STPP”在从不规则时空事件中恢复复杂强度函数方面表现出显著优势,特别是在强度被锐利定位时。

Learning continuous-time point processes is essential to many discrete event forecasting tasks. However, integration poses a major challenge, particularly for spatiotemporal point processes (STPPs), as it involves calculating the likelihood through triple integrals over space and time. Existing methods for integrating STPP either assume a parametric form of the intensity function, which lacks flexibility; or approximating the intensity with Monte Carlo sampling, which introduces numerical errors. Recent work by Omi et al. proposes a dual network approach for efficient integration of flexible intensity function. However, their method only focuses on the 1D temporal point process. In this paper, we introduce a novel paradigm: `Auto-STPP` (Automatic Integration for Spatiotemporal Neural Point Processes) that extends the dual network approach to 3D STPP. While previous work provides a foundation, its direct extension overly restricts the intensity function and leads to computational challenges. In response, we introduce a decomposable parametrization for the integral network using ProdNet. This approach, leveraging the product of simplified univariate graphs, effectively sidesteps the computational complexities inherent in multivariate computational graphs. We prove the consistency of `Auto-STPP` and validate it on synthetic data and benchmark real-world datasets. `Auto-STPP` shows a significant advantage in recovering complex intensity functions from irregular spatiotemporal events, particularly when the intensity is sharply localized. Our code is open-source at https://github.com/Rose-STL-Lab/AutoSTPP.

Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data
Siyuan Guo Viktor Tóth Bernhard Schölkopf Ferenc Huszár



研究问题:约束性因果发现方法主要利用条件独立性测试来推断各种应用中的因果关系,但研究问题:约束性因果发现方法主要利用条件独立性测试来推断各种应用中的因果关系,但现有的工作主要集中在研究独立同分布的数据上,这限制了因果发现的深度。
动机:研究者发现,与独立同分布数据相比,可交换数据具有更丰富的条件独立性结构,可以用于更深层次的因果发现。
方法:研究者首先提出了因果德·菲内蒂定理,该定理指出具有某些非平凡条件独立性的可交换分布总可以被表示为独立的因果机制生成过程。然后,他们提出了主要的识别定理,该定理表明,给定来自ICM生成过程的数据,其唯一的因果结构可以通过执行条件独立性测试来识别。最后,他们开发了一种因果发现算法,并证明了该算法可以用于从多环境数据中推断因果关系。
效果:实验结果表明,这种新的因果发现方法在各种知识驱动任务上取得了显著改进,并且在其他常见的自然语言处理任务上与最先进的BERT模型相媲美。

Constraint-based causal discovery methods leverage conditional independence tests to infer causal relationships in a wide variety of applications. Just as the majority of machine learning methods, existing work focuses on studying $\textit{independent and identically distributed}$ data. However, it is known that even with infinite $i.i.d.\$ data, constraint-based methods can only identify causal structures up to broad Markov equivalence classes, posing a fundamental limitation for causal discovery. In this work, we observe that exchangeable data contains richer conditional independence structure than $i.i.d.\$ data, and show how the richer structure can be leveraged for causal discovery. We first present causal de Finetti theorems, which state that exchangeable distributions with certain non-trivial conditional independences can always be represented as $\textit{independent causal mechanism (ICM)}$ generative processes. We then present our main identifiability theorem, which shows that given data from an ICM generative process, its unique causal structure can be identified through performing conditional independence tests. We finally develop a causal discovery algorithm and demonstrate its applicability to inferring causal relationships from multi-environment data.

Differentiable Neuro-Symbolic Reasoning on Large-Scale Knowledge Graphs
CHEN SHENGYUAN YUNFENG CAI Huang Fang Xiao Huang Mingming Sun



研究问题:如何有效地结合规则和知识图谱嵌入进行推理,以实现精确且高效的推理。
动机:现有的知识图谱推理方法,无论是基于规则的还是基于嵌入的,都有其优点和缺点。因此,需要一种新的方法来结合两者的优点。
方法:提出了一种名为DiffLogic的可微分框架,通过动态规则和权重自适应选择关键三元组,并使用连续的概率软逻辑网络评估整体一致性,实现了端到端的可微分优化。
效果:在基准数据集上,DiffLogic在有效性和效率上都超过了基线方法。

Knowledge graph (KG) reasoning utilizes two primary techniques, i.e., rule-based and KG-embedding based. The former provides precise inferences, but inferring via concrete rules is not scalable. The latter enables efficient reasoning at the cost of ambiguous inference accuracy. Neuro-symbolic reasoning seeks to amalgamate the advantages of both techniques. The crux of this approach is replacing the predicted existence of all possible triples (i.e., truth scores inferred from rules) with a suitable approximation grounded in embedding representations. However, constructing an effective approximation of all possible triples' truth scores is a challenging task, because it needs to balance the tradeoff between accuracy and efficiency, while compatible with both the rule-based and KG-embedding models. To this end, we proposed a differentiable framework - DiffLogic. Instead of directly approximating all possible triples, we design a tailored filter to adaptively select essential triples based on the dynamic rules and weights. The truth scores assessed by KG-embedding are continuous, so we employ a continuous Markov logic network named probabilistic soft logic (PSL). It employs the truth scores of essential triples to assess the overall agreement among rules, weights, and observed triples. PSL enables end-to-end differentiable optimization, so we can alternately update embedding and weighted rules. On benchmark datasets, we empirically show that DiffLogic surpasses baselines in both effectiveness and efficiency.

Detecting hidden confounding in observational data using multiple environments
Rickard Karlsson JH Krijthe



研究问题:在观察性数据中进行因果推断时,一个常见的假设是没有隐藏的混杂因素。然而,从单个数据集中验证隐藏混杂因素的存在通常是不可能做到的。
动机:在数据生成过程中存在独立因果关系的假设下,我们展示了一种方法来检测来自不同环境的多个观察数据集中的未观察到的混杂因素。
方法:我们提出了一种仅在存在隐藏混杂因素时才缺失的条件独立性理论,并检查了违反其假设的情况:退化和依赖机制以及忠实性违规。此外,我们提出了一种测试这些独立性的程序,并使用模拟研究和基于真实世界数据集的半合成数据来研究其经验上的有限样本行为。
效果:在大多数情况下,所提出的过程正确预测了隐藏混杂因素的存在,特别是当混杂偏差较大时。

A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify the presence of hidden confounding factors from a single dataset. Under the assumption of independent causal mechanisms underlying the data-generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent when there is hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, the proposed procedure correctly predicts the presence of hidden confounding, particularly when the confounding bias is large.

Entropy-dissipation Informed Neural Network for McKean-Vlasov Type PDEs
Zebang Shen Zhenfu Wang



研究问题:解决涉及奇异交互核的McKean-Vlasov方程(MVE)的挑战,特别是在提供严谨的理论保证方面。
动机:物理系统中的交互项可以是奇异的,即当两个粒子碰撞时会发散。这种相互作用的显著例子包括在等离子体物理学中的基本库仑相互作用和在流体动力学中的二维纳维尔-斯托克斯方程(NSE)的涡度公式中的毕奥-萨瓦特相互作用。
方法:我们提出了一种基于底层系统熵耗散概念的新方法。我们推导出一个有效的控制假设解与真实解之间的KL发散的潜在函数。在此基础上,我们引入了熵耗散信息神经网络(EINN)框架来解决MVEs。在EINN中,我们利用神经网络(NN)来近似底层速度场并最小化提出的潜在函数。通过利用NN的表达能力,我们的方法为应对奇异交互的复杂性提供了有希望的途径。
效果:通过与最先进的基于NN的MVE求解器进行比较,结果表明我们的方法在解决各种示例问题上有效。

The McKean-Vlasov equation (MVE) describes the collective behavior of particles subject to drift, diffusion, and mean-field interaction. In physical systems, the interaction term can be singular, i.e. it diverges when two particles collide. Notable examples of such interactions include the Coulomb interaction, fundamental in plasma physics, and the Biot-Savart interaction, present in the vorticity formulation of the 2D Navier-Stokes equation (NSE) in fluid dynamics. Solving MVEs that involve singular interaction kernels presents a significant challenge, especially when aiming to provide rigorous theoretical guarantees. In this work, we propose a novel approach based on the concept of entropy dissipation in the underlying system. We derive a potential function that effectively controls the KL divergence between a hypothesis solution and the ground truth. Building upon this theoretical foundation, we introduce the Entropy-dissipation Informed Neural Network (EINN) framework for solving MVEs. In EINN, we utilize neural networks (NN) to approximate the underlying velocity field and minimize the proposed potential function. By leveraging the expressive power of NNs, our approach offers a promising avenue for tackling the complexities associated with singular interactions. To assess the empirical performance of our method, we compare EINN with SOTA NN-based MVE solvers. The results demonstrate the effectiveness of our approach in solving MVEs across various example problems.

Convergence analysis of ODE models for accelerated first-order methods via positive semidefinite kernels
Jungbin Kim Insoon Yang



研究问题:本文旨在提出一种新的方法,通过将证明收敛速度的任务转化为验证特定的希尔伯特-研究问题:本文旨在提出一种新的方法,通过将证明收敛速度的任务转化为验证特定的希尔伯特-施密特积分算子的正半定性,系统地分析一阶优化方法的常微分方程模型。
动机:不同于以往依赖于有限维线性代数的研究,我们的方法基于函数分析工具,对性能估计问题进行研究。
方法:我们使用提出的新方法,建立了各种加速梯度流模型的收敛速度,其中一些是新的。
效果:作为我们框架的一个直接结果,我们展示了最小化函数值和最小化梯度范数之间的对应关系。

We propose a novel methodology that systematically analyzes ordinary differential equation (ODE) models for first-order optimization methods by converting the task of proving convergence rates into verifying the positive semidefiniteness of specific Hilbert-Schmidt integral operators. Our approach is based on the performance estimation problems (PEP) introduced by Drori and Teboulle. Unlike previous works on PEP, which rely on finite-dimensional linear algebra, we use tools from functional analysis. Using the proposed method, we establish convergence rates of various accelerated gradient flow models, some of which are new. As an immediate consequence of our framework, we show a correspondence between minimizing function values and minimizing gradient norms.

Computational Guarantees for Doubly Entropic Wasserstein Barycenters
Tomas Vaskevicius Lénaïc Chizat



研究问题:本文旨在研究双重正则化Wasserstein重心计算,这是一种由内部和外部正则化强度控制的熵重心的新家族。
动机:先前的研究已经证明,不同的正则化参数选择可以将几种熵惩罚重心的概念统一起来,同时也揭示了新的一类,包括去偏心重心的一个特例。
方法:本文提出了一种计算双重正则化Wasserstein重心的算法。该过程基于阻尼Sinkhorn迭代,然后进行精确的最大/最小化步骤,并保证对于任何正则化参数的选择都能收敛。
效果:我们算法的一种非精确变体,可以使用近似蒙特卡洛采样实现,为在自由支持/无网格设置中近似离散点云之间的Wasserstein重心提供了首个非渐近收敛保证。

We study the computation of doubly regularized Wasserstein barycenters, a recently introduced family of entropic barycenters governed by inner and outer regularization strengths. Previous research has demonstrated that various regularization parameter choices unify several notions of entropy-penalized barycenters while also revealing new ones, including a special case of debiased barycenters. In this paper, we propose and analyze an algorithm for computing doubly regularized Wasserstein barycenters. Our procedure builds on damped Sinkhorn iterations followed by exact maximization/minimization steps and guarantees convergence for any choice of regularization parameters. An inexact variant of our algorithm, implementable using approximate Monte Carlo sampling, offers the first non-asymptotic convergence guarantees for approximating Wasserstein barycenters between discrete point clouds in the free-support/grid-free setting.

Neural Processes with Stability
Huafeng Liu Liping Jing Jian Yu



研究问题:如何通过结合神经网络和随机过程的优点,定义一种灵活的随机过程类,以适应高度复杂的函数?
动机:传统的统计模型依赖于手动指定的先验,而神经过程(NPs)作为一种新型的强大神经统计模型,能够将上下文知识编码到函数空间中,更适合处理高度复杂的函数。
方法:通过引入算法稳定性的概念,为各种神经过程提供理论指导,以实现更稳定、更具泛化性的解。
效果:实验证明,该方法不仅能够提高性能准确性,还能增强模型鲁棒性。

Unlike traditional statistical models depending on hand-specified priors, neural processes (NPs) have recently emerged as a class of powerful neural statistical models that combine the strengths of neural networks and stochastic processes. NPs can define a flexible class of stochastic processes well suited for highly non-trivial functions by encoding contextual knowledge into the function space. However, noisy context points introduce challenges to the algorithmic stability that small changes in training data may significantly change the models and yield lower generalization performance. In this paper, we provide theoretical guidelines for deriving stable solutions with high generalization by introducing the notion of algorithmic stability into NPs, which can be flexible to work with various NPs and achieves less biased approximation with theoretical guarantees. To illustrate the superiority of the proposed model, we perform experiments on both synthetic and real-world data, and the results demonstrate that our approach not only helps to achieve more accurate performance but also improves model robustness.

A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models
Alexander Gilbert Reisach Myriam Tami Christof Seiler Antoine Chambaz Sebastian Weichwald



研究问题:本文旨在探讨加性噪声模型(ANMs)在观察性数据中进行因果发现的问题。
动机:由于缺乏已知底层ANM的真实世界数据,通常使用随机抽样参数的ANM来模拟数据以评估因果发现算法。作者发现,对于许多ANM参数选择,按变量增加的方差排序会产生接近因果顺序的排序,并引入“var-sortability”来量化这种对齐。
方法:作者提出一种新的模式,即可解释变量的方差比例(R²)倾向于沿因果顺序增加,即使在标准化后也保持不变。因此,他们提出了一种称为“R²-SortnRegress”的高效基线算法,该算法利用高R²-sortability,可以匹配和超越已建立的因果发现算法。
效果:实验结果表明,在具有不同模拟参数的合成数据上,R²-sortability表现出很高的值。这些发现揭示了高R²-sortability作为与因果发现相关的数据生成过程的假设,并且是许多ANM采样方案中的隐含假设。

Additive Noise Models (ANMs) are a common model class for causal discovery from observational data. Due to a lack of real-world data for which an underlying ANM is known, ANMs with randomly sampled parameters are commonly used to simulate data for the evaluation of causal discovery algorithms. While some parameters may be fixed by explicit assumptions, fully specifying an ANM requires choosing all parameters. Reisach et al. (2021) show that, for many ANM parameter choices, sorting the variables by increasing variance yields an ordering close to a causal order and introduce ‘var-sortability’ to quantify this alignment. Since increasing variances may be unrealistic and cannot be exploited when data scales are arbitrary, ANM data are often rescaled to unit variance in causal discovery benchmarking. We show that synthetic ANM data are characterized by another pattern that is scale-invariant and thus persists even after standardization: the explainable fraction of a variable’s variance, as captured by the coefficient of determination $R^2$, tends to increase along the causal order. The result is high ‘$R^2$-sortability’, meaning that sorting the variables by increasing $R^2$ yields an ordering close to a causal order. We propose a computationally efficient baseline algorithm termed ‘$R^2$-SortnRegress’ that exploits high $R^2$-sortability and that can match and exceed the performance of established causal discovery algorithms. We show analytically that sufficiently high edge weights lead to a relative decrease of the noise contributions along causal chains, resulting in increasingly deterministic relationships and high $R^2$. We characterize $R^2$-sortability on synthetic data with different simulation parameters and find high values in common settings. Our findings reveal high $R^2$-sortability as an assumption about the data generating process relevant to causal discovery and implicit in many ANM sampling schemes. It should be made explicit, as its prevalence in real-world data is an open question. For causal discovery benchmarking, we provide implementations of $R^2$-sortability, the $R^2$-SortnRegress algorithm, and ANM simulation procedures in our library CausalDisco at https://causaldisco.github.io/CausalDisco/.

Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions
Duligur Ibeling Thomas Icard



研究问题:本文旨在明确和精确化鲁宾因果模型(RCM)和结构因果模型(SCM)在因果推断中的关联。
动机:采用中立的逻辑视角,借鉴先前的研究,展示RCM如何被SCM表示。
方法:通过指出一个关键结果,即每个RCM——包括那些违反由SCM框架暗示的代数原理的RCM——都会出现为一些可表示的RCM的抽象。
效果:最后,我们通过强调SCM原则在经典RCM应用中的重要角色来说明这种改进观点的力量;相反,我们提供了一种图的代数约束的特性描述,有助于进一步比较这两个框架。

The aim of this paper is to make clear and precise the relationship between the Rubin causal model (RCM) and structural causal model (SCM) frameworks for causal inference. Adopting a neutral logical perspective, and drawing on previous work, we show what is required for an RCM to be representable by an SCM. A key result then shows that every RCM---including those that violate algebraic principles implied by the SCM framework---emerges as an abstraction of some representable RCM. Finally, we illustrate the power of this ameliorative perspective by pinpointing an important role for SCM principles in classic applications of RCMs; conversely, we offer a characterization of the algebraic constraints implied by a graph, helping to substantiate further comparisons between the two frameworks.

Learning Interpretable Low-dimensional Representation via Physical Symmetry
Xuanjie Liu Daniel Chin Yichen Huang Gus Xia



研究问题:如何从无标签的音乐音频中学习出解释性强的表示,特别是与人类感知一致的低维因素。
动机:大多数音乐表示学习方法严重依赖音乐领域知识,而我们想要探索的是通用的计算原理如何产生解释性强的表示。
方法:借鉴现代物理学,使用物理对称性作为潜在空间的自我一致性约束。具体来说,它要求表征潜在状态动态的先验模型在某些群变换下具有等变性。
效果:实验表明,物理对称性使模型能够以自监督的方式从无标签的单音音乐音频中学习线性音高因子。此外,相同的方法论也可以应用于计算机视觉,在没有标签的情况下从简单移动物体的视频中学习3D笛卡尔空间。此外,物理对称性自然地导致表示增强,这是一种提高样本效率的新技术。

We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord and texture. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles *give rise to* interpretable representations, especially low-dim factors that agree with human perception. In this study, we take inspiration from modern physics and use *physical symmetry* as a self-consistency constraint for the latent space. Specifically, it requires the prior model that characterises the dynamics of the latent states to be *equivariant* with respect to certain group transformations. We show that physical symmetry leads the model to learn a *linear* pitch factor from unlabelled monophonic music audio in a self-supervised fashion. In addition, the same methodology can be applied to computer vision, learning a 3D Cartesian space from videos of a simple moving object without labels. Furthermore, physical symmetry naturally leads to *representation augmentation*, a new technique which improves sample efficiency.

Stabilized Neural Differential Equations for Learning Dynamics with Explicit Constraints
Alistair White Niki Kilbertus Maximilian Gelbrecht Niklas Boers



研究问题:如何从数据中学习动态系统,同时确保推断的动力学保持已知的约束条件。
动机:现有的方法在保证已知约束条件的同时学习动态系统存在挑战。
方法:提出稳定化神经网络微分方程(SNDEs)方法,通过添加稳定化项到原始动力学中,使约束曲面被证明为渐近稳定,从而强制任意曲面约束。
效果:SNDEs方法在所有常见的神经网络微分方程模型中都适用,且在广泛的实证评估中表现优于现有方法,同时扩大了可以纳入NDE训练的约束类型。

Many successful methods to learn dynamical systems from data have recently been introduced. However, ensuring that the inferred dynamics preserve known constraints, such as conservation laws or restrictions on the allowed system states, remains challenging. We propose stabilized neural differential equations (SNDEs), a method to enforce arbitrary manifold constraints for neural differential equations. Our approach is based on a stabilization term that, when added to the original dynamics, renders the constraint manifold provably asymptotically stable. Due to its simplicity, our method is compatible with all common neural differential equation (NDE) models and broadly applicable. In extensive empirical evaluations, we demonstrate that SNDEs outperform existing methods while broadening the types of constraints that can be incorporated into NDE training.

Homotopy-based training of NeuralODEs for accurate dynamics discovery
Joon-Hyuk Ko Hankyul Koh Nojun Park Wonho Jhe



研究问题:如何有效地从时间序列数据中提取动态规律,并提高神经网络微分方程模型的训练效率和结果质量。
动机:虽然神经网络微分方程模型能够将神经网络与基于物理科学的微分方程建模范式相结合,但目前的方法在训练时间和结果上表现不佳,特别是对于较长持续时间的数据。
方法:本文提出了一种新的神经网络微分方程训练方法,该方法基于同步和同伦优化,不需要改变模型架构。通过同步模型动态和训练数据,可以驯服原本不规则的损失景观,然后利用同伦优化来增强训练效果。
效果:实验结果表明,该方法在训练损失上具有竞争力或更好的效果,同时通常需要的培训周期数不到其他模型无关技术的二分之一。此外,用我们的方法训练的模型显示出更好的外推能力,突显了我们方法的有效性。

Neural Ordinary Differential Equations (NeuralODEs) present an attractive way to extract dynamical laws from time series data, as they bridge neural networks with the differential equation-based modeling paradigm of the physical sciences. However, these models often display long training times and suboptimal results, especially for longer duration data. While a common strategy in the literature imposes strong constraints to the NeuralODE architecture to inherently promote stable model dynamics, such methods are ill-suited for dynamics discovery as the unknown governing equation is not guaranteed to satisfy the assumed constraints. In this paper, we develop a new training method for NeuralODEs, based on synchronization and homotopy optimization, that does not require changes to the model architecture. We show that synchronizing the model dynamics and the training data tames the originally irregular loss landscape, which homotopy optimization can then leverage to enhance training. Through benchmark experiments, we demonstrate our method achieves competitive or better training loss while often requiring less than half the number of training epochs compared to other model-agnostic techniques. Furthermore, models trained with our method display better extrapolation capabilities, highlighting the effectiveness of our method.

Assumption violations in causal discovery and the robustness of score matching
Francesco Montagna Atalanti A. Mastakouri Elias Eulig Nicoletta Noceti Lorenzo Rosasco Dominik Janzing Bryon Aragam Francesco Locatello



研究问题:在领域知识有限,实验受到道德、财务或时间限制时,研究人员如何利用观察性因果发现方法恢复因果结构?
动机:由于没有进一步假设的因果发现是一个定义不清的问题,每个算法都有自己的一组通常无法测试的假设,其中一些在真实数据集上很难满足。
方法:本文对最近在违反每种选定方法所需的关键假设的不同背景条件下生成的观察性独立同分布数据上的因果发现方法进行了广泛的基准测试。
效果:实验结果表明,基于得分匹配的方法在这些具有挑战性的场景中,对推断图中的假阳性和假阴性率表现出惊人的性能,并且我们为其性能提供了理论见解。这项工作也是第一次努力对因果发现算法的稳定性进行基准测试,以了解其超参数值的影响。最后,我们希望本文能为评估因果发现方法设定新的标准,并作为对该领域感兴趣的从业人员易于理解的切入点,突出不同算法选择的经验影响。

When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational _iid_ data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.

PICProp: Physics-Informed Confidence Propagation for Uncertainty Quantification
Qianli Shen Wai Hoh Tang Zhun Deng Apostolos Psaros Kenji Kawaguchi



研究问题:深度学习和物理信息学习中不确定性量化的标准方法存在持久的限制。
动机:当前的方法需要对数据可能性做出强烈的假设,性能高度依赖于先验的选择,并且后验只能近似采样,由于相关的计算成本,这会导致较差的近似结果。
方法:本文引入并研究了确定性偏微分方程的置信区间(CI)估计作为一个新的问题,即以CI的形式从数据位置向整个区域传播置信度,并带有概率保证。
效果:我们提出了一种基于双层优化的方法,称为物理信息置信传播(PICProp),用于在不做强假设的情况下计算有效的CI。我们提供了一个关于我们方法有效性的定理,并在关注物理信息学习的计算实验中进行了验证。代码可在https://github.com/ShenQianli/PICProp获取。

Standard approaches for uncertainty quantification in deep learning and physics-informed learning have persistent limitations. Indicatively, strong assumptions regarding the data likelihood are required, the performance highly depends on the selection of priors, and the posterior can be sampled only approximately, which leads to poor approximations because of the associated computational cost. This paper introduces and studies confidence interval (CI) estimation for deterministic partial differential equations as a novel problem. That is, to propagate confidence, in the form of CIs, from data locations to the entire domain with probabilistic guarantees. We propose a method, termed Physics-Informed Confidence Propagation (PICProp), based on bi-level optimization to compute a valid CI without making heavy assumptions. We provide a theorem regarding the validity of our method, and computational experiments, where the focus is on physics-informed learning. Code is available at https://github.com/ShenQianli/PICProp.

Riemannian SAM: Sharpness-Aware Minimization on Riemannian Manifolds
Jihun Yun Eunho Yang



研究问题:优化算法在训练几何深度学习模型方面仍然是一个未充分探索的领域。
动机:当前深度学习领域的进步已经开始探索数据的基本几何性质,因此鼓励了对考虑一般流形(如双曲或正交神经网络)的技术的研究。
方法:本文通过将传统的欧几里得SAM推广到黎曼流形上,引入了黎曼SAM。我们成功地在黎曼流形上形成了锐度感知最小化,导致了一个新颖的实例,洛伦兹SAM。此外,以前研究中提出的SAM变体,如费舍尔SAM,可以作为我们黎曼SAM框架下的特殊例子推导出来。
效果:我们的分析为包括各种流形在内的理论提供了可靠的贡献,也为费舍尔SAM等SAM变体的收敛分析提供了保证,这些变体的收敛分析是缺失的。最后,我们通过知识图谱补全和机器翻译任务的实验,说明了黎曼SAM在泛化方面优于以前的黎曼优化算法。

Contemporary advances in the field of deep learning have embarked upon an exploration of the underlying geometric properties of data, thus encouraging the investigation of techniques that consider general manifolds, for example, hyperbolic or orthogonal neural networks. However, the optimization algorithms for training such geometric deep learning models still remain highly under-explored. In this paper, we introduce Riemannian SAM by generalizing conventional Euclidean SAM to Riemannian manifolds. We successfully formulate the sharpness-aware minimization on Riemannian manifolds, leading to one of a novel instantiation, Lorentz SAM. In addition, SAM variants proposed in previous studies such as Fisher SAM can be derived as special examples under our Riemannian SAM framework. We provide the convergence analysis of Riemannian SAM under a less aggressively decaying ascent learning rate than Euclidean SAM. Our analysis serves as a theoretically sound contribution encompassing a diverse range of manifolds, also providing the guarantees for SAM variants such as Fisher SAM, whose convergence analyses are absent. Lastly, we illustrate the superiority of Riemannian SAM in terms of generalization over previous Riemannian optimization algorithms through experiments on knowledge graph completion and machine translation tasks.

Analysis of Variance of Multiple Causal Networks
Zhongli Jiang Dabao Zhang



研究问题:构建有向循环图(DCG)面临算法难度和计算负担的挑战,比较多个DCGs更为困难。
动机:我们提出了一个统一的结构模型来统一多个DCGs,并开发了一种基于有限信息的方法来同时构建多个网络并推断它们的不同之处。
方法:该方法设计了两个连续的阶段,每个阶段都包含可扩展至网络复杂性的并行计算任务。利用高性能集群,我们的方法使得使用自助法评估DCGs的统计显著性成为可能。
效果:通过在合成和真实数据集上的应用,我们展示了该方法的有效性。

Constructing a directed cyclic graph (DCG) is challenged by both algorithmic difficulty and computational burden. Comparing multiple DCGs is even more difficult, compounded by the need of identifying variational causalities across graphs. We propose to unify multiple DCGs with a single structural model and develop a limited-information-based method to simultaneously construct multiple networks and infer their disparities, which can be visualized by appropriate correspondence analysis. The algorithm provides DCGs with robust non-asymptotic theoretical properties. It is designed with two sequential stages, each of which involves parallel computation tasks that are scalable to the network complexity. Taking advantage of high-performance clusters, our method makes it possible to evaluate the statistical significance of DCGs using the bootstrap method. We demonstrated the effectiveness of our method by applying it to synthetic and real datasets.

Undirected Probabilistic Model for Tensor Decomposition
Zerui Tao Toshihisa Tanaka Qibin Zhao



研究问题:如何有效地从真实世界的数据中学习信息,而无需预先设定结构或分布假设。
动机:传统的张量分解方法需要预先设定数据的结构或分布假设,这在实际应用中往往是不可用的。
方法:本文提出了一种灵活的张量分解框架,通过深度能量模型(EBM)和神经网络来学习数据的底层结构和分布,并通过设计能量函数统一了不同类型的张量(如静态张量和带有时间戳的动态张量)的学习过程。
效果:实验结果表明,该方法在合成数据和多个真实世界数据集上都表现出优势。

Tensor decompositions (TDs) serve as a powerful tool for analyzing multiway data. Traditional TDs incorporate prior knowledge about the data into the model, such as a directed generative process from latent factors to observations. In practice, selecting proper structural or distributional assumptions beforehand is crucial for obtaining a promising TD representation. However, since such prior knowledge is typically unavailable in real-world applications, choosing an appropriate TD model can be challenging. This paper aims to address this issue by introducing a flexible TD framework that discards the structural and distributional assumptions, in order to learn as much information from the data. Specifically, we construct a TD model that captures the joint probability of the data and latent tensor factors through a deep energy-based model (EBM). Neural networks are then employed to parameterize the joint energy function of tensor factors and tensor entries. The flexibility of EBM and neural networks enables the learning of underlying structures and distributions. In addition, by designing the energy function, our model unifies the learning process of different types of tensors, such as static tensors and dynamic tensors with time stamps. The resulting model presents a doubly intractable nature due to the presence of latent tensor factors and the unnormalized probability function. To efficiently train the model, we derive a variational upper bound of the conditional noise-contrastive estimation objective that learns the unnormalized joint probability by distinguishing data from conditional noises. We show advantages of our model on both synthetic and several real-world datasets.

Differentiable and Stable Long-Range Tracking of Multiple Posterior Modes
Ali Younis Erik B. Sudderth



研究问题:本文旨在解决粒子滤波器在高维观测如图像中的应用问题,以及现有重参数化估计器的混合梯度问题。
动机:传统的粒子滤波器在已知动态和观察可能性的跟踪问题上表现良好,但在高维观测如图像上的应用受限,且现有的生成模型可能不准确或不可用。
方法:通过深度神经网络编码器,利用训练数据对潜在对象状态的不确定性进行判别性学习,以任意观察为条件,实现粒子基表示。同时,通过重要性采样梯度估计器解决了现有重参数化估计器的混合梯度问题。
效果:在一系列具有挑战性的跟踪和机器人定位问题上,该方法显著提高了准确性和稳定性,并在多次训练运行中表现出更大的稳定性。

Particle filters flexibly represent multiple posterior modes nonparametrically, via a collection of weighted samples, but have classically been applied to tracking problems with known dynamics and observation likelihoods. Such generative models may be inaccurate or unavailable for high-dimensional observations like images. We instead leverage training data to discriminatively learn particle-based representations of uncertainty in latent object states, conditioned on arbitrary observations via deep neural network encoders. While prior discriminative particle filters have used heuristic relaxations of discrete particle resampling, or biased learning by truncating gradients at resampling steps, we achieve unbiased and low-variance gradient estimates by representing posteriors as continuous mixture densities. Our theory and experiments expose dramatic failures of existing reparameterization-based estimators for mixture gradients, an issue we address via an importance-sampling gradient estimator. Unlike standard recurrent neural networks, our mixture density particle filter represents multimodal uncertainty in continuous latent states, improving accuracy and robustness. On a range of challenging tracking and robot localization problems, our approach achieves dramatic improvements in accuracy, will also showing much greater stability across multiple training runs.

What is Flagged in Uncertainty Quantification? Latent Density Models for Uncertainty Categorization
Hao Sun Boris van Breugel Jonathan Crabbé Nabeel Seedat Mihaela van der Schaar



研究问题:如何对由不确定性量化(UQ)方法标记的不确定示例进行分类。
动机:尽管近年来出现了许多可以标记可疑示例的UQ方法,但往往不清楚这些方法具体识别了什么。
方法:提出了一个框架,通过引入混淆密度矩阵来对由给定的不确定性方法识别出的可疑示例进行分类,将其分为分布外(OOD)示例、边界(Bnd)示例和高分布内误分类(IDM)区域中的示例三类。
效果:通过大量实验表明,该框架为评估不确定性量化方法之间的差异提供了一种新的、独特的视角,从而形成了有价值的评估基准。

Uncertainty quantification (UQ) is essential for creating trustworthy machine learning models. Recent years have seen a steep rise in UQ methods that can flag suspicious examples, however, it is often unclear what exactly these methods identify. In this work, we propose a framework for categorizing uncertain examples flagged by UQ methods. We introduce the confusion density matrix---a kernel-based approximation of the misclassification density---and use this to categorize suspicious examples identified by a given uncertainty method into three classes: out-of-distribution (OOD) examples, boundary (Bnd) examples, and examples in regions of high in-distribution misclassification (IDM). Through extensive experiments, we show that our framework provides a new and distinct perspective for assessing differences between uncertainty quantification methods, thereby forming a valuable assessment benchmark.

Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint
Junghyun Lee Hanseul Cho Se-Young Yun Chulhee Yun



研究问题:本文旨在解决公平主成分分析(PCA)的问题,即在执行PCA的同时,使结果表示在敏感属性条件下的投影分布彼此匹配。
动机:现有的公平PCA方法存在两个主要问题:理论上,公平PCA的学习性没有统计基础;实践上,由于现有方法需要完全访问整个数据,而内存限制使得我们无法使用它们。
方法:提出了一个框架,通过引入混淆密度矩阵来对由给定的不确定性方法识别出的可疑示例进行分类,将其分为分布外(OOD)示例、边界(Bnd)示例和高分布内误分类(IDM)区域中的示例三类。
效果:通过大量实验表明,该框架为评估不确定性量化方法之间的差异提供了一种新的、独特的视角,从而形成了有价值的评估基准。

Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another. However, existing approaches to fair PCA have two main problems: theoretically, there has been no statistical foundation of fair PCA in terms of learnability; practically, limited memory prevents us from using existing approaches, as they explicitly rely on full access to the entire data. On the theoretical side, we rigorously formulate fair PCA using a new notion called probably approximately fair and optimal (PAFO) learnability. On the practical side, motivated by recent advances in streaming algorithms for addressing memory limitation, we propose a new setting called fair streaming PCA along with a memory-efficient algorithm, fair noisy power method (FNPM). We then provide its statistical guarantee in terms of PAFO-learnability, which is the first of its kind in fair PCA literature. We verify our algorithm in the CelebA dataset without any pre-processing; while the existing approaches are inapplicable due to memory limitations, by turning it into a streaming setting, we show that our algorithm performs fair PCA efficiently and effectively.

Estimating Propensity for Causality-based Recommendation without Exposure Data
Zhongzhou Liu Yuan Fang Min Wu



研究问题:现有的基于因果关系的推荐系统需要额外的曝光数据和倾向性分数(即曝光的概率)进行训练,但在现实世界中,由于技术或隐私限制,这些关键数据往往无法获取。
动机:为了解决这个问题,本文提出了一个新的框架——基于倾向性估计的因果关系推荐(PropCare)。
方法:PropCare通过关联倾向性和项目流行度的成对特征,仅使用传统的交互数据就可以估计倾向性和曝光,无需在训练和推理中使用任何曝光或倾向性的地面真值。
效果:实验结果表明,PropCare能够实现竞争性的基于因果关系的推荐,同时我们还对其模型估计的因果效应偏差进行了理论分析。

Causality-based recommendation systems focus on the causal effects of user-item interactions resulting from item exposure (i.e., which items are recommended or exposed to the user), as opposed to conventional correlation-based recommendation. They are gaining popularity due to their multi-sided benefits to users, sellers and platforms alike. However, existing causality-based recommendation methods require additional input in the form of exposure data and/or propensity scores (i.e., the probability of exposure) for training. Such data, crucial for modeling causality in recommendation, are often not available in real-world situations due to technical or privacy constraints. In this paper, we bridge the gap by proposing a new framework, called Propensity Estimation for Causality-based Recommendation (PropCare). It can estimate the propensity and exposure from a more practical setup, where only interaction data are available *without* any ground truth on exposure or propensity in training and inference. We demonstrate that, by relating the pairwise characteristics between propensity and item popularity, PropCare enables competitive causality-based recommendation given only the conventional interaction data. We further present a theoretical analysis on the bias of the causal effect under our model estimation. Finally, we empirically evaluate PropCare through both quantitative and qualitative experiments.

Optimal Transport for Treatment Effect Estimation
Hao Wang Jiajun Fan Zhichao Chen Haoxuan Li Weiming Liu Tianqiao Liu Quanyu Dai Yichao Wang Zhenhua Dong Ruiming Tang



研究问题:从观察性数据中估计个体治疗效果具有挑战性,因为存在治疗选择偏差。
动机:目前的方法主要通过在潜在空间中对齐不同的治疗组来缓解这个问题,其核心是计算分布差异。然而,两个经常被忽视的问题可能会使这些方法无效。
方法:我们提出了整个空间反事实回归(ESCFR),这是一种基于因果关系的最优传输技术的新方法。具体来说,基于标准最优传输框架,我们提出了一种松弛的保质正则化器来解决MSE问题,并设计了一种接近实际结果的正则化器来处理UCE问题。
效果:大量的实验表明,ESCFR能准确估计分布差异,有效处理治疗选择偏差,并且显著优于现有的竞争对手。

Estimating individual treatment effects from observational data is challenging due to treatment selection bias. Prevalent methods mainly mitigate this issue by aligning different treatment groups in the latent space, the core of which is the calculation of distribution discrepancy. However, two issues that are often overlooked can render these methods invalid: (1) mini-batch sampling effects (MSE), where the calculated discrepancy is erroneous in non-ideal mini-batches with outcome imbalance and outliers; (2) unobserved confounder effects (UCE), where the unobserved confounders are not considered in the discrepancy calculation. Both of these issues invalidate the calculated discrepancy, mislead the training of estimators, and thus impede the handling of treatment selection bias. To tackle these issues, we propose Entire Space CounterFactual Regression (ESCFR), which is a new take on optimal transport technology in the context of causality. Specifically, based on the canonical optimal transport framework, we propose a relaxed mass-preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that ESCFR estimates distribution discrepancy accurately, handles the treatment selection bias effectively, and outperforms prevalent competitors significantly.

Function Space Bayesian Pseudocoreset for Bayesian Neural Networks
Balhae Kim Hyungi Lee Juho Lee



研究问题:如何有效地构建贝叶斯伪核心集,以实现大规模数据集的可扩展贝叶斯推理。
动机:现有的贝叶斯伪核心集构造方法在高维参数空间中进行模型参数(权重)的匹配,存在可扩展性差和多模态问题等挑战。
方法:提出一种新的贝叶斯伪核心集构造方法,该方法在函数空间上操作,通过在函数空间上构建伪核心集后验变分近似并与全数据后验进行匹配。
效果:实验证明,该方法构建的贝叶斯伪核心集具有更强的不确定性量化能力和更好的鲁棒性,适用于各种模型架构。

A Bayesian pseudocoreset is a compact synthetic dataset summarizing essential information of a large-scale dataset and thus can be used as a proxy dataset for scalable Bayesian inference. Typically, a Bayesian pseudocoreset is constructed by minimizing a divergence measure between the posterior conditioning on the pseudocoreset and the posterior conditioning on the full dataset. However, evaluating the divergence can be challenging, particularly for the models like deep neural networks having high-dimensional parameters. In this paper, we propose a novel Bayesian pseudocoreset construction method that operates on a function space. Unlike previous methods, which construct and match the coreset and full data posteriors in the space of model parameters (weights), our method constructs variational approximations to the coreset posterior on a function space and matches it to the full data posterior in the function space. By working directly on the function space, our method could bypass several challenges that may arise when working on a weight space, including limited scalability and multi-modality issue. Through various experiments, we demonstrate that the Bayesian pseudocoresets constructed from our method enjoys enhanced uncertainty quantification and better robustness across various model architectures.

Globally solving the Gromov-Wasserstein problem for point clouds in low dimensional Euclidean spaces
Martin Ryner Jan Kronqvist Johan Karlsson



研究问题:本文提出了一种计算低维空间中两组点之间Gromov-Wasserstein问题的框架,其中差异是平方欧几里得范数。
动机:Gromov-Wasserstein问题是最优传输问题的推广,可以找到尽可能保留成对距离的两组之间的分配。这可以用于量化两个形态或形状之间的相似性,这是AI和机器学习中的常见问题。
方法:通过将QAP重新定义为具有低维域的优化问题来解决此问题,利用该问题可以表示为具有低秩的凸二次优化问题这一事实。该方法具有良好的扩展性,并且可以用于找到具有数千个点的大规模问题的全局解决方案。
效果:我们在合成问题上比较了我们的方法与最先进的方法的计算复杂性,并将其应用于计算生物学中的一个特别感兴趣的近似对称问题。

This paper presents a framework for computing the Gromov-Wasserstein problem between two sets of points in low dimensional spaces, where the discrepancy is the squared Euclidean norm. The Gromov-Wasserstein problem is a generalization of the optimal transport problem that finds the assignment between two sets preserving pairwise distances as much as possible. This can be used to quantify the similarity between two formations or shapes, a common problem in AI and machine learning. The problem can be formulated as a Quadratic Assignment Problem (QAP), which is in general computationally intractable even for small problems. Our framework addresses this challenge by reformulating the QAP as an optimization problem with a low-dimensional domain, leveraging the fact that the problem can be expressed as a concave quadratic optimization problem with low rank. The method scales well with the number of points, and it can be used to find the global solution for large-scale problems with thousands of points. We compare the computational complexity of our approach with state-of-the-art methods on synthetic problems and apply it to a near-symmetrical problem which is of particular interest in computational biology.

Unbiased learning of deep generative models with structured discrete representations
Harry Bendekgey Gabriel Hope Erik B. Sudderth



研究问题:如何结合图形模型和深度学习架构,学习生成模型?
动机:图形模型具有结构和可解释性,深度学习具有处理高维数据的灵活性,但两者的结合存在优化挑战。
方法:提出新的算法来学习结构化变分自动编码器(SVAE),并首次展示了SVAE在数据缺失时通过引入离散潜在变量处理多模态不确定性的能力。
效果:通过梯度下降法使SVAE易于学习,同时证明其对不完整优化的鲁棒性。通过计算自然梯度而无需手动推导,可以更快地学习准确的图形模型参数,避免了先前工作中发现的偏见。这些优化创新使得SVAE能够与最先进的时间序列模型进行首次比较,其中SVAE在学习和表示结构化离散数据方面表现出竞争力。

By composing graphical models with deep learning architectures, we learn generative models with the strengths of both frameworks. The structured variational autoencoder (SVAE) inherits structure and interpretability from graphical models, and flexible likelihoods for high-dimensional data from deep learning, but poses substantial optimization challenges. We propose novel algorithms for learning SVAEs, and are the first to demonstrate the SVAE's ability to handle multimodal uncertainty when data is missing by incorporating discrete latent variables. Our memory-efficient implicit differentiation scheme makes the SVAE tractable to learn via gradient descent, while demonstrating robustness to incomplete optimization. To more rapidly learn accurate graphical model parameters, we derive a method for computing natural gradients without manual derivations, which avoids biases found in prior work. These optimization innovations enable the first comparisons of the SVAE to state-of-the-art time series models, where the SVAE performs competitively while learning interpretable and structured discrete data representations.

Geometry-Informed Neural Operator for Large-Scale 3D PDEs
Zongyi Li Nikola Borislavov Kovachki Chris Choy Boyi Li Jean Kossaifi Shourya Prakash Otta Mohammad Amin Nabian Maximilian Stadler Christian Hundt Kamyar Azizzadenesheli Anima Anandkumar



研究问题:如何有效地学习大规模偏微分方程的解算子,特别是在几何形状变化的情况下。
动机:现有的方法在处理不规则网格和进行高效的傅立叶运算方面存在困难。
方法:提出了一种基于图形和傅立叶架构的神经算子——几何信息感知神经算子(GINO)。GINO使用输入形状的符号距离函数(SDF)表示和基于图神经网络的算子来学习解算子。图神经网络算子处理不规则网格,并将其转换为常规潜在网格,以便在潜在网格上高效地应用傅立叶神经算子。
效果:实验结果表明,GINO在预测汽车表面压力方面取得了显著的效果,其计算成本比优化的GPU基计算流体动力学(CFD)模拟器快26,000倍。同时,GINO在新的几何形状和边界条件组合测试中,其误差率比深度神经网络方法低四分之一。

We propose the geometry-informed neural operator (GINO), a highly efficient approach to learning the solution operator of large-scale partial differential equations with varying geometries. GINO uses a signed distance function (SDF) representation of the input shape and neural operators based on graph and Fourier architectures to learn the solution operator. The graph neural operator handles irregular grids and transforms them into and from regular latent grids on which Fourier neural operator can be efficiently applied. We provide an efficient implementation of GINO using an optimized hashing approach, which allows efficient learning in a shared, compressed latent space with reduced computation and memory costs. GINO is discretization-invariant, meaning the trained model can be applied to arbitrary discretizations of the continuous domain and applies to any shape or resolution. To empirically validate the performance of our method on large-scale simulation, we generate the industry-standard aerodynamics dataset of 3D vehicle geometries with Reynolds numbers as high as five million. For this large-scale 3D fluid simulation, numerical methods are expensive to compute surface pressure. We successfully trained GINO to predict the pressure on car surfaces using only five hundred data points. The cost-accuracy experiments show a 26,000x speed-up compared to optimized GPU-based computational fluid dynamics (CFD) simulators on computing the drag coefficient. When tested on new combinations of geometries and boundary conditions (inlet velocities), GINO obtains a one-fourth reduction in error rate compared to deep neural network approaches.

An Information-Theoretic Evaluation of Generative Models in Learning Multi-modal Distributions
Mohammad Jalali Cheuk Ting Li Farzan Farnia



研究问题:如何评估生成模型在多模态分布中捕捉到的模式数量。
动机:现有的评估指标与分布中的模式数量之间的对应关系尚不明确,因此需要提出一种新的评估方法。
方法:提出了一种基于量子信息理论的Renyi Kernel熵(RKE)作为评估分数,用于测量生成样本中模式的数量。
效果:通过使用RKE分数对最新的生成模型进行广泛评估,发现尽管这些模型在模式多样性上有所提高,但它们仍无法完全捕捉真实数据的全部多样性。

The evaluation of generative models has received significant attention in the machine learning community. When applied to a multi-modal distribution which is common among image datasets, an intuitive evaluation criterion is the number of modes captured by the generative model. While several scores have been proposed to evaluate the quality and diversity of a model's generated data, the correspondence between existing scores and the number of modes in the distribution is unclear. In this work, we propose an information-theoretic diversity evaluation method for multi-modal underlying distributions. We utilize the R\'enyi Kernel Entropy (RKE) as an evaluation score based on quantum information theory to measure the number of modes in generated samples. To interpret the proposed evaluation method, we show that the RKE score can output the number of modes of a mixture of sub-Gaussian components. We also prove estimation error bounds for estimating the RKE score from limited data, suggesting a fast convergence of the empirical RKE score to the score for the underlying data distribution. Utilizing the RKE score, we conduct an extensive evaluation of state-of-the-art generative models over standard image datasets. The numerical results indicate that while the recent algorithms for training generative models manage to improve the mode-based diversity over the earlier architectures, they remain incapable of capturing the full diversity of real data. Our empirical results provide a ranking of widely-used generative models based on the RKE score of their generated samples.

On Convergence of Polynomial Approximations to the Gaussian Mixture Entropy
Caleb Dahlke Jason Pacheco



研究问题:本文旨在解决高斯混合模型(GMM)的不确定性量化问题,因为其熵的微分熵没有封闭形式。
动机:尽管高斯混合模型在机器学习中具有灵活性,可以作为密度的近似,但其不确定性量化仍然是一个挑战。
方法:本文从理论和实践的角度探讨了多项式逼近,特别是泰勒级数和勒让德级数,用于GMM的熵。我们提供了对Huber等人(2008)使用的广泛方法的新分析,并表明该系列在简单条件下会发散。受此发散性的启发,我们提供了一个新的泰勒级数,该级数被证明可以收敛到任何GMM的真实熵。我们还展示了一种选择中心的方法,使得该系列从下方向上收敛,从而为GMM的熵提供了一个下界。此外,我们还证明,正交多项式系列会产生更准确的多项式逼近。
效果:实验验证支持我们的理论结果,同时表明我们的方法在计算上与Huber等人的方法相当。我们还表明,在实际应用中,这些多项式逼近的使用(如Gershamn等人(2012)的非参数变分推断)依赖于方法在计算准确逼近时的收敛性。这项工作为现有方法提供了有用的分析,同时引入了一种得到坚实理论保证的支持的新近似方法。

Gaussian mixture models (GMMs) are fundamental to machine learning due to their flexibility as approximating densities. However, uncertainty quantification of GMMs remains a challenge as differential entropy lacks a closed form. This paper explores polynomial approximations, specifically Taylor and Legendre, to the GMM entropy from a theoretical and practical perspective. We provide new analysis of a widely used approach due to Huber et al.(2008) and show that the series diverges under simple conditions. Motivated by this divergence we provide a novel Taylor series that is provably convergent to the true entropy of any GMM. We demonstrate a method for selecting a center such that the series converges from below, providing a lower bound on GMM entropy. Furthermore, we demonstrate that orthogonal polynomial series result in more accurate polynomial approximations. Experimental validation supports our theoretical results while showing that our method is comparable in computation to Huber et al. We also show that in application, the use of these polynomial approximations, such as in Nonparametric Variational Inference by Gershamn et al. (2012), rely on the convergence of the methods in computing accurate approximations. This work contributes useful analysis to existing methods while introducing a novel approximation supported by firm theoretical guarantees.

Scaling Riemannian Diffusion Models
Aaron Lou Minkai Xu Adam Farris Stefano Ermon



研究问题:如何利用黎曼扩散模型在高维空间中进行有效的分布学习?
动机:黎曼扩散模型的几何复杂性使得其无法用封闭形式表达扩散转移项,导致性能下降和高维应用受限。
方法:通过重新审视近似方法并提出实用改进,特别是利用对称空间的计算优势,快速精确地计算相关量。
效果:在低维数据集上,该方法能显著提高性能并与其他技术竞争;在高维任务和非平凡流形(如量子色动力学中的SU(n)晶格)上,该方法能够扩展应用;在对比学习的超球体嵌入中,该方法能够解决表示塌陷问题,缩小理论与实践的差距。

Riemannian diffusion models draw inspiration from standard Euclidean space diffusion models to learn distributions on general manifolds. Unfortunately, the additional geometric complexity renders the diffusion transition term inexpressible in closed form, so prior methods resort to imprecise approximations of the score matching training objective that degrade performance and preclude applications in high dimensions. In this work, we reexamine these approximations and propose several practical improvements. Our key observation is that most relevant manifolds are symmetric spaces, which are much more amenable to computation. By leveraging and combining various ans\"{a}tze, we can quickly compute relevant quantities to high precision. On low dimensional datasets, our correction produces a noticeable improvement and is competitive with other techniques. Additionally, we show that our method enables us to scale to high dimensional tasks on nontrivial manifolds, including $SU(n)$ lattices in the context of lattice quantum chromodynamics (QCD). Finally, we apply our models to contrastively learned hyperspherical embeddings, curbing the representation collapse problem in the projection head and closing the gap between theory and practice.

Nearly Optimal VC-Dimension and Pseudo-Dimension Bounds for Deep Neural Network Derivatives
Yahong Yang Haizhao Yang Yang Xiang



研究问题:解决深度神经网络(DNNs)导数函数的Vapnik-Chervonenkis维度(VC-dimension)和伪维度估计问题。
动机:为深度学习模型提供误差估计,推动物理学信息机器学习模型和应用的发展,如生成模型、偏微分方程求解、算子学习、网络压缩、蒸馏、正则化等。
方法:通过大规模文本语料库和知识图谱训练增强的语言表示模型ERNIE,同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper addresses the problem of nearly optimal Vapnik--Chervonenkis dimension (VC-dimension) and pseudo-dimension estimations of the derivative functions of deep neural networks (DNNs). Two important applications of these estimations include: 1) Establishing a nearly tight approximation result of DNNs in the Sobolev space; 2) Characterizing the generalization error of machine learning methods with loss functions involving function derivatives. This theoretical investigation fills the gap of learning error estimations for a wide range of physics-informed machine learning models and applications including generative models, solving partial differential equations, operator learning, network compression, distillation, regularization, etc.

BasisFormer: Attention-based Time Series Forecasting with Learnable and Interpretable Basis
Zelin Ni Hang Yu Shizhan Liu Jianguo Li Weiyao Lin



研究问题:如何同时满足提取特征和作为未来参考的功能,提高基于深度学习的时间序列预测模型的效果?
动机:当前最先进的方法在满足这两个要求上存在限制。
方法:提出BasisFormer,一种利用可学习和可解释的基础进行端到端时间序列预测的架构。该架构包括三个部分:通过自适应自我监督学习获取基础;设计Coef模块,通过双向交叉注意力计算时间序列与历史视图中基础的相似系数;最后,根据相似系数选择并整合未来视图中的基础,进行准确的未来预测。
效果:在六个数据集上的大量实验表明,BasisFormer在单变量和多变量预测任务上分别比先前最先进的方法提高了11.04%和15.78%。

Bases have become an integral part of modern deep learning-based models for time series forecasting due to their ability to act as feature extractors or future references. To be effective, a basis must be tailored to the specific set of time series data and exhibit distinct correlation with each time series within the set. However, current state-of-the-art methods are limited in their ability to satisfy both of these requirements simultaneously. To address this challenge, we propose BasisFormer, an end-to-end time series forecasting architecture that leverages learnable and interpretable bases. This architecture comprises three components: First, we acquire bases through adaptive self-supervised learning, which treats the historical and future sections of the time series as two distinct views and employs contrastive learning. Next, we design a Coef module that calculates the similarity coefficients between the time series and bases in the historical view via bidirectional cross-attention. Finally, we present a Forecast module that selects and consolidates the bases in the future view based on the similarity coefficients, resulting in accurate future predictions. Through extensive experiments on six datasets, we demonstrate that BasisFormer outperforms previous state-of-the-art methods by 11.04% and 15.78% respectively for univariate and multivariate forecasting tasks. Code is available at: https://github.com/nzl5116190/Basisformer.

Finite Population Regression Adjustment and Non-asymptotic Guarantees for Treatment Effect Estimation
Mehrdad Ghadiri David Arbour Tung Mai Cameron N Musco Anup Rao



研究问题:本文旨在研究在有限总体中进行随机实验的回归调整方法,以解决估计样本均值、个体治疗效果和平均治疗效果的问题。
动机:目前的统计方法主要关注于整个人口都参与实验的情况,而在实践中,为了伦理和实用原因,研究者通常希望尽量减少受试者的数量。
方法:本文提出了利用随机数值线性代数技术从总体中抽取一部分进行实验的方法,并给出了非渐进准确性界限。
效果:实验结果表明,该方法与现有方法相比具有较好的性能。

The design and analysis of randomized experiments is fundamental to many areas, from the physical and social sciences to industrial settings. Regression adjustment is a popular technique to reduce the variance of estimates obtained from experiments, by utilizing information contained in auxiliary covariates. While there is a large literature within the statistics community studying various approaches to regression adjustment and their asymptotic properties, little focus has been given to approaches in the finite population setting with non-asymptotic accuracy bounds. Further, prior work typically assumes that an entire population is exposed to an experiment, whereas practitioners often seek to minimize the number of subjects exposed to an experiment, for ethical and pragmatic reasons. In this work, we study the problems of estimating the sample mean, individual treatment effects, and average treatment effect with regression adjustment. We propose approaches that use techniques from randomized numerical linear algebra to sample a subset of the population on which to perform an experiment. We give non-asymptotic accuracy bounds for our methods and demonstrate that they compare favorably with prior approaches.

Neural Ideal Large Eddy Simulation: Modeling Turbulence with Neural Stochastic Differential Equations
Anudhyan Boral Zhong Yi Wan Leonardo Zepeda-Nunez James Lottes Qing Wang Yi-Fan Chen John Roberts Anderson Fei Sha



研究问题:如何有效地结合理想大涡模拟和神经随机微分方程进行数据驱动学习。
动机:理想大涡模拟可以模型化湍流封闭,但无法解析求解;神经随机微分方程可以进行随机建模,但需要处理确定性实现的问题。
方法:使用潜在神经随机微分方程来模拟随机过程的演变,并使用编码器-解码器对在潜在空间和期望的理想流场之间进行转换。
效果:该方法在两个具有挑战性的混沌动力系统上表现出了有效性,能够无缝处理非均匀几何形状,并且相比于竞争方法,其生成的轨迹具有更准确的统计数据和更强的稳定性。

We introduce a data-driven learning framework that assimilates two powerful ideas: ideal large eddy simulation (LES) from turbulence closure modeling and neural stochastic differential equations (SDE) for stochastic modeling. The ideal LES models the LES flow by treating each full-order trajectory as a random realization of the underlying dynamics, as such, the effect of small-scales is marginalized to obtain the deterministic evolution of the LES state. However, ideal LES is analytically intractable. In our work, we use a latent neural SDE to model the evolution of the stochastic process and an encoder-decoder pair for transforming between the latent space and the desired ideal flow field. This stands in sharp contrast to other types of neural parameterization of closure models where each trajectory is treated as a deterministic realization of the dynamics. We show the effectiveness of our approach (niLES – neural ideal LES) on two challenging chaotic dynamical systems: Kolmogorov flow at a Reynolds number of 20,000 and flow past a cylinder at Reynolds number 500. Compared to competing methods, our method can handle non-uniform geometries using unstructured meshes seamlessly. In particular, niLES leads to trajectories with more accurate statistics and enhances stability, particularly for long-horizon rollouts. (Source codes and datasets will be made publicly available.)

Adaptive Linear Estimating Equations
Mufang Ying Koulik Khamaru Cun-Hui Zhang



研究问题:本文旨在解决序贯数据收集在提高数据收集效率的同时,给统计推断过程带来的复杂性问题。
动机:尽管序贯数据收集具有许多优点,但其常常使统计推断过程变得复杂,例如在自适应线性回归模型中,普通最小二乘法(OLS)估计量可能表现出非正态的渐近行为,这对准确推理和解释构成了挑战。
方法:本文提出了一种通用的方法来构建消除偏误的估计量,该方法利用了自适应线性估计方程的思想,并建立了渐近正态性的保证,同时讨论了如何实现接近最优的渐近方差。
效果:该方法在两个具有挑战性的混沌动力系统上表现出了有效性,能够无缝处理非均匀几何形状,并且相比于竞争方法,其生成的轨迹具有更准确的统计数据和更强的稳定性。

Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.

Score-based Source Separation with Applications to Digital Communication Signals
Tejas Jayashankar Gary C.F. Lee Alejandro Lancho Amir Weiss Yury Polyanskiy Gregory Wornell



研究问题:提出一种新的方法,通过基于扩散的生成模型分离叠加源。
动机:该方法主要应用于无线电频率(RF)系统,对具有离散性质的底层源以及从关注信号中恢复编码比特感兴趣,以比特错误率(BER)进行衡量。
方法:该方法仅依赖于分别训练的独立源的统计先验,通过最大后验估计和α-后验建立新的客观函数,并在多个高斯平滑级别上进行。
效果:实验结果表明,该方法在RF混合物上实现了比经典和现有学习型方法更高的BER降低95%。分析表明,该方法的解决方案渐近地接近底层离散分布的模式。此外,该方法可以看作是最近提出的得分蒸馏采样方案的多源扩展,为其超越条件采样的使用提供了额外的启示。项目网页可在https://alpha-rgs.github.io查看。

We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by $\textit{maximum a posteriori}$ estimation with an $\textit{$\alpha$-posterior}$, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95\% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https://alpha-rgs.github.io.

Generalized Belief Transport
Junqi Wang PEI WANG Patrick Shafto



研究问题:如何理解学习模型之间的关系,以构建能够在不同的学习模式之间切换的智能体。
动机:现有的学习模型通常被单独考虑,而不是相互关联。为了建立能够在不同学习模式之间切换的智能体,理解学习模型之间的关系至关重要。
方法:引入了一个数学框架——广义信念传输(GBT),将贝叶斯推理、合作通信和分类等现有模型统一起来,并将其视为未平衡最优传输(UOT)中的三个学习约束的参数化。
效果:通过GBT可视化学习模型的空间,并证明其连续性和可微性,为模型插值奠定了基础。此外,还研究了GBT的极限行为,探索了模型在GBT中的收敛性质,证明了其在分布漂移存在下的学习能力,并提出了关于一般行为的猜想。最后,提出了开放的问题和对更统一的学习模型的启示。

Human learners have ability to adopt appropriate learning approaches depending on constraints such as prior on the hypothesis, urgency of decision, and drift of the environment. However, existing learning models are typically considered individually rather than in relation to one and other. To build agents that have the ability to move between different modes of learning over time, it is important to understand how learning models are related as points in a broader space of possibilities. We introduce a mathematical framework, Generalized Belief Transport (GBT), that unifies and generalizes prior models, including Bayesian inference, cooperative communication and classification, as parameterizations of three learning constraints within Unbalanced Optimal Transport (UOT). We visualize the space of learning models encoded by GBT as a cube which includes classic learning models as special points. We derive critical properties of this parameterized space including proving continuity and differentiability which is the basis for model interpolation, and study limiting behavior of the parameters, which allows attaching learning models on the boundaries. Moreover, we investigate the long-run behavior of GBT, explore convergence properties of models in GBT mathematical and computationally, document the ability to learn in the presence of distribution drift, and formulate conjectures about general behavior. We conclude with open questions and implications for more unified models of learning.

Sequential Predictive Two-Sample and Independence Testing
Aleksandr Podkopaev Aaditya Ramdas



研究问题:本文研究了顺序非参数两样本和独立性测试的问题。
动机:现有的顺序测试方法在处理高维或结构化数据时,如图像,选择合适的核函数往往是困难的。
方法:本文设计了一种基于预测的下注策略,该策略依赖于一个事实:如果一个顺序更新的预测器开始一致地确定(a)一个实例是从哪个分布中抽取的,或者(b)一个实例是从联合分布还是从边际分布的乘积中抽取的(后者由外部随机化产生),它就分别提供了反对两个样本或独立性假设的证据。
效果:实验结果表明,在结构化设置下,我们的测试优于基于核的方法。即使在数据分布随时间漂移的情况下,我们的测试也可以应用,保持有效和强大。

We study the problems of sequential nonparametric two-sample and independence testing. Sequential tests process data online and allow using observed data to decide whether to stop and reject the null hypothesis or to collect more data, while maintaining type I error control. We build upon the principle of (nonparametric) testing by betting, where a gambler places bets on future observations and their wealth measures evidence against the null hypothesis. While recently developed kernel-based betting strategies often work well on simple distributions, selecting a suitable kernel for high-dimensional or structured data, such as images, is often nontrivial. To address this drawback, we design prediction-based betting strategies that rely on the following fact: if a sequentially updated predictor starts to consistently determine (a) which distribution an instance is drawn from, or (b) whether an instance is drawn from the joint distribution or the product of the marginal distributions (the latter produced by external randomization), it provides evidence against the two-sample or independence nulls respectively. We empirically demonstrate the superiority of our tests over kernel-based approaches under structured settings. Our tests can be applied beyond the case of independent and identically distributed data, remaining valid and powerful even when the data distribution drifts over time.

Estimating Noise Correlations Across Continuous Conditions With Wishart Processes
Amin Nejatbakhsh Isabel Garon Alex H Williams



研究问题:如何准确估计神经网络群体的噪声协方差,特别是在试验条件有限的情况下。
动机:现有的噪声协方差估计方法需要大量的刻板试验,且在自然行为和感官体验中的表现不佳。
方法:利用许多实验中的条件平滑参数化特性,采用Wishart过程模型将相邻条件的试验统计能力进行整合。
效果:该方法在老鼠视觉皮层和猴子运动皮层的实验数据上表现良好,能产生随刺激参数变化的平滑协方差估计,并能对未见过的条件进行噪声相关性的估计,以及连续的 Fisher 信息估计。这为理解噪声在复杂神经计算和行为中的作用铺平了道路。

The signaling capacity of a neural population depends on the scale and orientation of its covariance across trials. Estimating this "noise" covariance is challenging and is thought to require a large number of stereotyped trials. New approaches are therefore needed to interrogate the structure of neural noise across rich, naturalistic behaviors and sensory experiences, with few trials per condition. Here, we exploit the fact that conditions are smoothly parameterized in many experiments and leverage Wishart process models to pool statistical power from trials in neighboring conditions. We demonstrate that these models perform favorably on experimental data from the mouse visual cortex and monkey motor cortex relative to standard covariance estimators. Moreover, they produce smooth estimates of covariance as a function of stimulus parameters, enabling estimates of noise correlations in entirely unseen conditions as well as continuous estimates of Fisher information—a commonly used measure of signal fidelity. Together, our results suggest that Wishart processes are broadly applicable tools for quantification and uncertainty estimation of noise correlations in trial-limited regimes, paving the way toward understanding the role of noise in complex neural computations and behavior.

Uncertainty-Aware Instance Reweighting for Off-Policy Learning
Xiaoying Zhang Junpu Chen Hongning Wang Hong Xie Yang Liu John C.S. Lui Hang Li



研究问题:本文旨在解决现有离线学习中对日志策略估计的偏差和方差问题,以及由此产生的负面影响。
动机:在许多现实世界的应用中,如搜索引擎和推荐系统,离线学习的重要性已经显现出来。然而,由于地面实况日志策略通常是未知的,以往的工作只是简单地使用其估计值进行离线学习,忽视了这种估计器带来的高偏差和高方差的影响。
方法:本文提出了一种不确定性感知逆倾向得分估计器(UIPS),以显式地模拟估计日志策略中的不确定性,并提高离线学习的质量。
效果:实验结果在合成和现实世界的推荐数据集上表明,当与一系列先进的基线进行比较时,UIPS显著提高了发现的策略的质量。

Off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has shown importance in various important real-world applications, such as search engines and recommender systems. While the ground-truth logging policy is usually unknown, previous work simply takes its estimated value for the off-policy learning, ignoring the negative impact from both high bias and high variance resulted from such an estimator. And these impact is often magnified on samples with small and inaccurately estimated logging probabilities. The contribution of this work is to explicitly model the uncertainty in the estimated logging policy, and propose an Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning, with a theoretical convergence guarantee. Experiment results on the synthetic and real-world recommendation datasets demonstrate that UIPS significantly improves the quality of the discovered policy, when compared against an extensive list of state-of-the-art baselines.

DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with GFlowNets
Lazar Atanackovic Alexander Tong BO WANG Leo J Lee Yoshua Bengio Jason Hartford



研究问题:细胞生物学的一个重大挑战是推断基因调控网络(GRN),该网络描述了控制基因表达和细胞功能的基因及其产物之间的相互作用。
动机:由于调控网络本质上是循环的,并且观察结果存在显著的测量噪声,因此现有的方法要么专注于从动态中识别循环结构的挑战(1),要么专注于学习复杂的贝叶斯后验概率在有向无环图上的挑战(2),但都不能同时处理这两个挑战。
方法:本文利用RNA速度技术估计基因表达的“速度”,从而开发了一种可以同时解决这两个挑战的方法。由于我们可以获得速度信息,因此可以将贝叶斯结构学习问题视为稀疏识别动力系统的问题,通过时间捕获循环反馈环。
效果:我们利用生成流网络(GFlowNets)来估计可能的稀疏依赖关系的组合空间上的后验分布。我们的结果表明,与最先进的贝叶斯结构学习方法相比,该方法学习到的后验更好地封装了循环结构的分布。

One of the grand challenges of cell biology is inferring the gene regulatory network (GRN) which describes interactions between genes and their products that control gene expression and cellular function. We can treat this as a causal discovery problem but with two non-standard challenges: (1) regulatory networks are inherently cyclic so we should not model a GRN as a directed acyclic graph (DAG), and (2) observations have significant measurement noise so for typical sample sizes, there will always be a large equivalence class of graphs that are likely given the data, and we want methods that capture this uncertainty. Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over directed acyclic graphs, but not both. In this paper we leverage the fact that it is possible to estimate the ``velocity'' of the expression of a gene with RNA velocity techniques to develop an approach that addresses both challenges. Because we have access to velocity information, we can treat the Bayesian structure learning problem as a problem of sparse identification of a dynamical system, capturing cyclic feedback loops through time. We leverage Generative Flow Networks (GFlowNets) to estimate the posterior distribution over the combinatorial space of possible sparse dependencies. Our results indicate that our method learns posteriors that better encapsulate the distributions of cyclic structures compared to counterpart state-of-the-art Bayesian structure learning approaches.

Practical and Asymptotically Exact Conditional Sampling in Diffusion Models
Luhuan Wu Brian L. Trippe Christian A Naesseth David Blei John Patrick Cunningham



研究问题:扩散模型在一系列条件生成任务中取得了成功,但这些成就主要依赖于特定任务的条件训练或容易出错的启发式近似。理想的条件生成方法应该能够在不依赖特定任务训练的情况下为广泛的条件分布提供精确的样本。
动机:为了解决这个问题,研究者提出了扭曲扩散采样器(TDS),这是一种针对扩散模型的条件分布的序贯蒙特卡洛(SMC)算法。
方法:TDS的主要思想是使用扭曲这种享有良好计算效率的SMC技术来结合启发式近似,而不会损害渐近准确性。
效果:模拟和条件图像生成任务的结果显示,TDS提供了一种计算统计权衡,使用许多粒子可以得到更准确的近似,而且在具有最少两个粒子的情况下也能比启发式方法有更好的表现。在蛋白质设计的核心任务——模体脚手架任务中,使用TDS对黎曼扩散模型进行扩展,在基准测试案例上,TDS允许灵活的条件准则,并且通常优于最先进的条件训练模型。

Diffusion models have been successful on a range of conditional generation tasks including molecular design and text-to-image generation. However, these achievements have primarily depended on task-specific conditional training or error-prone heuristic approximations. Ideally, a conditional generation method should provide exact samples for a broad range of conditional distributions without requiring taskspecific training. To this end, we introduce the Twisted Diffusion Sampler, or TDS. TDS is a sequential Monte Carlo (SMC) algorithm that targets the conditional distributions of diffusion models. The main idea is to use twisting, an SMC technique that enjoys good computational efficiency, to incorporate heuristic approximations without compromising asymptotic exactness. We first find in simulation and in conditional image generation tasks that TDS provides a computational statistical trade-off, yielding more accurate approximations with many particles but with empirical improvements over heuristics with as few as two particles. We then turn to motif-scaffolding, a core task in protein design, using a TDS extension to Riemannian diffusion models; on benchmark test cases, TDS allows flexible conditioning criteria and often outperforms the state-of-the-art, conditionally trained model.

OneNet: Enhancing Time Series Forecasting Models under Concept Drift by Online Ensembling
YiFan Zhang Qingsong Wen Xue Wang Weiqi Chen Liang Sun Zhang Zhang Liang Wang Rong Jin Tieniu Tan



研究问题:本文旨在解决时间序列预测模型的概念漂移问题,通过有效地基于流数据更新预测模型。
动机:许多算法被设计用于在线时间序列预测,一些利用跨变量依赖性,而其他则假设变量之间的独立性。鉴于每种数据假设在在线时间序列建模中都有其优点和缺点,我们提出了在线集成网络(OneNet)。
方法:OneNet动态更新并结合了两种模型,一种关注于对时间维度的依赖性进行建模,另一种则关注于交叉变量依赖性。我们的方法将基于强化学习的方法纳入传统的在线凸优化框架,允许以动态调整的权重对两种模型进行线性组合。
效果:实证结果表明,与最先进的方法相比,OneNet减少了超过50%的在线预测误差。

Online updating of time series forecasting models aims to address the concept drifting problem by efficiently updating forecasting models based on streaming data. Many algorithms are designed for online time series forecasting, with some exploiting cross-variable dependency while others assume independence among variables. Given every data assumption has its own pros and cons in online time series modeling, we propose **On**line **e**nsembling **Net**work (**OneNet**). It dynamically updates and combines two models, with one focusing on modeling the dependency across the time dimension and the other on cross-variate dependency. Our method incorporates a reinforcement learning-based approach into the traditional online convex programming framework, allowing for the linear combination of the two models with dynamically adjusted weights. OneNet addresses the main shortcoming of classical online learning methods that tend to be slow in adapting to the concept drift. Empirical results show that OneNet reduces online forecasting error by more than $\mathbf{50}\\%$ compared to the State-Of-The-Art (SOTA) method.

Exploring and Interacting with the Set of Good Sparse Generalized Additive Models
Chudi Zhong Zhi Chen Jiachang Liu Margo Seltzer Cynthia Rudin



研究问题:本文旨在解决机器学习模型与领域专家交互困难的问题,通过近似研究问题:本文旨在解决机器学习模型与领域专家交互困难的问题,通过近似和探索Rashomon集(即所有接近最优模型的集合)来提供给用户一个包含多样化模型的可搜索空间。
动机:传统的机器学习范式通常只产生单一的模型,这不利于模型与领域专家的交互。通过近似和探索Rashomon集,可以解决这个问题。
方法:我们提出了一种算法,用于有效地精确地近似稀疏广义可加模型的Rashomon集,使用这些椭圆体来近似许多不同支持集的Rashomon集。
效果:实验表明,近似的Rashomon集具有很高的准确性,并且在解决实际挑战(如研究模型类别的变量重要性、寻找满足用户指定约束条件的模型、研究形状函数的突然变化等)方面非常有效。

In real applications, interaction between machine learning models and domain experts is critical; however, the classical machine learning paradigm that usually produces only a single model does not facilitate such interaction. Approximating and exploring the Rashomon set, i.e., the set of all near-optimal models, addresses this practical challenge by providing the user with a searchable space containing a diverse set of models from which domain experts can choose. We present algorithms to efficiently and accurately approximate the Rashomon set of sparse, generalized additive models with ellipsoids for fixed support sets and use these ellipsoids to approximate Rashomon sets for many different support sets. The approximated Rashomon set serves as a cornerstone to solve practical challenges such as (1) studying the variable importance for the model class; (2) finding models under user-specified constraints (monotonicity, direct editing); and (3) investigating sudden changes in the shape functions. Experiments demonstrate the fidelity of the approximated Rashomon set and its effectiveness in solving practical challenges.

Conformal PID Control for Time Series Prediction
Anastasios Nikolas Angelopoulos Emmanuel Candes Ryan Tibshirani



研究问题:时间序列预测的不确定性量化,目标是提供具有形式保证的易用算法。
动机:现有的在线预测方法无法适应季节性、趋势和一般分布偏移等系统性误差。
方法:基于共轭预测和控制理论的思想,构建了能够在在线环境中前瞻建模共轭分数的算法。
效果:在对美国4周内新冠死亡人数预测的实验中,该算法的覆盖率超过了美国疾病控制与预防中心使用的集成预测器。同时,在预测电力需求、市场回报和温度等方面也取得了良好的效果。

We study the problem of uncertainty quantification for time series prediction, with the goal of providing easy-to-use algorithms with formal guarantees. The algorithms we present build upon ideas from conformal prediction and control theory, are able to prospectively model conformal scores in an online setting, and adapt to the presence of systematic errors due to seasonality, trends, and general distribution shifts. Our theory both simplifies and strengthens existing analyses in online conformal prediction. Experiments on 4-week-ahead forecasting of statewide COVID-19 death counts in the U.S. show an improvement in coverage over the ensemble forecaster used in official CDC communications. We also run experiments on predicting electricity demand, market returns, and temperature using autoregressive, Theta, Prophet, and Transformer models. We provide an extendable codebase for testing our methods and for the integration of new algorithms, data sets, and forecasting rules at [this link](http://github.com/aangelopoulos/conformal-time-series).

Training neural operators to preserve invariant measures of chaotic attractors
Ruoxi Jiang Peter Y. Lu Elena Orlova Rebecca Willett



研究问题:本文旨在解决混沌系统长程预测困难的问题,即初始条件的微小扰动会导致轨迹以指数速度发散。
动机:目前的神经网络操作符在最小化平方误差损失方面虽然能够进行准确的短期预测,但在长期时间跨度上无法复制动态的统计或结构属性,可能导致退化的结果。
方法:本文提出了一种替代框架,用于保留混沌吸引子的不变测度,这些测度描述了动态的时间不变统计特性。具体来说,在多环境设置中(每个样本轨迹由略有不同的动力学控制),我们考虑了两种使用噪声数据进行训练的新方法。第一种方法是提出一个基于观测动力学和神经网络操作符输出之间的最优传输距离的损失函数。这种方法需要专家知识来确定最优传输损失中应包含哪些统计特征。第二种方法则展示了一种对比学习框架,无需任何专门的先验知识,几乎能与最优传输方法一样保留动态的统计特性。
效果:在多种混沌系统上,我们的方法被实证证明能够保留混沌吸引子的不变测度。

Chaotic systems make long-horizon forecasts difficult because small perturbations in initial conditions cause trajectories to diverge at an exponential rate. In this setting, neural operators trained to minimize squared error losses, while capable of accurate short-term forecasts, often fail to reproduce statistical or structural properties of the dynamics over longer time horizons and can yield degenerate results. In this paper, we propose an alternative framework designed to preserve invariant measures of chaotic attractors that characterize the time-invariant statistical properties of the dynamics. Specifically, in the multi-environment setting (where each sample trajectory is governed by slightly different dynamics), we consider two novel approaches to training with noisy data. First, we propose a loss based on the optimal transport distance between the observed dynamics and the neural operator outputs. This approach requires expert knowledge of the underlying physics to determine what statistical features should be included in the optimal transport loss. Second, we show that a contrastive learning framework, which does not require any specialized prior knowledge, can preserve statistical properties of the dynamics nearly as well as the optimal transport approach. On a variety of chaotic systems, our method is shown empirically to preserve invariant measures of chaotic attractors.

Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Zi Wang Alexander Ku Jason Michael Baldridge Thomas L. Griffiths Been Kim



研究问题:如何理解和评估模型对概念的表示能力,包括其能否表示某些概念以及对这些概念的确定性程度。
动机:理解模型对概念的表示能力是许多任务的基础,包括有效和负责任地使用模型以及检测分布外数据。
方法:介绍了高斯过程探针(GPP),这是一种统一且简单的框架,用于探测和测量模型表示的概念的不确定性。作为线性探针方法的贝叶斯扩展,GPP询问模型诱导了哪种分类器(概念)的分布。这种分布可以用于测量模型表示的内容以及探针对这些内容的确定性程度。
效果:实验表明,GPP可以在只有少量示例的情况下探测模型的概念表示,准确测量认识不确定性(探针的确定性)和偶然不确定性(模型对概念的模糊程度),并使用这些不确定性测量以及经典方法来检测分布外数据。通过使用高斯过程扩展探针的功能,GPP提供了一种数据高效、多功能和具有不确定性意识的工具,用于理解和评估机器学习模型的能力。

Understanding which concepts models can and cannot represent has been fundamental to many tasks: from effective and responsible use of models to detecting out of distribution data. We introduce Gaussian process probes (GPP), a unified and simple framework for probing and measuring uncertainty about concepts represented by models. As a Bayesian extension of linear probing methods, GPP asks what kind of distribution over classifiers (of concepts) is induced by the model. This distribution can be used to measure both what the model represents and how confident the probe is about what the model represents. GPP can be applied to any pre-trained model with vector representations of inputs (e.g., activations). It does not require access to training data, gradients, or the architecture. We validate GPP on datasets containing both synthetic and real images. Our experiments show it can (1) probe a model's representations of concepts even with a very small number of examples, (2) accurately measure both epistemic uncertainty (how confident the probe is) and aleatory uncertainty (how fuzzy the concepts are to the model), and (3) detect out of distribution data using those uncertainty measures as well as classic methods do. By using Gaussian processes to expand what probing can offer, GPP provides a data-efficient, versatile and uncertainty-aware tool for understanding and evaluating the capabilities of machine learning models.

GAUCHE: A Library for Gaussian Processes in Chemistry
Ryan-Rhys Griffiths Leo Klarner Henry Moss Aditya Ravuri Sang T. Truong Yuanqi Du Samuel Don Stanton Gary Tom Bojana Ranković Arian Rokkum Jamasb Aryan Deshwal Julius Schwartz Austin Tripp Gregory Kell Simon Frieder Anthony Bourached Alex James Chan Jacob Moss Chengzhi Guo Johannes P. Dürholt Saudamini Chaurasia Ji Won Park Felix Strieth-Kalthoff Alpha Lee Bingqing Cheng Alan Aspuru-Guzik Philippe Schwaller Jian Tang



研究问题:本文旨在开发一个用于化学中高斯过程的开源库GAUCHE。
动机:高斯过程是概率机器学习的基石,对于不确定性量化和贝叶斯优化具有特殊优势。然而,将其扩展到分子表示需要定义在结构化输入(如图形、字符串和位向量)上的内核。
方法:通过在一个模块化、稳健且易于使用的框架中提供这样的内核,我们希望使专家化学家和材料科学家能够利用最先进的黑箱优化技术。
效果:受实践中常见场景的启发,我们在分子发现、化学反应优化和蛋白质设计中展示了GAUCHE的应用。代码库可在https://github.com/leojklarner/gauche获取。

We introduce GAUCHE, an open-source library for GAUssian processes in CHEmistry. Gaussian processes have long been a cornerstone of probabilistic machine learning, affording particular advantages for uncertainty quantification and Bayesian optimisation. Extending Gaussian processes to molecular representations, however, necessitates kernels defined over structured inputs such as graphs, strings and bit vectors. By providing such kernels in a modular, robust and easy-to-use framework, we seek to enable expert chemists and materials scientists to make use of state-of-the-art black-box optimization techniques. Motivated by scenarios frequently encountered in practice, we showcase applications for GAUCHE in molecular discovery, chemical reaction optimisation and protein design. The codebase is made available at https://github.com/leojklarner/gauche.

Learning Nonparametric Latent Causal Graphs with Unknown Interventions
Yibo Jiang Bryon Aragam



研究问题:如何从未知的干预中重建潜在因果图,并在没有线性或高斯等参数假设的情况下识别测量模型的潜在结构。
动机:扩展了从观察和干预中学习因果表示的最新研究,提出了一种新的方法来处理未知的干预和潜在的非参数结构。
方法:通过引入两个新的图形概念——"虚子集"和"孤立边",建立了在潜在空间中从未知的干预中重建潜在因果图的条件。
效果:首次在一般设置中,无需任何参数假设、无需忠实性,就能确定因果表示的条件,并扩展了对潜在结构的认识。

We establish conditions under which latent causal graphs are nonparametrically identifiable and can be reconstructed from unknown interventions in the latent space. Our primary focus is the identification of the latent structure in measurement models without parametric assumptions such as linearity or Gaussianity. Moreover, we do not assume the number of hidden variables is known, and we show that at most one unknown intervention per hidden variable is needed. This extends a recent line of work on learning causal representations from observations and interventions. The proofs are constructive and introduce two new graphical concepts---_imaginary subsets_ and _isolated edges_---that may be useful in their own right. As a matter of independent interest, the proofs also involve a novel characterization of the limits of edge orientations within the equivalence class of DAGs induced by _unknown_ interventions. These are the first results to characterize the conditions under which causal representations are identifiable without making any parametric assumptions in a general setting with unknown interventions and without faithfulness.

Contextual Gaussian Process Bandits with Neural Networks
Haoting Zhang Jinghai He Rhonda Righter Zuo-Jun Shen Zeyu Zheng



研究问题:如何选择合适的代理模型来捕捉未知的复杂奖励函数,以解决在线内容推荐、个性化医疗和自动驾驶等领域的上下文决策问题。
动机:在实际应用中,既需要高近似精度,又需要明确的不确定性量化。
方法:提出一种神经网络伴随的高斯过程(NN-AGP)模型,利用神经网络对上下文变量进行未知且可能复杂的奖励函数近似,同时保持高斯过程对决策变量的代理模型。
效果:实验证明,由于使用了神经网络,该模型具有更好的近似精度,并且由于使用高斯过程,该模型具有明确的不确定性量化,从而优于现有方法。

Contextual decision-making problems have witnessed extensive applications in various fields such as online content recommendation, personalized healthcare, and autonomous vehicles, where a core practical challenge is to select a suitable surrogate model for capturing unknown complicated reward functions. It is often the case that both high approximation accuracy and explicit uncertainty quantification are desired. In this work, we propose a neural network-accompanied Gaussian process (NN-AGP) model, which leverages neural networks to approximate the unknown and potentially complicated reward function regarding the contextual variable, and maintains a Gaussian process surrogate model with respect to the decision variable. Our model is shown to outperform existing approaches by offering better approximation accuracy thanks to the use of neural networks and possessing explicit uncertainty quantification from the Gaussian process. We also analyze the maximum information gain of the NN-AGP model and prove the regret bounds for the corresponding algorithms. Moreover, we conduct the experiments on both synthetic and practical problems, illustrating the effectiveness of our approach.

Bayesian Metric Learning for Uncertainty Quantification in Image Retrieval
Frederik Rahbæk Warburg Marco Miani Silas Brack Søren Hauberg



研究问题:提出一种贝叶斯编码器用于度量学习。
动机:不依赖先前工作的神经模拟,而是通过拉普拉斯近似来学习网络权重的分布。
方法:首先证明对比损失是球面空间上的负对数似然,然后提出三种确保正定协方差矩阵的方法,最后提出广义高斯-牛顿逼近的新分解方法。
效果:实验表明,我们的拉普拉斯度量学习器(LAM)能产生良好校准的不确定性,可靠地检测出分布外的例子,并具有最先进的预测性能。

We propose a Bayesian encoder for metric learning. Rather than relying on neural amortization as done in prior works, we learn a distribution over the network weights with the Laplace Approximation. We first prove that the contrastive loss is a negative log-likelihood on the spherical space. We propose three methods that ensure a positive definite covariance matrix. Lastly, we present a novel decomposition of the Generalized Gauss-Newton approximation. Empirically, we show that our Laplacian Metric Learner (LAM) yields well-calibrated uncertainties, reliably detects out-of-distribution examples, and has state-of-the-art predictive performance.

The s-value: evaluating stability with respect to distributional shifts
Suyash Gupta Dominik Rothenhaeusler



研究问题:如何量化统计参数的分布不稳定性,特别是在不同地点和时间下分布的变化?
动机:传统的不确定性统计量如$p$-值和置信区间主要考虑了采样带来的不确定性,但在实际中,分布的变化也是一个重要的不确定性来源。
方法:提出一种新的稳定性测量方法,通过计算统计参数相对于Kullback-Leibler散度的敏感性,即参数在Kullback-Leibler散度球内的一般分布扰动下的敏感性,来量化分布不稳定性。同时,还对参数进行方向性或变量特异性偏移的稳定性进行了量化。
效果:实验结果表明,该方法能够有效地揭示参数在某些特定偏移下的分布不稳定性,并有助于提高在偏移分布下统计参数估计的准确性。

Common statistical measures of uncertainty such as $p$-values and confidence intervals quantify the uncertainty due to sampling, that is, the uncertainty due to not observing the full population. However, sampling is not the only source of uncertainty. In practice, distributions change between locations and across time. This makes it difficult to gather knowledge that transfers across data sets. We propose a measure of instability that quantifies the distributional instability of a statistical parameter with respect to Kullback-Leibler divergence, that is, the sensitivity of the parameter under general distributional perturbations within a Kullback-Leibler divergence ball. In addition, we quantify the instability of parameters with respect to directional or variable-specific shifts. Measuring instability with respect to directional shifts can be used to detect under which kind of distribution shifts a statistical conclusion might be reversed. We discuss how such knowledge can inform data collection for transfer learning of statistical parameters under shifted distributions. We evaluate the performance of the proposed measure on real data and show that it can elucidate the distributional instability of a parameter with respect to certain shifts and can be used to improve estimation accuracy under shifted distributions.

End-To-End Latent Variational Diffusion Models for Inverse Problems in High Energy Physics
Alexander Shmakov Kevin Greif Michael James Fenton Aishik Ghosh Pierre Baldi Daniel Whiteson



研究问题:本文旨在解决在大型强子对撞机(LHC)中,如何通过深度生成学习方法近似解决探测器观测到的反问题。
动机:目前的粒子物理分析需要将测量结果与理论预测或其他检测器的结果进行比较,但必须首先校正探测器效应。
方法:我们引入了一种新颖的统一架构,称为潜在变分扩散模型,该模型结合了最新的生成艺术方法的潜在学习与端到端变分框架。
效果:我们的统一方法在重建理论运动量的全局分布以及确保学习的后验分布符合已知的物理约束方面表现出色,其距离真实的无分布误差比非潜在状态-of-the-art基线小20倍以上,比传统的潜在扩散模型小3倍。

High-energy collisions at the Large Hadron Collider (LHC) provide valuable insights into open questions in particle physics. However, detector effects must be corrected before measurements can be compared to certain theoretical predictions or measurements from other detectors. Methods to solve this inverse problem of mapping detector observations to theoretical quantities of the underlying collision are essential parts of many physics analyses at the LHC. We investigate and compare various generative deep learning methods to approximate this inverse mapping. We introduce a novel unified architecture, termed latent variational diffusion models, which combines the latent learning of cutting-edge generative art approaches with an end-to-end variational framework. We demonstrate the effectiveness of this approach for reconstructing global distributions of theoretical kinematic quantities, as well as for ensuring the adherence of the learned posterior distributions to known physics constraints. Our unified approach achieves a distribution-free distance to the truth of over 20 times smaller than non-latent state-of-the-art baseline and 3 times smaller than traditional latent diffusion models.

Temporally Disentangled Representation Learning under Unknown Nonstationarity
Xiangchen Song Weiran Yao Yewen Fan Xinshuai Dong Guangyi Chen Juan Carlos Niebles Eric Xing Kun Zhang



研究问题:在非平稳环境下,如何从观测到的时序数据中恢复和识别具有时间延迟的潜在因果关系。
动机:现有的方法要么需要借助辅助变量(如类别标签和/或领域索引),要么假设简化的潜在因果动态,限制了其应用范围。
方法:本研究进一步探讨了非平稳环境下时间延迟的因果相关过程中的马尔可夫假设,并表明在温和条件下,无需观察辅助变量,就可以从非线性混合中恢复独立的潜变量,最多进行置换和逐分量变换。然后引入了NCTRL,一个原则性估计框架,用于重建时间延迟的潜在因果变量并仅从测量的时序数据中识别它们的关系。
效果:实证评估表明,该方法能够可靠地识别时间延迟的潜在因果关系,显著优于未能充分利用非平稳性的现有基线方法,因此无法区分分布偏移。

In unsupervised causal representation learning for sequential data with time-delayed latent causal influences, strong identifiability results for the disentanglement of causally-related latent variables have been established in stationary settings by leveraging temporal structure. However, in nonstationary setting, existing work only partially addressed the problem by either utilizing observed auxiliary variables (e.g., class labels and/or domain indexes) as side information or assuming simplified latent causal dynamics. Both constrain the method to a limited range of scenarios. In this study, we further explored the Markov Assumption under time-delayed causally related process in nonstationary setting and showed that under mild conditions, the independent latent components can be recovered from their nonlinear mixture up to a permutation and a component-wise transformation, without the observation of auxiliary variables. We then introduce NCTRL, a principled estimation framework, to reconstruct time-delayed latent causal variables and identify their relations from measured sequential data only. Empirical evaluations demonstrated the reliable identification of time-delayed latent causal influences, with our methodology substantially outperforming existing baselines that fail to exploit the nonstationarity adequately and then, consequently, cannot distinguish distribution shifts.

Embracing the chaos: analysis and diagnosis of numerical instability in variational flows
Zuheng Xu Trevor Campbell



研究问题:本文探讨数值不稳定性对变分流中采样、密度评估和证据下界(ELBO)估计的可靠性的影响。
动机:我们发现常见的变分流会出现数值累积误差,影响采样、密度和ELBO计算的准确性,但令人惊讶的是,其结果在应用上却常常足够准确。
方法:我们将变分流视为混沌动力系统,利用阴影理论通过理论上的保证来阐明这种行为,并开发了一种诊断程序以验证实践中由数值不稳定的流产生的结果。
效果:我们的理论分析和实验结果表明,尽管存在严重的数值不稳定性,但变分流的结果在实践中仍然足够准确。

In this paper, we investigate the impact of numerical instability on the reliability of sampling, density evaluation, and evidence lower bound (ELBO) estimation in variational flows. We first empirically demonstrate that common flows can exhibit a catastrophic accumulation of error: the numerical flow map deviates significantly from the exact map---which affects sampling---and the numerical inverse flow map does not accurately recover the initial input---which affects density and ELBO computations. Surprisingly though, we find that results produced by flows are often accurate enough for applications despite the presence of serious numerical instability. In this work, we treat variational flows as chaotic dynamical systems, and leverage shadowing theory to elucidate this behavior via theoretical guarantees on the error of sampling, density evaluation, and ELBO estimation. Finally, we develop and empirically test a diagnostic procedure that can be used to validate results produced by numerically unstable flows in practice.

Intervention Generalization: A View from Factor Graph Models
Gecia Bravo-Hermsdorff David Watson Jialin Yu Jakob Zeitler Ricardo Silva



研究问题:如何从过去的实验和观察数据推广到新的条件,特别是在处理可能的干预措施的大组合空间时。
动机:在稀疏的实验设计下,没有对分布的强大正则化或先验分布,这种映射可能是不适定的。
方法:通过因子图模型的语言,提出了一个干预因子模型(IFM),可以有效地抽象出未测量的混淆和反馈机制,从而得到可直接测试的声明。
效果:通过一系列半合成实验,实现了该框架,并得到了预期结果的新条件。

One of the goals of causal inference is to generalize from past experiments and observational data to novel conditions. While it is in principle possible to eventually learn a mapping from a novel experimental condition to an outcome of interest, provided a sufficient variety of experiments is available in the training data, coping with a large combinatorial space of possible interventions is hard. Under a typical sparse experimental design, this mapping is ill-posed without relying on heavy regularization or prior distributions. Such assumptions may or may not be reliable, and can be hard to defend or test. In this paper, we take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system, communicated in the well-understood language of factor graph models. A postulated interventional factor model (IFM) may not always be informative, but it conveniently abstracts away a need for explicitly modeling unmeasured confounding and feedback mechanisms, leading to directly testable claims. Given an IFM and datasets from a collection of experimental regimes, we derive conditions for identifiability of the expected outcomes of new regimes never observed in these training data. We implement our framework using several efficient algorithms, and apply them on a range of semi-synthetic experiments.

Mirror Diffusion Models for Constrained and Watermarked Generation
Guan-Horng Liu Tianrong Chen Evangelos Theodorou Molei Tao



研究问题:如何使扩散模型在受限的数据集中保持可处理性?
动机:现有的扩散模型在标准欧几里得空间中表现良好,但在受限的数据集上可能会失去其特性。
方法:提出Mirror Diffusion Models(MDM),通过在镜像映射的对偶空间中学习扩散过程,以生成凸约束集上的数据,同时保持可处理性。
效果:实验证明,MDM在常见约束集(如单纯形和L2球)上的运行效率和性能都优于现有方法,且可以用于安全和隐私保护的信息嵌入。

Modern successes of diffusion models in learning complex, high-dimensional data distributions are attributed, in part, to their capability to construct diffusion processes with analytic transition kernels and score functions. The tractability results in a simulation-free framework with stable regression losses, from which reversed, generative processes can be learned at scale. However, when data is confined to a constrained set as opposed to a standard Euclidean space, these desirable characteristics appear to be lost based on prior attempts. In this work, we propose Mirror Diffusion Models (MDM), a new class of diffusion models that generate data on convex constrained sets without losing any tractability. This is achieved by learning diffusion processes in a dual space constructed from a mirror map, which, crucially, is a standard Euclidean space. We derive efficient computation of mirror maps for popular constrained sets, such as simplices and $\ell_2$-balls, showing significantly improved performance of MDM over existing methods. For safety and privacy purposes, we also explore constrained sets as a new mechanism to embed invisible but quantitative information (i.e., watermarks) in generated data, for which MDM serves as a compelling approach. Our work brings new algorithmic opportunities for learning tractable diffusion on complex domains.

Meek Separators and Their Applications in Targeted Causal Discovery
Kirankumar Shiragur Jiaqi Zhang Caroline Uhler



研究问题:如何从干预性数据中学习因果关系结构,特别是在只需要学习部分因果图的情况下。
动机:许多先前的研究都集中在恢复整个因果图上,但在实践中,只需要学习部分因果图的场景更为常见。
方法:提出了“Meek分隔符”的概念,这是一种当进行干预时,可以将未定向的剩余边分解为更小的连通分量的顶点子集。并设计了寻找小型Meek分隔符的高效算法。
效果:提出了两种随机化算法,分别实现了子集搜索和因果匹配问题的对数近似解,这是首次为这两个问题提供了平均情况的可证明保证。

Learning causal structures from interventional data is a fundamental problem with broad applications across various fields. While many previous works have focused on recovering the entire causal graph, in practice, there are scenarios where learning only part of the causal graph suffices. This is called \emph{targeted} causal discovery. In our work, we focus on two such well-motivated problems: subset search and causal matching. We aim to minimize the number of interventions in both cases. Towards this, we introduce the \emph{Meek separator}, which is a subset of vertices that, when intervened, decomposes the remaining unoriented edges into smaller connected components. We then present an efficient algorithm to find Meek separators that are of small sizes. Such a procedure is helpful in designing various divide-and-conquer-based approaches. In particular, we propose two randomized algorithms that achieve logarithmic approximation for subset search and causal matching, respectively. Our results provide the first known average-case provable guarantees for both problems. We believe that this opens up possibilities to design near-optimal methods for many other targeted causal structure learning problems arising from various applications.

Causal Imitability Under Context-Specific Independence Relations
Fateme Jamshidi Sina Akbari Negar Kiyavash



研究问题:忽视因果关系执行模仿学习的缺点已被广泛认识,但关于潜在的好处和如何利用额外的底层结构信息尚未被探索。
动机:本文考虑了已知上下文特定独立性(CSI)关系时的因果模仿学习问题。
方法:我们证明了在此设置中关于模仿可行性的决策问题是NP-hard的,并提供了在CSI下的模仿学习的必要的图形标准。
效果:最后,我们提出了一种合理的算法方法来处理考虑CSI关系和数据的因果模仿学习。

Drawbacks of ignoring the causal mechanisms when performing imitation learning have recently been acknowledged. Several approaches both to assess the feasibility of imitation and to circumvent causal confounding and causal misspecifications have been proposed in the literature. However, the potential benefits of the incorporation of additional information about the underlying causal structure are left unexplored. An example of such overlooked information is context-specific independence (CSI), i.e., independence that holds only in certain contexts. We consider the problem of causal imitation learning when CSI relations are known. We prove that the decision problem pertaining to the feasibility of imitation in this setting is NP-hard. Further, we provide a necessary graphical criterion for imitation learning under CSI and show that under a structural assumption, this criterion is also sufficient. Finally, we propose a sound algorithmic approach for causal imitation learning which takes both CSI relations and data into account.

Identifiability Guarantees for Causal Disentanglement from Soft Interventions
Jiaqi Zhang Kristjan Greenewald Chandler Squires Akash Srivastava Karthikeyan Shanmugam Caroline Uhler



研究问题:本文旨在解决在有未观察的因果变量的情况下,如何通过一个广义的忠实性概念实现因果模型的识别。
动机:当因果变量完全被观察时,已有算法可以在忠实性假设下识别因果模型。本文旨在证明,即使在未观察到因果变量的情况下,也可以实现因果模型的识别。
方法:本文提出了一种自编码变分贝叶斯算法来实现因果解缠框架,并将其应用于预测基因组中的组合扰动效应的问题。
效果:实验结果表明,该方法可以恢复潜在的因果模型,并预测无限数据中未见过的组合干预的效果。

Causal disentanglement aims to uncover a representation of data using latent variables that are interrelated through a causal model. Such a representation is identifiable if the latent model that explains the data is unique. In this paper, we focus on the scenario where unpaired observational and interventional data are available, with each intervention changing the mechanism of a latent variable. When the causal variables are fully observed, statistically consistent algorithms have been developed to identify the causal model under faithfulness assumptions. We here show that identifiability can still be achieved with unobserved causal variables, given a generalized notion of faithfulness. Our results guarantee that we can recover the latent causal model up to an equivalence class and predict the effect of unseen combinations of interventions, in the limit of infinite data. We implement our causal disentanglement framework by developing an autoencoding variational Bayes algorithm and apply it to the problem of predicting combinatorial perturbation effects in genomics.

Time Series as Images: Vision Transformer for Irregularly Sampled Time Series
Zekun Li Shiyang Li Xifeng Yan



研究问题:如何有效地处理不规则采样的时间序列,特别是在医疗领域。
动机:虽然已经开发了各种专门的方法来处理这些不规则性,但有效模拟其复杂动态和显著稀疏性仍然是一个挑战。
方法:将不规则采样的时间序列转换为线图图像,然后利用强大的预训练视觉转换器进行时间序列分类,就像图像分类一样。这种方法不仅大大简化了专门的算法设计,而且有可能成为时间序列建模的通用框架。
效果:尽管方法简单,但在几个流行的健康保健和人类活动数据集上,该方法优于最先进的专门算法。特别是在严格的传感器剔除设置中,测试期间会省略一部分变量,该方法对不同程度的缺失观测表现出强大的鲁棒性,即使在一半的变量被屏蔽的情况下,也比领先的专门基线在绝对F1分数上提高了42.8%。代码和数据可在https://github.com/Leezekun/ViTST获取。

Irregularly sampled time series are increasingly prevalent, particularly in medical domains. While various specialized methods have been developed to handle these irregularities, effectively modeling their complex dynamics and pronounced sparsity remains a challenge. This paper introduces a novel perspective by converting irregularly sampled time series into line graph images, then utilizing powerful pre-trained vision transformers for time series classification in the same way as image classification. This method not only largely simplifies specialized algorithm designs but also presents the potential to serve as a universal framework for time series modeling. Remarkably, despite its simplicity, our approach outperforms state-of-the-art specialized algorithms on several popular healthcare and human activity datasets. Especially in the rigorous leave-sensors-out setting where a portion of variables is omitted during testing, our method exhibits strong robustness against varying degrees of missing observations, achieving an impressive improvement of 42.8% in absolute F1 score points over leading specialized baselines even with half the variables masked. Code and data are available at https://github.com/Leezekun/ViTST.

Explaining Predictive Uncertainty with Information Theoretic Shapley Values
David Watson Joshua O'Hara Niek Tax Richard Mudd Ido Guy



研究问题:当前,解释复杂监督学习模型的预测结果的方法已经相当成熟,但模型输出的不确定性的解释却相对较少。
动机:为了解决这个问题,研究人员将广受欢迎的沙普利值框架进行调整,以解释各种类型的预测不确定性,并对每个特征对个体模型输出的条件熵的贡献进行量化。
方法:通过修改特性函数来考虑游戏,发现由此产生的沙普利值与信息理论和条件独立性测试的基本数量之间有深层次的联系。同时,还概述了有限样本错误率控制的推理过程,并实现了在真实数据和模拟数据上表现良好的高效算法。
效果:该方法可以应用于协变量偏移检测、主动学习、特征选择和主动特征值获取等多个领域,并在实验中取得了良好的效果。

Researchers in explainable artificial intelligence have developed numerous methods for helping users understand the predictions of complex supervised learning models. By contrast, explaining the $\textit{uncertainty}$ of model outputs has received relatively little attention. We adapt the popular Shapley value framework to explain various types of predictive uncertainty, quantifying each feature's contribution to the conditional entropy of individual model outputs. We consider games with modified characteristic functions and find deep connections between the resulting Shapley values and fundamental quantities from information theory and conditional independence testing. We outline inference procedures for finite sample error rate control with provable guarantees, and implement efficient algorithms that perform well in a range of experiments on real and simulated data. Our method has applications to covariate shift detection, active learning, feature selection, and active feature-value acquisition.

Energy Discrepancies: A Score-Independent Loss for Energy-Based Models
Tobias Schröder Zijing Ou Jen Ning Lim Yingzhen Li Sebastian Josef Vollmer Andrew Duncan



研究问题:提出一种新的损失函数,称为能量差异(ED),以解决基于能量的模型训练计算负担重的问题。
动机:现有的基于能量的模型虽然强大但计算负担重,限制了其广泛应用。
方法:提出了一种名为能量差异(ED)的新型损失函数,避免了昂贵的马尔科夫链蒙特卡罗计算,并通过数值实验证明其比显式得分匹配或对比散度更快更准确地学习低维数据分布。
效果:通过数值实验,证明了ED在高维图像数据上的效果,并展示了将基于能量的模型作为变分解码器模型的先验进行训练的有效性。

Energy-based models are a simple yet powerful class of probabilistic models, but their widespread adoption has been limited by the computational burden of training them. We propose a novel loss function called Energy Discrepancy (ED) which does not rely on the computation of scores or expensive Markov chain Monte Carlo. We show that energy discrepancy approaches the explicit score matching and negative log-likelihood loss under different limits, effectively interpolating between both. Consequently, minimum energy discrepancy estimation overcomes the problem of nearsightedness encountered in score-based estimation methods, while also enjoying theoretical guarantees. Through numerical experiments, we demonstrate that ED learns low-dimensional data distributions faster and more accurately than explicit score matching or contrastive divergence. For high-dimensional image data, we describe how the manifold hypothesis puts limitations on our approach and demonstrate the effectiveness of energy discrepancy by training the energy-based model as a prior of a variational decoder model.

Deep Stochastic Processes via Functional Markov Transition Operators
Jin Xu Emilien Dupont Kaspar Märtens Tom Rainforth Yee Whye Teh



研究问题:本文提出了一种新的随机过程——马尔科夫神经过程(MNPs),用于增强原始神经过程(NPs)的灵活性和表达能力。
动机:现有的神经过程在处理复杂任务时可能存在限制,因此需要一种新方法来提高其灵活性和表达能力。
方法:通过在函数空间中堆叠神经参数化的马尔科夫转移操作符序列,构建了马尔科夫神经过程。这种马尔科夫转移操作符能够保持随机过程的交换性和一致性。
效果:实验结果表明,马尔科夫神经过程在各种任务上明显优于基线模型。

We introduce Markov Neural Processes (MNPs), a new class of Stochastic Processes (SPs) which are constructed by stacking sequences of neural parameterised Markov transition operators in function space. We prove that these Markov transition operators can preserve the exchangeability and consistency of SPs. Therefore, the proposed iterative construction adds substantial flexibility and expressivity to the original framework of Neural Processes (NPs) without compromising consistency or adding restrictions. Our experiments demonstrate clear advantages of MNPs over baseline models on a variety of tasks.

Thin and deep Gaussian processes
Daniel Augusto de Souza Alexander V Nikitin S. T. John Magnus Ross Mauricio A Álvarez Marc Peter Deisenroth João Paulo Pordeus Gomes Diego Mesquita César Lincoln Mattos



研究问题:如何有效地利用高斯过程进行不确定性量化,并选择合适的核函数。
动机:手动选择和设计核函数在高斯过程中是具有挑战性的,而深度学习高斯过程(Deep GPs)虽然可以避免手动进行核函数工程,但可能会失去浅层高斯过程的可解释性。
方法:本文提出了一种新的方法——薄而深的高斯过程(TDGP)。每个TDGP层对原始输入数据进行局部线性变换,同时保持潜在嵌入的概念和内核长度尺度的解释性。此外,与先前的解决方案不同,TDGP诱导了非病态的流形,可以学习更低维的表示。
效果:理论和实验结果表明,i) TDGP与以往模型不同,专门用于发现输入数据的低维流形,ii) 随着层数的增加,TDGP表现良好,iii) TDGP在标准基准数据集上表现良好。

Gaussian processes (GPs) can provide a principled approach to uncertainty quantification with easy-to-interpret kernel hyperparameters, such as the lengthscale, which controls the correlation distance of function values.However, selecting an appropriate kernel can be challenging. Deep GPs avoid manual kernel engineering by successively parameterizing kernels with GP layers, allowing them to learn low-dimensional embeddings of the inputs that explain the output data. Following the architecture of deep neural networks, the most common deep GPs warp the input space layer-by-layer but lose all the interpretability of shallow GPs. An alternative construction is to successively parameterize the lengthscale of a kernel, improving the interpretability but ultimately giving away the notion of learning lower-dimensional embeddings. Unfortunately, both methods are susceptible to particular pathologies which may hinder fitting and limit their interpretability. This work proposes a novel synthesis of both previous approaches: {Thin and Deep GP} (TDGP). Each TDGP layer defines locally linear transformations of the original input data maintaining the concept of latent embeddings while also retaining the interpretation of lengthscales of a kernel. Moreover, unlike the prior solutions, TDGP induces non-pathological manifolds that admit learning lower-dimensional representations. We show with theoretical and experimental results that i) TDGP is, unlike previous models, tailored to specifically discover lower-dimensional manifolds in the input data, ii) TDGP behaves well when increasing the number of layers, and iii) TDGP performs well in standard benchmark datasets.

On the Generalization Properties of Diffusion Models
Puheng Li Zhong Li Huishuai Zhang Jiang Bian



研究问题:本文旨在对扩散模型的泛化能力进行深入的理论探索。
动机:尽管扩散模型在实际应用中取得了显著的成功,但其泛化能力的理论理解仍然不足。
方法:通过建立理论估计,研究了基于分数的扩散模型的训练动态与泛化差距之间的关系。
效果:研究发现,当提前停止训练时,扩散模型具有多项式小的泛化误差(O(n^{-2/5}+m^{-4/5})),并且这种泛化误差不会受到数据维度的影响。此外,当目标分布被描绘为一系列密度时,这些估计也揭示了“模式转移”对模型泛化的负面影响。这些发现不仅有助于深化对扩散模型泛化属性的理解,也为实际应用提供了指导。

Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish the theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., independent of the data dimension) when *early-stopped*. Furthermore, we extend our quantitative analysis to a *data-dependent* scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the *adverse* effect of "*modes shift*'' in ground truths on the model generalization. Furthermore, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.

An Efficient Doubly-Robust Test for the Kernel Treatment Effect
Diego Martinez-Taboada Aaditya Ramdas Edward Kennedy



研究问题:如何准确度量和测试二元处理中治疗的分布效应。
动机:当前最流行的目标效果是期望差异的平均处理效应,但治疗可能会产生超出均值的效果,如降低或增加方差。
方法:提出了一种新的基于内核的测试来检测治疗的分布效应,这是首个具有有效类型I错误的基于内核的双重稳健测试。
效果:该算法计算效率高,避免了排列的使用,并在实证研究中表现出良好的性能。

The average treatment effect, which is the difference in expectation of the counterfactuals, is probably the most popular target effect in causal inference with binary treatments. However, treatments may have effects beyond the mean, for instance decreasing or increasing the variance. We propose a new kernel-based test for distributional effects of the treatment. It is, to the best of our knowledge, the first kernel-based, doubly-robust test with provably valid type-I error. Furthermore, our proposed algorithm is computationally efficient, avoiding the use of permutations.

Swarm Reinforcement Learning for Adaptive Mesh Refinement
Niklas Freymuth Philipp Dahlinger Tobias Daniel Würth Simon Reisch Luise Kärger Gerhard Neumann



研究问题:如何有效地进行自适应网格细化(AMR)以提高有限元方法(FEM)的计算速度和模拟精度。
动机:传统的AMR方法依赖于特定任务的启发式方法或昂贵的误差估计器,限制了其在复杂模拟中的应用。
方法:将AMR建模为一种新型的自适应群体马尔可夫决策过程,其中网格被建模为一组可以分裂成多个新代理的简单协作代理系统。结合消息传递网络在相邻网格元素之间传播信息,并简化信用分配问题。
效果:实验验证了该方法的有效性,称为自适应群体网格细化(ASMR),在一系列具有挑战性的问题中学习到了可靠、可扩展和高效的细化策略。相比传统方法,该方法显著提高了计算速度,并在复杂模拟中实现了高达30倍的改进。此外,该方法还优于已学习的基线,达到了与传统基于误差的AMR策略相当的细化质量,而无需关于误差信号的昂贵oracle信息。

The Finite Element Method, an important technique in engineering, is aided by Adaptive Mesh Refinement (AMR), which dynamically refines mesh regions to allow for a favorable trade-off between computational speed and simulation accuracy. Classical methods for AMR depend on task-specific heuristics or expensive error estimators, hindering their use for complex simulations. Recent learned AMR methods tackle these problems, but so far scale only to simple toy examples. We formulate AMR as a novel Adaptive Swarm Markov Decision Process in which a mesh is modeled as a system of simple collaborating agents that may split into multiple new agents. This framework allows for a spatial reward formulation that simplifies the credit assignment problem, which we combine with Message Passing Networks to propagate information between neighboring mesh elements. We experimentally validate the effectiveness of our approach, Adaptive Swarm Mesh Refinement (ASMR), showing that it learns reliable, scalable, and efficient refinement strategies on a set of challenging problems. Our approach significantly speeds up computation, achieving up to 30-fold improvement compared to uniform refinements in complex simulations. Additionally, we outperform learned baselines and achieve a refinement quality that is on par with a traditional error-based AMR strategy without expensive oracle information about the error signal.

On the Statistical Consistency of Risk-Sensitive Bayesian Decision-Making
Prateek Jaiswal Harsha Honnappa Vinayak Rao



研究问题:本文旨在研究在贝叶斯框架下的数据驱动决策问题,其中贝叶斯风险的期望被替换为关于后验分布的风险敏感熵风险度量。
动机:在现代应用中,大型数据集和复杂的数据生成模型使得计算后验分布变得困难,因此需要一种新方法来解决这个问题。
方法:本文提出了一种新的风险敏感变分贝叶斯(RSVB)框架,用于联合计算风险敏感的后验近似值和相应的决策规则。该框架包括损失校准变分贝叶斯(Lacoste-Julien等人,2011年)作为特例。
效果:通过三个例子,本文展示了理论发现在参数和非参数设置中的应用,并计算了RSVB近似后验分布和相应最优值的收敛速度。

We study data-driven decision-making problems in the Bayesian framework, where the expectation in the Bayes risk is replaced by a risk-sensitive entropic risk measure with respect to the posterior distribution. We focus on problems where calculating the posterior distribution is intractable, a typical situation in modern applications with large datasets and complex data generating models. We leverage a dual representation of the entropic risk measure to introduce a novel risk-sensitive variational Bayesian (RSVB) framework for jointly computing a risk-sensitive posterior approximation and the corresponding decision rule. Our general framework includes \textit{loss-calibrated} VB (Lacoste-Julien et al. [2011] ) as a special case. We also study the impact of these computational approximations on the predictive performance of the inferred decision rules. We compute the convergence rates of the RSVB approximate posterior and the corresponding optimal value. We illustrate our theoretical findings in parametric and nonparametric settings with the help of three examples.

Representation Equivalent Neural Operators: a Framework for Alias-free Operator Learning
Francesca Bartolucci Emmanuel de Bezenac Bogdan Raonic Roberto Molinaro Siddhartha Mishra Rima Alaifari



研究问题:如何将无限维函数空间之间的映射进行学习,特别是在从数据中学习偏微分方程方面。
动机:尽管在理论上概念清晰,但在计算机实现时,神经网络算子需要进行离散化,这可能会损害其完整性,导致它们偏离底层算子。
方法:提出了一种名为“等效表示神经网络算子”(ReNO)的框架,以解决这些问题。核心是算子别名的概念,用于衡量神经网络算子与其离散表示之间的不一致性。
效果:研究发现,当处理不同的离散化和网格以及丧失关键的连续结构时,别名引入了误差。此外,由于其建设性和广泛性,该框架不仅揭示了现有挑战,还可能为开发新的神经网络算子提供工具。

Recently, operator learning, or learning mappings between infinite-dimensional function spaces, has garnered significant attention, notably in relation to learning partial differential equations from data. Conceptually clear when outlined on paper, neural operators necessitate discretization in the transition to computer implementations. This step can compromise their integrity, often causing them to deviate from the underlying operators. This research offers a fresh take on neural operators with a framework Representation equivalent Neural Operators (ReNO) designed to address these issues. At its core is the concept of operator aliasing, which measures inconsistency between neural operators and their discrete representations. We explore this for widely-used operator learning techniques. Our findings detail how aliasing introduces errors when handling different discretizations and grids and loss of crucial continuous structures. More generally, this framework not only sheds light on existing challenges but, given its constructive and broad nature, also potentially offers tools for developing new neural operators.

Collapsed Inference for Bayesian Deep Learning
Zhe Zeng Guy Van den Broeck



研究问题:本文旨在解决贝叶斯神经网络的推理问题,即如何在保持预测性能的同时提高样本效率。
动机:当前的贝叶斯神经网络推理方法往往需要对网络权重进行采样,但这种方法要么计算成本过高,要么会损害预测性能。
方法:本文提出了一种新的塌陷推理方案,通过使用塌陷样本进行贝叶斯模型平均来改善蒙特卡洛样本。这种塌陷样本可以代表从近似后验分布中抽取的无数模型,从而提高样本效率。
效果:在各种回归和分类任务上,本文提出的塌陷贝叶斯深度学习方法在不确定性估计和预测性能方面均优于现有方法,并创造了新的最先进的状态。

Bayesian neural networks (BNNs) provide a formalism to quantify and calibrate uncertainty in deep learning. Current inference approaches for BNNs often resort to few-sample estimation for scalability, which can harm predictive performance, while its alternatives tend to be computationally prohibitively expensive. We tackle this challenge by revealing a previously unseen connection between inference on BNNs and volume computation problems. With this observation, we introduce a novel collapsed inference scheme that performs Bayesian model averaging using collapsed samples. It improves over a Monte-Carlo sample by limiting sampling to a subset of the network weights while pairing it with some closed-form conditional distribution over the rest. A collapsed sample represents uncountably many models drawn from the approximate posterior and thus yields higher sample efficiency. Further, we show that the marginalization of a collapsed sample can be solved analytically and efficiently despite the non-linearity of neural networks by leveraging existing volume computation solvers. Our proposed use of collapsed samples achieves a balance between scalability and accuracy. On various regression and classification tasks, our collapsed Bayesian deep learning approach demonstrates significant improvements over existing methods and sets a new state of the art in terms of uncertainty estimation as well as predictive performance.

Policy Optimization for Continuous Reinforcement Learning
Hanyang Zhao Wenpin Tang David Yao



研究问题:本文旨在研究连续时间和空间下的强化学习,以无限期折扣目标和由随机微分方程驱动的基本动态为基础。
动机:基于最近在连续强化学习方面的进展,作者开发了一种占用时间(特别是对于折扣目标)的概念,并展示了如何有效地利用它来导出性能差异和局部近似公式。
方法:作者进一步扩展了这些结果,以说明它们在PG(策略梯度)和TRPO/PPO(信任区域策略优化/接近策略优化)方法中的应用,这些方法在离散RL设置中已经熟悉且强大,但在连续RL中尚未充分发展。
效果:通过数值实验,作者证明了他们的方法的有效性和优势。

We study reinforcement learning (RL) in the setting of continuous time and space, for an infinite horizon with a discounted objective and the underlying dynamics driven by a stochastic differential equation. Built upon recent advances in the continuous approach to RL, we develop a notion of occupation time (specifically for a discounted objective), and show how it can be effectively used to derive performance difference and local approximation formulas. We further extend these results to illustrate their applications in the PG (policy gradient) and TRPO/PPO (trust region policy optimization/ proximal policy optimization) methods, which have been familiar and powerful tools in the discrete RL setting but under-developed in continuous RL. Through numerical experiments, we demonstrate the effectiveness and advantages of our approach.

A Causal Framework for Decomposing Spurious Variations
Drago Plecko Elias Bareinboim



研究问题:如何解释特定事物发生的方式,或变量X如何影响变量Y的机制。
动机:在数据科学中,一个基本的挑战是理解为什么某些事情会以特定的方式发生,或者某个变量X如何影响另一个变量Y。
方法:开发了一种新的形式工具,用于分解马尔可夫模型和半马尔可夫模型中的虚假关联。
效果:证明了第一个允许非参数分解虚假效应的结果,并提供了识别这种分解的充分条件。这种方法在解释性和公平AI、流行病学和医学等领域有广泛的应用,并在真实世界数据集上进行了实证演示。

One of the fundamental challenges found throughout the data sciences is to explain why things happen in specific ways, or through which mechanisms a certain variable $X$ exerts influences over another variable $Y$. In statistics and machine learning, significant efforts have been put into developing machinery to estimate correlations across variables efficiently. In causal inference, a large body of literature is concerned with the decomposition of causal effects under the rubric of mediation analysis. However, many variations are spurious in nature, including different phenomena throughout the applied sciences. Despite the statistical power to estimate correlations and the identification power to decompose causal effects, there is still little understanding of the properties of spurious associations and how they can be decomposed in terms of the underlying causal mechanisms. In this manuscript, we develop formal tools for decomposing spurious variations in both Markovian and Semi-Markovian models. We prove the first results that allow a non-parametric decomposition of spurious effects and provide sufficient conditions for the identification of such decompositions. The described approach has several applications, ranging from explainable and fair AI to questions in epidemiology and medicine, and we empirically demonstrate its use on a real-world dataset.

Towards Understanding the Dynamics of Gaussian-Stein Variational Gradient Descent
Tianle Liu Promit Ghosal Krishna Balasubramanian Natesh S. Pillai



研究问题:尽管SVGD已被广泛应用,但其理论性质的理解仍然是一个挑战。
动机:为了解决这一问题,我们详细地研究了高斯-SVGD,即通过双线性核投影到高斯分布族的SVGD,或等效的高斯变分推理(GVI)与SVGD。
方法:我们考虑了均值场偏微分方程和离散粒子系统,并提出了密度基础和粒子基础的高斯-SVGD实现。
效果:我们的结果显示,当目标具有强烈的对数凹性时,均值场高斯-SVGD动态将收敛到与目标在KL散度上最接近的高斯分布。在有限粒子设置中,如果目标为高斯分布,则时间和时间收敛到均值场极限以及时间收敛到平衡都是线性的。在一般情况下,我们提出了一种基于密度和基于粒子的高斯-SVGD实现,并发现几种最近从不同角度提出的GVI算法都是我们统一框架的特殊案例。有趣的是,这个框架中的一个新粒子实例在实践中表现优于现有方法。

Stein Variational Gradient Descent (SVGD) is a nonparametric particle-based deterministic sampling algorithm. Despite its wide usage, understanding the theoretical properties of SVGD has remained a challenging problem. For sampling from a Gaussian target, the SVGD dynamics with a bilinear kernel will remain Gaussian as long as the initializer is Gaussian. Inspired by this fact, we undertake a detailed theoretical study of the Gaussian-SVGD, i.e., SVGD projected to the family of Gaussian distributions via the bilinear kernel, or equivalently Gaussian variational inference (GVI) with SVGD. We present a complete picture by considering both the mean-field PDE and discrete particle systems. When the target is strongly log-concave, the mean-field Gaussian-SVGD dynamics is proven to converge linearly to the Gaussian distribution closest to the target in KL divergence. In the finite-particle setting, there is both uniform in time convergence to the mean-field limit and linear convergence in time to the equilibrium if the target is Gaussian. In the general case, we propose a density-based and a particle-based implementation of the Gaussian-SVGD, and show that several recent algorithms for GVI, proposed from different perspectives, emerge as special cases of our unified framework. Interestingly, one of the new particle-based instance from this framework empirically outperforms existing approaches. Our results make concrete contributions towards obtaining a deeper understanding of both SVGD and GVI.

Active Observing in Continuous-time Control
Samuel Holt Alihan Hüyük Mihaela van der Schaar



研究问题:如何有效地控制连续时间环境,同时积极决定何时进行昂贵的观察?
动机:现有的方法要么依赖于定期进行昂贵观察的连续时间控制方法,要么依赖于离散时间控制和昂贵的观察方法,但这些方法都不适合连续时间环境。
方法:我们首次将带有昂贵观察的连续时间控制问题形式化,并提出了一种新的方法,可以在连续时间控制中进行不规则的观察。
效果:我们在各种连续时间环境中进行了实证验证,包括一个癌症模拟。虽然确定最优方法仍然是一个开放的问题,但我们的工作为这个独特的问题提供了有价值的见解和理解,为未来在这个领域的研究奠定了基础。

The control of continuous-time environments while actively deciding when to take costly observations in time is a crucial yet unexplored problem, particularly relevant to real-world scenarios such as medicine, low-power systems, and resource management. Existing approaches either rely on continuous-time control methods that take regular, expensive observations in time or discrete-time control with costly observation methods, which are inapplicable to continuous-time settings due to the compounding discretization errors introduced by time discretization. In this work, we are the first to formalize the continuous-time control problem with costly observations. Our key theoretical contribution shows that observing at regular time intervals is not optimal in certain environments, while irregular observation policies yield higher expected utility. This perspective paves the way for the development of novel methods that can take irregular observations in continuous-time control with costly observations. We empirically validate our theoretical findings in various continuous-time environments, including a cancer simulation, by constructing a simple initial method to solve this new problem, with a heuristic threshold on the variance of reward rollouts in an offline continuous-time model-based model predictive control (MPC) planner. Although determining the optimal method remains an open problem, our work offers valuable insights and understanding of this unique problem, laying the foundation for future research in this area.

Gaussian Differential Privacy on Riemannian Manifolds
Yangdi Jiang Xiaotian Chang Yi Liu Lei Ding Linglong Kong Bei Jiang



研究问题:如何将高斯差分隐私(GDP)扩展到一般黎曼流形上。
动机:由于其中心极限属性,GDP的概念作为显著的隐私定义,迫切需要扩展到流形设置中。
方法:利用著名的几何分析中的Bishop-Gromov定理,我们提出了一个整合了黎曼距离的黎曼高斯分布,使我们能够在具有有界Ricci曲率的黎曼流形上实现GDP。
效果:通过在统计中最常见的流形之一,单位球$S^d$上的模拟,我们证明了我们的黎曼高斯机制在实施GDP方面优于先前提出的黎曼拉普拉斯机制。

We develop an advanced approach for extending Gaussian Differential Privacy (GDP) to general Riemannian manifolds. The concept of GDP stands out as a prominent privacy definition that strongly warrants extension to manifold settings, due to its central limit properties. By harnessing the power of the renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian Gaussian distribution that integrates the Riemannian distance, allowing us to achieve GDP in Riemannian manifolds with bounded Ricci curvature. To the best of our knowledge, this work marks the first instance of extending the GDP framework to accommodate general Riemannian manifolds, encompassing curved spaces, and circumventing the reliance on tangent space summaries. We provide a simple algorithm to evaluate the privacy budget $\mu$ on any one-dimensional manifold and introduce a versatile Markov Chain Monte Carlo (MCMC)-based algorithm to calculate $\mu$ on any Riemannian manifold with constant curvature. Through simulations on one of the most prevalent manifolds in statistics, the unit sphere $S^d$, we demonstrate the superior utility of our Riemannian Gaussian mechanism in comparison to the previously proposed Riemannian Laplace mechanism for implementing GDP.

Deep Gaussian Markov Random Fields for Graph-Structured Dynamical Systems
Fiona Lippert Bart Kranstauber E. Emiel van Loon Patrick Forré



研究问题:高维状态空间模型的概率推理计算具有挑战性。
动机:对于许多时空系统,关于状态变量的依赖结构的知识是可用的。
方法:利用这种结构,开发了一种计算效率高的方法,用于具有(部分)未知动力学和有限历史数据的图结构状态空间模型的状态估计和学习。
效果:在线性高斯假设下,保留了一种封闭形式的后验,可以使用共轭梯度方法进行高效采样,与基于卡尔曼滤波器的经典方法相比具有良好的可扩展性。

Probabilistic inference in high-dimensional state-space models is computationally challenging. For many spatiotemporal systems, however, prior knowledge about the dependency structure of state variables is available. We leverage this structure to develop a computationally efficient approach to state estimation and learning in graph-structured state-space models with (partially) unknown dynamics and limited historical data. Building on recent methods that combine ideas from deep learning with principled inference in Gaussian Markov random fields (GMRF), we reformulate graph-structured state-space models as Deep GMRFs defined by simple spatial and temporal graph layers. This results in a flexible spatiotemporal prior that can be learned efficiently from a single time sequence via variational inference. Under linear Gaussian assumptions, we retain a closed-form posterior, which can be sampled efficiently using the conjugate gradient method, scaling favourably compared to classical Kalman filter based approaches.

Accelerating Motion Planning via Optimal Transport
An Thai Le Georgia Chalvatzaki Armin Biess Jan Peters



研究问题:运动规划在机器人、自动驾驶等领域仍是一个开放性的问题,因为需要大量的计算资源,阻碍了实时、有效的决策。
动机:现有的基于梯度的轨迹优化方法通常会遇到局部极小值的问题,而且在许多情况下可能由于无法轻易获取优化目标的梯度而不适用。
方法:我们提出了一种名为“运动规划通过最优传输”(MPOT)的方法,这是一种无梯度的方法,可以在高度非线性的成本下优化一批平滑的轨迹,即使对于高维任务也是如此。同时,通过规划即推理的视角,通过高斯过程动力学先验来强制平滑性。
效果:我们引入了一种创新的零阶和高度并行化的更新规则——辛克斯步骤,它使用正则多面体族作为其搜索方向。在一系列从低维质点导航到高维全身机器人运动规划的问题中,MPOT的效率表现出色,证明了其在各种问题上优于流行的运动规划器,为最优传输在运动规划中的应用开辟了新的道路。

Motion planning is still an open problem for many disciplines, e.g., robotics, autonomous driving, due to their need for high computational resources that hinder real-time, efficient decision-making. A class of methods striving to provide smooth solutions is gradient-based trajectory optimization. However, those methods usually suffer from bad local minima, while for many settings, they may be inapplicable due to the absence of easy-to-access gradients of the optimization objectives. In response to these issues, we introduce Motion Planning via Optimal Transport (MPOT)---a \textit{gradient-free} method that optimizes a batch of smooth trajectories over highly nonlinear costs, even for high-dimensional tasks, while imposing smoothness through a Gaussian Process dynamics prior via the planning-as-inference perspective. To facilitate batch trajectory optimization, we introduce an original zero-order and highly-parallelizable update rule----the Sinkhorn Step, which uses the regular polytope family for its search directions. Each regular polytope, centered on trajectory waypoints, serves as a local cost-probing neighborhood, acting as a \textit{trust region} where the Sinkhorn Step ``transports'' local waypoints toward low-cost regions. We theoretically show that Sinkhorn Step guides the optimizing parameters toward local minima regions of non-convex objective functions. We then show the efficiency of MPOT in a range of problems from low-dimensional point-mass navigation to high-dimensional whole-body robot motion planning, evincing its superiority compared to popular motion planners, paving the way for new applications of optimal transport in motion planning.

Change point detection and inference in multivariate non-parametric models under mixing conditions
Carlos Misael Madrid Padilla Haotian Xu Daren Wang OSCAR HERNAN MADRID PADILLA Yi Yu



研究问题:本文旨在解决非参数多元时间序列中多个变化点的局部化和推断问题。
动机:在具有潜在短程依赖性的多元时间序列中,其底层分布可能随时间以分段常数的方式改变,而对应的变化点是未知的。
方法:我们提出了在最小跳跃大小消失或保持不变的情况下,变化点估计器的极限分布。这些结果在非参数变化点设置中尚未在文献中揭示。作为副产品,我们还开发了一个可以准确定位多元非参数时间序列中变化点的尖锐估计器,以及一个一致的块类型长程方差估计器。
效果:数值研究表明,我们的估计器具有良好的表现,并且提供了对理论发现的补充。

This paper addresses the problem of localizing and inferring multiple change points, in non-parametric multivariate time series settings. Specifically, we consider a multivariate time series with potentially short-range dependence, whose underlying distributions have Hölder smooth densities and can change over time in a piecewise-constant manner. The change points, which correspond to the times when the distribution changes, are unknown. We present the limiting distributions of the change point estimators under the scenarios where the minimal jump size vanishes or remains constant. Such results have not been revealed in the literature in non-parametric change point settings. As byproducts, we develop a sharp estimator that can accurately localize the change points in multivariate non-parametric time series, and a consistent block-type long-run variance estimator. Numerical studies are provided to complement our theoretical findings.

On the Identifiability of Sparse ICA without Assuming Non-Gaussianity
Ignavier Ng Yujia Zheng Xinshuai Dong Kun Zhang



研究问题:传统的独立成分分析(ICA)方法在处理高斯分布数据时,由于其旋转不变性,往往需要假设源数据的非高斯特性,这可能限制了其在更广泛情境中的应用。
动机:为了适应高斯源数据,我们开发了一种识别理论,该理论依赖于二阶统计量,而不对源数据的分布施加进一步的先决条件,通过引入关于从源到观察变量的连接结构的新颖假设。
方法:我们提出了两种基于二阶统计量和稀疏约束的估计方法。与最近的工作不同,我们的结构可变性假设既相当不具限制性,又经过证明是必要的。
效果:实验结果验证了我们的识别理论和估计方法的有效性。

Independent component analysis (ICA) is a fundamental statistical tool used to reveal hidden generative processes from observed data. However, traditional ICA approaches struggle with the rotational invariance inherent in Gaussian distributions, often necessitating the assumption of non-Gaussianity in the underlying sources. This may limit their applicability in broader contexts. To accommodate Gaussian sources, we develop an identifiability theory that relies on second-order statistics without imposing further preconditions on the distribution of sources, by introducing novel assumptions on the connective structure from sources to observed variables. Different from recent work that focuses on potentially restrictive connective structures, our proposed assumption of structural variability is both considerably less restrictive and provably necessary. Furthermore, we propose two estimation methods based on second-order statistics and sparsity constraint. Experimental results are provided to validate our identifiability theory and estimation methods.

Unbalanced Low-rank Optimal Transport Solvers
Meyer Scetbon Michal Klein Giovanni Palla marco cuturi



研究问题:最优传输方法在机器学习中的适用性长期以来一直受到两个显著限制的影响。
动机:首先,标准样本基求解器的计算成本(用于批量n个样本时)为O(n^3),这令人望而却步。其次,质量守恒约束使得OT求解器在实践中过于刚性:因为它们必须匹配来自两种度量的所有点,其输出可能受到异常值的严重影响。
方法:最近的许多OT工作已经解决了这些计算和建模的限制,但导致了两种分离的方法:虽然熵正则化大大提高了计算前景,但更近期的O(n)线性时间“低秩”求解器有望进一步扩展OT。另一方面,由于依赖于惩罚项而非强制实施质量守恒的不平衡OT变体的出现,建模刚性得到了缓解。
效果:本文的目标是融合这两种方法,实现通用/可扩展的不平衡/低秩OT求解器的承诺。我们提出了自定义算法来实现这些扩展对线性OT问题及其融合Gromov-Wasserstein泛化的应用,并展示了它们在具有挑战性的转录组学匹配问题中的实际相关性。

The relevance of optimal transport methods to machine learning has long been hindered by two salient limitations. First, the $O(n^3)$ computational cost of standard sample-based solvers (when used on batches of $n$ samples) is prohibitive. Second, the mass conservation constraint makes OT solvers too rigid in practice: because they must match \textit{all} points from both measures, their output can be heavily influenced by outliers. A flurry of recent works in OT has addressed these computational and modelling limitations, but has resulted in two separate strains of methods: While the computational outlook was much improved by entropic regularization, more recent $O(n)$ linear-time \textit{low-rank} solvers hold the promise to scale up OT further. On the other hand, modelling rigidities have been eased owing to unbalanced variants of OT, that rely on penalization terms to promote, rather than impose, mass conservation. The goal of this paper is to merge these two strains, to achieve the promise of \textit{both} versatile/scalable unbalanced/low-rank OT solvers. We propose custom algorithms to implement these extensions for the linear OT problem and its Fused-Gromov-Wasserstein generalization, and demonstrate their practical relevance to challenging spatial transcriptomics matching problems.

Differentiable Sampling of Categorical Distributions Using the CatLog-Derivative Trick
Lennert De Smet Emanuele Sansone Pedro Zuidberg Dos Martires



研究问题:如何有效地对离散潜在变量模型中的类别概率分布参数进行学习,特别是当涉及到独立类别分布的乘积时。
动机:现有的方法在处理类别概率分布的梯度估计时存在困难,需要通过引入新的技巧来解决这个问题。
方法:提出了CatLog-Derivative trick,这是一种针对类别分布的Log-Derivative trick的变体,并基于此提出了IndeCateR,这是一种用于独立类别分布乘积的无偏梯度估计器。
效果:实验证明,IndeCateR可以有效实现,并且其梯度估计与现有技术相比,具有更低的偏差和方差。

Categorical random variables can faithfully represent the discrete and uncertain aspects of data as part of a discrete latent variable model. Learning in such models necessitates taking gradients with respect to the parameters of the categorical probability distributions, which is often intractable due to their combinatorial nature. A popular technique to estimate these otherwise intractable gradients is the Log-Derivative trick. This trick forms the basis of the well-known REINFORCE gradient estimator and its many extensions. While the Log-Derivative trick allows us to differentiate through samples drawn from categorical distributions, it does not take into account the discrete nature of the distribution itself. Our first contribution addresses this shortcoming by introducing the CatLog-Derivative trick -- a variation of the Log-Derivative trick tailored towards categorical distributions. Secondly, we use the CatLog-Derivative trick to introduce IndeCateR, a novel and unbiased gradient estimator for the important case of products of independent categorical distributions with provably lower variance than REINFORCE. Thirdly, we empirically show that IndeCateR can be efficiently implemented and that its gradient estimates have significantly lower bias and variance for the same number of samples compared to the state of the art.

Generative Neural Fields by Mixtures of Neural Implicit Functions
Tackgeun You Mijeong Kim Jungtaek Kim Bohyung Han



研究问题:提出一种学习生成神经场的新方法,该方法由隐式基础网络的线性组合表示。
动机:通过元学习和自解码模式在潜在空间中学习基础网络和它们的系数,以扩大生成神经场的容量。
方法:增加基础网络的数量,同时通过加权模型平均保持推理网络的规模,从而有效地提高采样实例的效率。
效果:实验表明,该方法在图像、体素数据和NeRF场景等多种基准上实现了有竞争力的生成性能,无需为特定模态和领域进行复杂的设计。

We propose a novel approach to learning the generative neural fields represented by linear combinations of implicit basis networks. Our algorithm learns basis networks in the form of implicit neural representations and their coefficients in a latent space by either conducting meta-learning or adopting auto-decoding paradigms. The proposed method easily enlarges the capacity of generative neural fields by increasing the number of basis networks while maintaining the size of a network for inference to be small through their weighted model averaging. Consequently, sampling instances using the model is efficient in terms of latency and memory footprint. Moreover, we customize denoising diffusion probabilistic model for a target task to sample latent mixture coefficients, which allows our final model to generate unseen data effectively. Experiments show that our approach achieves competitive generation performance on diverse benchmarks for images, voxel data, and NeRF scenes without sophisticated designs for specific modalities and domains.

Monte Carlo Tree Search with Boltzmann Exploration
Michael Painter Mohamed Baioumy Nick Hawes Bruno Lacerda



研究问题:如何提高蒙特卡洛树搜索(MCTS)方法在寻找最优行动时的效率和准确性?
动机:现有的MCTS方法,如UCT,在探索最优行动时可能会较慢。MENTS虽然通过最大熵原理鼓励更多的探索,但其最优行动并不总是对应于原始目标的最优行动。
方法:本文提出了Boltzmann Tree Search (BTS) 和 Decaying ENtropy Tree-Search (DENTS)两种算法,它们解决了MENTS的局限性,同时保留了Boltzmann策略的优点,如使用Alias方法使行动采样更快。
效果:实验分析表明,这两种算法在多个基准领域表现出一致的高性能,包括围棋游戏。

Monte-Carlo Tree Search (MCTS) methods, such as Upper Confidence Bound applied to Trees (UCT), are instrumental to automated planning techniques. However, UCT can be slow to explore an optimal action when it initially appears inferior to other actions. Maximum ENtropy Tree-Search (MENTS) incorporates the maximum entropy principle into an MCTS approach, utilising Boltzmann policies to sample actions, naturally encouraging more exploration. In this paper, we highlight a major limitation of MENTS: optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective. We introduce two algorithms, Boltzmann Tree Search (BTS) and Decaying ENtropy Tree-Search (DENTS), that address these limitations and preserve the benefits of Boltzmann policies, such as allowing actions to be sampled faster by using the Alias method. Our empirical analysis shows that our algorithms show consistent high performance across several benchmark domains, including the game of Go.

PETAL: Physics Emulation Through Averaged Linearizations for Solving Inverse Problems
Jihui Jin Etienne Ollivier Richard Touret Matthew McKinley Karim Sabra Justin Romberg



研究问题:如何通过已知的观察量恢复出感兴趣的底层信号。
动机:非线性前向模型的逆运算通常需要大量的计算,且当前的方法都是以黑箱监督机器学习的方式进行模拟器的训练,无法充分利用已有的前向模型知识。
方法:提出一种简单的学习加权平均模型,将前向模型在各个参考点处的线性化嵌入到模型本身中,明确地结合已知的物理知识。
效果:通过在海洋声学断层扫描(OAT)示例中的演示,证明了该方法能够更准确地恢复海洋声速剖面(SSP)的变化,提高了前向建模的准确性,并在反演过程中提供了更丰富的基于物理的梯度信息。

Inverse problems describe the task of recovering an underlying signal of interest given observables. Typically, the observables are related via some non-linear forward model applied to the underlying unknown signal. Inverting the non-linear forward model can be computationally expensive, as it often involves computing and inverting a linearization at a series of estimates. Rather than inverting the physics-based model, we instead train a surrogate forward model (emulator) and leverage modern auto-grad libraries to solve for the input within a classical optimization framework. Current methods to train emulators are done in a black box supervised machine learning fashion and fail to take advantage of any existing knowledge of the forward model. In this article, we propose a simple learned weighted average model that embeds linearizations of the forward model around various reference points into the model itself, explicitly incorporating known physics. Grounding the learned model with physics based linearizations improves the forward modeling accuracy and provides richer physics based gradient information during the inversion process leading to more accurate signal recovery. We demonstrate the efficacy on an ocean acoustic tomography (OAT) example that aims to recover ocean sound speed profile (SSP) variations from acoustic observations (e.g. eigenray arrival times) within simulation of ocean dynamics in the Gulf of Mexico.

Amortized Reparametrization: Efficient and Scalable Variational Inference for Latent SDEs
Kevin Course Prasanth B. Nair



研究问题:解决隐式随机微分方程(SDEs)的推断问题,其时间和内存成本与数据量、时间序列的总长度和近似微分方程的硬度独立地成比例。
动机:与传统的隐式微分方程推断方法相比,尽管其内存成本恒定,但其时间复杂度严重依赖于近似微分方程的硬度。
方法:通过使用一种新的平均策略和最近推导的线性SDEs下期望的重参数化,消除了在近似梯度时需要解微分方程的需求。
效果:实践中,这种方法使我们能够在训练中以超过一个数量级的模型评估次数实现与基于伴随敏感性的方法相似的性能。

We consider the problem of inferring latent stochastic differential equations (SDEs) with a time and memory cost that scales independently with the amount of data, the total length of the time series, and the stiffness of the approximate differential equations. This is in stark contrast to typical methods for inferring latent differential equations which, despite their constant memory cost, have a time complexity that is heavily dependent on the stiffness of the approximate differential equation. We achieve this computational advancement by removing the need to solve differential equations when approximating gradients using a novel amortization strategy coupled with a recently derived reparametrization of expectations under linear SDEs. We show that, in practice, this allows us to achieve similar performance to methods based on adjoint sensitivities with more than an order of magnitude fewer evaluations of the model in training.

Learning Causal Models under Independent Changes
Sarah Mameche David Kaltenpoth Jilles Vreeken



研究问题:本文旨在解释多环境下系统的生成过程,其中系统组件可能会发生变化。
动机:目前的模型受限于稀疏机制转移假设,并且只有一部分因果条件会改变,而实际情况中这种假设不易验证。因此,我们研究了更一般的原理,即机制转移是独立的。
方法:我们引入了一种使用高斯过程模型进行因果发现的方法,该方法超越了部分有向图的限制,并给出了正确识别因果模型的条件。
效果:实验结果表明,我们的方法在一系列合成设置、现实的基因表达模拟以及真实的细胞信号数据上都表现良好。

In many scientific applications, we observe a system in different conditions in which its components may change, rather than in isolation. In our work, we are interested in explaining the generating process of such a multi-context system using a finite mixture of causal mechanisms. Recent work shows that this causal model is identifiable from data, but is limited to settings where the sparse mechanism shift hypothesis holds and only a subset of the causal conditionals change. As this assumption is not easily verifiable in practice, we study the more general principle that mechanism shifts are independent, which we formalize using the algorithmic notion of independence. We introduce an approach for causal discovery beyond partially directed graphs using Gaussian Process models, and give conditions under which we provably identify the correct causal model. In our experiments, we show that our method performs well in a range of synthetic settings, on realistic gene expression simulations, as well as on real-world cell signaling data.

Moment Matching Denoising Gibbs Sampling
Mingtian Zhang Alex Hawkins-Hooker Brooks Paige David Barber



研究问题:本研究旨在解决能量模型(EBMs)训练和采样的挑战,特别是针对常用的去噪得分匹配(DSM)方法的不一致性问题。
动机:现有的去噪得分匹配方法在训练能量模型时存在不一致性问题,导致学习到的数据分布噪声较大。
方法:我们提出了一种有效的采样框架——带矩匹配的(伪)吉布斯采样,该框架可以在给定通过DSM良好训练的噪声模型时,从底层的干净模型中进行有效采样。
效果:我们的方法在与相关方法的比较中表现出优势,并且可以扩展到高维数据集。

Energy-Based Models (EBMs) offer a versatile framework for modelling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a noisy data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a noisy model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.

Equivariant Neural Simulators for Stochastic Spatiotemporal Dynamics
Koen Minartz Yoeri Poels Simon Martinus Koop Vlado Menkovski



研究问题:如何利用神经网络进行大规模数据驱动的高维动态系统模拟,特别是在数值方法不可行或计算成本高昂的情况下。
动机:在确定性神经网络模拟器中引入领域对称性可以显著提高其准确性、样本效率和参数效率。然而,要在能够模拟随机现象的概率神经网络模拟器中引入对称性,我们需要一个能够产生轨迹等变分布的模型,而不是等变函数近似。
方法:我们提出了等变概率神经网络模拟(EPNS)框架,用于系统演化的自回归概率建模。我们使用EPNS设计了随机n体系统和随机细胞动力学的模型。
效果:实验结果表明,EPNS在基于神经网络的概率模拟方面大大优于现有的方法。具体来说,我们在EPNS中引入等变性,提高了模拟质量、数据效率、滚动稳定性和不确定性量化。我们认为EPNS是一种有前景的方法,可用于各种领域的高效有效的数据驱动概率模拟。

Neural networks are emerging as a tool for scalable data-driven simulation of high-dimensional dynamical systems, especially in settings where numerical methods are infeasible or computationally expensive. Notably, it has been shown that incorporating domain symmetries in deterministic neural simulators can substantially improve their accuracy, sample efficiency, and parameter efficiency. However, to incorporate symmetries in probabilistic neural simulators that can simulate stochastic phenomena, we need a model that produces equivariant distributions over trajectories, rather than equivariant function approximations. In this paper, we propose Equivariant Probabilistic Neural Simulation (EPNS), a framework for autoregressive probabilistic modeling of equivariant distributions over system evolutions. We use EPNS to design models for a stochastic n-body system and stochastic cellular dynamics. Our results show that EPNS considerably outperforms existing neural network-based methods for probabilistic simulation. More specifically, we demonstrate that incorporating equivariance in EPNS improves simulation quality, data efficiency, rollout stability, and uncertainty quantification. We conclude that EPNS is a promising method for efficient and effective data-driven probabilistic simulation in a diverse range of domains.

Probabilistic Exponential Integrators
Nathanael Bosch Philipp Hennig Filip Tronarp



研究问题:本文旨在解决在动态系统中,概率解算器在处理某些刚性系统时性能下降的问题。
动机:由于需要稳定性而非数值精度,刚性系统中的微小步骤会导致标准解算器的性能下降。
方法:通过在先验中包含快速线性动力学,提出了一类具有有利特性的概率指数积分器。
效果:实验证明,这种方法比已建立的概率解算器在处理刚性微分方程时具有更好的稳定性和效率。

Probabilistic solvers provide a flexible and efficient framework for simulation, uncertainty quantification, and inference in dynamical systems. However, like standard solvers, they suffer performance penalties for certain stiff systems, where small steps are required not for reasons of numerical accuracy but for the sake of stability. This issue is greatly alleviated in semi-linear problems by the probabilistic exponential integrators developed in this paper. By including the fast, linear dynamics in the prior, we arrive at a class of probabilistic integrators with favorable properties. Namely, they are proven to be L-stable, and in a certain case reduce to a classic exponential integrator---with the added benefit of providing a probabilistic account of the numerical error. The method is also generalized to arbitrary non-linear systems by imposing piece-wise semi-linearity on the prior via Jacobians of the vector field at the previous estimates, resulting in probabilistic exponential Rosenbrock methods. We evaluate the proposed methods on multiple stiff differential equations and demonstrate their improved stability and efficiency over established probabilistic solvers. The present contribution thus expands the range of problems that can be effectively tackled within probabilistic numerics.

Learning Efficient Surrogate Dynamic Models with Graph Spline Networks
Chuanbo Hua Federico Berto Michael Poli Stefano Massaroli Jinkyoo Park



研究问题:如何降低物理系统模拟的计算需求,提高预测速度?
动机:尽管深度学习在工程和科学计算中广泛应用,但降低物理系统模拟的高计算需求仍是一个挑战。
方法:本文提出了一种名为GraphSplineNets的新型深度学习方法,通过减少深度替代模型的网格大小和迭代步数来加速物理系统的预测。该方法使用两种可微分正交样条配点法,以高效地预测任何时间和空间位置的反应。此外,还引入了一种自适应的空间配点策略,优先从最重要的区域进行采样。
效果:GraphSplineNets在预测各种复杂动态系统(包括热方程、阻尼波传播、纳维-斯托克斯方程以及规则和非规则域中的实海流)时,改善了精度与速度之间的权衡关系。

While complex simulations of physical systems have been widely used in engineering and scientific computing, lowering their often prohibitive computational requirements has only recently been tackled by deep learning approaches. In this paper, we present GraphSplineNets, a novel deep-learning method to speed up the forecasting of physical systems by reducing the grid size and number of iteration steps of deep surrogate models. Our method uses two differentiable orthogonal spline collocation methods to efficiently predict response at any location in time and space. Additionally, we introduce an adaptive collocation strategy in space to prioritize sampling from the most important regions. GraphSplineNets improve the accuracy-speedup tradeoff in forecasting various dynamical systems with increasing complexity, including the heat equation, damped wave propagation, Navier-Stokes equations, and real-world ocean currents in both regular and irregular domains.

Differentiable Random Partition Models
Thomas M. Sutter Alain Ryser Joram Liebeskind Julia E Vogt



研究问题:将一组元素分割成未知数量的互斥子集在许多机器学习问题中是必要的。
动机:分配元素(如数据集中样本或网络层中的神经元)到未知和离散数量的子集本质上是不可微分的,这阻止了端到端基于参数的梯度优化。
方法:我们通过提出一种新的两步法来推断分区,从而克服了这个限制,使其能够在变分推理任务中使用。这种方法通过推断每个子集的元素数量,然后按照学习的顺序填充这些子集,实现了新的随机分区模型的参数重参数化梯度。
效果:我们在三个具有挑战性的实验中展示了我们的方法的通用性:变分聚类、弱监督下共享和独立生成因子的推理以及多任务学习。

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our general-purpose approach on three different challenging experiments: variational clustering, inference of shared and independent generative factors under weak supervision, and multitask learning.

Leveraging Locality and Robustness to Achieve Massively Scalable Gaussian Process Regression
Robert F Allison Anthony Stephenson Samuel F Edward Pyzer-Knapp



研究问题:高斯过程回归(GP)在大型应用中的准确预测和原理性不确定性测量成本过高。
动机:由于现代大规模应用中高斯过程回归的准确预测和原理性不确定性测量的成本过高,因此进行了大量的计算效率近似研究。
方法:通过理论和模拟探索了高斯过程最近邻(GPnn)预测的鲁棒性和极限行为。
效果:研究发现,随着数据量n的增加,估计参数和高斯过程模型假设对GPnn预测精度的影响逐渐减小。即使在存在严重错误指定的情况下,也只需要花费少量工作在参数估计上就可以实现高均方误差精度。此外,当n趋向于无穷大时,不确定性校准和负对数似然仍然对一个参数敏感,即附加噪声方差,但研究表明这种不准确性来源可以通过修正得到纠正,从而实现了低计算成本下的准确预测和良好校准的不确定性测量。

The accurate predictions and principled uncertainty measures provided by GP regression incur $O(n^3)$ cost which is prohibitive for modern-day large-scale applications. This has motivated extensive work on computationally efficient approximations. We introduce a new perspective by exploring robustness properties and limiting behaviour of GP nearest-neighbour (GPnn) prediction. We demonstrate through theory and simulation that as the data-size $n$ increases, accuracy of estimated parameters and GP model assumptions become increasingly irrelevant to GPnn predictive accuracy. Consequently, it is sufficient to spend small amounts of work on parameter estimation in order to achieve high MSE accuracy, even in the presence of gross misspecification. In contrast, as $n \rightarrow \infty$, uncertainty calibration and NLL are shown to remain sensitive to just one parameter, the additive noise-variance; but we show that this source of inaccuracy can be corrected for, thereby achieving both well-calibrated uncertainty measures and accurate predictions at remarkably low computational cost. We exhibit a very simple GPnn regression algorithm with stand-out performance compared to other state-of-the-art GP approximations as measured on large UCI datasets. It operates at a small fraction of those other methods' training costs, for example on a basic laptop taking about 30 seconds to train on a dataset of size $n = 1.6 \times 10^6$.

SAMoSSA: Multivariate Singular Spectrum Analysis with Stochastic Autoregressive Noise
Abdullah Omar Alomar Munther A. Dahleh Sean Mann Devavrat Shah



研究问题:本文旨在解决时间序列分析中同时估计确定性非平稳趋势和季节性成分以及学习残差随机平稳成分的问题,并建立多阶段学习算法的理论依据。
动机:尽管在没有相关平稳成分的情况下,可以使用多元奇异谱分析(mSSA)准确学习确定性非平稳成分,或者使用普通最小二乘法(OLS)容易地学习自回归(AR)平稳成分,但多阶段学习算法的理论依据在文献中尚未明确。
方法:本文提出了一种自然两阶段算法SAMoSSA,首先应用mSSA来估计非平稳成分,尽管存在相关的AR平稳成分,然后从剩余的时间序列中学习该AR成分。我们为该算法提供了有限的样本预测一致性边界,这是一种数据驱动的算法,因此需要最小的参数调整。
效果:通过代表性的实证研究,我们验证了SAMoSSA与现有基线相比的优越性能。值得注意的是,SAMoSSA能够考虑AR噪声结构,从而在不同的基准数据集上实现了5%到37%的性能提升。

The well-established practice of time series analysis involves estimating deterministic, non-stationary trend and seasonality components followed by learning the residual stochastic, stationary components. Recently, it has been shown that one can learn the deterministic non-stationary components accurately using multivariate Singular Spectrum Analysis (mSSA) in the absence of a correlated stationary component; meanwhile, in the absence of deterministic non-stationary components, the Autoregressive (AR) stationary component can also be learnt readily, e.g. via Ordinary Least Squares (OLS). However, a theoretical underpinning of multi-stage learning algorithms involving both deterministic and stationary components has been absent in the literature despite its pervasiveness. We resolve this open question by establishing desirable theoretical guarantees for a natural two-stage algorithm, where mSSA is first applied to estimate the non-stationary components despite the presence of a correlated stationary AR component, which is subsequently learned from the residual time series. We provide a finite-sample forecasting consistency bound for the proposed algorithm, SAMoSSA, which is data-driven and thus requires minimal parameter tuning. To establish theoretical guarantees, we overcome three hurdles: (i) we characterize the spectra of Page matrices of stable AR processes, thus extending the analysis of mSSA; (ii) we extend the analysis of AR process identification in the presence of arbitrary bounded perturbations; (iii) we characterize the out-of-sample or forecasting error, as opposed to solely considering model identification. Through representative empirical studies, we validate the superior performance of SAMoSSA compared to existing baselines. Notably, SAMoSSA's ability to account for AR noise structure yields improvements ranging from 5% to 37% across various benchmark datasets.

On kernel-based statistical learning theory in the mean field limit
Christian Fiedler Michael Herty Sebastian Trimpe



研究问题:本文旨在研究当输入变量数量趋向无穷大时,如何进行机器学习。
动机:受交互粒子系统机器学习的启发,我们考虑了输入变量数量无限大的情况。
方法:我们继续研究了核函数及其再生核希尔伯特空间的平均场极限,完善了现有理论。同时,我们还提供了在平均场极限下使用这些核函数进行近似的相关结果,包括一个表示定理。最后,我们在统计学习的背景下使用了这些核函数,重点关注支持向量机。
效果:我们的结果表明,经验解和无穷样本解以及相应的风险都收敛于平均场。一方面,我们的研究为大规模问题提供了新的理论工具和见解,确立了核方法的严格平均场极限。另一方面,我们的研究环境对应于一种新的学习问题极限形式,这在统计学习理论文献中似乎尚未被研究过。

In many applications of machine learning, a large number of variables are considered. Motivated by machine learning of interacting particle systems, we consider the situation when the number of input variables goes to infinity. First, we continue the recent investigation of the mean field limit of kernels and their reproducing kernel Hilbert spaces, completing the existing theory. Next, we provide results relevant for approximation with such kernels in the mean field limit, including a representer theorem. Finally, we use these kernels in the context of statistical learning in the mean field limit, focusing on Support Vector Machines. In particular, we show mean field convergence of empirical and infinite-sample solutions as well as the convergence of the corresponding risks. On the one hand, our results establish rigorous mean field limits in the context of kernel methods, providing new theoretical tools and insights for large-scale problems. On the other hand, our setting corresponds to a new form of limit of learning problems, which seems to have not been investigated yet in the statistical learning theory literature.

ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling
Yuqi Chen Kan Ren Yansen Wang Yuchen Fang Weiwei Sun Dongsheng Li



研究问题:如何有效地对非规则时间序列进行建模,以捕捉数据演变和连续发生的相关性?
动机:传统的基于神经网络的方法,如循环神经网络或Transformer模型,由于其离散特性,在处理连续时间数据时存在局限性。虽然神经常微分方程(Neural ODEs)及其变体在处理非规则时间序列方面表现出了潜力,但它们往往无法捕捉到这些序列中的复杂关联性。因此,同时捕捉输入数据点之间的关系并捕获连续时间系统的动态变化是一个具有挑战性的需求。
方法:我们提出了ContiFormer,这是一种将神经ODE的连续动态建模能力与Transformer的注意力机制相结合的新型模型。我们通过数学方式描述了ContiFormer的表达能力,并通过精心设计的函数假设,将许多专门用于非规则时间序列建模的Transformer变体作为ContiFormer的特例进行了覆盖。
效果:我们在合成数据集和真实世界数据集上进行了大量的实验,结果显示ContiFormer在非规则时间序列数据的建模能力和预测性能上都表现出了优越性。

Modeling continuous-time dynamics on irregular time series is critical to account for data evolution and correlations that occur continuously. Traditional methods including recurrent neural networks or Transformer models leverage inductive bias via powerful neural architectures to capture complex patterns. However, due to their discrete characteristic, they have limitations in generalizing to continuous-time data paradigms. Though neural ordinary differential equations (Neural ODEs) and their variants have shown promising results in dealing with irregular time series, they often fail to capture the intricate correlations within these sequences. It is challenging yet demanding to concurrently model the relationship between input data points and capture the dynamic changes of the continuous-time system. To tackle this problem, we propose ContiFormer that extends the relation modeling of vanilla Transformer to the continuous-time domain, which explicitly incorporates the modeling abilities of continuous dynamics of Neural ODEs with the attention mechanism of Transformers. We mathematically characterize the expressive power of ContiFormer and illustrate that, by curated designs of function hypothesis, many Transformer variants specialized in irregular time series modeling can be covered as a special case of ContiFormer. A wide range of experiments on both synthetic and real-world datasets have illustrated the superior modeling capacities and prediction performance of ContiFormer on irregular time series data. The project link is https://seqml.github.io/contiformer/.

On the Identifiability and Interpretability of Gaussian Process Models
Jiawen Chen Wancen Mu Yun Li Didong Li



研究问题:本文对在单输出高斯过程(GP)模型中使用Matérn核的加性混合的实践进行了批判性审查,并探讨了多输出GP模型中Matérn核的乘性混合的性质。
动机:对于单输出情况,作者们推导出一系列理论结果,表明Matérn核混合物的平滑度由最不平滑的组件决定,并且具有这种核的GP实际上等效于最不平滑的核组件。此外,作者们证明,单个内核组件内的混合权重或参数均无法识别。
方法:作者们将注意力转向多输出GP模型,并分析了乘性核$K(x,y) = AK_0(x,y)$中的协方差矩阵$A$的可识别性,其中$K_0$是标准的单输出内核,如Matérn。结果显示,$A$可以识别出一个乘性常数,这表明乘性混合物非常适合多输出任务。
效果:作者们的研究得到了广泛的模拟和实际应用的支持,无论是在单输出还是多输出设置中。这项工作为高斯过程模型的内核选择和解释提供了深入的见解,强调了为不同任务选择适当的内核结构的重要性。

In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Mat\'ern. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.

Efficient Exploration in Continuous-time Model-based Reinforcement Learning
Lenart Treven Jonas Hübotter Bhavya Sukhija Florian Dorfler Andreas Krause



研究问题:本文旨在解决强化学习算法在处理连续时间动态系统时的问题。
动机:尽管底层系统通常是连续的,但现有的强化学习算法通常只考虑离散时间动态。
方法:本文提出了一种基于模型的强化学习方法,使用非线性常微分方程(ODEs)来表示连续时间动态。通过校准概率模型捕获认识不确定性,并使用乐观原则进行探索。
效果:实验结果表明,当使用高斯过程(GP)对常见的测量选择策略(MSS)如等距离采样进行建模时,该方法的遗憾是次线性的。此外,本文还提出了一种自适应、数据依赖的实际MSS,当与GP动态结合使用时,也实现了次线性的遗憾,且样本数量显著减少。

Reinforcement learning algorithms typically consider discrete-time dynamics, even though the underlying systems are often continuous in time. In this paper, we introduce a model-based reinforcement learning algorithm that represents continuous-time dynamics using nonlinear ordinary differential equations (ODEs). We capture epistemic uncertainty using well-calibrated probabilistic models, and use the optimistic principle for exploration. Our regret bounds surface the importance of the measurement selection strategy (MSS), since in continuous time we not only must decide how to explore, but also when to observe the underlying system. Our analysis demonstrates that the regret is sublinear when modeling ODEs with Gaussian Processes (GP) for common choices of MSS, such as equidistant sampling. Additionally, we propose an adaptive, data-dependent, practical MSS that, when combined with GP dynamics, also achieves sublinear regret with significantly fewer samples. We showcase the benefits of continuous-time modeling over its discrete-time counterpart, as well as our proposed adaptive MSS over standard baselines, on several applications.

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints
Kareem Ahmed Kai-Wei Chang Guy Van den Broeck



研究问题:如何将符号学习和神经网络学习相结合,以更好地进行学习。
动机:纯粹的符号学习和神经网络学习方法之间存在鸿沟,需要通过最大化符号约束相对于神经网络输出分布的可能性来弥合这一鸿沟。
方法:提出了一种局部近似的方法,即在模型样本周围对约束的似然性进行近似,而不是在整个似然分布上执行约束。这种方法是可分解的,可以复用子问题的解,为神经符号损失的有效计算提供了主要原则。
效果:在数独和最短路径预测等任务上,该方法大大提高了模型预测逻辑一致输出的能力。在大型语言模型的解毒任务中,使用简单的禁止有毒词汇列表的约束,能够使模型的输出远离有毒生成,与以往的方法相比取得了最先进的效果。

Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. This often requires maximizing the likelihood of a symbolic constraint w.r.t the neural network's output distribution. Such output distributions are typically assumed to be fully-factorized. This limits the applicability of neuro-symbolic learning to the more expressive auto-regressive distributions, e.g., transformers. Under such distributions, computing the likelihood of even simple constraints is #P-hard. Instead of attempting to enforce the constraint on the entire likelihood distribution, we propose to do so on a random, local approximation thereof. More precisely, we approximate the likelihood of the constraint with the pseudolikelihood of the constraint centered around a model sample. Our approach is factorizable, allowing us to reuse solutions to sub-problems---a main tenet for the efficient computation of neuro-symbolic losses. It also provides a local, high fidelity approximation of the likelihood: it exhibits low entropy and KL-divergence around the model sample. We tested our approach on Sudoku and shortest-path prediction cast as auto-regressive generation, and observe that we greatly improve upon the base model's ability to predict logically-consistent outputs. We also tested our approach on the task of detoxifying large language models. We observe that using a simple constraint disallowing a list of toxic words, we are able to steer the model's outputs away from toxic generations, achieving SoTA compared to previous approaches.

Structure Learning with Adaptive Random Neighborhood Informed MCMC
Xitong Liang Alberto Caron Samuel Livingstone Jim Griffin



研究问题:本文旨在提出一种新的MCMC采样器PARNI-DAG,用于在观察数据下进行结构学习的全贝叶斯方法。
动机:在假设因果充分性的情况下,该算法允许直接从有向无环图(DAGs)的后验分布中进行近似采样。
方法:PARNI-DAG通过局部信息、自适应随机邻域建议进行高效的DAG采样,以提高混合属性。此外,为了确保更好的扩展性,我们将PARNI-DAG与预调谐采样器参数的程序相结合,该程序利用通过一些约束或评分算法得出的骨架图。
效果:由于这些新颖的特性,PARNI-DAG能快速收敛到高概率区域,并且在高维设置中节点之间高度相关时,不太可能陷入局部模式。在介绍了PARNI-DAG的技术新颖性之后,我们在各种实验中实证地展示了其在学习DAG结构上的混合效率和准确性。

In this paper, we introduce a novel MCMC sampler, PARNI-DAG, for a fully-Bayesian approach to the problem of structure learning under observational data. Under the assumption of causal sufficiency, the algorithm allows for approximate sampling directly from the posterior distribution on Directed Acyclic Graphs (DAGs). PARNI-DAG performs efficient sampling of DAGs via locally informed, adaptive random neighborhood proposal that results in better mixing properties. In addition, to ensure better scalability with the number of nodes, we couple PARNI-DAG with a pre-tuning procedure of the sampler's parameters that exploits a skeleton graph derived through some constraint-based or scoring-based algorithms. Thanks to these novel features, PARNI-DAG quickly converges to high-probability regions and is less likely to get stuck in local modes in the presence of high correlation between nodes in high-dimensional settings. After introducing the technical novelties in PARNI-DAG, we empirically demonstrate its mixing efficiency and accuracy in learning DAG structures on a variety of experiments.

Fast Conditional Mixing of MCMC Algorithms for Non-log-concave Distributions
Xiang Cheng Bohan Wang Jingzhao Zhang Yusong Zhu



研究问题:MCMC算法在目标分布非对数凹时,理论混合速度慢。
动机:弥补MCMC算法在非对数凹目标分布下的理论和实践之间的差距。
方法:当Poincaré风格的不等式在状态空间的子集上成立时,证明了MCMC迭代在该子集上的条件分布能快速混合到真实条件分布。
效果:这种快速混合的保证可以在全局混合被证明为缓慢的情况下成立。进一步发现,条件混合对于高斯混合的采样、高斯混合模型的参数估计以及具有良好连接局部极小值的吉布斯采样有影响。

MCMC algorithms offer empirically efficient tools for sampling from a target distribution $\pi(x) \propto \exp(-V(x))$. However, on the theory side, MCMC algorithms suffer from slow mixing rate when $\pi(x)$ is non-log-concave. Our work examines this gap and shows that when Poincar\'e-style inequality holds on a subset $\mathcal{X}$ of the state space, the conditional distribution of MCMC iterates over $\mathcal{X}$ mixes fast to the true conditional distribution. This fast mixing guarantee can hold in cases when global mixing is provably slow. We formalize the statement and quantify the conditional mixing rate. We further show that conditional mixing can have interesting implications for sampling from mixtures of Gaussians, parameter estimation for Gaussian mixture models, and Gibbs-sampling with well-connected local minima.

Asymptotics of Bayesian Uncertainty Estimation in Random Features Regression
Youngsoo Baek Samuel Berchuck Sayan Mukherjee



研究问题:本文比较了后验预测分布与随机特征回归模型的最大后验估计器风险在过参数化区域中的行为。
动机:主要关注后验预测分布的方差(贝叶斯模型平均)并将其渐近性与MAP估计器的风险进行比较。
方法:通过数值模拟,展示了在模型维度增长快于任何常数倍数样本数量的情况下,这两种量之间的相位转变如何支配它们的渐进一致性。
效果:数值模拟揭示了两种量的有限维分布特性,并推测它们具有高斯波动性,表现出与先前作者在高斯序列模型中发现的类似性质,这在理论上是独立的。

In this paper we compare and contrast the behavior of the posterior predictive distribution to the risk of the the maximum a posteriori estimator for the random features regression model in the overparameterized regime. We will focus on the variance of the posterior predictive distribution (Bayesian model average) and compare its asymptotics to that of the risk of the MAP estimator. In the regime where the model dimensions grow faster than any constant multiple of the number of samples, asymptotic agreement between these two quantities is governed by the phase transition in the signal-to-noise ratio. They also asymptotically agree with each other when the number of samples grow faster than any constant multiple of model dimensions. Numerical simulations illustrate finer distributional properties of the two quantities for finite dimensions. We conjecture they have Gaussian fluctuations and exhibit similar properties as found by previous authors in a Gaussian sequence model, this is of independent theoretical interest.

Variational Weighting for Kernel Density Ratios
Sangwoong Yoon Frank C. Park Gunsu S YUN Iljung Kim Yung-Kyun Noh



研究问题:本文旨在通过优化权重函数,减少标准核密度估计的偏差,提高预测后验和信息论测量的估计精度。
动机:在机器学习中,核密度估计是生成性和判别性任务的关键。然而,标准的核密度估计存在偏差,影响了预测后验和信息论测量的精度。
方法:利用多维变分微积分工具,推导出一种最优的权重函数,用于降低标准核密度估计的偏差。
效果:实验结果表明,使用这种最优权重函数可以显著提高预测后验和信息论测量的估计精度。

Kernel density estimation (KDE) is integral to a range of generative and discriminative tasks in machine learning. Drawing upon tools from the multidimensional calculus of variations, we derive an optimal weight function that reduces bias in standard kernel density estimates for density ratios, leading to improved estimates of prediction posteriors and information-theoretic measures. In the process, we shed light on some fundamental aspects of density estimation, particularly from the perspective of algorithms that employ KDEs as their main building blocks.

A Bayesian Take on Gaussian Process Networks
Enrico Giudice Jack Kuipers Giusi Moffa



研究问题:本文旨在利用高斯过程网络(GPNs)进行贝叶斯结构学习,以实现对网络结构的后验分布的采样。
动机:传统的贝叶斯结构学习方法在计算网络结构的后验分布时存在计算上的困难,因此需要寻找一种有效的方法来进行采样。
方法:本文采用了蒙特卡洛和马尔可夫链蒙特卡洛方法来从网络结构的后验分布中进行采样。这种方法遵循贝叶斯范式,通过比较模型的边际似然性来计算GPN特征的后验概率。
效果:模拟研究表明,该方法在恢复网络结构方面优于现有的最优算法,并能准确近似其后验分布。

Gaussian Process Networks (GPNs) are a class of directed graphical models which employ Gaussian processes as priors for the conditional expectation of each variable given its parents in the network. The model allows the description of continuous joint distributions in a compact but flexible manner with minimal parametric assumptions on the dependencies between variables. Bayesian structure learning of GPNs requires computing the posterior over graphs of the network and is computationally infeasible even in low dimensions. This work implements Monte Carlo and Markov Chain Monte Carlo methods to sample from the posterior distribution of network structures. As such, the approach follows the Bayesian paradigm, comparing models via their marginal likelihood and computing the posterior probability of the GPN features. Simulation studies show that our method outperforms state-of-the-art algorithms in recovering the graphical structure of the network and provides an accurate approximation of its posterior distribution.

Granger Components Analysis: Unsupervised learning of latent temporal dependencies
Jacek Dmochowski



研究问题:提出一种新的基于格兰杰因果关系的无监督学习时间序列数据的技术。
动机:现有的技术在处理多变量数据集时,无法有效地识别和利用潜在的时间序列。
方法:开发了一种交替学习的坐标下降算法,通过最大化潜在时间序列之间的格兰杰因果关系来学习多元数据集的投影对。
效果:在模拟向量自回归(VAR)数据上,该技术可以盲识别底层源(最多到规模)。在运动想象实验的头皮脑电图(EEG)数据和功能磁共振成像(fMRI)数据上进行测试,结果显示,该技术能够实现与提示手侧相同的横向化,并表达先前报告的静息状态网络。

A new technique for unsupervised learning of time series data based on the notion of Granger causality is presented. The technique learns pairs of projections of a multivariate data set such that the resulting components -- "driving" and "driven" -- maximize the strength of the Granger causality between the latent time series (how strongly the past of the driving signal predicts the present of the driven signal). A coordinate descent algorithm that learns pairs of coefficient vectors in an alternating fashion is developed and shown to blindly identify the underlying sources (up to scale) on simulated vector autoregressive (VAR) data. The technique is tested on scalp electroencephalography (EEG) data from a motor imagery experiment where the resulting components lateralize with the side of the cued hand, and also on functional magnetic resonance imaging (fMRI) data, where the recovered components express previously reported resting-state networks.

Entropy-based Training Methods for Scalable Neural Implicit Samplers
Weijian Luo Boya Zhang Zhihua Zhang



研究问题:如何有效地从非标准化的目标分布中进行采样是科学计算和机器学习中的一个基本问题。
动机:传统的MCMC等方法虽然能保证从这种分布中渐近无偏地采样,但计算效率低下,特别是在处理高维目标时,需要多次迭代才能生成一批样本。
方法:本文提出了一种高效且可扩展的神经隐式采样器,通过利用直接将易采样的潜在向量映射到目标样本而无需迭代过程的神经转换,可以以较低的计算成本生成大量的样本。同时引入了KL训练方法和Fisher训练方法来训练神经隐式采样器。
效果:在三个不同规模的采样基准测试中,包括从2D目标、贝叶斯推理以及高维能量基模型(EBMs)中采样,证明了所提出的采样器的有效性、效率和可扩展性。特别是在涉及高维EBMs的实验中,我们的采样器生成的样本与基于MCMC的方法生成的样本相当,但效率提高了100倍以上。

Efficiently sampling from un-normalized target distributions is a fundamental problem in scientific computing and machine learning. Traditional approaches such as Markov Chain Monte Carlo (MCMC) guarantee asymptotically unbiased samples from such distributions but suffer from computational inefficiency, particularly when dealing with high-dimensional targets, as they require numerous iterations to generate a batch of samples. In this paper, we introduce an efficient and scalable neural implicit sampler that overcomes these limitations. The implicit sampler can generate large batches of samples with low computational costs by leveraging a neural transformation that directly maps easily sampled latent vectors to target samples without the need for iterative procedures. To train the neural implicit samplers, we introduce two novel methods: the KL training method and the Fisher training method. The former method minimizes the Kullback-Leibler divergence, while the latter minimizes the Fisher divergence between the sampler and the target distributions. By employing the two training methods, we effectively optimize the neural implicit samplers to learn and generate from the desired target distribution. To demonstrate the effectiveness, efficiency, and scalability of our proposed samplers, we evaluate them on three sampling benchmarks with different scales. These benchmarks include sampling from 2D targets, Bayesian inference, and sampling from high-dimensional energy-based models (EBMs). Notably, in the experiment involving high-dimensional EBMs, our sampler produces samples that are comparable to those generated by MCMC-based methods while being more than 100 times more efficient, showcasing the efficiency of our neural sampler. Besides the theoretical contributions and strong empirical performances, the proposed neural samplers and corresponding training methods will shed light on further research on developing efficient samplers for various applications beyond the ones explored in this study.

Learning Space-Time Continuous Latent Neural PDEs from Partially Observed States
Valerii Iakovlev Markus Heinonen Harri Lähdesmäki



研究问题:如何从不规则时空网格上的噪声和部分观测中学习偏微分方程(PDEs)。
动机:现有的方法在处理部分观测数据时存在局限性,需要开发一种能够有效处理这种情况的新模型。
方法:提出了一种空间-时间连续的潜在神经PDE模型,该模型结合了配置方法和行方法,并采用了有效的概率框架和新型编码器设计以提高数据效率和网格独立性。
效果:实验结果表明,该模型在复杂合成和真实世界数据集上表现出最先进的性能,克服了现有方法的限制,能有效处理部分观测数据。

We introduce a novel grid-independent model for learning partial differential equations (PDEs) from noisy and partial observations on irregular spatiotemporal grids. We propose a space-time continuous latent neural PDE model with an efficient probabilistic framework and a novel encoder design for improved data efficiency and grid independence. The latent state dynamics are governed by a PDE model that combines the collocation method and the method of lines. We employ amortized variational inference for approximate posterior estimation and utilize a multiple shooting technique for enhanced training speed and stability. Our model demonstrates state-of-the-art performance on complex synthetic and real-world datasets, overcoming limitations of previous approaches and effectively handling partially-observed data. The proposed model outperforms recent methods, showing its potential to advance data-driven PDE modeling and enabling robust, grid-independent modeling of complex partially-observed dynamic processes across various domains.

Curve Your Enthusiasm: Concurvity Regularization in Differentiable Generalized Additive Models
Julien Niklas Siems Konstantin Ditschuneit Winfried Ripken Alma Lindborg Maximilian Schambach Johannes Otterbach Martin Genzel



研究问题:本文旨在解决广义可加模型(GAMs)的依赖性问题,即特征之间的相关性可能影响模型的解释性。
动机:尽管广义可加模型在解释性方面受到欢迎,但其对相关性(可能是非线性的)的敏感性尚未得到广泛关注。
方法:作者提出了一种有效的正则化方法,该方法惩罚非线性转换后的特征变量的成对相关性。这种方法适用于任何可微的附加模型,如神经附加模型或神经预测器。
效果:实验证明,通过减少GAMs中的相关性,可以在不显著降低预测质量的情况下提高解释性和减少特征重要性的方差。

Generalized Additive Models (GAMs) have recently experienced a resurgence in popularity due to their interpretability, which arises from expressing the target value as a sum of non-linear transformations of the features. Despite the current enthusiasm for GAMs, their susceptibility to concurvity — i.e., (possibly non-linear) dependencies between the features — has hitherto been largely overlooked. Here, we demonstrate how concurvity can severly impair the interpretability of GAMs and propose a remedy: a conceptually simple, yet effective regularizer which penalizes pairwise correlations of the non-linearly transformed feature variables. This procedure is applicable to any differentiable additive model, such as Neural Additive Models or NeuralProphet, and enhances interpretability by eliminating ambiguities due to self-canceling feature contributions. We validate the effectiveness of our regularizer in experiments on synthetic as well as real-world datasets for time-series and tabular data. Our experiments show that concurvity in GAMs can be reduced without significantly compromising prediction quality, improving interpretability and reducing variance in the feature importances.

Geometric Neural Diffusion Processes
Emile Mathieu Vincent Dutordoir Michael John Hutchinson Valentin De Bortoli Yee Whye Teh Richard E Turner



研究问题:扩散模型在生成建模中已被证明是灵活和有效的,但其在处理自然科学中的对称性和非欧几里得空间数据时存在问题。
动机:为了解决这些问题,本文将扩散模型的框架扩展到无限维欧几里得空间,以纳入一系列几何先验。
方法:通过a)构造一个允许作为极限分布的几何高斯过程的噪声过程,该过程在感兴趣的对称群下变换,以及b)用关于该群等变的神经网络近似得分。
效果:实验表明,该生成功能性模型具有相同的对称性。通过使用一种新的基于朗之万的条件采样器,我们展示了该模型在复杂标量场和矢量场上的可扩展性和容量,这些场在合成和现实世界的天气数据上具有欧几里得和球面域。

Denoising diffusion models have proven to be a flexible and effective paradigm for generative modelling. Their recent extension to infinite dimensional Euclidean spaces has allowed for the modelling of stochastic processes. However, many problems in the natural sciences incorporate symmetries and involve data living in non-Euclidean spaces. In this work, we extend the framework of diffusion models to incorporate a series of geometric priors in infinite-dimension modelling. We do so by a) constructing a noising process which admits, as limiting distribution, a geometric Gaussian process that transforms under the symmetry group of interest, and b) approximating the score with a neural network that is equivariant w.r.t. this group. We show that with these conditions, the generative functional model admits the same symmetry. We demonstrate scalability and capacity of the model, using a novel Langevin-based conditional sampler, to fit complex scalar and vector fields, with Euclidean and spherical codomain, on synthetic and real-world weather data.

Reliable Off-Policy Learning for Dosage Combinations
Jonas Schweisthal Dennis Frauen Valentyn Melnychuk Stefan Feuerriegel



研究问题:个性化医疗中如何为多种连续治疗做出最佳剂量组合的决策。
动机:现有方法独立地对多种治疗的效果进行建模,而联合效果的估计却鲜有关注且存在挑战。
方法:提出一种新颖的方法用于可靠的非策略学习以确定剂量组合。包括三个步骤:(1)开发一个定制的神经网络来估计个体化的剂量反应函数,同时考虑多个依赖剂量的联合效应。(2)使用条件正态流估计广义倾向分数,以检测共享协变量-治疗空间中的有限重叠区域。(3)提出一种基于梯度的学习方法来找到最优的个体化剂量组合,确保通过避免有限重叠区域可靠地估计策略值。
效果:通过广泛的评估表明该方法的有效性。据我们所知,这是首次提供一种用于优化剂量组合的可靠非策略学习方法。

Decision-making in personalized medicine such as cancer therapy or critical care must often make choices for dosage combinations, i.e., multiple continuous treatments. Existing work for this task has modeled the effect of multiple treatments independently, while estimating the joint effect has received little attention but comes with non-trivial challenges. In this paper, we propose a novel method for reliable off-policy learning for dosage combinations. Our method proceeds along three steps: (1) We develop a tailored neural network that estimates the individualized dose-response function while accounting for the joint effect of multiple dependent dosages. (2) We estimate the generalized propensity score using conditional normalizing flows in order to detect regions with limited overlap in the shared covariate-treatment space. (3) We present a gradient-based learning algorithm to find the optimal, individualized dosage combinations. Here, we ensure reliable estimation of the policy value by avoiding regions with limited overlap. We finally perform an extensive evaluation of our method to show its effectiveness. To the best of our knowledge, ours is the first work to provide a method for reliable off-policy learning for optimal dosage combinations.

Bounce: Reliable High-Dimensional Bayesian Optimization for Combinatorial and Mixed Spaces
Leonard Papenmeier Luigi Nardi Matthias Poloczek



研究问题:如何优化高维黑箱函数,特别是在混合和组合输入空间中的问题。
动机:现有的贝叶斯优化方法在解决这类问题时表现不稳定,当函数的未知最优解没有特定结构时,其性能会大幅降低。
方法:本文提出了一种名为Bounce的方法,该方法将各种变量类型映射到嵌套的、维度逐渐增大的嵌入中。
效果:实验表明,Bounce在各种高维问题上都能稳定地达到甚至超过现有的最佳性能。

Impactful applications such as materials discovery, hardware design, neural architecture search, or portfolio optimization require optimizing high-dimensional black-box functions with mixed and combinatorial input spaces. While Bayesian optimization has recently made significant progress in solving such problems, an in-depth analysis reveals that the current state-of-the-art methods are not reliable. Their performances degrade substantially when the unknown optima of the function do not have a certain structure. To fill the need for a reliable algorithm for combinatorial and mixed spaces, this paper proposes Bounce that relies on a novel map of various variable types into nested embeddings of increasing dimensionality. Comprehensive experiments show that Bounce reliably achieves and often even improves upon state-of-the-art performance on a variety of high-dimensional problems.

Add and Thin: Diffusion for Temporal Point Processes
David Lüdke Marin Biloš Oleksandr Shchur Marten Lienen Stephan Günnemann



研究问题:如何提高预训练语言模型在知识驱动任务上的性能,同时保持对其他常见NLP任务的优异表现?
动机:目前的预训练语言模型缺乏对结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过结合大规模文本语料库和知识图谱进行联合训练,提出了一种增强的语言表示模型ERNIE。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.

MMGP: a Mesh Morphing Gaussian Process-based machine learning method for regression of physical problems under nonparametrized geometrical variability
Fabien Casenave Brian Staber Xavier Roynard



研究问题:在工业设计中,模拟物理现象时,几何变化是主要的关注点。
动机:虽然经典的回归技术对参数化的几何有效,但实际场景经常在推理阶段缺乏形状参数化,只剩下网格离散化可用数据。从这种基于网格的表示学习模拟提出了重大挑战。
方法:我们提出了一种不依赖图神经网络的机器学习方法。通过将复杂的几何形状和固定拓扑的变化转换为常见的支持进行网格变形,结合经典的降维技术和高斯过程进行处理。
效果:所提出的方法可以容易地处理大型网格,无需显式的形状参数化,并提供关键的预测不确定性,这对于知情决策至关重要。在考虑的数字实验中,所提出的方法在训练效率和预测准确性方面与现有的图神经网络具有竞争力。

When learning simulations for modeling physical phenomena in industrial designs, geometrical variabilities are of prime interest. While classical regression techniques prove effective for parameterized geometries, practical scenarios often involve the absence of shape parametrization during the inference stage, leaving us with only mesh discretizations as available data. Learning simulations from such mesh-based representations poses significant challenges, with recent advances relying heavily on deep graph neural networks to overcome the limitations of conventional machine learning approaches. Despite their promising results, graph neural networks exhibit certain drawbacks, including their dependency on extensive datasets and limitations in providing built-in predictive uncertainties or handling large meshes. In this work, we propose a machine learning method that do not rely on graph neural networks. Complex geometrical shapes and variations with fixed topology are dealt with using well-known mesh morphing onto a common support, combined with classical dimensionality reduction techniques and Gaussian processes. The proposed methodology can easily deal with large meshes without the need for explicit shape parameterization and provides crucial predictive uncertainties, which are essential for informed decision-making. In the considered numerical experiments, the proposed method is competitive with respect to existing graph neural networks, regarding training efficiency and accuracy of the predictions.

Bayesian nonparametric (non-)renewal processes for analyzing neural spike train variability
David Liu Máté Lengyel



研究问题:如何准确捕捉和量化神经脉冲活动的瞬时变化性,以及其与各种共变量(如感官输入或行为)的复杂依赖关系。
动机:目前的基于点过程的方法只能捕捉到神经脉冲活动瞬时均值对共变量的依赖,而无法捕捉瞬时变化性。为了解决这个问题,提出了一种可扩展的贝叶斯方法,该方法通过使用稀疏变分高斯过程来泛化调制更新过程。
方法:利用路径条件计算条件间期分布的非参数先验,并依靠自动相关性检测来发现超过更新顺序的滞后间期依赖关系。
效果:在合成数据上进行了系统验证后,将该方法应用于两个基础的动物导航数据集:自由移动小鼠的头部方向细胞和沿直线轨道奔跑的大鼠的海马位置细胞。模型展现出与最先进的基线相比具有竞争力或更好的预测能力,并在捕获间期统计方面优于它们。这些结果证实了建模共变量依赖性脉冲变化性的重要性。

Neural spiking activity is generally variable, non-stationary, and exhibits complex dependencies on covariates, such as sensory input or behavior. These dependencies have been proposed to be signatures of specific computations, and so characterizing them with quantitative rigor is critical for understanding neural computations. Approaches based on point processes provide a principled statistical framework for modeling neural spiking activity. However, currently, they only allow the instantaneous mean, but not the instantaneous variability, of responses to depend on covariates. To resolve this limitation, we propose a scalable Bayesian approach generalizing modulated renewal processes using sparse variational Gaussian processes. We leverage pathwise conditioning for computing nonparametric priors over conditional interspike interval distributions and rely on automatic relevance determination to detect lagging interspike interval dependencies beyond renewal order. After systematically validating our method on synthetic data, we apply it to two foundational datasets of animal navigation: head direction cells in freely moving mice and hippocampal place cells in rats running along a linear track. Our model exhibits competitive or better predictive power compared to state-of-the-art baselines, and outperforms them in terms of capturing interspike interval statistics. These results confirm the importance of modeling covariate-dependent spiking variability, and further analyses of our fitted models reveal rich patterns of variability modulation beyond the temporal resolution of flexible count-based approaches.

L-C2ST: Local Diagnostics for Posterior Approximations in Simulation-Based Inference
Julia Linhart Alexandre Gramfort Pedro L. C. Rodrigues



研究问题:如何评估基于模拟的推理(SBI)中复杂高维后验分布的近似值是否可信。
动机:大多数方法仅在观察空间的期望上评估后验估计器,这限制了其可解释性,并不足以确定哪些观察值的近似值可以信任或需要改进。
方法:提出了一种新的方法$\ell$-C2ST,该方法可以在任何给定的观察值处对后验估计器进行局部评估。它提供了理论依据和易于理解的诊断,与C2ST不同,它不需要访问真实后验样本。
效果:在标准的SBI基准测试中,$ell$-C2ST提供了与C2ST相当的结果,并优于其他局部方法,如基于最高预测密度(HPD)的覆盖测试。我们还强调了局部评估的重要性和$\ell$-C2ST在计算神经科学中的挑战性应用的可解释性优势。

Many recent works in simulation-based inference (SBI) rely on deep generative models to approximate complex, high-dimensional posterior distributions. However, evaluating whether or not these approximations can be trusted remains a challenge. Most approaches evaluate the posterior estimator only in expectation over the observation space. This limits their interpretability and is not sufficient to identify for which observations the approximation can be trusted or should be improved. Building upon the well-known classifier two-sample test (C2ST), we introduce $\ell$-C2ST, a new method that allows for a local evaluation of the posterior estimator at any given observation. It offers theoretically grounded and easy to interpret -- e.g. graphical -- diagnostics, and unlike C2ST, does not require access to samples from the true posterior. In the case of normalizing flow-based posterior estimators, $\ell$-C2ST can be specialized to offer better statistical power, while being computationally more efficient. On standard SBI benchmarks, $\ell$-C2ST provides comparable results to C2ST and outperforms alternative local approaches such as coverage tests based on highest predictive density (HPD). We further highlight the importance of local evaluation and the benefit of interpretability of $\ell$-C2ST on a challenging application from computational neuroscience.

Continuous Parametric Optical Flow
Jianqin Luo Zhexiong Wan yuxin mao Bo Li Yuchao Dai



研究问题:提出一种连续参数光流模型,用于表示任意时间间隔内的密集和连续运动。
动机:现有的离散时间表示(即在连续帧之间的流动)无法充分捕捉到连续的密集运动。
方法:通过使用B-splines来拟合有限的几帧中的点轨迹,并添加一个带有神经常微分方程(ODE)的编码器来表示与特定时间相关联的特征。
效果:由于显式参数建模和隐式特征优化的结合,该模型专注于运动连续性,并在拟合长期和可变序列方面优于基于流和点跟踪的方法。

In this paper, we present continuous parametric optical flow, a parametric representation of dense and continuous motion over arbitrary time interval. In contrast to existing discrete-time representations (i.e., flow in between consecutive frames), this new representation transforms the frame-to-frame pixel correspondences to dense continuous flow. In particular, we present a temporal-parametric model that employs B-splines to fit point trajectories using a limited number of frames. To further improve the stability and robustness of the trajectories, we also add an encoder with a neural ordinary differential equation (ODE) to represent features associated with specific times. We also contribute a synthetic dataset and introduce two evaluation perspectives to measure the accuracy and robustness of continuous flow estimation. Benefiting from the combination of explicit parametric modeling and implicit feature optimization, our model focuses on motion continuity and outperforms than the flow-based and point-tracking approaches for fitting long-term and variable sequences.

The Rank-Reduced Kalman Filter: Approximate Dynamical-Low-Rank Filtering In High Dimensions
Jonathan Schmidt Philipp Hennig Jörg Nick Filip Tronarp



研究问题:高维动力系统中的推理和模拟仍然是计算上的挑战。
动机:需要某种形式的降维以使问题在一般情况下可处理。
方法:本文提出了一种新的近似高斯滤波和平滑方法,通过传播协方差矩阵的低秩近似来实现。这通过将与预测步骤相关的李雅普诺夫方程投影到低秩矩阵的流形上,然后由最近开发的数值稳定的动态低秩积分器来解决。同时,通过注意到协方差更新只转换协方差矩阵的列空间(其本身就是低秩构造)来使更新步骤具有可操作性。该算法与现有的基于集成的方法不同,因为协方差矩阵的低秩近似是确定性的,而不是随机的。
效果:我们的方法将计算复杂度从立方体(对于卡尔曼滤波器)降低到状态空间大小最坏情况下的二次方,并且如果状态空间模型满足某些标准,可以实现线性复杂度。通过一系列经典数据同化和时空回归实验,我们表明所提出的方法在均值和协方差的误差方面始终优于基于集成的方法,相对于精确的卡尔曼滤波器没有额外的计算复杂度成本。

Inference and simulation in the context of high-dimensional dynamical systems remain computationally challenging problems. Some form of dimensionality reduction is required to make the problem tractable in general. In this paper, we propose a novel approximate Gaussian filtering and smoothing method which propagates low-rank approximations of the covariance matrices. This is accomplished by projecting the Lyapunov equations associated with the prediction step to a manifold of low-rank matrices, which are then solved by a recently developed, numerically stable, dynamical low-rank integrator. Meanwhile, the update steps are made tractable by noting that the covariance update only transforms the column space of the covariance matrix, which is low-rank by construction. The algorithm differentiates itself from existing ensemble-based approaches in that the low-rank approximations of the covariance matrices are deterministic, rather than stochastic. Crucially, this enables the method to reproduce the exact Kalman filter as the low-rank dimension approaches the true dimensionality of the problem. Our method reduces computational complexity from cubic (for the Kalman filter) to quadratic in the state-space size in the worst-case, and can achieve linear complexity if the state-space model satisfies certain criteria. Through a set of experiments in classical data-assimilation and spatio-temporal regression, we show that the proposed method consistently outperforms the ensemble-based methods in terms of error in the mean and covariance with respect to the exact Kalman filter. This comes at no additional cost in terms of asymptotic computational complexity.

Approximate inference of marginals using the IBIA framework
Shivani Bathla Vinita Vasudevan



研究问题:概率图模型(PGM)的边缘精确推断是困难的,需要使用近似方法。
动机:现有的变分技术在循环图中进行迭代消息传递,对于许多基准测试来说,收敛速度很慢。
方法:本文提出了一种新的边缘推断算法,基于增量构建-推理-近似(IBIA)范式。该算法将PGM转换为一系列链接的团树森林(SLCTF),并使用启发式信念更新算法来推断边缘。
效果:对于贝叶斯网络的特殊案例,如果IBIA的增量构建步骤使用变量的拓扑顺序,那么(a)所有CTF中的先验边缘是一致的,(b)一旦所有证据变量添加到SLCTF中,后验边缘就是一致。在我们的方法中,信念传播步骤是非迭代的,准确性-复杂性权衡是通过用户定义的团大小边界来控制的。对最近几次UAI竞赛的几个基准集的结果表明,我们的方法在准确性上至少与现有的变分和采样方法相当,同时运行时间更短。

Exact inference of marginals in probabilistic graphical models (PGM) is known to be intractable, necessitating the use of approximate methods. Most of the existing variational techniques perform iterative message passing in loopy graphs which is slow to converge for many benchmarks. In this paper, we propose a new algorithm for marginal inference that is based on the incremental build-infer-approximate (IBIA) paradigm. Our algorithm converts the PGM into a sequence of linked clique tree forests (SLCTF) with bounded clique sizes, and then uses a heuristic belief update algorithm to infer the marginals. For the special case of Bayesian networks, we show that if the incremental build step in IBIA uses the topological order of variables then (a) the prior marginals are consistent in all CTFs in the SLCTF and (b) the posterior marginals are consistent once all evidence variables are added to the SLCTF. In our approach, the belief propagation step is non-iterative and the accuracy-complexity trade-off is controlled using user-defined clique size bounds. Results for several benchmark sets from recent UAI competitions show that our method gives either better or comparable accuracy than existing variational and sampling based methods, with smaller runtimes.

Unbiased constrained sampling with Self-Concordant Barrier Hamiltonian Monte Carlo
Maxence Noble Valentin De Bortoli Alain Durmus



研究问题:本文提出了一种基于障碍哈密顿蒙特卡洛(BHMC)的采样方法,该方法旨在从具有Hessian度量的流形上的吉布斯分布中进行采样。
动机:现有的哈密顿蒙特卡洛(HMC)算法在黎曼流形上的应用存在不可避免的偏差。
方法:本文提出了一种新的滤波步骤,称为“对自反检查步骤”,以解决这个问题。这个步骤分别在连续BHMC(c-bHMC)和数值BHMC(n-BHMC)两种版本中实现。
效果:这两种新算法都能生成相对于π的可逆马尔科夫链,并且与以前的实现相比没有任何偏差。这一结论得到了数值实验的支持,其中考虑了定义在多面体上的目标分布。

In this paper, we propose Barrier Hamiltonian Monte Carlo (BHMC), a version of the HMC algorithm which aims at sampling from a Gibbs distribution $\pi$ on a manifold $\mathsf{M}$, endowed with a Hessian metric $\mathfrak{g}$ derived from a self-concordant barrier. Our method relies on Hamiltonian dynamics which comprises $\mathfrak{g}$. Therefore, it incorporates the constraints defining $\mathsf{M}$ and is able to exploit its underlying geometry. However, the corresponding Hamiltonian dynamics is defined via non separable Ordinary Differential Equations (ODEs) in contrast to the Euclidean case. It implies unavoidable bias in existing generalization of HMC to Riemannian manifolds. In this paper, we propose a new filter step, called ``involution checking step'', to address this problem. This step is implemented in two versions of BHMC, coined continuous BHMC (c-bHMC) and numerical BHMC (n-BHMC) respectively. Our main results establish that these two new algorithms generate reversible Markov chains with respect to $\pi$ and do not suffer from any bias in comparison to previous implementations. Our conclusions are supported by numerical experiments where we consider target distributions defined on polytopes.

Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport
Jaemoo Choi Jaewoong Choi Myungjoo Kang



研究问题:本文旨在解决最优传输(OT)问题在生成模型任务中的应用,特别是其对异常值的敏感性和训练过程中的优化挑战。
动机:最优传输(OT)被广泛用于生成模型任务,但其对异常值敏感且训练时面临优化难题。
方法:本文提出了一种基于非平衡最优传输(UOT)半双线性形式的新生成模型。与OT不同,UOT放松了分布匹配的硬约束,提高了对异常值的鲁棒性,稳定性和训练速度。
效果:实验证明,该模型优于现有的OT基生成模型,在CIFAR-10上取得了2.97的FID分数,在CelebA-HQ-256上取得了5.80的FID分数。

Optimal Transport (OT) problem investigates a transport map that bridges two distributions while minimizing a given cost function. In this regard, OT between tractable prior distribution and data has been utilized for generative modeling tasks. However, OT-based methods are susceptible to outliers and face optimization challenges during training. In this paper, we propose a novel generative model based on the semi-dual formulation of Unbalanced Optimal Transport (UOT). Unlike OT, UOT relaxes the hard constraint on distribution matching. This approach provides better robustness against outliers, stability during training, and faster convergence. We validate these properties empirically through experiments. Moreover, we study the theoretical upper-bound of divergence between distributions in UOT. Our model outperforms existing OT-based generative models, achieving FID scores of 2.97 on CIFAR-10 and 5.80 on CelebA-HQ-256. The code is available at \url{https://github.com/Jae-Moo/UOTM}.

Integration-free Training for Spatio-temporal Multimodal Covariate Deep Kernel Point Processes
YIXUAN ZHANG Quyu Kong Feng Zhou



研究问题:本文提出了一种新的深度时空点过程模型,即深度核混合点过程(DKMPP),该模型能够整合多模态协变量信息。
动机:为了解决传统模型在处理复杂事件和协变量数据关系时的局限性,我们提出了一种使用更灵活的深度内核来提高模型表现力的方法。
方法:我们采用了基于得分匹配的无集成训练方法,并进一步通过采用可扩展的去噪得分匹配方法来提高效率。
效果:实验结果表明,DKMPP及其对应的基于得分的估计器优于基线模型,展示了整合协变量信息、使用深度内核和采用基于得分的估计器的优势。

In this study, we propose a novel deep spatio-temporal point process model, Deep Kernel Mixture Point Processes (DKMPP), that incorporates multimodal covariate information. DKMPP is an enhanced version of Deep Mixture Point Processes (DMPP), which uses a more flexible deep kernel to model complex relationships between events and covariate data, improving the model's expressiveness. To address the intractable training procedure of DKMPP due to the non-integrable deep kernel, we utilize an integration-free method based on score matching, and further improve efficiency by adopting a scalable denoising score matching method. Our experiments demonstrate that DKMPP and its corresponding score-based estimators outperform baseline models, showcasing the advantages of incorporating covariate information, utilizing a deep kernel, and employing score-based estimators.

Joint Bayesian Inference of Graphical Structure and Parameters with a Single Generative Flow Network
Tristan Deleu Mizu Nishikawa-Toomey Jithendaraa Subramanian Nikolay Malkin Laurent Charlin Yoshua Bengio



研究问题:如何更准确地对贝叶斯网络的结构和参数进行联合后验分布近似。
动机:现有的方法仅能对贝叶斯网络的结构进行近似,而忽视了其参数的重要性。
方法:提出一种基于生成流网络(GFlowNets)的方法,该方法通过两阶段采样策略同时估计贝叶斯网络的结构和参数。
效果:实验证明,该方法在模拟数据和真实数据上都优于现有方法,能更准确地近似联合后验分布。

Generative Flow Networks (GFlowNets), a class of generative models over discrete and structured sample spaces, have been previously applied to the problem of inferring the marginal posterior distribution over the directed acyclic graph (DAG) of a Bayesian Network, given a dataset of observations. Based on recent advances extending this framework to non-discrete sample spaces, we propose in this paper to approximate the joint posterior over not only the structure of a Bayesian Network, but also the parameters of its conditional probability distributions. We use a single GFlowNet whose sampling policy follows a two-phase process: the DAG is first generated sequentially one edge at a time, and then the corresponding parameters are picked once the full structure is known. Since the parameters are included in the posterior distribution, this leaves more flexibility for the local probability models of the Bayesian Network, making our approach applicable even to non-linear models parametrized by neural networks. We show that our method, called JSP-GFN, offers an accurate approximation of the joint posterior, while comparing favorably against existing methods on both simulated and real data.

Conformal Prediction for Time Series with Modern Hopfield Networks
Andreas Auer Martin Gauch Daniel Klotz Sepp Hochreiter



研究问题:如何将一致性预测方法应用于时间序列,以解决其自相关性结构违反一致性预测的基本假设的问题。
动机:现有的一致性预测方法难以应用于时间序列,因为其自相关性结构违反了一致性预测的基本假设。
方法:提出HopCPT,一种新颖的一致性预测方法,该方法不仅能够处理时间序列的结构,而且可以利用它们。
效果:实验结果表明,我们的方法在存在时间依赖性的时间序列上优于最先进的一致性预测方法。

To quantify uncertainty, conformal prediction methods are gaining continuously more interest and have already been successfully applied to various domains. However, they are difficult to apply to time series as the autocorrelative structure of time series violates basic assumptions required by conformal prediction. We propose HopCPT, a novel conformal prediction approach for time series that not only copes with temporal structures but leverages them. We show that our approach is theoretically well justified for time series where temporal dependencies are present. In experiments, we demonstrate that our new approach outperforms state-of-the-art conformal prediction methods on multiple real-world time series datasets from four different domains.

Particle-based Variational Inference with Generalized Wasserstein Gradient Flow
Ziheng Cheng Shiyue Zhang Longlin Yu Cheng Zhang



研究问题:现有的粒子变分推断方法在设计核函数时通常具有挑战性,限制了其灵活性。
动机:近期的研究表明,带有二次形式正则化项的功能梯度流近似可以提高性能。
方法:本文提出了一种基于KL散度的广义Wasserstein梯度流的粒子变分推断框架(GWG),这是一种带有由凸函数诱导的更广泛类别的正则化器的函数梯度方法。
效果:实验证明,GWG展现出强大的收敛保证。我们还提供了一个自适应版本,可以自动选择Wasserstein度量以加速收敛。在模拟和真实数据问题上,我们展示了所提出框架的有效性和效率。

Particle-based variational inference methods (ParVIs) such as Stein variational gradient descent (SVGD) update the particles based on the kernelized Wasserstein gradient flow for the Kullback-Leibler (KL) divergence. However, the design of kernels is often non-trivial and can be restrictive for the flexibility of the method. Recent works show that functional gradient flow approximations with quadratic form regularization terms can improve performance. In this paper, we propose a ParVI framework, called generalized Wasserstein gradient descent (GWG), based on a generalized Wasserstein gradient flow of the KL divergence, which can be viewed as a functional gradient method with a broader class of regularizers induced by convex functions. We show that GWG exhibits strong convergence guarantees. We also provide an adaptive version that automatically chooses Wasserstein metric to accelerate convergence. In experiments, we demonstrate the effectiveness and efficiency of the proposed framework on both simulated and real data problems.

Robust covariance estimation with missing values and cell-wise contamination
gregoire pacreau Karim Lounici



研究问题:大型数据集经常受到以缺失或错误数据形式的单元异常值的影响,如何处理这些异常值是一个问题。
动机:丢弃包含异常值的任何样本可能导致数据集过小,无法准确估计协方差矩阵。同时,针对此问题设计的鲁棒程序需要协方差算子的可逆性,因此在高维数据上效果不佳。
方法:本文提出了一种在存在缺失值的情况下估计协方差的无偏估计器,该估计器不需要任何插补步骤,并且在算子范数下仍能达到接近最小最大统计精度。我们还主张将其与单元异常检测方法结合使用,以解决高维低秩设置中的单元污染问题。
效果:通过实验研究,我们的方法在低维和高维设置中都优于现有技术,证明了其优越性。

Large datasets are often affected by cell-wise outliers in the form of missing or erroneous data. However, discarding any samples containing outliers may result in a dataset that is too small to accurately estimate the covariance matrix. Moreover, the robust procedures designed to address this problem require the invertibility of the covariance operator and thus are not effective on high-dimensional data. In this paper, we propose an unbiased estimator for the covariance in the presence of missing values that does not require any imputation step and still achieves near minimax statistical accuracy with the operator norm. We also advocate for its use in combination with cell-wise outlier detection methods to tackle cell-wise contamination in a high-dimensional and low-rank setting, where state-of-the-art methods may suffer from numerical instability and long computation times. To complement our theoretical findings, we conducted an experimental study which demonstrates the superiority of our approach over the state of the art both in low and high dimension settings.

On the Consistency of Maximum Likelihood Estimation of Probabilistic Principal Component Analysis
Arghya Datta Sayak Chakrabarty



研究问题:概率主成分分析(PPCA)模型的最大似然估计存在理论保证的问题。
动机:尽管PPCA在科学、工程和金融等领域有广泛应用,但其最大似然估计(MLE)的解决方案缺乏理论保证。
方法:提出使用商空间的新方法,证明最大似然解在适当的商欧几里得空间中是一致的。
效果:建立了PPCA模型的最大似然估计的强一致性和协方差估计,并扩展了更一般的估计器类别。

Probabilistic principal component analysis (PPCA) is currently one of the most used statistical tools to reduce the ambient dimension of the data. From multidimensional scaling to the imputation of missing data, PPCA has a broad spectrum of applications ranging from science and engineering to quantitative finance.\\ Despite this wide applicability in various fields, hardly any theoretical guarantees exist to justify the soundness of the maximal likelihood (ML) solution for this model. In fact, it is well known that the maximum likelihood estimation (MLE) can only recover the true model parameters up to a rotation. The main obstruction is posed by the inherent identifiability nature of the PPCA model resulting from the rotational symmetry of the parameterization. To resolve this ambiguity, we propose a novel approach using quotient topological spaces and in particular, we show that the maximum likelihood solution is consistent in an appropriate quotient Euclidean space. Furthermore, our consistency results encompass a more general class of estimators beyond the MLE. Strong consistency of the ML estimate and consequently strong covariance estimation of the PPCA model have also been established under a compactness assumption.

Flat Seeking Bayesian Neural Networks
Van-Anh Nguyen Long Tung Vuong Hoang Phan Thanh-Toan Do Dinh Phung Trung Le



研究问题:本文旨在通过引入先验分布和后验推断,为深度学习模型提供一种贝叶斯神经网络(BNNs)的概率解释,并开发了一种关注锐度的后验推断理论、贝叶斯设置和变分推断方法。
动机:现有的后验推断并未考虑到模型的锐度/平坦度,可能导致采样出的模型锐度过高。而深度学模型的锐度较低通常具有更好的泛化能力。
方法:通过在模型参数上施加先验分布,并根据观察到的数据推断后验分布,开发出关注锐度的后验推断理论、贝叶斯设置和变分推断方法。
效果:实验结果显示,利用关注锐度的后验推断与最先进的贝叶斯神经网络结合,得到的平缓模型在所有关注指标上都优于其基线模型。

Bayesian Neural Networks (BNNs) provide a probabilistic interpretation for deep learning models by imposing a prior distribution over model parameters and inferring a posterior distribution based on observed data. The model sampled from the posterior distribution can be used for providing ensemble predictions and quantifying prediction uncertainty. It is well-known that deep learning models with lower sharpness have better generalization ability. However, existing posterior inferences are not aware of sharpness/flatness in terms of formulation, possibly leading to high sharpness for the models sampled from them. In this paper, we develop theories, the Bayesian setting, and the variational inference approach for the sharpness-aware posterior. Specifically, the models sampled from our sharpness-aware posterior, and the optimal approximate posterior estimating this sharpness-aware posterior, have better flatness, hence possibly possessing higher generalization ability. We conduct experiments by leveraging the sharpness-aware posterior with state-of-the-art Bayesian Neural Networks, showing that the flat-seeking counterparts outperform their baselines in all metrics of interest.

Langevin Quasi-Monte Carlo
Sifan Liu



研究问题:本文旨在探讨利用具有低差异性质的完全均匀分布序列来生成高斯扰动,以降低Langevin Monte Carlo算法的估计误差。
动机:在复杂高维分布采样中,Langevin蒙特卡洛及其随机梯度版本是强大的算法。通过将独立随机样本替换为低差异序列等准随机样本,可以显著降低普通蒙特卡洛的估计误差。本文旨在证明,对于Langevin蒙特卡洛,使用低差异性的完全均匀分布序列也可以降低其估计误差。
方法:具体来说,我们提出使用具有特定低差异性质的完全均匀分布序列来生成高斯扰动。在平滑性和凸性条件下,我们证明了使用低差异CUD序列的LMC比标准的LMC具有更小的误差。
效果:理论分析得到了令人信服的数值实验的支持,这些实验表明了我们的方法的有效性。

Langevin Monte Carlo (LMC) and its stochastic gradient versions are powerful algorithms for sampling from complex high-dimensional distributions. To sample from a distribution with density $\pi(\theta)\propto \exp(-U(\theta)) $, LMC iteratively generates the next sample by taking a step in the gradient direction $\nabla U$ with added Gaussian perturbations. Expectations w.r.t. the target distribution $\pi$ are estimated by averaging over LMC samples. In ordinary Monte Carlo, it is well known that the estimation error can be substantially reduced by replacing independent random samples by quasi-random samples like low-discrepancy sequences. In this work, we show that the estimation error of LMC can also be reduced by using quasi-random samples. Specifically, we propose to use completely uniformly distributed (CUD) sequences with certain low-discrepancy property to generate the Gaussian perturbations. Under smoothness and convexity conditions, we prove that LMC with a low-discrepancy CUD sequence achieves smaller error than standard LMC. The theoretical analysis is supported by compelling numerical experiments, which demonstrate the effectiveness of our approach.

A Unified Discretization Framework for Differential Equation Approach with Lyapunov Arguments for Convex Optimization
Kansei Ushiyama Shun Sato Takayasu Matsuo



研究问题:本文旨在解决利用连续微分方程进行凸优化的问题,即如何将优化方法与特定的连续微分方程和速率揭示的李雅普诺夫泛函联系起来。
动机:尽管Su-Boyd-Candès(2014)的开创性论文使连续微分方程方法在凸优化中越来越受到关注,但这种方法仍然缺乏一个关键部分,使其无法真正有用:没有通用、一致的方法可以转换回离散优化方法。因此,即使我们从连续微分方程中获得洞察,我们仍然需要为每种方法的分析执行个性化和繁琐的计算。
方法:本文通过引入一个新的概念“弱离散梯度”(wDG)来填补这一空白,该概念整合了DE方法参数中离散梯度所需的条件。然后,我们使用wDG定义抽象优化方法并提供与连续微分方程平行的抽象收敛理论。
效果:我们证明许多典型的优化方法和它们的收敛率都可以作为这种抽象理论的特例推导出来。所提出的统一离散化框架为利用微分方程进行凸优化提供了一种简单的环境,便于开发新的优化方法和实现与最先进的方法(如Nesterov加速梯度)相竞争的收敛速度。

The differential equation (DE) approach for convex optimization, which relates optimization methods to specific continuous DEs with rate-revealing Lyapunov functionals, has gained increasing interest since the seminal paper by Su--Boyd--Candès (2014). However, the approach still lacks a crucial component to make it truly useful: there is no general, consistent way to transition back to discrete optimization methods. Consequently, even if we derive insights from continuous DEs, we still need to perform individualized and tedious calculations for the analysis of each method. This paper aims to bridge this gap by introducing a new concept called ``weak discrete gradient'' (wDG), which consolidates the conditions required for discrete versions of gradients in the DE approach arguments. We then define abstract optimization methods using wDG and provide abstract convergence theories that parallel those in continuous DEs. We demonstrate that many typical optimization methods and their convergence rates can be derived as special cases of this abstract theory. The proposed unified discretization framework for the differential equation approach to convex optimization provides an easy environment for developing new optimization methods and achieving competitive convergence rates with state-of-the-art methods, such as Nesterov's accelerated gradient.

Exploring the Optimal Choice for Generative Processes in Diffusion Models: Ordinary vs Stochastic Differential Equations
Yu Cao Jingrun Chen Yixin Luo Xiang ZHOU



研究问题:本文旨在解决在计算机视觉中,基于ODE的概率流和基于SDE的扩散模型哪个更优越以及在什么情况下更优越的问题。
动机:由于对数据分布、得分训练和其他数值问题的依赖性,比较这两种模型具有挑战性。因此,本文希望通过数学方法对此进行研究。
方法:本文首先引入脉冲形状误差来扰动得分函数,并分析采样质量的错误累积,然后对任意误差的泛化进行了全面分析。
效果:研究发现,当扰动发生在生成过程结束时,具有大扩散系数的ODE模型优于SDE模型。然而,当扰动发生在较早的时候,SDE模型则优于ODE模型。此外,我们还发现,随着扩散项的大小增加到无穷大,由于脉冲形状扰动导致的样本生成错误会呈指数级抑制。通过高斯分布、高斯混合分布、瑞士卷分布以及MNIST和CIFAR-10等真实数据集进行数值验证,证实了这一现象。

The diffusion model has shown remarkable success in computer vision, but it remains unclear whether the ODE-based probability flow or the SDE-based diffusion model is more superior and under what circumstances. Comparing the two is challenging due to dependencies on data distributions, score training, and other numerical issues. In this paper, we study the problem mathematically for two limiting scenarios: the zero diffusion (ODE) case and the large diffusion case. We first introduce a pulse-shape error to perturb the score function and analyze error accumulation of sampling quality, followed by a thorough analysis for generalization to arbitrary error. Our findings indicate that when the perturbation occurs at the end of the generative process, the ODE model outperforms the SDE model with a large diffusion coefficient. However, when the perturbation occurs earlier, the SDE model outperforms the ODE model. We demonstrate that the error of sample generation due to the pulse-shape perturbation is exponentially suppressed as the diffusion term's magnitude increases to infinity. Numerical validation of this phenomenon is provided using Gaussian, Gaussian mixture, and Swiss roll distribution, as well as realistic datasets like MNIST and CIFAR-10.

Errors-in-variables Fr\'echet Regression with Low-rank Covariate Approximation
Dogyoon Song Kyunghee Han



研究问题:提出一种新的估计方法,解决非欧几里得响应变量的回归分析中存在的局限性。
动机:现有的弗雷歇特回归方法依赖于充足且无噪声协变量数据的理想场景,实际应用受到限制。
方法:通过利用协变量矩阵内在的低秩结构,结合全球弗雷歇特回归和主成分回归的概念,提出了一种新颖的估计方法。
效果:该方法能够更有效地建立和估计模型,尤其在高维和误差变量回归设置中表现优越。理论分析和数值实验结果均支持该方法的优越性能,为非欧几里得变量的回归分析引入了一种有前景的框架。

Fr\'echet regression has emerged as a promising approach for regression analysis involving non-Euclidean response variables. However, its practical applicability has been hindered by its reliance on ideal scenarios with abundant and noiseless covariate data. In this paper, we present a novel estimation method that tackles these limitations by leveraging the low-rank structure inherent in the covariate matrix. Our proposed framework combines the concepts of global Fr\'echet regression and principal component regression, aiming to improve the efficiency and accuracy of the regression estimator. By incorporating the low-rank structure, our method enables more effective modeling and estimation, particularly in high-dimensional and errors-in-variables regression settings. We provide a theoretical analysis of the proposed estimator's large-sample properties, including a comprehensive rate analysis of bias, variance, and additional variations due to measurement errors. Furthermore, our numerical experiments provide empirical evidence that supports the theoretical findings, demonstrating the superior performance of our approach. Overall, this work introduces a promising framework for regression analysis of non-Euclidean variables, effectively addressing the challenges associated with limited and noisy covariate data, with potential applications in diverse fields.

Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration
Longlin Yu Tianyu Xie Yu Zhu Tong Yang Xiangyu Zhang Cheng Zhang



研究问题:目前的半隐式变分推理(SIVI)方法在处理复杂目标后验分布时,由于通常采用的单层架构可能无法满足需求。
动机:为了解决上述问题,本文提出了一种称为分层半隐式变分推理(HSIVI)的新方法,该方法通过引入辅助分布在简单的基础分布和目标分布之间进行插值,使得条件层能够逐层逐步匹配这些辅助分布进行训练。
方法:HSIVI 将 SIVI 扩展到允许更富有表现力的多层半隐式分布构造。通过使用预训练的得分网络,HSIVI 可以加速具有得分匹配目标的扩散模型的采样过程。
效果:实验结果表明,HSIVI 在几个具有复杂目标分布的贝叶斯推理问题上显著提高了 SIVI 的表现力。当用于加速扩散模型时,HSIVI 可以在各种数据集上以较少的函数评估产生与现有的快速扩散模型基于采样器相媲美或更好的高质量样本。

Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variational inference, called HSIVI, which generalizes SIVI to allow more expressive multi-layer construction of semi-implicit distributions. By introducing auxiliary distributions that interpolate between a simple base distribution and the target distribution, the conditional layers can be trained by progressively matching these auxiliary distributions one layer after another. Moreover, given pre-trained score networks, HSIVI can be used to accelerate the sampling process of diffusion models with the score matching objective. We show that HSIVI significantly enhances the expressiveness of SIVI on several Bayesian inference problems with complicated target distributions. When used for diffusion model acceleration, we show that HSIVI can produce high quality samples comparable to or better than the existing fast diffusion model based samplers with a small number of function evaluations on various datasets.

Gaussian Mixture Solvers for Diffusion Models
Hanzhong Allan Guo Cheng Lu Fan Bao Tianyu Pang Shuicheng YAN Chao Du Chongxuan Li



研究问题:现有的SDE(随机微分方程)求解器在生成高质量样本和图像翻译任务上表现优秀,但在效率-效果的平衡问题上存在困扰。
动机:由于在有限的离散步骤中,反向转移核的高斯假设经常被违反,导致现有的SDE求解器在推理过程中受到严重限制。
方法:提出一种新的基于SDE的求解器——高斯混合求解器(GMS)。该求解器在每一步采样中估计三阶矩并优化高斯混合转移核的参数。
效果:实证结果显示,GMS在各种扩散模型中的图像生成和笔触合成等任务上的样本质量优于其他SDE求解器,验证了GMS的动机和有效性。

Recently, diffusion models have achieved great success in generative tasks. Sampling from diffusion models is equivalent to solving the reverse diffusion stochastic differential equations (SDEs) or the corresponding probability flow ordinary differential equations (ODEs). In comparison, SDE-based solvers can generate samples of higher quality and are suited for image translation tasks like stroke-based synthesis. During inference, however, existing SDE-based solvers are severely constrained by the efficiency-effectiveness dilemma. Our investigation suggests that this is because the Gaussian assumption in the reverse transition kernel is frequently violated (even in the case of simple mixture data) given a limited number of discretization steps. To overcome this limitation, we introduce a novel class of SDE-based solvers called \emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver estimates the first three-order moments and optimizes the parameters of a Gaussian mixture transition kernel using generalized methods of moments in each step during sampling. Empirically, our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis in various diffusion models, which validates the motivation and effectiveness of GMS. Our code is available at https://github.com/Guohanzhong/GMS.

Hyperbolic VAE via Latent Gaussian Distributions
Seunghyuk Cho Juyong Lee Dongwoo Kim



研究问题:提出一种利用高斯流形变分自编码器(GM-VAE)进行模型训练的方法,其潜在空间由一组高斯分布构成。
动机:现有的变分自编码器在处理图像数据集的密度估计和基于模型的强化学习状态表示学习等任务上存在不足。
方法:提出了一种基于KL散度、局部平方Fisher-Rao距离近似的高斯流形正态分布,用于定义潜在空间上的密度。
效果:实验证明,GM-VAE在密度估计任务上优于其他超球和欧几里得VAEs变体,并在基于模型的强化学习中表现出竞争力。同时,该模型提供了强大的数值稳定性,解决了先前报道的超球VAEs的一个常见限制。

We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent space consists of a set of Gaussian distributions. It is known that the set of the univariate Gaussian distributions with the Fisher information metric form a hyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed with the Gaussian manifolds, we propose a pseudo-Gaussian manifold normal distribution based on the Kullback-Leibler divergence, a local approximation of the squared Fisher-Rao distance, to define a density over the latent space. We demonstrate the efficacy of GM-VAE on two different tasks: density estimation of image datasets and state representation learning for model-based reinforcement learning. GM-VAE outperforms the other variants of hyperbolic- and Euclidean-VAEs on density estimation tasks and shows competitive performance in model-based reinforcement learning. We observe that our model provides strong numerical stability, addressing a common limitation reported in previous hyperbolic-VAEs. The implementation is available at https://github.com/ml-postech/GM-VAE.

The probability flow ODE is provably fast
Sitan Chen Sinho Chewi Holden Lee Yuanzhi Li Jianfeng Lu Adil Salim



研究问题:本文旨在为基于得分的生成模型的概率流ODE实现(以及校正步骤)提供首次多项式时间收敛保证。
动机:在最近获得SDE实现(即去噪扩散概率建模或DDPM)此类保证的结果之后,进行此项分析,但需要开发新的技术来研究非收缩性的确定性动力学。
方法:通过使用基于欠阻尼朗之万扩散的特殊选择的校正步骤,我们获得了比先前关于DDPM的工作更好的维度依赖性(假设数据分布的平滑性,$O(\sqrt d)$与$O(d)$相比),突出了ODE框架的潜在优势。
效果:实验结果表明,我们在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We provide the first polynomial-time convergence guarantees for the probabilistic flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt d)$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework.

On Calibrating Diffusion Probabilistic Models
Tianyu Pang Cheng Lu Chao Du Min Lin Shuicheng YAN Zhijie Deng



研究问题:如何提高预训练扩散概率模型(DPMs)在各种生成任务上的效果。
动机:扩散概率模型的随机反向数据得分过程是一个鞅,可以推导出数据得分的浓度界限和可选停止定理。
方法:通过一次校准任意预训练的DPM,降低分数匹配损失,从而提高模型似然的下界。提供通用的校准指南。
效果:实验证明,该方法能显著提高DPM在多个数据集上的采样效果,且校准后的模型可重复使用。

Recently, diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. A typical DPM framework includes a forward process that gradually diffuses the data distribution and a reverse process that recovers the data distribution from time-dependent data scores. In this work, we observe that the stochastic reverse process of data scores is a martingale, from which concentration bounds and the optional stopping theorem for data scores can be derived. Then, we discover a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can consequently be increased. We provide general calibration guidelines under various model parametrizations. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling. We conduct experiments on multiple datasets to empirically validate our proposal. Our code is available at https://github.com/thudzj/Calibrated-DPMs.

Statistical Insights into HSIC in High Dimensions
Tao Zhang Yaowu Zhang Tingyou Zhou



研究问题:衡量随机向量之间的非线性依赖性并测试它们的统计独立性是统计学中的基本问题。
动机:Hilbert-Schmidt独立性准则(HSIC)是最流行的依赖性度量之一,近年来引起了越来越多的关注。然而,大多数现有工作都集中在固定或非常高维的协变量上。
方法:我们填补了这两种情况之间的差距,并对维度以不同速率增长时HSIC的性能提供了统计洞察。
效果:我们首先证明,在零假设下,重新缩放的HSIC会收敛到标准正态分布。然后,我们提供了一个通用条件,使得HSIC基于测试在高维空间中具有非平凡能力。通过分解这个条件,我们说明了随着维度的增加,HSIC测量非线性依赖的能力如何变化。此外,我们还证明了,根据样本大小、协变量的维度以及协变量内部的依赖结构,HSIC可以捕获随机向量之间的不同类型的关联。我们还进行了广泛的数值研究以验证我们的理论结果。

Measuring the nonlinear dependence between random vectors and testing for their statistical independence is a fundamental problem in statistics. One of the most popular dependence measures is the Hilbert-Schmidt independence criterion (HSIC), which has attracted increasing attention in recent years. However, most existing works have focused on either fixed or very high-dimensional covariates. In this work, we bridge the gap between these two scenarios and provide statistical insights into the performance of HSIC when the dimensions grow at different rates. We first show that, under the null hypothesis, the rescaled HSIC converges in distribution to a standard normal distribution. Then we provide a general condition for the HSIC based tests to have nontrivial power in high dimensions. By decomposing this condition, we illustrate how the ability of HSIC to measure nonlinear dependence changes with increasing dimensions. Moreover, we demonstrate that, depending on the sample size, the covariate dimensions and the dependence structures within covariates, the HSIC can capture different types of associations between random vectors. We also conduct extensive numerical studies to validate our theoretical results.

Multinomial Logistic Regression: Asymptotic Normality on Null Covariates in High-Dimensions
Kai Tan Pierre C Bellec



研究问题:本文研究了多项逻辑模型中极大似然估计(MLE)在高维情况下的渐近分布。
动机:传统的大样本理论在某些条件下提供了MLE的渐近正态性,但在高维情况下,这种经典结果可能会失败。
方法:本文针对3个或更多类别的分类问题,对多项逻辑MLE(也称为交叉熵最小化器)进行了渐近正态性和渐近卡方结果的研究。
效果:通过大量的模拟数据验证了这些渐近结果,并确认了用于测试给定特征重要性的提出的p值的有效性。

This paper investigates the asymptotic distribution of the maximum-likelihood estimate (MLE) in multinomial logistic models in the high-dimensional regime where dimension and sample size are of the same order. While classical large-sample theory provides asymptotic normality of the MLE under certain conditions, such classical results are expected to fail in high-dimensions as documented for the binary logistic case in the seminal work of Sur and Candès [2019]. We address this issue in classification problems with 3 or more classes, by developing asymptotic normality and asymptotic chi-square results for the multinomial logistic MLE (also known as cross-entropy minimizer) on null covariates. Our theory leads to a new methodology to test the significance of a given feature. Extensive simulation studies on synthetic data corroborate these asymptotic results and confirm the validity of proposed p-values for testing the significance of a given feature.

Perceptual Kalman Filters: Online State Estimation under a Perfect Perceptual-Quality Constraint
Dror Freirich Tomer Michaeli Ron Meir



研究问题:如何从损坏或缺失的数据中重建时间信号,并实现人类感知的最佳质量。
动机:在许多实际场景中,如解码、跟踪、信号增强和去噪等,需要从损坏或缺失的数据中重建时间信号。由于重建的信号最终由人类观察,因此希望实现的重建结果能符合人类的感知。
方法:我们研究了在完美感知质量约束下的最优因果滤波问题,这是一个本质上不同的任务。具体来说,我们分析了通过线性噪声变换观察到的高斯马尔可夫信号。在没有感知约束的情况下,卡尔曼滤波器在这种设置下已知是MSE意义上的最优解。在这里,我们表明添加完美的感知质量约束(即要求时间一致性)引入了一个基本的矛盾,即滤波器可能必须“故意”忽略观察所揭示的新信息,以符合其过去的决策。这通常会导致MSE显著增加(超过静态设置中的MSE)。我们的分析超越了卡尔曼滤波的经典创新过程,引入了未利用的信息过程这一新概念。使用此工具,我们提出了一种感知滤波器的递归公式,并展示了完美感知质量估计对视频重建问题的质量效应。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,我们也展示了完美感知质量估计对视频重建问题的质量效应。

Many practical settings call for the reconstruction of temporal signals from corrupted or missing data. Classic examples include decoding, tracking, signal enhancement and denoising. Since the reconstructed signals are ultimately viewed by humans, it is desirable to achieve reconstructions that are pleasing to human perception. Mathematically, perfect perceptual-quality is achieved when the distribution of restored signals is the same as that of natural signals, a requirement which has been heavily researched in static estimation settings (i.e. when a whole signal is processed at once). Here, we study the problem of optimal causal filtering under a perfect perceptual-quality constraint, which is a task of fundamentally different nature. Specifically, we analyze a Gaussian Markov signal observed through a linear noisy transformation. In the absence of perceptual constraints, the Kalman filter is known to be optimal in the MSE sense for this setting. Here, we show that adding the perfect perceptual quality constraint (i.e. the requirement of temporal consistency), introduces a fundamental dilemma whereby the filter may have to ``knowingly'' ignore new information revealed by the observations in order to conform to its past decisions. This often comes at the cost of a significant increase in the MSE (beyond that encountered in static settings). Our analysis goes beyond the classic innovation process of the Kalman filter, and introduces the novel concept of an unutilized information process. Using this tool, we present a recursive formula for perceptual filters, and demonstrate the qualitative effects of perfect perceptual-quality estimation on a video reconstruction problem.

Generator Identification for Linear SDEs with Additive and Multiplicative Noise
Yuanyuan Wang Xi Geng Wei Huang Biwei Huang Mingming Gong



研究问题:如何从给定初始状态的解过程分布中识别线性随机微分方程(SDE)的生成器。
动机:这对于使用线性SDE进行因果推断至关重要,因为它们可以从观察分布中识别出干预后的分布。
方法:我们为识别带有加性噪声和乘性噪声的线性SDE的生成器分别推导出了充分必要条件,并提供了这些条件的几何解释以增强理解。
效果:通过一系列模拟实验验证了理论结果,支持并证实了所建立的发现。

In this paper, we present conditions for identifying the generator of a linear stochastic differential equation (SDE) from the distribution of its solution process with a given fixed initial state. These identifiability conditions are crucial in causal inference using linear SDEs as they enable the identification of the post-intervention distributions from its observational distribution. Specifically, we derive a sufficient and necessary condition for identifying the generator of linear SDEs with additive noise, as well as a sufficient condition for identifying the generator of linear SDEs with multiplicative noise. We show that the conditions derived for both types of SDEs are generic. Moreover, we offer geometric interpretations of the derived identifiability conditions to enhance their understanding. To validate our theoretical results, we perform a series of simulations, which support and substantiate the established findings.

A Riemannian Exponential Augmented Lagrangian Method for Computing the Projection Robust Wasserstein Distance
Bo Jiang Ya-Feng Liu



研究问题:如何有效地缓解经典Wasserstein距离的维数灾难。
动机:提出了投影鲁棒Wasserstein(PRW)距离,以解决经典Wasserstein距离在高维问题上的挑战。
方法:通过将PRW距离的计算等效地重新表述为Stiefel流形和欧几里得空间的笛卡尔积上的优化问题,并添加额外的非线性不等式约束,提出了黎曼指数增强拉格朗日方法(REALM)来解决这个问题。
效果:与现有的黎曼指数惩罚方法相比,REALM可以避免过小的惩罚参数,表现出更稳定的数值性能。同时,设计了一种非精确的带有Sinkhorn迭代的黎曼Barzilai-Borwein方法(iRBBS),可以自适应选择步长,而不是像现有方法那样需要手动调整步长。实验结果表明,iRBBS可以在 $\mathcal{O}(\epsilon^{-3})$ 次迭代内返回原始PRW距离问题的 $\epsilon$-稳定点,这达到了已知的最佳迭代复杂度结果。大量的数值结果也表明,我们提出的方法在计算PRW距离方面优于最先进的求解器。

Projection robust Wasserstein (PRW) distance is recently proposed to efficiently mitigate the curse of dimensionality in the classical Wasserstein distance. In this paper, by equivalently reformulating the computation of the PRW distance as an optimization problem over the Cartesian product of the Stiefel manifold and the Euclidean space with additional nonlinear inequality constraints, we propose a Riemannian exponential augmented Lagrangian method (REALM) for solving this problem. Compared with the existing Riemannian exponential penalty-based approaches, REALM can potentially avoid too small penalty parameters and exhibit more stable numerical performance. To solve the subproblems in REALM efficiently, we design an inexact Riemannian Barzilai-Borwein method with Sinkhorn iteration (iRBBS), which selects the stepsizes adaptively rather than tuning the stepsizes in efforts as done in the existing methods. We show that iRBBS can return an $\epsilon$-stationary point of the original PRW distance problem within $\mathcal{O}(\epsilon^{-3})$ iterations, which matches the best known iteration complexity result. Extensive numerical results demonstrate that our proposed methods outperform the state-of-the-art solvers for computing the PRW distance.

Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies
Oscar Li James Harrison Jascha Sohl-Dickstein Virginia Smith Luke Metz



研究问题:如何有效地处理机器学习中具有极端局部敏感性、不连续性或黑箱特性的损失函数的自动微分梯度估计方法。
动机:现有的在线进化策略方法比传统的进化策略更具并行性,但需要解决部分展开和梯度更新的交错问题。
方法:提出一种无偏的在线进化策略方法,通过分析其梯度估计器的方差并确定方差最小的方法(噪声重用进化策略)来解决这个问题。
效果:实验证明,噪声重用进化策略在收敛速度上优于现有的自动微分和进化策略方法,无论是从计算时间还是从展开步骤数量来看,在一系列应用中都表现出色,包括学习动力系统、元训练学习优化器和强化学习等。

Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.

Koopman Kernel Regression
Petar Bevanda Max Beier Armin Lederer Stefan Georg Sosnowski Eyke Hüllermeier Sandra Hirche



研究问题:如何利用模拟器或预测模型进行决策制定,特别是在处理非线性动态系统时的挑战。
动机:现有的机器学习方法在处理复杂的预测现象时,通常缺乏必要的学习理论保证,导致随着数据和维度的增加,模型的行为变得不明确。
方法:提出了一种新的基于轨迹的再生核希尔伯特空间(RKHS)Koopman算子理论,将多步预测转化为稀疏矩阵乘法,并使用统计学习工具进行函数逼近,得出新的收敛结果和泛化误差边界。
效果:实验证明,该方法在Koopman算子和序列数据预测器上的预测性能优于RKHS。

Many machine learning approaches for decision making, such as reinforcement learning, rely on simulators or predictive models to forecast the time-evolution of quantities of interest, e.g., the state of an agent or the reward of a policy. Forecasts of such complex phenomena are commonly described by highly nonlinear dynamical systems, making their use in optimization-based decision-making challenging. Koopman operator theory offers a beneficial paradigm for addressing this problem by characterizing forecasts via linear time-invariant (LTI) ODEs -- turning multi-step forecasting into sparse matrix multiplications. Though there exists a variety of learning approaches, they usually lack crucial learning-theoretic guarantees, making the behavior of the obtained models with increasing data and dimensionality unclear. We address the aforementioned by deriving a novel reproducing kernel Hilbert space (RKHS) over trajectories that solely spans transformations into LTI dynamical systems. The resulting Koopman Kernel Regression (KKR) framework enables the use of statistical learning tools from function approximation for novel convergence results and generalization error bounds under weaker assumptions than existing work. Our experiments demonstrate superior forecasting performance compared to Koopman operator and sequential data predictors in RKHS.

Front-door Adjustment Beyond Markov Equivalence with Limited Graph Knowledge
Abhin Shah Karthikeyan Shanmugam Murat Kocaoglu



研究问题:如何有效地从数据中估计因果效应,特别是在处理变量和结果变量被混淆的情况下。
动机:传统的因果效应估计方法需要明确的因果图结构或潜在的结果框架中的(条件)独立声明的假设,而这些假设在实践中很难学习。
方法:本文提出了一种无需知道图结构,仅需要有限的结构边信息就可以计算因果效应的测试条件独立声明的方法。这种方法类似于前门调整,可以在知道马尔科夫等价类不足以进行因果效应估计的场景中使用。
效果:通过在一类随机图以及真实的因果公平基准上进行演示,证明了该方法的有效性。

Causal effect estimation from data typically requires assumptions about the cause-effect relations either explicitly in the form of a causal graph structure within the Pearlian framework, or implicitly in terms of (conditional) independence statements between counterfactual variables within the potential outcomes framework. When the treatment variable and the outcome variable are confounded, front-door adjustment is an important special case where, given the graph, causal effect of the treatment on the target can be estimated using post-treatment variables. However, the exact formula for front-door adjustment depends on the structure of the graph, which is difficult to learn in practice. In this work, we provide testable conditional independence statements to compute the causal effect using front-door-like adjustment without knowing the graph under limited structural side information. We show that our method is applicable in scenarios where knowing the Markov equivalence class is not sufficient for causal effect estimation. We demonstrate the effectiveness of our method on a class of random graphs as well as real causal fairness benchmarks.

Deep Equilibrium Based Neural Operators for Steady-State PDEs
Tanya Marwah Ashwini Pokle J Zico Kolter Zachary Chase Lipton Jianfeng Lu Andrej Risteski



研究问题:如何利用数据驱动的机器学习方法解决偏微分方程(PDEs)问题,特别是在已知PDE家族结构知识的情况下,对神经网络架构的设计空间的理解仍然不足。
动机:大部分稳态PDE的解可以表示为非线性操作符的固定点,受此观察启发,我们提出了FNO-DEQ,这是一种深度平衡FNO架构,可以直接求解稳态PDE作为隐式操作层无限深度的固定点。
方法:我们使用黑箱根解决器直接求解稳态PDE作为隐式操作层无限深度的固定点,并通过这个固定点进行解析微分,实现了O(1)的训练内存。
效果:实验表明,基于FNO-DEQ架构的预测稳态PDE(如达西流和不可压缩纳维叶-斯托克斯方程)的解决方案比基于FNO的基线具有4倍的参数数量。此外,当训练数据集的观测噪声较大时,FNO-DEQ比基于FNO的基线更稳健,展示了在针对不同神经网络PDE求解器的架构设计中使用适当的归纳偏差的好处。最后,我们还展示了一个通用近似结果,证明FNO-DEQ可以近似任何可以写成固定点方程的稳态PDE的解。

Data-driven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We seek to remedy this gap by studying the benefits of weight-tied neural network architectures for steady-state PDEs. To achieve this, we first demonstrate that the solution of most steady-state PDEs can be expressed as a fixed point of a non-linear operator. Motivated by this observation, we propose FNO-DEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steady-state PDE as the infinite-depth fixed point of an implicit operator layer using a black-box root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory. Our experiments indicate that FNO-DEQ-based architectures outperform FNO-based baselines with $4\times$ the number of parameters in predicting the solution to steady-state PDEs such as Darcy Flow and steady-state incompressible Navier-Stokes. Finally, we show FNO-DEQ is more robust when trained with datasets with more noisy observations than the FNO-based baselines, demonstrating the benefits of using appropriate inductive biases in architectural design for different neural network based PDE solvers. Further, we show a universal approximation result that demonstrates that FNO-DEQ can approximate the solution to any steady-state PDE that can be written as a fixed point equation.

Percentile Criterion Optimization in Offline Reinforcement Learning
Cyrus Cousins Elita Lobo Marek Petrik Yair Zick



研究问题:如何优化强化学习中高风险决策问题的稳健策略,特别是在数据有限的情况下。
动机:现有的方法通过构建包含真实模型的高概率不确定性集并优化该集中最差模型的策略来优化百分位数准则,但这种方法存在挑战,如非凸性问题和过于保守的策略。
方法:本文提出了一种基于风险价值的动态规划算法,无需显式构建任何不确定性集就可以优化百分位数准则。
效果:理论和实验结果表明,该方法可以隐式地构建更小的不确定性集,学习出更少保守的稳健策略。

In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion. The percentile criterion is optimized by constructing an uncertainty set that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing these sets itself is challenging. Existing works use Bayesian credible regions as uncertainty sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any uncertainty sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller uncertainty sets and learns less-conservative robust policies.

Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models
Naoki Egami Musashi Hinck Brandon M. Stewart Hanying Wei



研究问题:如何在大规模语言模型(LLMs)的不完美标注下,进行无偏且具有正确不确定性量化的统计推断。
动机:大规模的语言模型虽然可以廉价地对文档进行标注,但这种替代性标注通常是不完美和有偏差的。
方法:提出了一种新的算法,通过设计基于监督学习(DSL)的估计器,将替代标签与少量高质量的黄金标准标签结合,以实现无偏的统计推断。
效果:理论分析和实验结果表明,DSL在保证统计推断有效性的同时,其均方根误差与只关注预测而没有推断保证的现有方法相当。

In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. One increasingly common way to annotate documents cheaply at scale is through large language models (LLMs). However, like other scalable ways of producing annotations, such surrogate labels are often imperfect and biased. We present a new algorithm for using imperfect annotation surrogates for downstream statistical analyses while guaranteeing statistical properties—like asymptotic unbiasedness and proper uncertainty quantification—which are fundamental to CSS research. We show that direct use of surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high surrogate accuracy of 80-90\%. To address this, we build on debiased machine learning to propose the design-based supervised learning (DSL) estimator. DSL employs a doubly-robust procedure to combine surrogate labels with a smaller number of high-quality, gold-standard labels. Our approach guarantees valid inference for downstream statistical analyses, even when surrogates are arbitrarily biased and without requiring stringent assumptions, by controlling the probability of sampling documents for gold-standard labeling. Both our theoretical analysis and experimental results show that DSL provides valid statistical inference while achieving root mean squared errors comparable to existing alternatives that focus only on prediction without inferential guarantees.

Continuous-Time Functional Diffusion Processes
Giulio Franzese Giulio Corallo Simone Rossi Markus Heinonen Maurizio Filippone Pietro Michiardi



研究问题:本文旨在提出一种功能扩散过程(FDPs),将基于分数的扩散模型推广到无限维函数空间。
动机:现有的基于分数的扩散模型需要专门的网络架构,并且只能处理特定类型的连续数据。
方法:通过引入新的数学框架来描述前向和后向动力学,并对其进行一些扩展以导出实际的训练目标,包括无穷维吉洪诺夫定理和采样定理,从而构建了一种新的生成模型。
效果:在真实数据上的实验结果表明,FDPs使用简单的多层感知器架构,其参数数量比现有的扩散模型少几个数量级,但能实现高质量的图像生成。

We introduce Functional Diffusion Processes (FDPs), which generalize score-based diffusion models to infinite-dimensional function spaces. FDPs require a new mathematical framework to describe the forward and backward dynamics, and several extensions to derive practical training objectives. These include infinite-dimensional versions of Girsanov theorem, in order to be able to compute an ELBO, and of the sampling theorem, in order to guarantee that functional evaluations in a countable set of points are equivalent to infinite-dimensional functions. We use FDPs to build a new breed of generative models in function spaces, which do not require specialized network architectures, and that can work with any kind of continuous data. Our results on real data show that FDPs achieve high-quality image generation, using a simple MLP architecture with orders of magnitude fewer parameters than existing diffusion models.

Refined Mechanism Design for Approximately Structured Priors via Active Regression
Christos Boutsikas Petros Drineas Marios Mertzanidis Alexandros Psomas Paritosh Verma



研究问题:如何设计一种机制,使卖家在大量物品中实现收入最大化,同时应对具有未知先验分布的多个战略出价者。
动机:传统的最优机制难以计算和描述,即使找到,也常常包含各种反直觉的性质。
方法:本文采用一种由Cai和Daskalakis最近提出的模型,将出价者的先验分布近似为一个主题模型。设计了一个主动学习组件和一个机制设计组件,前者负责与出价者互动并输出其类型的低维近似,后者负责强化机制以适应前者的低维模型。
效果:本文是首次将机制设计与随机线性代数(RLA)中的回归问题主动学习相结合的工作,为进一步应用随机线性代数原语到机制设计打开了大门。

We consider the problem of a revenue-maximizing seller with a large number of items $m$ for sale to $n$ strategic bidders, whose valuations are drawn independently from high-dimensional, unknown prior distributions. It is well-known that optimal and even approximately-optimal mechanisms for this setting are notoriously difficult to characterize or compute, and, even when they can be found, are often rife with various counter-intuitive properties. In this paper, following a model introduced recently by Cai and Daskalakis [CD22], we consider the case that bidders' prior distributions can be well-approximated by a topic model. We design an active learning component, responsible for interacting with the bidders and outputting low-dimensional approximations of their types, and a mechanism design component, responsible for robustifying mechanisms for the low-dimensional model to work for the approximate types of the former component. On the active learning front, we cast our problem in the framework of Randomized Linear Algebra (RLA) for regression problems, allowing us to import several breakthrough results from that line of research, and adapt them to our setting. On the mechanism design front, we remove many restrictive assumptions of prior work on the type of access needed to the underlying distributions and the associated mechanisms. To the best of our knowledge, our work is the first to formulate connections between mechanism design, and RLA for active learning of regression problems, opening the door for further applications of randomized linear algebra primitives to mechanism design.

Block Coordinate Plug-and-Play Methods for Blind Inverse Problems
Weijie Gan Shirin Shoushtari Yuyang Hu Jiaming Liu Hongyu An Ulugbek Kamilov



研究问题:本文旨在解决盲逆向问题,即在未知测量算子的情况下进行图像恢复。
动机:虽然已知的测量算子的PnP方法已被广泛用于图像恢复,但在解决盲逆向问题上的研究却很少。
方法:提出了一种新的块坐标PnP(BC-PnP)方法,通过将学习到的去噪器作为未知图像和未知测量算子的先验引入,有效地解决了这个联合估计问题。
效果:通过考虑非凸数据保真项和扩展去噪器,为BC-PnP提供了与盲逆向问题兼容的新收敛理论。数值实验验证了该方法在两个盲逆向问题上的有效性:磁共振成像中的自动线圈灵敏度估计和盲图像去模糊。结果表明,BC-PnP提供了一个有效且原理性的框架,用于将去噪器作为PnP先验进行测量算子和图像的联合估计。

Plug-and-play (PnP) prior is a well-known class of methods for solving imaging inverse problems by computing fixed-points of operators combining physical measurement models and learned image denoisers. While PnP methods have been extensively used for image recovery with known measurement operators, there is little work on PnP for solving blind inverse problems. We address this gap by presenting a new block-coordinate PnP (BC-PnP) method that efficiently solves this joint estimation problem by introducing learned denoisers as priors on both the unknown image and the unknown measurement operator. We present a new convergence theory for BC-PnP compatible with blind inverse problems by considering nonconvex data-fidelity terms and expansive denoisers. Our theory analyzes the convergence of BC-PnP to a stationary point of an implicit function associated with an approximate minimum mean-squared error (MMSE) denoiser. We numerically validate our method on two blind inverse problems: automatic coil sensitivity estimation in magnetic resonance imaging (MRI) and blind image deblurring. Our results show that BC-PnP provides an efficient and principled framework for using denoisers as PnP priors for jointly estimating measurement operators and images.

Self-Consistent Velocity Matching of Probability Flows
Lingxiao Li Samuel Hurault Justin Solomon



研究问题:本文旨在提出一种解决大量守恒偏微分方程(PDEs)的无离散化可扩展框架,包括时变Fokker-Planck方程和Wasserstein梯度流。
动机:目前的方法存在计算障碍和范围限制,需要通过直接最小化固定点方程的残差来解决问题。
方法:我们提出了一种迭代形式,使用带有偏差梯度估计器的固定点方程,避免了显著的计算障碍,并具有强大的实证性能。
效果:实验结果表明,我们的方法在高维情况下具有优越的性能和更少的训练时间,并且能够准确恢复可用的解析解。

We present a discretization-free scalable framework for solving a large class of mass-conserving partial differential equations (PDEs), including the time-dependent Fokker-Planck equation and the Wasserstein gradient flow. The main observation is that the time-varying velocity field of the PDE solution needs to be self-consistent: it must satisfy a fixed-point equation involving the probability flow characterized by the same velocity field. Instead of directly minimizing the residual of the fixed-point equation with neural parameterization, we use an iterative formulation with a biased gradient estimator that bypasses significant computational obstacles with strong empirical performance. Compared to existing approaches, our method does not suffer from temporal or spatial discretization, covers a wider range of PDEs, and scales to high dimensions. Experimentally, our method recovers analytical solutions accurately when they are available and achieves superior performance in high dimensions with less training time compared to alternatives.

Beyond Normal: On the Evaluation of Mutual Information Estimators
Paweł Czyż Frederic Grabowski Julia E Vogt Niko Beerenwinkel Alexander Marx



研究问题:本文旨在构建一个具有已知真实互信息的各种分布的多样化家族,并提出一种用于评估互信息估计器的独立语言基准平台。
动机:互信息是一种通用的统计依赖性度量,已在表示学习、因果关系、领域泛化和计算生物学等领域找到应用。然而,互信息估计器通常在简单的概率分布族上进行评估,如多元正态分布和具有一维随机变量的选定分布。
方法:我们构建了一个具有已知真实互信息的多样化分布族,并提出了用于评估互信息估计器的独立语言基准平台。我们讨论了经典和神经估计器在高维、稀疏交互、长尾分布和高互信息设置中的一般适用性和局限性。
效果:实验结果表明,我们的方法可以有效地评估各种互信息估计器的性能,并为实践者提供了选择适合问题的适当估计器以及在新数据集上应用估计器时需要考虑的问题的指导方针。

Mutual information is a general statistical dependency measure which has found applications in representation learning, causality, domain generalization and computational biology. However, mutual information estimators are typically evaluated on simple families of probability distributions, namely multivariate normal distribution and selected distributions with one-dimensional random variables. In this paper, we show how to construct a diverse family of distributions with known ground-truth mutual information and propose a language-independent benchmarking platform for mutual information estimators. We discuss the general applicability and limitations of classical and neural estimators in settings involving high dimensions, sparse interactions, long-tailed distributions, and high mutual information. Finally, we provide guidelines for practitioners on how to select appropriate estimator adapted to the difficulty of problem considered and issues one needs to consider when applying an estimator to a new data set.

Bayesian Learning via Q-Exponential Process
Shuyi Li Michael O'Connor Shiwei Lan



研究问题:本文旨在解决优化、统计和机器学习中的基本问题,即如何通过添加$\ell_q$惩罚项来估计稀疏参数。
动机:为了在估计参数$u\in\mathbb{R}^d$时获得稀疏性,通常在目标函数中添加$\ell_q$惩罚项$\Vert u\Vert_q$。这种$ell_q$惩罚的正态分布是什么?当我们对$L^q$中的函数进行建模时,$\Vert u\Vert_q$的正确随机过程是什么?这对于统计建模高维对象(如图像)以保留某些属性(如图像边缘)至关重要。
方法:我们将 $q$-指数分布(密度与 $\exp{(- \frac{1}{2}|u|^q)}$ 成正比)推广为一种名为 $Q$-指数(Q-EP)过程的随机过程,该过程对应于函数的 $L_q$ 正则化。关键步骤是通过从大量的椭圆形轮廓分布中选择来指定一致的多元 $q$-指数分布。这项工作与贝索夫过程密切相关,后者通常以级数的形式定义。Q-EP 可以被视为具有显式概率公式、对相关性强度的直接控制和可追踪预测公式的贝索夫过程的定义。从贝叶斯的角度来看,Q-EP 提供了比常用的高斯过程(GP,$q=2$)更严格的函数先验。
效果:我们在模拟功能性数据、重建图像和解决反问题方面比较了 GP、贝索夫和 Q-EP,并展示了我们提出的方法的优势。

Regularization is one of the most fundamental topics in optimization, statistics and machine learning. To get sparsity in estimating a parameter $u\in\mathbb{R}^d$, an $\ell_q$ penalty term, $\Vert u\Vert_q$, is usually added to the objective function. What is the probabilistic distribution corresponding to such $\ell_q$ penalty? What is the \emph{correct} stochastic process corresponding to $\Vert u\Vert_q$ when we model functions $u\in L^q$? This is important for statistically modeling high-dimensional objects such as images, with penalty to preserve certainty properties, e.g. edges in the image. In this work, we generalize the $q$-exponential distribution (with density proportional to) $\exp{(- \frac{1}{2}|u|^q)}$ to a stochastic process named \emph{$Q$-exponential (Q-EP) process} that corresponds to the $L_q$ regularization of functions. The key step is to specify consistent multivariate $q$-exponential distributions by choosing from a large family of elliptic contour distributions. The work is closely related to Besov process which is usually defined in terms of series. Q-EP can be regarded as a definition of Besov process with explicit probabilistic formulation, direct control on the correlation strength, and tractable prediction formula. From the Bayesian perspective, Q-EP provides a flexible prior on functions with sharper penalty ($q<2$) than the commonly used Gaussian process (GP, $q=2$). We compare GP, Besov and Q-EP in modeling functional data, reconstructing images and solving inverse problems and demonstrate the advantage of our proposed methodology.

Learning Energy-based Model via Dual-MCMC Teaching
Jiali Cui Tian Han



研究问题:本文研究了能量基础模型(EBM)的基本学习问题。
动机:传统的通过最大似然估计(MLE)和马尔科夫链蒙特卡洛(MCMC)采样如Langevin动力学来学习EBM在实践中存在挑战,如噪声初始化的Langevin动力学难以混合。
方法:提出了一种联合训练框架,将生成器模型作为补充模型,避免使用MCMC采样。生成器模型同时匹配EBM和经验数据分布,使其成为更有效的EBM MCMC采样的初始化器。
效果:通过两个(双)MCMC教学,三个独立模型可以无缝集成到我们的联合框架中,实现了有效且高效的EBM学习。

This paper studies the fundamental learning problem of the energy-based model (EBM). Learning the EBM can be achieved using the maximum likelihood estimation (MLE), which typically involves the Markov Chain Monte Carlo (MCMC) sampling, such as the Langevin dynamics. However, the noise-initialized Langevin dynamics can be challenging in practice and hard to mix. This motivates the exploration of joint training with the generator model where the generator model serves as a complementary model to bypass MCMC sampling. However, such a method can be less accurate than the MCMC and result in biased EBM learning. While the generator can also serve as an initializer model for better MCMC sampling, its learning can be biased since it only matches the EBM and has no access to empirical training examples. Such biased generator learning may limit the potential of learning the EBM. To address this issue, we present a joint learning framework that interweaves the maximum likelihood learning algorithm for both the EBM and the complementary generator model. In particular, the generator model is learned by MLE to match both the EBM and the empirical data distribution, making it a more informative initializer for MCMC sampling of EBM. Learning generator with observed examples typically requires inference of the generator posterior. To ensure accurate and efficient inference, we adopt the MCMC posterior sampling and introduce a complementary inference model to initialize such latent MCMC sampling. We show that three separate models can be seamlessly integrated into our joint framework through two (dual-) MCMC teaching, enabling effective and efficient EBM learning.

Characterization and Learning of Causal Graphs with Small Conditioning Sets
Murat Kocaoglu



研究问题:约束性因果发现算法在数据有限时,由于条件独立性测试的统计效力快速下降,尤其是当条件集较大时,会面临困难。
动机:为了解决这个问题,我们提出了一种方法,即在进行条件独立性测试时,将条件集的大小上限设为某个整数k,以进行稳健的因果发现。
方法:我们首先定义了k-马尔科夫等价的概念,然后提出了一种新的表示方法,可以图形化地描述两个因果图之间的k-马尔科夫等价关系。我们还提出了一种名为k-PC的新算法,用于学习这种等价类。
效果:通过合成和半合成实验,我们发现与基线算法相比,k-PC算法在小样本情况下能实现更稳健的因果发现。

Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer k for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of k-Markov equivalence: Two causal graphs are k-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by k. We propose a novel representation that allows us to graphically characterize k-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the k-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the k-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline algorithms.

Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery
Mateusz Olko Michał Zając Aleksandra Nowak Nino Scherrer Yashas Annadani Stefan Bauer Łukasz Kuciński Piotr Miłoś



研究问题:如何从数据中推断出因果关系结构,特别是在观察性数据无法唯一确定系统因果结构的情况下。
动机:干预性数据的获取可以解决这个问题,但通常需要大量的时间和资源投入。
方法:提出了一种新的基于梯度的干预目标方法(GIT),该方法利用梯度估计器提供的信号来确定干预目标函数。
效果:在模拟和真实世界的数据集上进行的大量实验表明,GIT在低数据量的情况下与竞争性基线表现相当,甚至超过它们。

Inferring causal structure from data is a challenging task of fundamental importance in science. Often, observational data alone is not enough to uniquely identify a system’s causal structure. The use of interventional data can address this issue, however, acquiring these samples typically demands a considerable investment of time and physical or financial resources. In this work, we are concerned with the acquisition of interventional data in a targeted manner to minimize the number of required experiments. We propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that ’trusts’ the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention targeting function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.

PreDiff: Precipitation Nowcasting with Latent Diffusion Models
Zhihan Gao Xingjian Shi Boran Han Hao Wang Xiaoyong Jin Danielle C. Maddix Yi Zhu Mu Li Bernie Wang



研究问题:传统的地球系统预测依赖于复杂的物理模型,这些模型计算量大且需要大量的专业知识。
动机:过去十年中,时空地球观测数据的空前增长使得使用深度学习技术的基于数据驱动的预测模型成为可能。
方法:我们提出了一个两阶段的时空预测概率管道:1)开发*PreDiff*,一种能够进行概率预测的条件潜在扩散模型;2)引入显式的知识对齐机制,以将预测与特定的物理约束对齐。
效果:我们在两个数据集上进行了实证研究:N体MNIST(一个具有混沌行为的合成数据集)和SEVIR(一个实际降水短时预报数据集)。实验证明,PreDiff在处理不确定性、引入特定领域的先验知识以及生成具有高操作效用的预测方面非常有效。

Earth system forecasting has traditionally relied on complex physical models that are computationally expensive and require significant domain expertise. In the past decade, the unprecedented increase in spatiotemporal Earth observation data has enabled data-driven forecasting models using deep learning techniques. These models have shown promise for diverse Earth system forecasting tasks but either struggle with handling uncertainty or neglect domain-specific prior knowledge, resulting in averaging possible futures to blurred forecasts or generating physically implausible predictions. To address these limitations, we propose a two-stage pipeline for probabilistic spatiotemporal forecasting: 1) We develop *PreDiff*, a conditional latent diffusion model capable of probabilistic forecasts. 2) We incorporate an explicit knowledge alignment mechanism to align forecasts with domain-specific physical constraints. This is achieved by estimating the deviation from imposed constraints at each denoising step and adjusting the transition distribution accordingly. We conduct empirical studies on two datasets: N-body MNIST, a synthetic dataset with chaotic behavior, and SEVIR, a real-world precipitation nowcasting dataset. Specifically, we impose the law of conservation of energy in N-body MNIST and anticipated precipitation intensity in SEVIR. Experiments demonstrate the effectiveness of PreDiff in handling uncertainty, incorporating domain-specific prior knowledge, and generating forecasts that exhibit high operational utility.

A Heat Diffusion Perspective on Geodesic Preserving Dimensionality Reduction
Guillaume Huguet Alexander Tong Edward De Brouwer Yanlei Zhang Guy Wolf Ian Adelstein Smita Krishnaswamy



研究问题:扩散基流形学习方法在现代高维、高通量、噪声大的数据集中的表示学习和降维方面已被证明是有用的。
动机:尽管这些方法被认为通过学习测地距离的代理来保留数据的潜在流形结构,但尚未建立特定的理论联系。
方法:通过黎曼几何学中的结果,将热扩散与流形距离明确联系起来,建立了更一般的基于热核的流形嵌入方法,称为热测地嵌入。
效果:实验结果表明,该方法在保留真实流形距离和保留玩具数据集的聚类结构方面优于现有技术。同时,该方法在具有连续和聚类结构的单细胞RNA测序数据集上表现出色,并能够对隐藏的时间点进行插值。最后,我们的一般方法的参数可以配置为与PHATE(一种先进的扩散基流形学习方法)和SNE(一种基于吸引/排斥邻域的方法,是t-SNE的基础)产生类似的结果。

Diffusion-based manifold learning methods have proven useful in representation learning and dimensionality reduction of modern high dimensional, high throughput, noisy datasets. Such datasets are especially present in fields like biology and physics. While it is thought that these methods preserve underlying manifold structure of data by learning a proxy for geodesic distances, no specific theoretical links have been established. Here, we establish such a link via results in Riemannian geometry explicitly connecting heat diffusion to manifold distances. In this process, we also formulate a more general heat kernel based manifold embedding method that we call heat geodesic embeddings. This novel perspective makes clearer the choices available in manifold learning and denoising. Results show that our method outperforms existing state of the art in preserving ground truth manifold distances, and preserving cluster structure in toy datasets. We also showcase our method on single cell RNA-sequencing datasets with both continuum and cluster structure, where our method enables interpolation of withheld timepoints of data. Finally, we show that parameters of our more general method can be configured to give results similar to PHATE (a state-of-the-art diffusion based manifold learning method) as well as SNE (an attraction/repulsion neighborhood based method that forms the basis of t-SNE).

Learning Rate Free Bayesian Inference in Constrained Domains
Louis Sharrock Lester Mackey Christopher Nemeth



研究问题:本文旨在提出一套新的基于粒子的算法,用于在约束域上进行采样,这些算法完全不需要学习率。
动机:现有的约束采样算法需要调整许多超参数,而我们的方法不需要。
方法:我们的方法利用凸优化中的投注理念,将约束采样视为概率测度空间上的镜像优化问题。基于这种观点,我们还为几种现有的约束采样算法(包括镜像Langevin动力学和镜像Stein变分梯度下降)引入了一个统一的框架。
效果:我们在一系列数值示例上展示了我们算法的性能,包括从单纯形目标中采样、带有公平性约束的采样以及后选择推理中的约束采样问题。我们的结果表明,我们的算法在性能上与现有的约束采样方法相当,而无需调整任何超参数。

We introduce a suite of new particle-based algorithms for sampling on constrained domains which are entirely learning rate free. Our approach leverages coin betting ideas from convex optimisation, and the viewpoint of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this viewpoint, we also introduce a unifying framework for several existing constrained sampling algorithms, including mirrored Langevin dynamics and mirrored Stein variational gradient descent. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex, sampling with fairness constraints, and constrained sampling problems in post-selection inference. Our results indicate that our algorithms achieve competitive performance with existing constrained sampling methods, without the need to tune any hyperparameters.

Self-Supervised Learning with Lie Symmetries for Partial Differential Equations
Grégoire Mialon Quentin Garrido Hannah Lawrence Danyal Rehman Yann LeCun Bobak Kiani



研究问题:如何利用异构数据学习偏微分方程的通用表示。
动机:目前的算法需要针对特定设置的模拟训练数据,但人们可能希望从异构来源或来自真实动态系统观测的混乱或不完整数据中学习有用的信息。
方法:通过实施自监督学习的联合嵌入方法,从异构数据中学习偏微分方程的通用表示。
效果:该方法在回归偏微分方程系数等不变任务上优于基线方法,同时也提高了神经求解器的时间步进性能。

Machine learning for differential equations paves the way for computationally efficient alternatives to numerical solvers, with potentially broad impacts in science and engineering. Though current algorithms typically require simulated training data tailored to a given setting, one may instead wish to learn useful information from heterogeneous sources, or from real dynamical systems observations that are messy or incomplete. In this work, we learn general-purpose representations of PDEs from heterogeneous data by implementing joint embedding methods for self-supervised learning (SSL), a framework for unsupervised representation learning that has had notable success in computer vision. Our representation outperforms baseline approaches to invariant tasks, such as regressing the coefficients of a PDE, while also improving the time-stepping performance of neural solvers. We hope that our proposed methodology will prove useful in the eventual development of general-purpose foundation models for PDEs.

GeoPhy: Differentiable Phylogenetic Inference via Geometric Gradients of Tree Topologies
Takahiro Mimori Michiaki Hamada



研究问题:本文旨在解决现有基于分子进化模型的系统发育推断方法在处理树状结构变量(包括树拓扑和分支上的进化距离)的不确定性时,需要限制可能的树状结构数量的问题。
动机:考虑到系统发育推断中存在的不确定性,以及在不限制可能的树状结构数量的情况下进行推断的挑战,作者提出了一种新的、完全可微分的系统发育推断方法。
方法:作者引入了一种新颖的、基于连续几何空间中的拓扑分布的独特表示的系统发育推断方法。通过考虑设计空间和梯度估计的控制变量等实际因素,该方法能够在不限制拓扑候选者的情况下进行变分推断。
效果:实验结果表明,该方法显著优于其他考虑整个拓扑结构的近似贝叶斯方法。

Phylogenetic inference, grounded in molecular evolution models, is essential for understanding the evolutionary relationships in biological data. Accounting for the uncertainty of phylogenetic tree variables, which include tree topologies and evolutionary distances on branches, is crucial for accurately inferring species relationships from molecular data and tasks requiring variable marginalization. Variational Bayesian methods are key to developing scalable, practical models; however, it remains challenging to conduct phylogenetic inference without restricting the combinatorially vast number of possible tree topologies. In this work, we introduce a novel, fully differentiable formulation of phylogenetic inference that leverages a unique representation of topological distributions in continuous geometric spaces. Through practical considerations on design spaces and control variates for gradient estimations, our approach, GeoPhy, enables variational inference without limiting the topological candidates. In experiments using real benchmark datasets, GeoPhy significantly outperformed other approximate Bayesian methods that considered whole topologies.

Convergent Bregman Plug-and-Play Image Restoration for Poisson Inverse Problems
Samuel Hurault Ulugbek Kamilov Arthur Leclaire Nicolas Papadakis



研究问题:解决病态图像逆问题的高效迭代算法。
动机:目前的PnP方法依赖于具有Lipschitz梯度或闭式近算子的数据保真项,这在泊松逆问题上不适用。
方法:我们提出了一种基于Bregman Proximal Gradient (BPG)方法的PnP泛化方法。该方法使用Bregman散度代替欧几里得距离,以更好地捕捉问题的平滑性特性。我们还引入了特定参数化和训练的新Bregman几何学的Bregman Score Denoiser,并证明它对应于非凸势的近算子。
效果:我们在各种泊松逆问题上实施了所提出的算法,实验结果表明这些方法有效,且具有良好的恢复性能。

Plug-and-Play (PnP) methods are efficient iterative algorithms for solving ill-posed image inverse problems. PnP methods are obtained by using deep Gaussian denoisers instead of the proximal operator or the gradient-descent step within proximal algorithms. Current PnP schemes rely on data-fidelity terms that have either Lipschitz gradients or closed-form proximal operators, which is not applicable to Poisson inverse problems. Based on the observation that the Gaussian noise is not the adequate noise model in this setting, we propose to generalize PnP using the Bregman Proximal Gradient (BPG) method. BPG replaces the Euclidean distance with a Bregman divergence that can better capture the smoothness properties of the problem. We introduce the Bregman Score Denoiser specifically parametrized and trained for the new Bregman geometry and prove that it corresponds to the proximal operator of a nonconvex potential. We propose two PnP algorithms based on the Bregman Score Denoiser for solving Poisson inverse problems. Extending the convergence results of BPG in the nonconvex settings, we show that the proposed methods converge, targeting stationary points of an explicit global functional. Experimental evaluations conducted on various Poisson inverse problems validate the convergence results and showcase effective restoration performance.

Bias in Evaluation Processes: An Optimization-Based Model
L. Elisa Celis Amit Kumar Anay Mehrotra Nisheeth K Vishnoi



研究问题:评估过程中的个体社会属性偏见,如招生和招聘。
动机:理解评估过程中偏见的产生机制,提供干预工具以减轻偏见。
方法:将评估过程视为从个体真实效用分布到观察到的分布的转换,并建模为损失最小化问题,受信息约束。模型有两个参数可能导致偏见:信息约束中资源-信息权衡参数和损失函数中的风险厌恶参数。
效果:通过拟合真实世界数据集验证模型,研究下游选择任务中的干预效果。这些结果有助于理解评估过程中偏见的产生,并为减轻偏见提供指导。

Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.

Calibrating Neural Simulation-Based Inference with Differentiable Coverage Probability
Maciej Falkiewicz Naoya Takeishi Imahn Shekhzadeh Antoine Wehenkel Arnaud Delaunoy Gilles Louppe Alexandros Kalousis



研究问题:现有的贝叶斯推理算法在模拟基础上的推理过程中,可能会产生过于自信的后验结果,导致不确定性量化不准确。
动机:为了解决这一问题,我们提出了一种新的方法,通过在神经网络模型的训练目标中直接引入校准项,以改善不确定性量化的准确性。
方法:我们的方法不需要特定的神经网络模型,并且与现有的计算流程兼容,可以直接进行可靠的黑箱后验推理。我们通过引入经典校准误差公式的放松形式,实现了端到端的反向传播。
效果:我们在六个基准问题上进行了实证研究,结果表明,我们的方法在覆盖范围和期望后验密度方面,比现有的方法具有竞争力或更好的效果。

Bayesian inference allows expressing the uncertainty of posterior belief under a probabilistic model given prior information and the likelihood of the evidence. Predominantly, the likelihood function is only implicitly established by a simulator posing the need for simulation-based inference (SBI). However, the existing algorithms can yield overconfident posteriors (Hermans *et al.*, 2022) defeating the whole purpose of credibility if the uncertainty quantification is inaccurate. We propose to include a calibration term directly into the training objective of the neural model in selected amortized SBI techniques. By introducing a relaxation of the classical formulation of calibration error we enable end-to-end backpropagation. The proposed method is not tied to any particular neural model and brings moderate computational overhead compared to the profits it introduces. It is directly applicable to existing computational pipelines allowing reliable black-box posterior inference. We empirically show on six benchmark problems that the proposed method achieves competitive or better results in terms of coverage and expected posterior density than the previously existing approaches.

A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference
Emile van Krieken Thiviyan Thanapalasingam Jakub M. Tomczak Frank Van Harmelen Annette Ten Teije



研究问题:本文旨在解决将神经网络与符号推理相结合的问题。
动机:现有的概率神经符号学习(PNL)框架,如DeepProbLog,执行指数时间精确推理,限制了PNL解决方案的可扩展性。
方法:我们引入了近似神经符号推理(A-NeSI):一种新的PNL框架,使用神经网络进行可扩展的近似推理。A-NeSI 1)在不改变概率逻辑语义的情况下执行近似推理;2)使用背景知识生成的数据进行训练;3)可以生成预测的符号解释;4)可以在测试时保证满足逻辑约束,这在安全关键应用中至关重要。
效果:我们的实验表明,A-NeSI是第一个解决具有指数组合缩放的三个神经符号任务的端到端方法。最后,我们的实验表明,A-NeSI在没有性能损失的情况下实现了可解释性和安全性。

We study the problem of combining neural networks with symbolic reasoning. Recently introduced frameworks for Probabilistic Neurosymbolic Learning (PNL), such as DeepProbLog, perform exponential-time exact inference, limiting the scalability of PNL solutions. We introduce Approximate Neurosymbolic Inference (A-NeSI): a new framework for PNL that uses neural networks for scalable approximate inference. A-NeSI 1) performs approximate inference in polynomial time without changing the semantics of probabilistic logics; 2) is trained using data generated by the background knowledge; 3) can generate symbolic explanations of predictions; and 4) can guarantee the satisfaction of logical constraints at test time, which is vital in safety-critical applications. Our experiments show that A-NeSI is the first end-to-end method to solve three neurosymbolic tasks with exponential combinatorial scaling. Finally, our experiments show that A-NeSI achieves explainability and safety without a penalty in performance.

Relative Entropic Optimal Transport: a (Prior-aware) Matching Perspective to (Unbalanced) Classification
Liangliang Shi Haoyu Zhen Gu Zhang Junchi Yan



研究问题:本文旨在通过最优传输(OT)理论重新思考分类问题,并探索样本和标签之间的匹配概率。
动机:由于自然中的普遍存在,分类问题是机器学习中的基本问题,特别是在需求较高的长尾设置中。
方法:本文提出了一种新的最优传输变体,称为相对熵最优传输(RE-OT),它引导耦合解决方案到一个已知的先验信息矩阵。然后采用逆RE-OT进行长尾数据训练。
效果:实验结果表明,RE-OT损失与基于Softmax的交叉熵损失具有类似的形式,表明最优传输和分类之间存在紧密联系,并且在这两个学术领域之间有概念转移的可能性。在图像分类、分子分类、实例分割和表示学习等任务上进行的实验证明了其有效性。

Classification is a fundamental problem in machine learning, and considerable efforts have been recently devoted to the demanding long-tailed setting due to its prevalence in nature. Departure from the Bayesian framework, this paper rethinks classification from a matching perspective by studying the matching probability between samples and labels with optimal transport (OT) formulation. Specifically, we first propose a new variant of optimal transport, called Relative Entropic Optimal Transport (RE-OT), which guides the coupling solution to a known prior information matrix. We gives some theoretical results and their proof for RE-OT and surprisingly find RE-OT can help to deblur for barycenter images. Then we adopt inverse RE-OT for training long-tailed data and find that the loss derived from RE-OT has a similar form to Softmax-based cross-entropy loss, indicating a close connection between optimal transport and classification and the potential for transferring concepts between these two academic fields, such as barycentric projection in OT, which can map the labels back to the feature space. We further derive an epoch-varying RE-OT loss, and do the experiments on unbalanced image classification, molecule classification, instance segmentation and representation learning. Experimental results show its effectiveness.

Exact Generalization Guarantees for (Regularized) Wasserstein Distributionally Robust Models
Waïss Azizian Franck Iutzeler Jérôme Malick



研究问题:如何有效地进行不确定性下的预测和决策?
动机:现有的方法存在维度诅咒、限制在特定设置中或导致伪误差项等问题。
方法:提出了Wasserstein分布鲁棒估计器,该模型能提供吸引人的泛化保证,且不会受到维度诅咒的影响,甚至可以覆盖测试时的数据偏移。
效果:证明了这些结果可以扩展到新引入的正则化版本的Wasserstein分布鲁棒问题,并在各种任务上取得了显著改进。

Wasserstein distributionally robust estimators have emerged as powerful models for prediction and decision-making under uncertainty. These estimators provide attractive generalization guarantees: the robust objective obtained from the training distribution is an exact upper bound on the true risk with high probability. However, existing guarantees either suffer from the curse of dimensionality, are restricted to specific settings, or lead to spurious error terms. In this paper, we show that these generalization guarantees actually hold on general classes of models, do not suffer from the curse of dimensionality, and can even cover distribution shifts at testing. We also prove that these results carry over to the newly-introduced regularized versions of Wasserstein distributionally robust problems.

Likelihood Ratio Confidence Sets for Sequential Decision Making
Nicolas Emmenegger Mojmir Mutny Andreas Krause



研究问题:如何为未知数量提供可验证的、自适应的不确定性估计,以支持序列决策算法。
动机:标准方法依赖于特定情境下的集中结果,且仅限于特定的参数化、噪声族和估计器组合。
方法:本文重新审视了基于似然的推理原理,并提出使用“似然比”来构建“任何时间有效”的置信序列,而无需在每个应用场景中进行专门处理。
效果:该方法特别适用于具有良好指定似然的问题,生成的集合始终以模型无关的方式保持规定的覆盖范围。集合的大小取决于似然比中的估计器序列选择。我们讨论了如何证明选择最佳的估计器序列,并揭示了与在线凸优化的联系,如Follow-the-Regularized-Leader等算法。为了抵消估计器的初始大偏差,我们提出了一个重加权方案,这也使得我们可以在非参数设置(如RKHS函数类)中进行部署。我们为广义线性模型提供了一种非渐近的似然比置信集大小分析,利用凸对偶性和在线学习的见解。我们在广义线性Bandit问题、生存分析和各种附加噪声分布的Bandits上展示了该方法的实际优势。

Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use \emph{likelihood ratios} to construct \emph{any-time valid} confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a \emph{non-asymptotic} analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.

Neural Fields with Hard Constraints of Arbitrary Differential Order
Fangcheng Zhong Kyle Thomas Fogarty Param Hanji Tianhao Walter Wu Alejandro Sztrajman Andrew Everett Spielberg Andrea Tagliasacchi Petra Bosilj Cengiz Oztireli



研究问题:尽管深度学习技术在解决各种优化问题上非常流行,但在优化过程中,特别是在深度神经网络中实施硬约束的方法仍然不完善。
动机:受科学计算中丰富的无网格插值及其扩展到谱配置方法的启发,我们开发了一系列用于在神经场中实施硬约束的方法,称为受限神经场(CNF)。
方法:我们将约束条件定义为应用于神经场及其导数的线性算子。我们还为标准模型可能遇到困难的问题设计了特定的模型表示和训练策略,例如系统的条件、内存消耗以及网络被约束时的容量。
效果:我们的方法是通过对一系列真实世界应用进行演示来验证的。此外,我们还开发了一个框架,可以高效地指定模型和约束,该框架可以很容易地应用于任何需要明确满足优化过程中硬约束的下游任务。

While deep learning techniques have become extremely popular for solving a broad range of optimization problems, methods to enforce hard constraints during optimization, particularly on deep neural networks, remain underdeveloped. Inspired by the rich literature on meshless interpolation and its extension to spectral collocation methods in scientific computing, we develop a series of approaches for enforcing hard constraints on neural fields, which we refer to as Constrained Neural Fields (CNF). The constraints can be specified as a linear operator applied to the neural field and its derivatives. We also design specific model representations and training strategies for problems where standard models may encounter difficulties, such as conditioning of the system, memory consumption, and capacity of the network when being constrained. Our approaches are demonstrated in a wide range of real-world applications. Additionally, we develop a framework that enables highly efficient model and constraint specification, which can be readily applied to any downstream task where hard constraints need to be explicitly satisfied during optimization.

Score-based Data Assimilation
François Rozet Gilles Louppe



研究问题:解决贝叶斯逆问题,即通过有噪声或不完整的观测数据识别可能的状态轨迹。
动机:现有的大多数算法依赖于转移动态进行推理,对于长期的时间范围或高维复杂动态系统(如海洋和大气)来说,这变得难以处理。
方法:提出基于得分的数据同化轨迹推理方法,学习基于关键洞察的状态轨迹的得分生成模型,即任意长轨迹的得分可以分解为一系列短段的得分。在训练后,使用得分模型进行推理,以非自回归的方式同时生成所有状态。
效果:我们的方法有效地解决了贝叶斯逆问题,并在各种零射击观察场景中表现出色。

Data assimilation, in its most comprehensive form, addresses the Bayesian inverse problem of identifying plausible state trajectories that explain noisy or incomplete observations of stochastic dynamical systems. Various approaches have been proposed to solve this problem, including particle-based and variational methods. However, most algorithms depend on the transition dynamics for inference, which becomes intractable for long time horizons or for high-dimensional systems with complex dynamics, such as oceans or atmospheres. In this work, we introduce score-based data assimilation for trajectory inference. We learn a score-based generative model of state trajectories based on the key insight that the score of an arbitrarily long trajectory can be decomposed into a series of scores over short segments. After training, inference is carried out using the score model, in a non-autoregressive manner by generating all states simultaneously. Quite distinctively, we decouple the observation model from the training procedure and use it only at inference to guide the generative process, which enables a wide range of zero-shot observation scenarios. We present theoretical and empirical evidence supporting the effectiveness of our method.

Learning Energy-Based Prior Model with Diffusion-Amortized MCMC
Peiyu Yu Yaxuan Zhu Sirui Xie Xiaojian Ma Ruiqi Gao Song-Chun Zhu Ying Nian Wu



研究问题:如何改善预训练语言模型对结构化知识的利用,以提升语言理解能力。
动机:现有的预训练语言模型往往忽视了知识图谱中的有信息量的实体,而这些实体可以增强语言表示,提升语言理解能力。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型同时利用大规模文本语料库和知识图谱进行训练,能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Latent space EBMs, also known as energy-based priors, have drawn growing interests in the field of generative modeling due to its flexibility in the formulation and strong modeling power of the latent space. However, the common practice of learning latent space EBMs with non-convergent short-run MCMC for prior and posterior sampling is hindering the model from further progress; the degenerate MCMC sampling quality in practice often leads to degraded generation quality and instability in training, especially with highly multi-modal and/or high-dimensional target distributions. To remedy this sampling issue, in this paper we introduce a simple but effective diffusion-based amortization method for long-run MCMC sampling and develop a novel learning algorithm for the latent space EBM based on it. We provide theoretical evidence that the learned amortization of MCMC is a valid long-run MCMC sampler. Experiments on several image modeling benchmark datasets demonstrate the superior performance of our method compared with strong counterparts.

DeepSimHO: Stable Pose Estimation for Hand-Object Interaction via Physics Simulation
Rong Wang Wei Mao Hongdong Li



研究问题:本文旨在解决从单张图像中估计手与物体交互的3D姿态的问题。
动机:现有的方法主要利用接近性线索进行建模,忽视了手必须稳定抓住物体以抵消重力并防止物体滑动或掉落的动态性质,导致估计结果不稳定。同时,在数据驱动的学习框架中,使用基于物理的推理来精炼不稳定的配置既复杂又困难。
方法:提出了一种新的深度学习流程DeepSimHO,该流程结合了前向物理模拟和后向梯度近似,通过神经网络实现。具体来说,对于基础网络初步估计的手-物体姿态,将其输入到物理模拟器中评估其稳定性。但由于非平滑接触几何和穿透,现有的可微分模拟器无法提供可靠的状态梯度。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper addresses the task of 3D pose estimation for a hand interacting with an object from a single image observation. When modeling hand-object interaction, previous works mainly exploit proximity cues, while overlooking the dynamical nature that the hand must stably grasp the object to counteract gravity and thus preventing the object from slipping or falling. These works fail to leverage dynamical constraints in the estimation and consequently often produce unstable results. Meanwhile, refining unstable configurations with physics-based reasoning remains challenging, both by the complexity of contact dynamics and by the lack of effective and efficient physics inference in the data-driven learning framework. To address both issues, we present DeepSimHO: a novel deep-learning pipeline that combines forward physics simulation and backward gradient approximation with a neural network. Specifically, for an initial hand-object pose estimated by a base network, we forward it to a physics simulator to evaluate its stability. However, due to non-smooth contact geometry and penetration, existing differentiable simulators can not provide reliable state gradient. To remedy this, we further introduce a deep network to learn the stability evaluation process from the simulator, while smoothly approximating its gradient and thus enabling effective back-propagation. Extensive experiments show that our method noticeably improves the stability of the estimation and achieves superior efficiency over test-time optimization. The code is available at https://github.com/rongakowang/DeepSimHO.

Optimized Covariance Design for AB Test on Social Network under Interference
Qianyi Chen Bo Li LU DENG Yong Wang



研究问题:如何准确估计社交网络平台在线A/B测试的全局平均处理效应(GATE),并解决网络干扰对实验设计的挑战。
动机:由于网络干扰违反了稳定单位治疗值假设(SUTVA),导致现有的网络实验设计研究大多基于Horvitz-Thompson(HT)估计量,但这种方法需要大量数据修剪以保证无偏性,从而增加了结果估计的方差。
方法:本文提出了一种新的随机化网络实验设计方法,通过优化处理分配向量的协方差矩阵来平衡偏差和方差,以最小化估计量的均方误差(MSE)。
效果:通过大量的模拟研究,我们发现该方法在许多设置中,包括不同级别的模型误设定,都优于现有的其他方法。

Online A/B tests have become increasingly popular and important for social platforms. However, accurately estimating the global average treatment effect (GATE) has proven to be challenging due to network interference, which violates the Stable Unit Treatment Value Assumption (SUTVA) and poses great challenge to experimental design. Existing network experimental design research was mostly based on the unbiased Horvitz-Thompson (HT) estimator with substantial data trimming to ensure unbiasedness at the price of high resultant estimation variance. In this paper, we strive to balance the bias and variance in designing randomized network experiments. Under a potential outcome model with 1-hop interference, we derive the bias and variance of the standard HT estimator and reveal their relation to the network topological structure and the covariance of the treatment assignment vector. We then propose to formulate the experimental design problem as to optimize the covariance matrix of the treatment assignment vector to achieve the bias and variance balance by minimizing the mean squared error (MSE) of the estimator. An efficient projected gradient descent algorithm is presented to the implement of the desired randomization scheme. Finally, we carry out extensive simulation studies to demonstrate the advantages of our proposed method over other existing methods in many settings, with different levels of model misspecification.

Non-adversarial training of Neural SDEs with signature kernel scores
Zacharia Issa Blanka Horvath Maud Lemercier Cristopher Salvi



研究问题:训练神经网络扩散模型(Neural SDEs)用于生成序列数据的稳定性和效率问题。
动机:尽管神经网络扩散模型在非规则时间序列生成方面取得了显著效果,但其训练过程不稳定,经常出现模式崩溃,需要专门的技术如权重裁剪和梯度惩罚来缓解这些问题。
方法:本文提出了一种基于签名核的新的目标函数类别,并将其用作训练神经网络扩散模型的非对抗性目标。通过证明这种核得分的严格正确性和相应估计器的一致性,我们为最小化器提供了存在性和唯一性保证。
效果:该方法在模拟粗波动模型、预测现实世界外汇对的条件概率以及无网格生成限价订单动态等多种任务上表现出色,显著优于其他训练神经网络扩散模型的方法。

Neural SDEs are continuous-time generative models for sequential data. State-of-the-art performance for irregular time series generation has been previously obtained by training these models adversarially as GANs. However, as typical for GAN architectures, training is notoriously unstable, often suffers from mode collapse, and requires specialised techniques such as weight clipping and gradient penalty to mitigate these issues. In this paper, we introduce a novel class of scoring rules on pathspace based on signature kernels and use them as objective for training Neural SDEs non-adversarially. By showing strict properness of such kernel scores and consistency of the corresponding estimators, we provide existence and uniqueness guarantees for the minimiser. With this formulation, evaluating the generator-discriminator pair amounts to solving a system of linear path-dependent PDEs which allows for memory-efficient adjoint-based backpropagation. Moreover, because the proposed kernel scores are well-defined for paths with values in infinite dimensional spaces of functions, our framework can be easily extended to generate spatiotemporal data. Our procedure significantly outperforms alternative ways of training Neural SDEs on a variety of tasks including the simulation of rough volatility models, the conditional probabilistic forecasts of real-world forex pairs where the conditioning variable is an observed past trajectory, and the mesh-free generation of limit order book dynamics.

Koopa: Learning Non-stationary Time Series Dynamics with Koopman Predictors
Yong Liu Chenyu Li Jianmin Wang Mingsheng Long



研究问题:如何应对真实世界中时间序列的非平稳性,这对深度预测模型构成了主要挑战。
动机:现有的模型由于复杂的系列变化而受到不断变化的时间分布的影响,我们使用现代的Koopman理论来解决非平稳时间序列的问题。
方法:通过傅立叶滤波器从复杂的非平稳系列中分离出时变和时不变组件,并设计Koopman预测器来推进各自的动态。具体来说,我们提出了Koopa,这是一个由可堆叠的块组成的新的Koopman预测器,可以学习分层动态。
效果:与最先进的模型相比,Koopa在节省77.3%的训练时间和76.0%的内存的同时,实现了竞争的性能。

Real-world time series are characterized by intrinsic non-stationarity that poses a principal challenge for deep forecasting models. While previous models suffer from complicated series variations induced by changing temporal distribution, we tackle non-stationary time series with modern Koopman theory that fundamentally considers the underlying time-variant dynamics. Inspired by Koopman theory of portraying complex dynamical systems, we disentangle time-variant and time-invariant components from intricate non-stationary series by Fourier Filter and design Koopman Predictor to advance respective dynamics forward. Technically, we propose Koopa as a novel Koopman forecaster composed of stackable blocks that learn hierarchical dynamics. Koopa seeks measurement functions for Koopman embedding and utilizes Koopman operators as linear portraits of implicit transition. To cope with time-variant dynamics that exhibits strong locality, Koopa calculates context-aware operators in the temporal neighborhood and is able to utilize incoming ground truth to scale up forecast horizon. Besides, by integrating Koopman Predictors into deep residual structure, we ravel out the binding reconstruction loss in previous Koopman forecasters and achieve end-to-end forecasting objective optimization. Compared with the state-of-the-art model, Koopa achieves competitive performance while saving 77.3% training time and 76.0% memory.

Generalization bounds for neural ordinary differential equations and deep residual networks
Pierre Marion



研究问题:本文旨在通过连续深度深度学习模型——神经常微分方程(neural ODEs),推导出一类具有连续时间参数的参数化ODEs的泛化界限。
动机:利用神经常微分方程和深度残差网络之间的类比,我们的方法可以特别地为一类深度残差网络得出泛化界限。
方法:通过基于Lipschitz的论证,我们推导出了这类神经ODEs的泛化界限,该界限涉及到连续时间参数的权重矩阵之间的差异大小。
效果:数值实验表明,这种数量会如何影响神经网络的泛化能力。

Neural ordinary differential equations (neural ODEs) are a popular family of continuous-depth deep learning models. In this work, we consider a large family of parameterized ODEs with continuous-in-time parameters, which include time-dependent neural ODEs. We derive a generalization bound for this class by a Lipschitz-based argument. By leveraging the analogy between neural ODEs and deep residual networks, our approach yields in particular a generalization bound for a class of deep residual networks. The bound involves the magnitude of the difference between successive weight matrices. We illustrate numerically how this quantity affects the generalization capability of neural networks.

CARE: Modeling Interacting Dynamics Under Temporal Environmental Variation
Xiao Luo Haixin Wang Zijie Huang Huiyu Jiang Abhijeet Sadashiv Gangan Song Jiang Yizhou Sun



研究问题:如何有效地模拟和理解复杂的动态系统,如流体动力学和分子间相互作用。
动机:现有的模型大多假设动态系统不随时间变化,但实际情况并非如此,例如环境温度会影响分子动力学。
方法:提出一种概率视角的时间变化动态模型——Context-attended Graph ODE (CARE),通过上下文变量来模拟随时间变化的环境,并使用神经网络ODE模型描述从系统状态推断出的上下文变量的动态演变。
效果:在四个数据集上的全面实验表明,与几种最先进的方法相比,提出的CARE模型具有有效性。

Modeling interacting dynamical systems, such as fluid dynamics and intermolecular interactions, is a fundamental research problem for understanding and simulating complex real-world systems. Many of these systems can be naturally represented by dynamic graphs, and graph neural network-based approaches have been proposed and shown promising performance. However, most of these approaches assume the underlying dynamics does not change over time, which is unfortunately untrue. For example, a molecular dynamics can be affected by the environment temperature over the time. In this paper, we take an attempt to provide a probabilistic view for time-varying dynamics and propose a model Context-attended Graph ODE (CARE) for modeling time-varying interacting dynamical systems. In our CARE, we explicitly use a context variable to model time-varying environment and construct an encoder to initialize the context variable from historical trajectories. Furthermore, we employ a neural ODE model to depict the dynamic evolution of the context variable inferred from system states. This context variable is incorporated into a coupled ODE to simultaneously drive the evolution of systems. Comprehensive experiments on four datasets demonstrate the effectiveness of our proposed CARE compared with several state-of-the-art approaches.

Directed Cyclic Graph for Causal Discovery from Multivariate Functional Data
Saptarshi Roy Raymond K. W. Wong Yang Ni



研究问题:如何利用多元函数数据发现因果关系。
动机:多元函数数据的因果关系发现在最近受到了大量关注,而现有的方法往往无法处理涉及循环的多元函数图结构。
方法:提出了一种功能性线性结构方程模型来学习多元函数数据的因果结构,该模型包含一个低维的因果嵌入空间,以保留所有相关的因果信息。
效果:通过大量的模拟研究和一个脑电图数据集的应用,证明了该方法在因果图估计方面的优越性能。

Discovering causal relationship using multivariate functional data has received a significant amount of attention very recently. In this article, we introduce a functional linear structural equation model for causal structure learning when the underlying graph involving the multivariate functions may have cycles. To enhance interpretability, our model involves a low-dimensional causal embedded space such that all the relevant causal information in the multivariate functional data is preserved in this lower-dimensional subspace. We prove that the proposed model is causally identifiable under standard assumptions that are often made in the causal discovery literature. To carry out inference of our model, we develop a fully Bayesian framework with suitable prior specifications and uncertainty quantification through posterior summaries. We illustrate the superior performance of our method over existing methods in terms of causal graph estimation through extensive simulation studies. We also demonstrate the proposed method using a brain EEG dataset.

Survival Permanental Processes for Survival Analysis with Time-Varying Covariates
Hideaki Kim



研究问题:如何准确地分析生存或时间到事件数据中随时间变化协变量的非线性依赖性。
动机:传统的生存分析方法如Cox比例风险模型通过计数过程公式扩展以处理随时间变化协变量,但能够适应随时间变化的协变量的复杂机器学习方法有限。
方法:本文提出了一种非参数贝叶斯生存模型来分析时间到事件结果对随时间变化的协变量的非线性依赖性。我们专注于计算上可行的Cox过程,称为永久性过程,该过程假设危险函数的平方根由高斯过程生成,并针对具有随时间变化协变量的生存数据进行定制。
效果:我们的算法在合成和现实世界的数据上进行了评估,表明它在预测准确性方面与最先进的方法相当,同时比最先进的方法快几十到几百倍。

Survival or time-to-event data with time-varying covariates are common in practice, and exploring the non-stationarity in covariates is essential to accurately analyzing the nonlinear dependence of time-to-event outcomes on covariates. Traditional survival analysis methods such as Cox proportional hazards model have been extended to address the time-varying covariates through a counting process formulation, although sophisticated machine learning methods that can accommodate time-varying covariates have been limited. In this paper, we propose a non-parametric Bayesian survival model to analyze the nonlinear dependence of time-to-event outcomes on time-varying covariates. We focus on a computationally feasible Cox process called permanental process, which assumes the square root of hazard function to be generated from a Gaussian process, and tailor it for survival data with time-varying covariates. We verify that the proposed model holds with the representer theorem, a beneficial property for functional analysis, which offers us a fast Bayesian estimation algorithm that scales linearly with the number of observed events without relying on Markov Chain Monte Carlo computation. We evaluate our algorithm on synthetic and real-world data, and show that it achieves comparable predictive accuracy while being tens to hundreds of times faster than state-of-the-art methods.

Beyond Unimodal: Generalising Neural Processes for Multimodal Uncertainty Estimation
Myong Chol Jung He Zhao Joanna Dipnall Lan Du



研究问题:如何有效地进行多模态数据的不确定性估计。
动机:尽管现有的方法在单模态数据的不确定性估计上已取得显著成果,但多模态数据的不确定性估计仍是一个挑战。
方法:提出一种名为“多模态神经过程”(MNPs)的新方法,该方法通过将神经过程(NPs)进行泛化,以适应多模态数据的特性。
效果:实验结果表明,该方法在多模态不确定性估计上取得了最先进的性能,同时具有对噪声样本的强鲁棒性和在分布外检测中的可靠性,且计算速度比当前最先进的多模态不确定性估计方法更快。

Uncertainty estimation is an important research area to make deep neural networks (DNNs) more trustworthy. While extensive research on uncertainty estimation has been conducted with unimodal data, uncertainty estimation for multimodal data remains a challenge. Neural processes (NPs) have been demonstrated to be an effective uncertainty estimation method for unimodal data by providing the reliability of Gaussian processes with efficient and powerful DNNs. While NPs hold significant potential for multimodal uncertainty estimation, the adaptation of NPs for multimodal data has not been carefully studied. To bridge this gap, we propose Multimodal Neural Processes (MNPs) by generalising NPs for multimodal uncertainty estimation. Based on the framework of NPs, MNPs consist of several novel and principled mechanisms tailored to the characteristics of multimodal data. In extensive empirical evaluation, our method achieves state-of-the-art multimodal uncertainty estimation performance, showing its appealing robustness against noisy samples and reliability in out-of-distribution detection with faster computation time compared to the current state-of-the-art multimodal uncertainty estimation method.

Variational Imbalanced Regression: Fair Uncertainty Quantification via Probabilistic Smoothing
Ziyan Wang Hao Wang



研究问题:现有的回归模型在标签分布不平衡时,准确性和不确定性估计往往不足。
动机:提出一种新的概率深度学习模型,即变分不平衡回归(VIR),以解决不平衡回归问题,并自然地产生合理的不确定性估计。
方法:VIR模型与典型的假设独立同分布表示的变分自动编码器不同,借用了具有相似回归标签的数据来计算潜在表示的变分分布;此外,与只产生点估计的确定性回归模型不同,VIR预测整个正态逆伽马分布,并通过调整相关共轭分布对不平衡数据进行概率重加权,从而提供更好的不确定性估计。
效果:在几个真实世界数据集上的实验表明,VIR模型在准确性和不确定性估计方面均优于最先进的不平衡回归模型。

Existing regression models tend to fall short in both accuracy and uncertainty estimation when the label distribution is imbalanced. In this paper, we propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct. Different from typical variational autoencoders assuming I.I.D. representations (a data point's representation is not directly affected by other data points), our VIR borrows data with similar regression labels to compute the latent representation's variational distribution; furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions to impose probabilistic reweighting on the imbalanced data, thereby providing better uncertainty estimation. Experiments in several real-world datasets show that our VIR can outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation. Code will soon be available at https://github.com/Wang-ML-Lab/variational-imbalanced-regression.

Many-body Approximation for Non-negative Tensors
Kazu Ghalamkari Mahito Sugiyama Yoshinobu Kawahara



研究问题:提出一种新颖的非负张量分解方法,称为多体近似法。
动机:传统的分解方法假设表示具有低秩性,导致全局优化和目标秩选择困难。
方法:通过张量的基于能量的建模来避免这些问题,其中张量和其模式分别对应于概率分布和随机变量。我们的模型可以通过考虑变量之间的交互(即模式)在KL散度最小化方面进行全局优化,这比秩更直观可调整。此外,我们将模式之间的交互可视化为张量网络,并揭示了多体近似法与低秩近似法之间的非平凡关系。
效果:我们在张量补全和近似方面展示了该方法的有效性。

We present an alternative approach to decompose non-negative tensors, called many-body approximation. Traditional decomposition methods assume low-rankness in the representation, resulting in difficulties in global optimization and target rank selection. We avoid these problems by energy-based modeling of tensors, where a tensor and its mode correspond to a probability distribution and a random variable, respectively. Our model can be globally optimized in terms of the KL divergence minimization by taking the interaction between variables (that is, modes), into account that can be tuned more intuitively than ranks. Furthermore, we visualize interactions between modes as tensor networks and reveal a nontrivial relationship between many-body approximation and low-rank approximation. We demonstrate the effectiveness of our approach in tensor completion and approximation.

Sparse Deep Learning for Time Series Data: Theory and Applications
Mingxuan Zhang Yan Sun Faming Liang



研究问题:本文旨在解决现有稀疏深度学习在处理依赖数据(如时间序列和自然语言处理中的序列数据)时的问题。
动机:大多数现有的稀疏深度学习研究都集中在独立同分布的观察问题上,对于依赖数据的问题,如时间序列数据,研究甚少。
方法:通过研究依赖数据的稀疏深度理论,作者提出稀疏循环神经网络(RNNs)可以一致估计,其预测在适当假设下呈渐进正态分布,从而正确量化预测不确定性。
效果:实验结果表明,该方法在预测不确定性量化方面优于最先进的方法,如一致性预测。此外,该方法能准确确定时间序列数据的自回归阶数,并在大规模模型压缩方面超越现有方法。

Sparse deep learning has become a popular technique for improving the performance of deep neural networks in areas such as uncertainty quantification, variable selection, and large-scale network compression. However, most existing research has focused on problems where the observations are independent and identically distributed (i.i.d.), and there has been little work on the problems where the observations are dependent, such as time series data and sequential data in natural language processing. This paper aims to address this gap by studying the theory for sparse deep learning with dependent data. We show that sparse recurrent neural networks (RNNs) can be consistently estimated, and their predictions are asymptotically normally distributed under appropriate assumptions, enabling the prediction uncertainty to be correctly quantified. Our numerical results show that sparse deep learning outperforms state-of-the-art methods, such as conformal predictions, in prediction uncertainty quantification for time series data. Furthermore, our results indicate that the proposed method can consistently identify the autoregressive order for time series data and outperform existing methods in large-scale model compression. Our proposed method has important practical implications in fields such as finance, healthcare, and energy, where both accurate point estimates and prediction uncertainty quantification are of concern.

Disentanglement via Latent Quantization
Kyle Hsu Will Dorrell James C. R. Whittington Jiajun Wu Chelsea Finn



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In disentangled representation learning, a model is asked to tease apart a dataset's underlying sources of variation and represent them independently of one another. Since the model is provided with no ground truth information about these sources, inductive biases take a paramount role in enabling disentanglement. In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. Concretely, we do this by (i) quantizing the latent space into discrete code vectors with a separate learnable scalar codebook per dimension and (ii) applying strong model regularization via an unusually high weight decay. Intuitively, the latent space design forces the encoder to combinatorially construct codes from a small number of distinct scalar values, which in turn enables the decoder to assign a consistent meaning to each value. Regularization then serves to drive the model towards this parsimonious strategy. We demonstrate the broad applicability of this approach by adding it to both basic data-reconstructing (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models. For reliable evaluation, we also propose InfoMEC, a new set of metrics for disentanglement that is cohesively grounded in information theory and fixes well-established shortcomings in previous metrics. Together with regularization, latent quantization dramatically improves the modularity and explicitness of learned representations on a representative suite of benchmark datasets. In particular, our quantized-latent autoencoder (QLAE) consistently outperforms strong methods from prior work in these key disentanglement properties without compromising data reconstruction.

AdaVAE: Bayesian Structural Adaptation for Variational Autoencoders
Paribesh Regmi Rui Li



研究问题:现有的变分自动编码器(VAEs)的生成模型和对应的推理模型的网络结构在模型的生成性能中起着关键作用,但这些强大的网络结构是预先设定的,需要大量的计算来调整以适应给定的数据。
动机:现有的VAE正则化方法在很大程度上忽视了网络结构的重要性,无法防止深度VAE模型中的过拟合。
方法:我们提出了一个贝叶斯推理框架,该框架可以自动调整VAE的网络结构以适应数据,并防止随着层数的增加而出现过拟合。我们使用贝塔过程对隐藏层的数量进行建模,以推断出最有可能的编码/解码网络深度。我们还进行了逐层的dropout正则化处理。
效果:实验表明,我们的推理框架有效地防止了浅层和深层VAE模型的过拟合,取得了最先进的性能。我们的框架可以与不同类型的VAE主干网络兼容,并可以应用于各种VAE变体,从而进一步提高其性能。

The neural network structures of generative models and their corresponding inference models paired in variational autoencoders (VAEs) play a critical role in the models' generative performance. However, powerful VAE network structures are hand-crafted and fixed prior to training, resulting in a one-size-fits-all approach that requires heavy computation to tune for given data. Moreover, existing VAE regularization methods largely overlook the importance of network structures and fail to prevent overfitting in deep VAE models with cascades of hidden layers. To address these issues, we propose a Bayesian inference framework that automatically adapts VAE network structures to data and prevent overfitting as they grow deeper. We model the number of hidden layers with a beta process to infer the most plausible encoding/decoding network depths warranted by data and perform layer-wise dropout regularization with a conjugate Bernoulli process. We develop a scalable estimator that performs joint inference on both VAE network structures and latent variables. Our experiments show that the inference framework effectively prevents overfitting in both shallow and deep VAE models, yielding state-of-the-art performance. We demonstrate that our framework is compatible with different types of VAE backbone networks and can be applied to various VAE variants, further improving their performance.

Causal discovery from observational and interventional data across multiple environments
Adam Li Amin Jaber Elias Bareinboim



研究问题:如何在多领域收集的数据中学习非马尔可夫系统的因果结构。
动机:现有的方法只能在单个领域中从观察和实验数据中学习因果图等价类,对于多领域的数据则无法处理。
方法:通过使用来自不同领域的观察和干预数据,利用S-Markov属性定义一种新的约束基础的因果发现算法S-FCI。
效果:该算法被证明是有效的,并包含了现有的约束基础的因果发现算法。

A fundamental problem in many sciences is the learning of causal structure underlying a system, typically through observation and experimentation. Commonly, one even collects data across multiple domains, such as gene sequencing from different labs, or neural recordings from different species. Although there exist methods for learning the equivalence class of causal diagrams from observational and experimental data, they are meant to operate in a single domain. In this paper, we develop a fundamental approach to structure learning in non-Markovian systems (i.e. when there exist latent confounders) leveraging observational and interventional data collected from multiple domains. Specifically, we start by showing that learning from observational data in multiple domains is equivalent to learning from interventional data with unknown targets in a single domain. But there are also subtleties when considering observational and experimental data. Using causal invariances derived from do-calculus, we define a property called S-Markov that connects interventional distributions from multiple-domains to graphical criteria on a selection diagram. Leveraging the S-Markov property, we introduce a new constraint-based causal discovery algorithm, S-FCI, that can learn from observational and interventional data from different domains. We prove that the algorithm is sound and subsumes existing constraint-based causal discovery algorithms.

Neural Lyapunov Control for Discrete-Time Systems
Junlin Wu Andrew Clark Yiannis Kantaros Yevgeniy Vorobeychik



研究问题:如何为非线性系统找到稳定的控制策略。
动机:虽然线性系统的稳定控制方法已经成熟,但非线性系统的稳定控制仍是一个重大挑战。
方法:提出一种学习离散时间系统中的神经Lyapunov控制的方法,通过混合整数线性规划验证离散时间Lyapunov稳定性条件,并利用特定结构计算已验证的子层级集,同时采用启发式梯度方法快速找到反例以加速Lyapunov函数的学习。
效果:在四个标准基准测试中,该方法显著优于最先进的基线。例如,在路径跟踪基准测试中,该方法在运行时间和吸引区域大小方面比最近的神经Lyapunov控制基线提高了一个数量级,并且在四个基准测试中的两个(cartpole和PVTOL)中,这是第一个返回可证明稳定控制器的自动化方法。

While ensuring stability for linear systems is well understood, it remains a major challenge for nonlinear systems. A general approach in such cases is to compute a combination of a Lyapunov function and an associated control policy. However, finding Lyapunov functions for general nonlinear systems is a challenging task. To address this challenge, several methods have been proposed that represent Lyapunov functions using neural networks. However, such approaches either focus on continuous-time systems, or highly restricted classes of nonlinear dynamics. We propose the first approach for learning neural Lyapunov control in a broad class of discrete-time systems. Three key ingredients enable us to effectively learn provably stable control policies. The first is a novel mixed-integer linear programming approach for verifying the discrete-time Lyapunov stability conditions, leveraging the particular structure of these conditions. The second is a novel approach for computing verified sublevel sets. The third is a heuristic gradient-based method for quickly finding counterexamples to significantly speed up Lyapunov function learning. Our experiments on four standard benchmarks demonstrate that our approach significantly outperforms state-of-the-art baselines. For example, on the path tracking benchmark, we outperform recent neural Lyapunov control baselines by an order of magnitude in both running time and the size of the region of attraction, and on two of the four benchmarks (cartpole and PVTOL), ours is the first automated approach to return a provably stable controller. Our code is available at: https://github.com/jlwu002/nlc_discrete.

DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics
Zhiao Huang Feng Chen Yewen Pu Chunru Lin Hao Su Chuang Gan



研究问题:如何让非专家用户有效地描述软体操作任务,以便于物理仿真器进行求解?
动机:现有的优化目标需要专业知识才能编写,限制了从非专家用户那里收集大量自然问题的能力。
方法:我们提出了DiffVL方法,该方法允许非专家用户通过视觉和自然语言的组合来描述软体操作任务,并利用大型语言模型将任务描述转化为机器可解释的优化目标。
效果:我们开发了GUI工具,使非专家用户可以指定100个基于真实生活软体操作的在线视频任务。实验证明,这种方法能够有效地帮助物理仿真器解决这些长周期的多阶段任务,这是以前的基线方法难以解决的问题。

Combining gradient-based trajectory optimization with differentiable physics simulation is an efficient technique for solving soft-body manipulation problems. Using a well-crafted optimization objective, the solver can quickly converge onto a valid trajectory. However, writing the appropriate objective functions requires expert knowledge, making it difficult to collect a large set of naturalistic problems from non-expert users. We introduce DiffVL, a method that enables non-expert users to communicate soft-body manipulation tasks -- a combination of vision and natural language, given in multiple stages -- that can be readily leveraged by a differential physics solver. We have developed GUI tools that enable non-expert users to specify 100 tasks inspired by real-life soft-body manipulations from online videos, which we'll make public. We leverage large language models to translate task descriptions into machine-interpretable optimization objectives. The optimization objectives can help differentiable physics solvers to solve these long-horizon multistage tasks that are challenging for previous baselines.

Dense-Exponential Random Features: Sharp Positive Estimators of the Gaussian Kernel
Valerii Likhosherstov Krzysztof Marcin Choromanski Kumar Avinava Dubey Frederick Liu Tamas Sarlos Adrian Weller



研究问题:如何有效地近似由高斯或softmax核引发的线性算子。
动机:传统的随机特征(RFs)方法可以无偏地近似这种算子的结果,但其参数无法优化以降低近似的方差。
方法:提出参数化、正、非三角的RFs来近似高斯和softmax核。这些新方法的参数可以被优化以降低近似的方差,并且最优解可以用闭式表示。
效果:实验表明,这种方法在实践中可以显著降低方差(达到e^{10}次方或更高),并在核回归任务中优于先前的方法。此外,利用这种方法,我们还提出了FAVOR#,一种在Transformers中进行自注意力近似的方法。实验证明,FAVOR#在语音建模和自然语言处理方面优于其他随机特征方法。

The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice (e^{10}-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.

Differentially Private Statistical Inference through $\beta$-Divergence One Posterior Sampling
Jack Jewson Sahra Ghalebikesabi Christopher C. Holmes



研究问题:如何在进行涉及敏感数据的统计分析时,在不泄露任何参与者隐私的情况下发布结果。
动机:目前的隐私保护方法通常需要对参数估计或估计过程直接注入噪声,而不是人为引入扰动。
方法:提出一种名为βD-Bayes的后验采样方案,从针对模型和数据生成过程最小化β-散度的广义后验中进行采样,以实现更通用且无需改变底层模型的私有估计。
效果:实验表明,βD-Bayes能在相同的隐私保证下产生更精确的推理估计,并能进一步促进复杂分类器和连续回归模型(如神经网络)的差分隐私估计,这是目前后验采样所无法实现的。

Differential privacy guarantees allow the results of a statistical analysis involving sensitive data to be released without compromising the privacy of any individual taking part. Achieving such guarantees generally requires the injection of noise, either directly into parameter estimates or into the estimation process. Instead of artificially introducing perturbations, sampling from Bayesian posterior distributions has been shown to be a special case of the exponential mechanism, producing consistent, and efficient private estimates without altering the data generative process. The application of current approaches has, however, been limited by their strong bounding assumptions which do not hold for basic models, such as simple linear regressors. To ameliorate this, we propose $\beta$D-Bayes, a posterior sampling scheme from a generalised posterior targeting the minimisation of the $\beta$-divergence between the model and the data generating process. This provides private estimation that is generally applicable without requiring changes to the underlying model and consistently learns the data generating parameter. We show that $\beta$D-Bayes produces more precise inference estimation for the same privacy guarantees, and further facilitates differentially private estimation of complex classifiers, and continuous regression models such as neural networks, which goes beyond what has been currently possible with private posterior sampling.

Do Not Marginalize Mechanisms, Rather Consolidate!
Moritz Willig Matej Zečević Devendra Singh Dhami Kristian Kersting



研究问题:如何简化和压缩大规模结构因果模型(SCM)以适应日益增长的数据需求。
动机:随着系统规模的增大,变量数量和交互复杂性也随之增加,使得SCM变得复杂且难以分析,特别是在机器学习和人工智能领域。
方法:引入“整合因果关系”的概念来转换大规模的SCM,同时保留一致的干预行为。
效果:整合是一种强大的简化SCM的方法,可以降低计算复杂度,并提升整合后的SCM的泛化能力。

Structural causal models (SCMs) are a powerful tool for understanding the complex causal relationships that underlie many real-world systems. As these systems grow in size, the number of variables and complexity of interactions between them does, too. Thus, becoming convoluted and difficult to analyze. This is particularly true in the context of machine learning and artificial intelligence, where an ever increasing amount of data demands for new methods to simplify and compress large scale SCM. While methods for marginalizing and abstracting SCM already exist today, they may destroy the causality of the marginalized model. To alleviate this, we introduce the concept of consolidating causal mechanisms to transform large-scale SCM while preserving consistent interventional behaviour. We show consolidation is a powerful method for simplifying SCM, discuss reduction of computational complexity and give a perspective on generalizing abilities of consolidated SCM.

Neural Lad: A Neural Latent Dynamics Framework for Times Series Modeling
Ting Li Jianguo Li Zhanxing Zhu



研究问题:现有的神经ODE预测模型存在两个缺点,一是只能通过观察信号的局部变化对潜在状态进行线性变换控制,可能不够充分;二是在时间序列预测任务中缺乏捕获内在周期性的能力。
动机:为了克服这两个问题,提出了一种新的神经ODE框架,称为Neural Lad,这是一种神经潜在动力学模型,其中潜在表示通过增强观察信号变化和季节性趋势特征的ODE进行演化。
方法:我们将输入信号的局部变化以注意力的方式融入到潜在动态中,并设计了一个基于基展开的残差架构来描述潜在动态中的周期性。为了适应多元时间序列预测,我们通过学习多个时间序列之间的自适应关系来扩展Neural Lad。
效果:实验表明,我们的模型在各种数据集上可以取得比现有的神经ODE家族和变压器变体更好或相当的性能。值得注意的是,Neural Lad的实证优势在短期和长期预测中都是一致的,适用于单变量、多变量甚至不规则采样的时间序列。

Neural ordinary differential equation (Neural ODE) is an elegant yet powerful framework to learn the temporal dynamics for time series modeling. However, we observe that existing Neural ODE forecasting models suffer from two disadvantages: i) controlling the latent states only through the linear transformation over the local change of the observed signals may be inadequate; ii) lacking the ability to capture the inherent periodical property in time series forecasting tasks; To overcome the two issues, we introduce a new neural ODE framework called \textbf{Neural Lad}, a \textbf{Neural} \textbf{La}tent \textbf{d}ynamics model in which the latent representations evolve with an ODE enhanced by the change of observed signal and seasonality-trend characterization. We incorporate the local change of input signal into the latent dynamics in an attention-based manner and design a residual architecture over basis expansion to depict the periodicity in the underlying dynamics. To accommodate the multivariate time series forecasting, we extend the Neural Lad through learning an adaptive relationship between multiple time series. Experiments demonstrate that our model can achieve better or comparable performance against existing neural ODE families and transformer variants in various datasets. Remarkably, the empirical superiority of Neural Lad is consistent across short and long-horizon forecasting for both univariate, multivariate and even irregular sampled time series.

Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods
Shang Liu Zhongze Cai Xiaocheng Li



研究问题:本文考虑回归模型的不确定性量化问题,特别是针对预测模型分位数的特征化个体校准目标。
动机:尽管这种目标在下游任务如新闻供应商成本等方面具有充分的动机,但现有方法大多基于启发式且缺乏个体校准方面的统计保证。
方法:我们提出了简单的非参数校准方法,这些方法与底层预测模型无关,并具有良好的计算效率和统计一致性。
效果:我们的分析将非参数分析和覆盖数论证相结合,为提出的校准方法的校准误差建立了上下界。从技术上讲,这种方法在维度灾难和不可能的个体校准方面提供了新的理论见解,并在有限样本条件下实现了个体校准和一致保证。

In this paper, we consider the uncertainty quantification problem for regression models. Specifically, we consider an individual calibration objective for characterizing the quantiles of the prediction model. While such an objective is well-motivated from downstream tasks such as newsvendor cost, the existing methods have been largely heuristic and lack of statistical guarantee in terms of individual calibration. We show via simple examples that the existing methods focusing on population-level calibration guarantees such as average calibration or sharpness can lead to harmful and unexpected results. We propose simple nonparametric calibration methods that are agnostic of the underlying prediction model and enjoy both computational efficiency and statistical consistency. Our approach enables a better understanding of the possibility of individual calibration, and we establish matching upper and lower bounds for the calibration error of our proposed methods. Technically, our analysis combines the nonparametric analysis with a covering number argument for parametric analysis, which advances the existing theoretical analyses in the literature of nonparametric density estimation and quantile bandit problems. Importantly, the nonparametric perspective sheds new theoretical insights into regression calibration in terms of the curse of dimensionality and reconciles the existing results on the impossibility of individual calibration. To our knowledge, we make the first effort to reach both individual calibration and finite-sample guarantee with minimal assumptions in terms of conformal prediction. Numerical experiments show the advantage of such a simple approach under various metrics, and also under covariates shift. We hope our work provides a simple benchmark and a starting point of theoretical ground for future research on regression calibration.

Near-Linear Time Algorithm for the Chamfer Distance
Ainesh Bakshi Piotr Indyk Rajesh Jayaram Sandeep Silwal Erik Waingarten



研究问题:如何高效地计算两个点集之间的Chamfer距离。
动机:Chamfer距离是衡量点云间相似性的重要指标,但现有的计算方法时间复杂度高,不适用于大规模数据集。
方法:提出一种$(1+\epsilon)$近似的Chamfer距离快速算法,运行时间为$O(nd \log (n)/\epsilon^2)$。
效果:实验证明该算法在大规模高维数据集上既准确又快速,为分析大规模高维点云提供了新的途径。

For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\texttt{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, and graphics applications, and admits a straightforward $O(d n^2)$-time brute force algorithm. Further, Chamfer distance is often used as a proxy for the more computationally demanding Earth-Mover (Optimal Transport) Distance. However, the \emph{quadratic} dependence on $n$ in the running time makes the naive approach intractable for large datasets. We overcome this bottleneck and present the first $(1+\epsilon)$-approximate algorithm for estimating Chamfer distance with a near-linear running time. Specifically, our algorithm runs in time $O(nd \log (n)/\epsilon^2)$ and is implementable. Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets. We believe that our algorithm will open new avenues for analyzing large high-dimensional point clouds. We also give evidence that if the goal is to report a $(1+\epsilon)$-approximate mapping from $A$ to $B$ (as opposed to just its value), then any sub-quadratic time algorithm is unlikely to exist.

History Filtering in Imperfect Information Games: Algorithms and Complexity
Christopher Solinas Doug Rebstock Nathan R. Sturtevant Michael Buro



研究问题:如何有效地在不完美信息环境中进行深度有限搜索和子游戏分解。
动机:尽管子游戏分解已被广泛应用于深度有限搜索,但其计算复杂性和可解性尚未得到明确分析。
方法:通过引入并分析子游戏分解中历史记录的过滤和生成算法,确定其计算复杂性和可解性。
效果:实验证明,这种方法可以有效提高深度有限搜索在不完美信息环境下的效率,并在“Oh Hell”等纸牌游戏中表现出良好的扩展性。

Historically applied exclusively to perfect information games, depth-limited search with value functions has been key to recent advances in AI for imperfect information games. Most prominent approaches with strong theoretical guarantees require *subgame decomposition* - a process in which a subgame is computed from public information and player beliefs. However, subgame decomposition can itself require non-trivial computations, and its tractability depends on the existence of efficient algorithms for either full enumeration or generation of the histories that form the root of the subgame. Despite this, no formal analysis of the tractability of such computations has been established in prior work, and application domains have often consisted of games, such as poker, for which enumeration is trivial on modern hardware. Applying these ideas to more complex domains requires understanding their cost. In this work, we introduce and analyze the computational aspects and tractability of filtering histories for subgame decomposition. We show that constructing a single history from the root of the subgame is generally intractable, and then provide a necessary and sufficient condition for efficient enumeration. We also introduce a novel Markov Chain Monte Carlo-based generation algorithm for trick-taking card games - a domain where enumeration is often prohibitively expensive. Our experiments demonstrate its improved scalability in the trick-taking card game *Oh Hell*. These contributions clarify when and how depth-limited search via subgame decomposition can be an effective tool for sequential decision-making in imperfect information settings.

Revisiting Implicit Differentiation for Learning Problems in Optimal Control
Ming Xu Timothy L Molloy Stephen Gould



研究问题:如何通过隐函数定理(IFT)对非凸、约束的离散时间最优控制(COC)问题进行最佳轨迹微分。
动机:现有的方法需要解决一个关于轨迹导数的微分KKT系统,并通过解决辅助的线性二次调节器(LQR)问题来提高效率。相比之下,我们直接评估应用拉格朗日乘子项变量消除后的矩阵方程。
方法:我们直接评估从应用拉格朗日乘子项的变量消除后产生的矩阵方程。通过适当考虑结果方程中各项的结构,我们证明了轨迹导数与时间步长的线性缩放关系。此外,我们的方法易于并行化,与模型大小相比具有显著改善的可扩展性,可以直接计算向量-雅可比积,并与先前的工作相比具有改进的数值稳定性。
效果:我们在合成基准测试和四个具有挑战性的学习演示基准测试上评估了我们的方法,包括一个6自由度机动四旋翼飞行器和一个6自由度的火箭动力着陆。

This paper proposes a new method for differentiating through optimal trajectories arising from non-convex, constrained discrete-time optimal control (COC) problems using the implicit function theorem (IFT). Previous works solve a differential Karush-Kuhn-Tucker (KKT) system for the trajectory derivative, and achieve this efficiently by solving an auxiliary Linear Quadratic Regulator (LQR) problem. In contrast, we directly evaluate the matrix equations which arise from applying variable elimination on the Lagrange multiplier terms in the (differential) KKT system. By appropriately accounting for the structure of the terms within the resulting equations, we show that the trajectory derivatives scale linearly with the number of timesteps. Furthermore, our approach allows for easy parallelization, significantly improved scalability with model size, direct computation of vector-Jacobian products and improved numerical stability compared to prior works. As an additional contribution, we unify prior works, addressing claims that computing trajectory derivatives using IFT scales quadratically with the number of timesteps. We evaluate our method on a both synthetic benchmark and four challenging, learning from demonstration benchmarks including a 6-DoF maneuvering quadrotor and 6-DoF rocket powered landing.

CrossGNN: Confronting Noisy Multivariate Time Series Via Cross Interaction Refinement
Qihe Huang Lei Shen Ruixin Zhang Shouhong Ding Binwu Wang Zhengyang Zhou Yang Wang



研究问题:现有的多元时间序列预测技术在处理时间维度的突发噪声和变量间的异质性上存在不足。
动机:为解决这些问题,我们提出了CrossGNN模型,通过提取更清晰趋势和较弱噪声的时间尺度,以及利用不同变量间的同质性和异质性,来改进多元时间序列预测。
方法:我们设计了一个自适应多尺度标识器(AMSI)来减少时间维度的噪声,并构建了多尺度时间序列。同时,我们提出了跨尺度GNN和跨变量GNN来提取更清晰的趋势和较弱噪声的时间尺度,以及利用变量间的同质性和异质性。
效果:实验结果表明,我们的CrossGNN模型在8个真实世界的多元时间序列数据集上的表现优于现有的最佳方法。

Recently, multivariate time series (MTS) forecasting techniques have seen rapid development and widespread applications across various fields. Transformer-based and GNN-based methods have shown promising potential due to their strong ability to model interaction of time and variables. However, by conducting a comprehensive analysis of the real-world data, we observe that the temporal fluctuations and heterogeneity between variables are not well handled by existing methods. To address the above issues, we propose CrossGNN, a linear complexity GNN model to refine the cross-scale and cross-variable interaction for MTS. To deal with the unexpected noise in time dimension, an adaptive multi-scale identifier (AMSI) is leveraged to construct multi-scale time series with reduced noise. A Cross-Scale GNN is proposed to extract the scales with clearer trend and weaker noise. Cross-Variable GNN is proposed to utilize the homogeneity and heterogeneity between different variables. By simultaneously focusing on edges with higher saliency scores and constraining those edges with lower scores, the time and space complexity (i.e., $O(L)$) of CrossGNN can be linear with the input sequence length $L$. Extensive experimental results on 8 real-world MTS datasets demonstrate the effectiveness of CrossGNN compared with state-of-the-art methods.

BayesDAG: Gradient-Based Posterior Inference for Causal Discovery
Yashas Annadani Nick Pawlowski Joel Jennings Stefan Bauer Cheng Zhang Wenbo Gong



研究问题:本文旨在解决贝叶斯因果发现中的计算挑战,即如何从观察数据中推断出因果模型的后验分布,并量化认识不确定性。
动机:尽管现有的方法在高效的后验推理上取得了一定的进展,但它们要么局限于线性因果模型的节点排列矩阵变分推理,导致推理精度降低,要么通过受DAG正则化器约束的邻接矩阵连续松弛来确保结果图是DAGs,但这无法保证结果图是DAGs。
方法:本文提出了一种基于随机梯度马尔科夫链蒙特卡洛(SG-MCMC)和变分推理(VI)的组合的可扩展贝叶斯因果发现框架,该框架直接从后验中采样DAGs,无需任何DAG正则化,同时绘制函数参数样本,适用于线性和非线性因果模型。
效果:实证评估表明,本文的方法在合成和真实世界数据集上都优于最先进的基线方法。

Bayesian causal discovery aims to infer the posterior distribution over causal models from observed data, quantifying epistemic uncertainty and benefiting downstream tasks. However, computational challenges arise due to joint inference over combinatorial space of Directed Acyclic Graphs (DAGs) and nonlinear functions. Despite recent progress towards efficient posterior inference over DAGs, existing methods are either limited to variational inference on node permutation matrices for linear causal models, leading to compromised inference accuracy, or continuous relaxation of adjacency matrices constrained by a DAG regularizer, which cannot ensure resulting graphs are DAGs. In this work, we introduce a scalable Bayesian causal discovery framework based on a combination of stochastic gradient Markov Chain Monte Carlo (SG-MCMC) and Variational Inference (VI) that overcomes these limitations. Our approach directly samples DAGs from the posterior without requiring any DAG regularization, simultaneously draws function parameter samples and is applicable to both linear and nonlinear causal models. To enable our approach, we derive a novel equivalence to the permutation-based DAG learning, which opens up possibilities of using any relaxed gradient estimator defined over permutations. To our knowledge, this is the first framework applying gradient-based MCMC sampling for causal discovery. Empirical evaluation on synthetic and real-world datasets demonstrate our approach's effectiveness compared to state-of-the-art baselines.

GRAND-SLAMIN’ Interpretable Additive Modeling with Structural Constraints
Shibal Ibrahim Gabriel Isaac Afriat Kayhan Behdin Rahul Mazumder



研究问题:如何提高广义可加模型(GAMs)的灵活性和解释性,同时保持计算效率和统计特性?
动机:现有的方法在处理高阶交互作用时会导致计算挑战,且难以保证模型的稀疏性和解释性。
方法:提出一种灵活的GRAND-SLAMIN框架,通过在端到端的方式学习具有交互作用的稀疏GAMs。利用稀疏反向传播进行优化,并针对任何可微分的损失函数进行GPU兼容操作。
效果:实验结果表明,该方法在性能、变量选择和可扩展性方面优于其他流行的工具包,同时保持了与非解释性黑箱模型相当的预测精度。

Generalized Additive Models (GAMs) are a family of flexible and interpretable models with old roots in statistics. GAMs are often used with pairwise interactions to improve model accuracy while still retaining flexibility and interpretability but lead to computational challenges as we are dealing with order of $p^2$ terms. It is desirable to restrict the number of components (i.e., encourage sparsity) for easier interpretability, and better computational and statistical properties. Earlier approaches, considering sparse pairwise interactions, have limited scalability, especially when imposing additional structural interpretability constraints. We propose a flexible GRAND-SLAMIN framework that can learn GAMs with interactions under sparsity and additional structural constraints in a differentiable end-to-end fashion. We customize first-order gradient-based optimization to perform sparse backpropagation to exploit sparsity in additive effects for any differentiable loss function in a GPU-compatible manner. Additionally, we establish novel non-asymptotic prediction bounds for our estimators with tree-based shape functions. Numerical experiments on real-world datasets show that our toolkit performs favorably in terms of performance, variable selection and scalability when compared with popular toolkits to fit GAMs with interactions. Our work expands the landscape of interpretable modeling while maintaining prediction accuracy competitive with non-interpretable black-box models. Our code is available at https://github.com/mazumder-lab/grandslamin.

Deep Momentum Multi-Marginal Schrödinger Bridge
Tianrong Chen Guan-Horng Liu Molei Tao Evangelos Theodorou



研究问题:如何利用粗时间间隔的未标记样本重建种群动态。
动机:现有的流模型或薛定谔桥模型在推断样本轨迹时,要么无法考虑潜在的随机性,要么过于刚性。
方法:将薛定谔桥扩展到相位空间,提出深度动量多边缘薛定谔桥(DMSB)模型,这是一种新的计算框架,用于学习满足时间位置边缘约束的随机系统的平滑测值样条。
效果:实验表明,该算法在合成数据集和真实世界的单细胞RNA序列数据集上均显著优于基线方法。此外,当存在不可访问的真实速度时,该方法还可以合理地从位置快照中重建速度分布的演变。

It is a crucial challenge to reconstruct population dynamics using unlabeled samples from distributions at coarse time intervals. Recent approaches such as flow-based models or Schrödinger Bridge (SB) models have demonstrated appealing performance, yet the inferred sample trajectories either fail to account for the underlying stochasticity or are unnecessarily rigid. In this article, we extend SB into phase space and propose $\underline{D}$eep $\underline{M}$omentum Multi-Marginal $\underline{S}$chrödinger $\underline{B}$ridge (DMSB), a novel computational framework that learns the smooth measure-valued spline for stochastic systems that satisfy position marginal constraints across time. By tailoring the celebrated Bregman Iteration and extending the Iteration Proportional Fitting to phase space, we manage to handle high-dimensional multi-marginal trajectory inference tasks efficiently. Our algorithm outperforms baselines significantly, as evidenced by experiments for synthetic datasets and a real-world single-cell RNA sequence dataset. Additionally, the proposed approach can reasonably reconstruct the evolution of velocity distribution, from position snapshots only, when there is a ground truth velocity that is nevertheless inaccessible.

False Discovery Proportion control for aggregated Knockoffs
Alexandre Blain Bertrand Thirion Olivier Grisel Pierre Neuvial



研究问题:如何在高维数据中进行有效的变量选择,同时控制假阳性发现的比例。
动机:在诸如脑成像或基因组学等科学领域中,考虑过多的变量会导致模型质量低下和成本高昂,因此需要对假阳性进行统计保证。
方法:提出了一种新的KOPI方法,该方法基于Knockoffs进行推断,可以控制假发现的比例。这种方法还依赖于一种新型的聚合方法,以解决与经典Knockoffs推断相关的不良随机性问题。
效果:在各种模拟设置中展示了对FDP的控制和相对于现有基于Knockoffs的方法的显著功率增益,并在脑成像数据上实现了良好的敏感性/特异性权衡。

Controlled variable selection is an important analytical step in various scientific fields, such as brain imaging or genomics. In these high-dimensional data settings, considering too many variables leads to poor models and high costs, hence the need for statistical guarantees on false positives. Knockoffs are a popular statistical tool for conditional variable selection in high dimension. However, they control for the expected proportion of false discoveries (FDR) and not the actual proportion of false discoveries (FDP). We present a new method, KOPI, that controls the proportion of false discoveries for Knockoff-based inference. The proposed method also relies on a new type of aggregation to address the undesirable randomness associated with classical Knockoff inference. We demonstrate FDP control and substantial power gains over existing Knockoff-based methods in various simulation settings and achieve good sensitivity/specificity tradeoffs on brain imaging data.

Diffusion Schrödinger Bridge Matching
Yuyang Shi Valentin De Bortoli Andrew Campbell Arnaud Doucet



研究问题:解决传输问题,即找到将一个给定分布映射到另一个的地图,在机器学习中有广泛应用。
动机:受生成模型启发的新型质量传输方法最近被提出,如去噪扩散模型(DDMs)和流匹配模型(FMMs),它们通过随机微分方程(SDE)或常微分方程(ODE)实现这种传输。然而,虽然在许多应用中希望近似确定性的动态最优传输(OT)图,但DDMs和FMMs并不能保证提供的传输接近OT图。
方法:我们引入了迭代马尔可夫拟合(IMF),这是一种新的解决SB问题的方法,以及扩散薛定谔桥匹配(DSBM),这是一种计算IMF迭代的新数值算法。DSBM显著改善了以前的SB数值,并作为特殊/极限情况恢复各种最近的传输方法。
效果:我们在各种问题上展示了DSBM的性能。

Solving transport problems, i.e. finding a map transporting one given distribution to another, has numerous applications in machine learning. Novel mass transport methods motivated by generative modeling have recently been proposed, e.g. Denoising Diffusion Models (DDMs) and Flow Matching Models (FMMs) implement such a transport through a Stochastic Differential Equation (SDE) or an Ordinary Differential Equation (ODE). However, while it is desirable in many applications to approximate the deterministic dynamic Optimal Transport (OT) map which admits attractive properties, DDMs and FMMs are not guaranteed to provide transports close to the OT map. In contrast, Schrödinger bridges (SBs) compute stochastic dynamic mappings which recover entropy-regularized versions of OT. Unfortunately, existing numerical methods approximating SBs either scale poorly with dimension or accumulate errors across iterations. In this work, we introduce Iterative Markovian Fitting (IMF), a new methodology for solving SB problems, and Diffusion Schrödinger Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates. DSBM significantly improves over previous SB numerics and recovers as special/limiting cases various recent transport methods. We demonstrate the performance of DSBM on a variety of problems.

Synthetic Combinations: A Causal Inference Framework for Combinatorial Interventions
Abhineet Agarwal Anish Agarwal Suhas Vijaykumar



研究问题:如何在存在异构单元和多种干预措施的情况下,学习单元特定的潜在结果。
动机:在诸如因子设计实验和推荐引擎等应用中,选择一种干预措施组合是一个自然产生的问题。由于单元数量和干预措施数量的增长,运行大量实验来估计各种参数可能是昂贵且/或不可行的。此外,观察数据可能存在混淆,即一个单元是否出现在某种干预组合下与其在该组合下的潜在结果有关。
方法:我们研究了一个新模型,该模型在单元和干预措施组合之间施加潜在的结构。具体来说,我们假设潜在结果在单元间具有相似性(即潜在结果矩阵的秩约为r),并且干预措施组合的交互方式是有规律的(即潜在结果的傅里叶展开系数大约为s稀疏)。尽管存在未观察到的混淆,但我们建立了对所有N×2^p参数的识别。我们提出了一种估计程序——合成组合,并建立了观测模式精确条件下的有限样本一致性。
效果:我们表明,给定总量为poly(r)×(N+s^2p)次的观察,合成组合能够一致地估计出单元特定潜在结果。相比之下,先前的方法没有利用单元和组合之间的结构,其样本复杂度随着min(N×s^2p, r×(N+2^p))的增长而变差。

We consider a setting where there are $N$ heterogeneous units and $p$ interventions. Our goal is to learn unit-specific potential outcomes for any combination of these $p$ interventions, i.e., $N \times 2^p$ causal parameters. Choosing a combination of interventions is a problem that naturally arises in a variety of applications such as factorial design experiments and recommendation engines (e.g., showing a set of movies that maximizes engagement for a given user). Running $N \times 2^p$ experiments to estimate the various parameters is likely expensive and/or infeasible as $N$ and $p$ grow. Further, with observational data there is likely confounding, i.e., whether or not a unit is seen under a combination is correlated with its potential outcome under that combination. We study this problem under a novel model that imposes latent structure across both units and combinations of interventions. Specifically, we assume latent similarity in potential outcomes across units (i.e., the matrix of potential outcomes is approximately rank $r$) and regularity in how combinations of interventions interact (i.e., the coefficients in the Fourier expansion of the potential outcomes is approximately $s$ sparse). We establish identification for all $N \times 2^p$ parameters despite unobserved confounding. We propose an estimation procedure, Synthetic Combinations, and establish finite-sample consistency under precise conditions on the observation pattern. We show that Synthetic Combinations is able to consistently estimate unit-specific potential outcomes given a total of $\text{poly}(r) \times \left( N + s^2p\right)$ observations. In comparison, previous methods that do not exploit structure across both units and combinations have poorer sample complexity scaling as $\min(N \times s^2p, \ \ r \times (N + 2^p))$.

Learning World Models with Identifiable Factorization
Yu-Ren Liu Biwei Huang Zhengmao Zhu Honglong Tian Mingming Gong Yang Yu Kun Zhang



研究问题:如何有效地提取和分离高维度、嘈杂和非平稳环境中的多种信息,以进行高效的强化学习。
动机:在这类环境中,不同种类的信息共存,如何有效提取和区分这些信息是一个挑战性的问题。
方法:本文提出了IFactor框架,通过动作和奖励的互动,对四种不同的潜在状态变量进行建模,捕捉强化学习系统中的各种信息。
效果:实验证明,该方法能准确识别出真实的潜在变量,并在DeepMind控制套件和RoboDesk等变体中表现出优于基线的性能。

Extracting a stable and compact representation of the environment is crucial for efficient reinforcement learning in high-dimensional, noisy, and non-stationary environments. Different categories of information coexist in such environments -- how to effectively extract and disentangle the information remains a challenging problem. In this paper, we propose IFactor, a general framework to model four distinct categories of latent state variables that capture various aspects of information within the RL system, based on their interactions with actions and rewards. Our analysis establishes block-wise identifiability of these latent variables, which not only provides a stable and compact representation but also discloses that all reward-relevant factors are significant for policy learning. We further present a practical approach to learning the world model with identifiable blocks, ensuring the removal of redundancies but retaining minimal and sufficient information for policy optimization. Experiments in synthetic worlds demonstrate that our method accurately identifies the ground-truth latent variables, substantiating our theoretical findings. Moreover, experiments in variants of the DeepMind Control Suite and RoboDesk showcase the superior performance of our approach over baselines.

Identification of Nonlinear Latent Hierarchical Models
Lingjing Kong Biwei Huang Feng Xie Eric Xing Yuejie Chi Kun Zhang



研究问题:如何从观察数据中识别潜在的变量和因果关系结构,特别是在观察到的变量由因果相关的潜在变量生成且关系非线性的情况下。
动机:在许多涉及生物数据、医学数据以及图像和语言等非结构化数据的实际应用场景中,从观察数据中识别潜在的变量和因果关系结构至关重要。然而,当观察到的变量由因果相关的潜在变量生成且关系非线性时,这项任务可能极具挑战性。
方法:本研究探讨了非线性潜在分层因果关系模型中的识别问题,其中观察到的变量由一组因果相关的潜在变量生成,而某些潜在变量可能没有观察到的子变量。我们证明了在温和的假设下可以实现因果关系结构和潜在变量(最多可逆变换)的可识别性:在因果关系结构上,我们允许图中任意一对变量之间的多条路径,这放宽了先前工作中的潜在树状假设;在结构函数上,我们允许一般非线性和多维连续变量,缓解了现有工作的参数假设。
效果:我们首先开发了一种新颖的识别标准,为基本潜在变量模型提供了可识别性的保证。利用这一标准,我们表明通过显式构造估计过程,可以渐近地识别分层模型的因果关系结构和潜在变量。据我们所知,我们的研究是首次为非线性潜在分层模型中的原因结构和潜在变量建立可识别性保证的工作。

Identifying latent variables and causal structures from observational data is essential to many real-world applications involving biological data, medical data, and unstructured data such as images and languages. However, this task can be highly challenging, especially when observed variables are generated by causally related latent variables and the relationships are nonlinear. In this work, we investigate the identification problem for nonlinear latent hierarchical causal models in which observed variables are generated by a set of causally related latent variables, and some latent variables may not have observed children. We show that the identifiability of causal structures and latent variables (up to invertible transformations) can be achieved under mild assumptions: on causal structures, we allow for multiple paths between any pair of variables in the graph, which relaxes latent tree assumptions in prior work; on structural functions, we permit general nonlinearity and multi-dimensional continuous variables, alleviating existing work's parametric assumptions. Specifically, we first develop an identification criterion in the form of novel identifiability guarantees for an elementary latent variable model. Leveraging this criterion, we show that both causal structures and latent variables of the hierarchical model can be identified asymptotically by explicitly constructing an estimation procedure. To the best of our knowledge, our work is the first to establish identifiability guarantees for both causal structures and latent variables in nonlinear latent hierarchical models.

Flow Factorized Representation Learning
Yue Song T. Anderson Keller Nicu Sebe Max Welling



研究问题:如何实现一种有用的方式来分解表示,以适应变化的真实因子。
动机:现有的分解和等变表示学习方法在实际应用中往往无法有效地分离所有现实感兴趣的因素。
方法:提出一种新的结构化表示学习方法——流分解表示学习,通过引入一个生成模型来定义不同的输入转换,每个潜在流都是由学习的势能的梯度场遵循动态最优传输生成的。
效果:实验结果表明,该方法在标准表示学习基准上实现了更高的似然性,同时更接近近似等变模型。此外,该方法学习的转换具有灵活的可组合性,并能扩展到新数据,显示出接近有用分解表示学习目标的鲁棒性和泛化能力。

A prominent goal of representation learning research is to achieve representations which are factorized in a useful manner with respect to the ground truth factors of variation. The fields of disentangled and equivariant representation learning have approached this ideal from a range of complimentary perspectives; however, to date, most approaches have proven to either be ill-specified or insufficiently flexible to effectively separate all realistic factors of interest in a learned latent space. In this work, we propose an alternative viewpoint on such structured representation learning which we call Flow Factorized Representation Learning, and demonstrate it to learn both more efficient and more usefully structured representations than existing frameworks. Specifically, we introduce a generative model which specifies a distinct set of latent probability paths that define different input transformations. Each latent flow is generated by the gradient field of a learned potential following dynamic optimal transport. Our novel setup brings new understandings to both \textit{disentanglement} and \textit{equivariance}. We show that our model achieves higher likelihoods on standard representation learning benchmarks while simultaneously being closer to approximately equivariant models. Furthermore, we demonstrate that the transformations learned by our model are flexibly composable and can also extrapolate to new data, implying a degree of robustness and generalizability approaching the ultimate goal of usefully factorized representation learning.

Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off
Zichen Zhang Johannes Kirschner Junxi Zhang Francesco Zanini Alex Ayoub Masood Dehghan Dale Schuurmans



研究问题:本文旨在解决强化学习和最优控制中,观测值在固定时钟周期上以离散时间点到达的默认假设对连续时间系统的影响。
动机:许多应用涉及连续时间系统,理论上可以管理时间离散化。然而,现有理论尚未充分描述时间离散化对强化学习方法的影响,更详细的分析可能会揭示提高数据效率的机会。
方法:通过对LQR系统的蒙特卡洛策略评估进行分析,揭示了近似和价值估计中的统计误差之间的基本权衡。
效果:研究发现,对于有限的数据,管理时间分辨率可以显著提高LQR系统的策略评估效率。在数值模拟的LQR实例和非线性连续控制的常规RL基准测试中,我们证明了这种权衡的效果。

A default assumption in reinforcement learning (RL) and optimal control is that observations arrive at discrete time points on a fixed clock cycle. Yet, many applications involve continuous-time systems where the time discretization, in principle, can be managed. The impact of time discretization on RL methods has not been fully characterized in existing theory, but a more detailed analysis of its effect could reveal opportunities for improving data-efficiency. We address this gap by analyzing Monte-Carlo policy evaluation for LQR systems and uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently to time discretization, leading to an optimal choice of temporal resolution for a given data budget. These findings show that managing the temporal resolution can provably improve policy evaluation efficiency in LQR systems with finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and standard RL benchmarks for non-linear continuous control.

Neural Frailty Machine: Beyond proportional hazard assumption in neural survival regressions
Ruofan Wu Jiawei Qiao Mingzhe Wu Wen Yu Ming Zheng Tengfei LIU Tianyi Zhang Weiqiang Wang



研究问题:开发一种强大且灵活的神经网络建模框架,用于生存回归分析。
动机:现有的生存模型无法有效处理非线性协变量依赖,而神经网络架构的强大近似能力可以解决这个问题。
方法:提出了神经脆弱机器(NFM)框架,利用生存分析中的乘性脆弱概念扩展比例风险假设,同时利用神经网络架构处理非线性协变量依赖。
效果:通过理论和实验验证了所提出的NFM模型的优越性,并在6个不同规模的基准数据集上进行了实验评估,结果表明NFM模型的预测性能与或超过了最先进的生存模型。

We present neural frailty machine (NFM), a powerful and flexible neural modeling framework for survival regressions. The NFM framework utilizes the classical idea of multiplicative frailty in survival analysis as a principled way of extending the proportional hazard assumption, at the same time being able to leverage the strong approximation power of neural architectures for handling nonlinear covariate dependence. Two concrete models are derived under the framework that extends neural proportional hazard models and nonparametric hazard regression models. Both models allow efficient training under the likelihood objective. Theoretically, for both proposed models, we establish statistical guarantees of neural function approximation with respect to nonparametric components via characterizing their rate of convergence. Empirically, we provide synthetic experiments that verify our theoretical statements. We also conduct experimental evaluations over $6$ benchmark datasets of different scales, showing that the proposed NFM models achieve predictive performance comparable to or sometimes surpassing state-of-the-art survival models. Our code is publicly availabel at https://github.com/Rorschach1989/nfm

Optimal Treatment Regimes for Proximal Causal Learning
Tao Shen Yifan Cui



研究问题:政策制定者在从观察数据中进行因果推断和决策时,常见的问题是测量的协变量不足以解释所有混淆来源。
动机:最近提出的近因果推理框架显示,现实生活中丰富的代理变量可以用来识别因果关系,从而促进决策。
方法:基于此,我们提出了一种基于所谓的结果和治疗混淆桥的新的最佳个体化治疗方案。
效果:理论保证包括识别、优越性、超额价值边界和估计方案的一致性。此外,我们还通过数值实验和真实数据应用展示了所提出的最佳方案。

A common concern when a policymaker draws causal inferences from and makes decisions based on observational data is that the measured covariates are insufficiently rich to account for all sources of confounding, i.e., the standard no confoundedness assumption fails to hold. The recently proposed proximal causal inference framework shows that proxy variables that abound in real-life scenarios can be leveraged to identify causal effects and therefore facilitate decision-making. Building upon this line of work, we propose a novel optimal individualized treatment regime based on so-called outcome and treatment confounding bridges. We then show that the value function of this new optimal treatment regime is superior to that of existing ones in the literature. Theoretical guarantees, including identification, superiority, excess value bound, and consistency of the estimated regime, are established. Furthermore, we demonstrate the proposed optimal regime via numerical experiments and a real data application.

Designing Robust Transformers using Robust Kernel Density Estimation
Xing Han Tongzheng Ren Tan Minh Nguyen Khai Nguyen Joydeep Ghosh Nhat Ho



研究问题:现有的Transformer模型主要关注预测精度和计算成本,对对抗性攻击和数据污染的鲁棒性关注不足。
动机:通过重新解释自注意力机制为非参数核密度估计器,将经典的鲁棒核密度估计方法应用于开发新的抵抗对抗性攻击和数据污染的Transformer类。
方法:首先提出在计算自注意力操作时降低核希尔伯特空间(RKHS)中的异常值权重的方法。然后利用中位数-均值原则获得另一种有效方法,显著提高语言建模和时间序列分类任务的性能和鲁棒性。
效果:实验结果表明,这些方法在图像数据受到对抗性攻击时的表现优于现有最先进的方法。并且可以与现有的Transformer结合,增强其鲁棒性,有望影响各种应用。

Transformer-based architectures have recently exhibited remarkable successes across different domains beyond just powering large language models. However, existing approaches typically focus on predictive accuracy and computational cost, largely ignoring certain other practical issues such as robustness to contaminated samples. In this paper, by re-interpreting the self-attention mechanism as a non-parametric kernel density estimator, we adapt classical robust kernel density estimation methods to develop novel classes of transformers that are resistant to adversarial attacks and data contamination. We first propose methods that down-weight outliers in RKHS when computing the self-attention operations. We empirically show that these methods produce improved performance over existing state-of-the-art methods, particularly on image data under adversarial attacks. Then we leverage the median-of-means principle to obtain another efficient approach that results in noticeably enhanced performance and robustness on language modeling and time series classification tasks. Our methods can be combined with existing transformers to augment their robust properties, thus promising to impact a wide variety of applications.

iSCAN: Identifying Causal Mechanism Shifts among Nonlinear Additive Noise Models
Tianyu Chen Kevin Bello Bryon Aragam Pradeep Kumar Ravikumar



研究问题:如何识别两个或更多相关数据集中变量的因果机制变化,而无需估计每个SCM的完整DAG结构。
动机:在许多情况下,我们的目标是定位相关数据集之间的因果机制变化,而不是学习单个数据集的完整因果结构。
方法:本文提出了一种基于非线性加性噪声模型(ANMs)的方法,通过混合分布得分函数的雅可比矩阵来识别变化。一旦确定了变化的变量,就可以利用最近的研究来估计这些变量的结构差异。
效果:通过对合成和真实世界数据的实验,展示了该方法的适用性。开源代码可在 https://github.com/kevinsbello/iSCAN 上公开获取。

Structural causal models (SCMs) are widely used in various disciplines to represent causal relationships among variables in complex systems. Unfortunately, the underlying causal structure is often unknown, and estimating it from data remains a challenging task. In many situations, however, the end goal is to localize the changes (shifts) in the causal mechanisms between related datasets instead of learning the full causal structure of the individual datasets. Some applications include root cause analysis, analyzing gene regulatory network structure changes between healthy and cancerous individuals, or explaining distribution shifts. This paper focuses on identifying the causal mechanism shifts in two or more related datasets over the same set of variables---*without estimating the entire DAG structure of each SCM*. Prior work under this setting assumed linear models with Gaussian noises; instead, in this work we assume that each SCM belongs to the more general class of *nonlinear* additive noise models (ANMs). A key technical contribution of this work is to show that the Jacobian of the score function for the *mixture distribution* allows for the identification of shifts under general non-parametric functional mechanisms. Once the shifted variables are identified, we leverage recent work to estimate the structural differences, if any, for the shifted variables. Experiments on synthetic and real-world data are provided to showcase the applicability of this approach. Code implementing the proposed method is open-source and publicly available at https://github.com/kevinsbello/iSCAN.

Efficient Robust Bayesian Optimization for Arbitrary Uncertain inputs
Lin Yang Junlong Lyu Wenlong Lyu Zhitang Chen



研究问题:本文旨在解决贝叶斯优化中由于输入不确定性导致的性能波动问题。
动机:在挑战性的贝叶斯优化任务中,由于优化过程中的随机性(如机械误差、执行噪声或上下文变化性),输入存在不确定性,这会导致最终结果的性能大幅波动。
方法:本文提出了一种新的鲁棒贝叶斯优化算法AIRBO,该算法通过赋予高斯过程最大均值差异(MMD)的能力,并进一步通过Nystrom近似加速后验推理,直接对任意分布的不确定输入进行建模。
效果:在MMD估计误差下建立了严格的理论遗憾界限,并在合成函数和实际问题上的大量实验表明,该方法能够处理各种输入不确定性,并实现最先进的性能。

Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in the final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nystrom approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve a state-of-the-art performance.

Markovian Sliced Wasserstein Distances: Beyond Independent Projections
Khai Nguyen Tongzheng Ren Nhat Ho



研究问题:现有的切比雪夫距离由于独立均匀随机投影方向导致冗余投影,且最优性优化无法保证其度量性。
动机:为了解决这个问题,我们提出了一种新的SW距离家族——马尔科夫切片Wasserstein(MSW)距离,它在投影方向上施加了一级马尔科夫结构。
方法:我们通过指定马尔科夫结构(包括先验分布、转移分布以及燃烧和细化技术)来讨论MSW的各种成员。我们还研究了MSW的理论性质,包括拓扑性质(度量性、弱收敛性和与其他距离的关联)、统计性质(样本复杂度和蒙特卡洛估计误差)以及计算性质(计算复杂度和内存复杂度)。
效果:最后,我们在各种应用中比较了MSW距离与之前的SW变体,如梯度流、颜色转移和深度生成模型,以证明MSW的优秀性能。

Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of the Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of the MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of the MSW.

Energy-Based Sliced Wasserstein Distance
Khai Nguyen Nhat Ho



研究问题:提出一种无参数的能量基分布作为切片分布,以解决两种现有方法在切片Wasserstein距离中的限制。
动机:现有的两种选择切片分布的方法存在局限性,如固定先验分布的非信息性,优化最佳分布的成本和不稳定性等。
方法:设计切片分布为无参数的能量基分布,其密度与投影一维Wasserstein距离的能量函数成比例,从而得到一种新的切片Wasserstein变体——能量基础切片Wasserstein(EBSW)距离。
效果:通过重要性采样、采样重要性重采样和马尔科夫链方法研究了EBSW的距离的拓扑、统计和计算性质。实验结果表明,EBSW在点云梯度流、颜色转移和点云重建等任务上表现出良好的性能。

The sliced Wasserstein (SW) distance has been widely recognized as a statistically effective and computationally efficient metric between two probability measures. A key component of the SW distance is the slicing distribution. There are two existing approaches for choosing this distribution. The first approach is using a fixed prior distribution. The second approach is optimizing for the best distribution which belongs to a parametric family of distributions and can maximize the expected distance. However, both approaches have their limitations. A fixed prior distribution is non-informative in terms of highlighting projecting directions that can discriminate two general probability measures. Doing optimization for the best distribution is often expensive and unstable. Moreover, designing the parametric family of the candidate distribution could be easily misspecified. To address the issues, we propose to design the slicing distribution as an energy-based distribution that is parameter-free and has the density proportional to an energy function of the projected one-dimensional Wasserstein distance. We then derive a novel sliced Wasserstein variant, energy-based sliced Waserstein (EBSW) distance, and investigate its topological, statistical, and computational properties via importance sampling, sampling importance resampling, and Markov Chain methods. Finally, we conduct experiments on point-cloud gradient flow, color transfer, and point-cloud reconstruction to show the favorable performance of the EBSW.

Estimating Riemannian Metric with Noise-Contaminated Intrinsic Distance
Jiaming Qiu Xiongtao Dai



研究问题:如何通过学习数据点之间相似性度量所引发的底层数据空间的黎曼流形结构来扩展度量学习。
动机:当前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We extend metric learning by studying the Riemannian manifold structure of the underlying data space induced by similarity measures between data points. The key quantity of interest here is the Riemannian metric, which characterizes the Riemannian geometry and defines straight lines and derivatives on the manifold. Being able to estimate the Riemannian metric allows us to gain insights into the underlying manifold and compute geometric features such as the geodesic curves. We model the observed similarity measures as noisy responses generated from a function of the intrinsic geodesic distance between data points. A new local regression approach is proposed to learn the Riemannian metric tensor and its derivatives based on a Taylor expansion for the squared geodesic distances, accommodating different types of data such as continuous, binary, or comparative responses. We develop theoretical foundation for our method by deriving the rates of convergence for the asymptotic bias and variance of the estimated metric tensor. The proposed method is shown to be versatile in simulation studies and real data applications involving taxi trip time in New York City and MNIST digits.

Gradient-Free Kernel Stein Discrepancy
Matthew A Fisher Chris J. Oates



研究问题:本文旨在解决复杂统计模型中稳定数值计算导数的问题,以使斯坦因差异法变得实用。
动机:对于复杂的统计模型,稳定计算导数需要专门的算法开发,使得斯坦因差异法在实际应用中变得困难。
方法:本文引入了一系列无需导数的非标准斯坦因差异法,并建立了收敛检测和控制的充分条件。
效果:该方法被成功应用于采样和变分推理,为处理复杂统计模型提供了新的工具。

Stein discrepancies have emerged as a powerful statistical tool, being applied to fundamental statistical problems including parameter inference, goodness-of-fit testing, and sampling. The canonical Stein discrepancies require the derivatives of a statistical model to be computed, and in return provide theoretical guarantees of convergence detection and control. However, for complex statistical models, the stable numerical computation of derivatives can require bespoke algorithmic development and render Stein discrepancies impractical. This paper focuses on posterior approximation using Stein discrepancies, and introduces a collection of non-canonical Stein discrepancies that are gradient-free, meaning that derivatives of the statistical model are not required. Sufficient conditions for convergence detection and control are established, and applications to sampling and variational inference are presented.

Fast Exact Leverage Score Sampling from Khatri-Rao Products with Applications to Tensor Decomposition
Vivek Bharadwaj Osman Asif Malik Riley Murray Laura Grigori Aydin Buluc James Demmel



研究问题:如何从几个矩阵的Khatri-Rao积中随机抽样行,以符合其杠杆分数的精确分布?
动机:现有的方法在处理具有数千万行的大矩阵时,无法有效且高效地抽样。
方法:提出一种数据结构,根据Khatri-Rao积的杠杆分数的精确分布进行随机抽样。该方法的时间复杂度为Khatri-Rao积的高度的对数和列数的平方,空间开销最多为输入矩阵的大小。
效果:实验证明,该方法在处理十亿级稀疏张量和合成数据时,比最新的最先进方法具有更低的复杂度和更高的精度。

We present a data structure to randomly sample rows from the Khatri-Rao product of several matrices according to the exact distribution of its leverage scores. Our proposed sampler draws each row in time logarithmic in the height of the Khatri-Rao product and quadratic in its column count, with persistent space overhead at most the size of the input matrices. As a result, it tractably draws samples even when the matrices forming the Khatri-Rao product have tens of millions of rows each. When used to sketch the linear least-squares problems arising in Candecomp / PARAFAC decomposition, our method achieves lower asymptotic complexity per solve than recent state-of-the-art methods. Experiments on billion-scale sparse tensors and synthetic data validate our theoretical claims, with our algorithm achieving higher accuracy than competing methods as the decomposition rank grows.

PCF-GAN: generating sequential data via the characteristic function of measures on the path space
Hang Lou Siran Li Hao Ni



研究问题:如何利用生成对抗网络(GANs)生成高保真时间序列数据,特别是捕捉时间序列数据的联合概率分布的时序依赖性。
动机:由于难以捕获时间序列数据引起的联合概率分布的时序依赖性,因此使用GANs生成高保真时间序列数据仍然是一个挑战。
方法:提出了一种名为PCF-GAN的新型GAN,将路径特征函数(PCF)作为时间序列分布的原则表示纳入判别器中,以提高其生成性能。同时,建立了PCF距离的理论基础,并设计了PCF的有效初始化和优化方案,以增强判别能力和提高训练效率。此外,通过将自编码器结构通过顺序嵌入集成到PCF-GAN中,提供了额外的重建功能。
效果:在各种数据集上的大量数值实验表明,PCF-GAN在生成和重建质量方面始终优于最先进的基线方法。

Generating high-fidelity time series data using generative adversarial networks (GANs) remains a challenging task, as it is difficult to capture the temporal dependence of joint probability distributions induced by time-series data. Towards this goal, a key step is the development of an effective discriminator to distinguish between time series distributions. We propose the so-called PCF-GAN, a novel GAN that incorporates the path characteristic function (PCF) as the principled representation of time series distribution into the discriminator to enhance its generative performance. On the one hand, we establish theoretical foundations of the PCF distance by proving its characteristicity, boundedness, differentiability with respect to generator parameters, and weak continuity, which ensure the stability and feasibility of training the PCF-GAN. On the other hand, we design efficient initialisation and optimisation schemes for PCFs to strengthen the discriminative power and accelerate training efficiency. To further boost the capabilities of complex time series generation, we integrate the auto-encoder structure via sequential embedding into the PCF-GAN, which provides additional reconstruction functionality. Extensive numerical experiments on various datasets demonstrate the consistently superior performance of PCF-GAN over state-of-the-art baselines, in both generation and reconstruction quality.

SNEkhorn: Dimension Reduction with Symmetric Entropic Affinities
Hugues Van Assel Titouan Vayer Rémi Flamary Nicolas Courty



研究问题:如何利用加权图对数据集中样本的相似性进行编码,并保证对异构采样密度的鲁棒性?
动机:现有的方法在处理异构采样密度时会违反行恒定熵和随机性属性,因此需要一种自然对称化的方法来提高效率。
方法:将熵亲和性(EAs)视为最优传输问题,通过使用双上升法进行自然对称化计算。
效果:新的亲和矩阵在聚类性能方面具有优势,同时有效地控制每行的熵,使其对不同的噪声水平具有鲁棒性。新提出的DR算法SNEkhorn利用这种新的亲和矩阵,在合成数据集和真实世界数据集上都表现出明显优于现有方法的性能。

Many approaches in machine learning rely on a weighted graph to encode the similarities between samples in a dataset. Entropic affinities (EAs), which are notably used in the popular Dimensionality Reduction (DR) algorithm t-SNE, are particular instances of such graphs. To ensure robustness to heterogeneous sampling densities, EAs assign a kernel bandwidth parameter to every sample in such a way that the entropy of each row in the affinity matrix is kept constant at a specific value, whose exponential is known as perplexity. EAs are inherently asymmetric and row-wise stochastic, but they are used in DR approaches after undergoing heuristic symmetrization methods that violate both the row-wise constant entropy and stochasticity properties. In this work, we uncover a novel characterization of EA as an optimal transport problem, allowing a natural symmetrization that can be computed efficiently using dual ascent. The corresponding novel affinity matrix derives advantages from symmetric doubly stochastic normalization in terms of clustering performance, while also effectively controlling the entropy of each row thus making it particularly robust to varying noise levels. Following, we present a new DR algorithm, SNEkhorn, that leverages this new affinity matrix. We show its clear superiority to state-of-the-art approaches with several indicators on both synthetic and real-world datasets.

Kernel Stein Discrepancy thinning: a theoretical perspective of pathologies and a practical fix with regularization
Clement Benard Brian Staber Sébastien Da Veiga



研究问题:本文旨在对斯坦因精简算法进行理论分析,以解决其在实践中可能产生的问题。
动机:斯坦因精简是一种有前景的MCMC后处理方法,但其在实际应用中存在一些问题,如偏差修正不足、收敛速度慢等。
方法:通过理论分析,明确了这些问题的产生机制,并提出了改进策略。同时,引入了正则化斯坦因精简算法来缓解这些问题。
效果:理论保证和大量实验表明,所提出的算法具有高效率。该算法的Python和JAX实现可在https://gitlab.com/drti/kernax 获取。

Stein thinning is a promising algorithm proposed by (Riabiz et al., 2022) for post-processing outputs of Markov chain Monte Carlo (MCMC). The main principle is to greedily minimize the kernelized Stein discrepancy (KSD), which only requires the gradient of the log-target distribution, and is thus well-suited for Bayesian inference. The main advantages of Stein thinning are the automatic remove of the burn-in period, the correction of the bias introduced by recent MCMC algorithms, and the asymptotic properties of convergence towards the target distribution. Nevertheless, Stein thinning suffers from several empirical pathologies, which may result in poor approximations, as observed in the literature. In this article, we conduct a theoretical analysis of these pathologies, to clearly identify the mechanisms at stake, and suggest improved strategies. Then, we introduce the regularized Stein thinning algorithm to alleviate the identified pathologies. Finally, theoretical guarantees and extensive experiments show the high efficiency of the proposed algorithm. An implementation of regularized Stein thinning as the kernax library in python and JAX is available at https://gitlab.com/drti/kernax.

Sharp Calibrated Gaussian Processes
Alexandre Capone Sandra Hirche Geoff Pleiss



研究问题:现有的高斯过程在工程和科学应用中被广泛使用,但其不确定性估计不满足频率主义者的保证,并且在实践中可能被误校准。
动机:为了解决高斯过程的不确定性估计问题,我们提出了一种新的校准方法,该方法通过使用不同的超参数集来生成预测分位数,以满足经验校准约束。
方法:我们的方法受到简单高斯过程后验方差的启发,但使用了不同的超参数集来满足经验校准约束。这种方法比现有的方法更具灵活性,我们对其进行优化以产生紧密的预测分位数。
效果:实验结果表明,在合理的假设下,我们的方法能够产生一个校准模型。此外,当我们将其用于校准回归时,它比现有方法在锐度上表现更好。

While Gaussian processes are a mainstay for various engineering and scientific applications, the uncertainty estimates don't satisfy frequentist guarantees and can be miscalibrated in practice. State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance, which yields confidence intervals that are potentially too coarse. To remedy this, we present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance but using a different set of hyperparameters chosen to satisfy an empirical calibration constraint. This results in a calibration approach that is considerably more flexible than existing approaches, which we optimize to yield tight predictive quantiles. Our approach is shown to yield a calibrated model under reasonable assumptions. Furthermore, it outperforms existing approaches in sharpness when employed for calibrated regression.

Solving Inverse Physics Problems with Score Matching
Benjamin Holzschuh Simona Vegetti Nils Thuerey



研究问题:本文旨在解决涉及物理系统时间演化的逆问题。
动机:利用扩散模型的最新进展,通过结合近似逆物理模拟器和学习到的校正函数,逐步将系统当前状态向过去移动。
方法:训练学习的校正函数使用单步损失等效于得分匹配目标,而递归预测训练过程中的较长轨迹部分则与相应概率流的最大似然训练相关。
效果:与其他标准去噪得分匹配、隐式得分匹配以及全学习基线相比,该算法在各种逆物理问题上具有明显优势。所得逆解算器具有出色的准确性和时间稳定性,并且与其他学习逆解算器不同,允许对解决方案的后验进行采样。

We propose to solve inverse problems involving the temporal evolution of physics systems by leveraging recent advances from diffusion models. Our method moves the system's current state backward in time step by step by combining an approximate inverse physics simulator and a learned correction function. A central insight of our work is that training the learned correction with a single-step loss is equivalent to a score matching objective, while recursively predicting longer parts of the trajectory during training relates to maximum likelihood training of a corresponding probability flow. We highlight the advantages of our algorithm compared to standard denoising score matching and implicit score matching, as well as fully learned baselines for a wide range of inverse physics problems. The resulting inverse solver has excellent accuracy and temporal stability and, in contrast to other learned inverse solvers, allows for sampling the posterior of the solutions. Code and experiments are available at https://github.com/tum-pbs/SMDP.

K-Nearest-Neighbor Local Sampling Based Conditional Independence Testing
Shuai Li Yingjie Zhang Hongtu Zhu Christina Dan Wang Hai Shu Ziqi Chen Zhuoran Sun Yanfeng Yang



研究问题:条件独立测试是统计和机器学习中的基本任务,但其有效性受到高维条件变量和有限数据样本的挑战。
动机:本文提出了一种新的测试方法,以解决这些挑战,增强对I型错误的控制,同时在备择假设下实现高功率。
方法:该方法引入了一种计算效率高的基于分类器的互信息估计器,能够捕捉变量之间的复杂依赖结构。为了近似编码零假设的分布,采用了$k$-最近邻局部采样策略。这种方法的一个重要优点是无需对分布形式或特征依赖性进行假设。此外,它消除了为估计的互信息推导渐近零分布的需要,并避免了数据集分割,使其特别适合小数据集。
效果:该方法展示了对I型错误的渐近控制和对所有备择假设的一致性。使用合成和真实数据的广泛分析突出了所提出测试的计算效率。此外,即使在高维条件集的情况下,它也在I型和II型错误方面优于现有的最先进方法。此外,该方法在存在重尾数据的情况下表现出鲁棒性。

Conditional independence (CI) testing is a fundamental task in statistics and machine learning, but its effectiveness is hindered by the challenges posed by high-dimensional conditioning variables and limited data samples. This article introduces a novel testing approach to address these challenges and enhance control of the type I error while achieving high power under alternative hypotheses. The proposed approach incorporates a computationally efficient classifier-based conditional mutual information (CMI) estimator, capable of capturing intricate dependence structures among variables. To approximate a distribution encoding the null hypothesis, a $k$-nearest-neighbor local sampling strategy is employed. An important advantage of this approach is its ability to operate without assumptions about distribution forms or feature dependencies. Furthermore, it eliminates the need to derive asymptotic null distributions for the estimated CMI and avoids dataset splitting, making it particularly suitable for small datasets. The method presented in this article demonstrates asymptotic control of the type I error and consistency against all alternative hypotheses. Extensive analyses using both synthetic and real data highlight the computational efficiency of the proposed test. Moreover, it outperforms existing state-of-the-art methods in terms of type I and II errors, even in scenarios with high-dimensional conditioning sets. Additionally, the proposed approach exhibits robustness in the presence of heavy-tailed data.

Optimization or Architecture: How to Hack Kalman Filtering
Ido Greenberg Netanel Yannay Shie Mannor



研究问题:本文旨在比较非线性架构(如神经网络)和标准的线性卡尔曼滤波器(KF),并优化它们以使KF具有竞争力。
动机:传统的非线性滤波方法将非线性架构与参数优化方法混合在一起进行评估,导致实验结论存在缺陷。
方法:提出优化卡尔曼滤波器(OKF),对非线性模型和参考KF模型进行类似的优化,使其具有竞争力。
效果:理论和实证研究表明,在各种问题上,OKF可以替代标准KF,并在现实世界的系统中通过更新参数来使用。

In non-linear filtering, it is traditional to compare non-linear architectures such as neural networks to the standard linear Kalman Filter (KF). We observe that this mixes the evaluation of two separate components: the non-linear architecture, and the parameters optimization method. In particular, the non-linear model is often optimized, whereas the reference KF model is not. We argue that both should be optimized similarly, and to that end present the Optimized KF (OKF). We demonstrate that the KF may become competitive to neural models – if optimized using OKF. This implies that experimental conclusions of certain previous studies were derived from a flawed process. The advantage of OKF over the standard KF is further studied theoretically and empirically, in a variety of problems. Conveniently, OKF can replace the KF in real-world systems by merely updating the parameters.

Formulating Discrete Probability Flow Through Optimal Transport
Pengze Zhang Hubery Yin Chen Li Xiaohua Xie



研究问题:本文旨在建立离散扩散模型的概率流基本理论。
动机:连续扩散模型通常显示出确定性概率流,而离散扩散模型则否。因此,需要为离散扩散模型建立基本理论。
方法:首先证明在一定条件下,连续概率流是蒙热最优传输映射,然后提出离散情况下的等效证据。根据这些发现,定义与最优传输原则一致的离散概率流。最后,利用新的定义,提出一种新的采样方法,该方法在生成更确定的结果方面超越了以前的离散扩散模型。
效果:在合成玩具数据集和CIFAR-10数据集上的大量实验验证了所提出的离散概率流的有效性。代码已在GitHub上发布。

Continuous diffusion models are commonly acknowledged to display a deterministic probability flow, whereas discrete diffusion models do not. In this paper, we aim to establish the fundamental theory for the probability flow of discrete diffusion models. Specifically, we first prove that the continuous probability flow is the Monge optimal transport map under certain conditions, and also present an equivalent evidence for discrete cases. In view of these findings, we are then able to define the discrete probability flow in line with the principles of optimal transport. Finally, drawing upon our newly established definitions, we propose a novel sampling method that surpasses previous discrete diffusion models in its ability to generate more certain outcomes. Extensive experiments on the synthetic toy dataset and the CIFAR-10 dataset have validated the effectiveness of our proposed discrete probability flow. Code is released at: https://github.com/PangzeCheung/Discrete-Probability-Flow.

Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing
Xiangyu Sun Oliver Schulte



研究问题:因果发现的根本问题是因果关系推断,即学习两个随机变量之间的正确因果关系。
动机:通过将效果建模为其原因和噪声项的函数,可以充分利用生成函数类的假设,从而实现了显著的进展。然而,当噪声分布形式被用户错误指定时,基于最大化似然的LSNM模型选择的准确性会急剧下降。
方法:我们提出了一种替代方案,即通过残差独立性测试进行因果模型选择,这种方法对噪声误指定和误导的条件方差更具鲁棒性。
效果:实验结果表明,当噪声分布在反因果关系方向上的条件方差小于因果关系方向上的条件方差时,该方法能够更好地处理噪声误指定的问题。

A fundamental problem of causal discovery is cause-effect inference, to learn the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional models (LSNMs) combine expressive power with identifiability guarantees. LSNM model selection based on maximizing likelihood achieves state-of-the-art accuracy, when the noise distributions are correctly specified. However, through an extensive empirical evaluation, we demonstrate that the accuracy deteriorates sharply when the form of the noise distribution is misspecified by the user. Our analysis shows that the failure occurs mainly when the conditional variance in the anti-causal direction is smaller than that in the causal direction. As an alternative, we find that causal model selection through residual independence testing is much more robust to noise misspecification and misleading conditional variance.

A Heavy-Tailed Algebra for Probabilistic Programming
Feynman T. Liang Liam Hodgkinson Michael W. Mahoney



研究问题:目前的基于神经网络的概率模型在捕捉尾部行为上往往不准确,除非基础分布的尾部被适当校准。
动机:为了克服这个缺点,我们提出了一种系统的方法来分析随机变量的尾部,并说明了如何在概率编程语言(PPL)编译器的静态分析(采样前)阶段使用这种方法。
方法:我们开发了一种基于广义伽马分布的三参数尾部渐近性族的代数,用于描述各种操作下尾部的变化。我们的代数运算在加法和乘法下是封闭的,能够区分具有不同尺度的次高斯分布,并且对比率的处理足够好,可以直接从定义中重现大多数重要统计分布的尾部。
效果:实证结果表明,利用我们重度尾部代数的推理算法在一系列密度建模和变分推断(VI)任务上取得了优越的性能。

Despite the successes of probabilistic models based on passing noise through neural networks, recent work has identified that such methods often fail to capture tail behavior accurately---unless the tails of the base distribution are appropriately calibrated. To overcome this deficiency, we propose a systematic approach for analyzing the tails of random variables, and we illustrate how this approach can be used during the static analysis (before drawing samples) pass of a probabilistic programming language (PPL) compiler. To characterize how the tails change under various operations, we develop an algebra which acts on a three-parameter family of tail asymptotics and which is based on the generalized Gamma distribution. Our algebraic operations are closed under addition and multiplication; they are capable of distinguishing sub-Gaussians with differing scales; and they handle ratios sufficiently well to reproduce the tails of most important statistical distributions directly from their definitions. Our empirical results confirm that inference algorithms that leverage our heavy-tailed algebra attain superior performance across a number of density modeling and variational inference (VI) tasks.

SutraNets: Sub-series Autoregressive Networks for Long-Sequence, Probabilistic Forecasting
Shane Bergsma Tim Zeyl Lei Guo



研究问题:提出一种新的方法SutraNets,用于长序列时间序列的神经概率预测。
动机:大多数自回归方法在生成长序列时会遭受有害的错误累积,并且在建模长距离依赖关系方面存在挑战。
方法:SutraNets使用自回归生成模型将长序列的可能性分解为条件概率的乘积。在生成长序列时,SutraNets将长、单变量预测视为低频率子序列的多变量预测。自回归在时间和子序列之间进行,以确保一致的多变量(以及高频率单变量)输出。由于子序列可以使用更少的步骤生成,SutraNets有效地减少了错误累积和信号路径距离。
效果:在六个真实世界的数据集上,SutraNets显著提高了预测准确性,包括当改变子序列的数量和扩大底层序列模型的深度和宽度时。

We propose SutraNets, a novel method for neural probabilistic forecasting of long-sequence time series. SutraNets use an autoregressive generative model to factorize the likelihood of long sequences into products of conditional probabilities. When generating long sequences, most autoregressive approaches suffer from harmful error accumulation, as well as challenges in modeling long-distance dependencies. SutraNets treat long, univariate prediction as multivariate prediction over lower-frequency sub-series. Autoregression proceeds across time and across sub-series in order to ensure coherent multivariate (and, hence, high-frequency univariate) outputs. Since sub-series can be generated using fewer steps, SutraNets effectively reduce error accumulation and signal path distances. We find SutraNets to significantly improve forecasting accuracy over competitive alternatives on six real-world datasets, including when we vary the number of sub-series and scale up the depth and width of the underlying sequence models.

Learning Robust Statistics for Simulation-based Inference under Model Misspecification
Daolang Huang Ayush Bharti Amauri H Souza Luigi Acerbi Samuel Kaski



研究问题:模拟推理方法(SBI)如近似贝叶斯计算(ABC)、合成似然和神经后验估计(NPE)在模型误设下会产生不可信和误导的推理结果,限制了其广泛应用。
动机:针对这一问题,我们提出了一种通用的方法来处理不同类别的SBI方法中的模型误设问题。
方法:利用统计选择决定了SBI中误设程度的事实,我们引入了一个正则化损失函数,对那些增加数据与模型不匹配的统计量进行惩罚。以NPE和ABC为例,我们在高维时间序列模型上展示了该方法的优秀性能,这些模型是人为误设的。我们还将此方法应用于无线电传播领域的实际数据,这些数据已知模型存在误设。
效果:实验证明,该方法在误设场景下产生稳健的推理,同时在模型正确设定时仍然准确。

Simulation-based inference (SBI) methods such as approximate Bayesian computation (ABC), synthetic likelihood, and neural posterior estimation (NPE) rely on simulating statistics to infer parameters of intractable likelihood models. However, such methods are known to yield untrustworthy and misleading inference outcomes under model misspecification, thus hindering their widespread applicability. In this work, we propose the first general approach to handle model misspecification that works across different classes of SBI methods. Leveraging the fact that the choice of statistics determines the degree of misspecification in SBI, we introduce a regularized loss function that penalizes those statistics that increase the mismatch between the data and the model. Taking NPE and ABC as use cases, we demonstrate the superior performance of our method on high-dimensional time-series models that are artificially misspecified. We also apply our method to real data from the field of radio propagation where the model is known to be misspecified. We show empirically that the method yields robust inference in misspecified scenarios, whilst still being accurate when the model is well-specified.

Sharp Bounds for Generalized Causal Sensitivity Analysis
Dennis Frauen Valentyn Melnychuk Stefan Feuerriegel



研究问题:如何从观察数据中进行因果推断,特别是在存在未观察到的混淆因素的情况下。
动机:在医学、经济学等学科中,因果推断至关重要,但目前对因果效应的严格界限的研究仍在进行中。
方法:本文提出了一个统一的框架,用于在各种设置下进行未观察到的混淆因素的因果敏感性分析。我们提出了一种灵活的边际敏感性模型(MSM)的泛化,并为其导出了一类大因果效应的严格界限。
效果:我们的敏感性模型适用于离散、连续和时变的处理方式。在单个二进制治疗的特殊情况下,我们的条件平均治疗效果界限与最近的因果敏感性分析最优结果相吻合。最后,我们提出了一种可扩展的算法,用于从观察数据中估计我们的严格界限。

Causal inference from observational data is crucial for many disciplines such as medicine and economics. However, sharp bounds for causal effects under relaxations of the unconfoundedness assumption (causal sensitivity analysis) are subject to ongoing research. So far, works with sharp bounds are restricted to fairly simple settings (e.g., a single binary treatment). In this paper, we propose a unified framework for causal sensitivity analysis under unobserved confounding in various settings. For this, we propose a flexible generalization of the marginal sensitivity model (MSM) and then derive sharp bounds for a large class of causal effects. This includes (conditional) average treatment effects, effects for mediation analysis and path analysis, and distributional effects. Furthermore, our sensitivity model is applicable to discrete, continuous, and time-varying treatments. It allows us to interpret the partial identification problem under unobserved confounding as a distribution shift in the latent confounders while evaluating the causal effect of interest. In the special case of a single binary treatment, our bounds for (conditional) average treatment effects coincide with recent optimality results for causal sensitivity analysis. Finally, we propose a scalable algorithm to estimate our sharp bounds from observational data.

Distributional Learning of Variational AutoEncoder: Application to Synthetic Data Generation
SeungHwan An Jong-June Jeon



研究问题:尽管变分自编码器(VAE)在计算建模方面效率高,但其高斯性假设一直受到批评,被认为是其主要限制。
动机:本文提出了一种新的方法,旨在不牺牲VAE框架的计算优势的情况下,扩大模型容量(即分布族的表现力)。
方法:我们的VAE模型的解码器由无穷混合的非对称拉普拉斯分布组成,该分布具有连续变量的一般分布拟合能力。我们的模型由一种特殊形式的非参数M-估计器表示,用于估计一般的分位数函数,并在理论上建立了所提出的模型与分位数估计之间的相关性。
效果:我们将所提出的模型应用于合成数据生成,特别是,我们的模型在轻松调整数据隐私级别方面表现出优越性。

The Gaussianity assumption has been consistently criticized as a main limitation of the Variational Autoencoder (VAE) despite its efficiency in computational modeling. In this paper, we propose a new approach that expands the model capacity (i.e., expressive power of distributional family) without sacrificing the computational advantages of the VAE framework. Our VAE model's decoder is composed of an infinite mixture of asymmetric Laplace distribution, which possesses general distribution fitting capabilities for continuous variables. Our model is represented by a special form of a nonparametric M-estimator for estimating general quantile functions, and we theoretically establish the relevance between the proposed model and quantile estimation. We apply the proposed model to synthetic data generation, and particularly, our model demonstrates superiority in easily adjusting the level of data privacy.

Pointwise uncertainty quantification for sparse variational Gaussian process regression with a Brownian motion prior
Luke Travis Kolyan Ray



研究问题:本文旨在研究稀疏变分高斯过程方法的点估计和不确定性量化,并使用特征向量诱导变量。
动机:对于重新缩放的布朗运动先验,我们为点可信集的频率论大小和覆盖范围提供了理论保证和限制。
方法:通过足够多的诱导变量,我们精确地描述了渐进频率论覆盖范围,推导出这种变分方法产生的可信集何时保守、何时过于自信/误导。
效果:我们的数值结果展示了这些结果的应用性,并讨论了与其他常见高斯过程先验的联系。

We study pointwise estimation and uncertainty quantification for a sparse variational Gaussian process method with eigenvector inducing variables. For a rescaled Brownian motion prior, we derive theoretical guarantees and limitations for the frequentist size and coverage of pointwise credible sets. For sufficiently many inducing variables, we precisely characterize the asymptotic frequentist coverage, deducing when credible sets from this variational method are conservative and when overconfident/misleading. We numerically illustrate the applicability of our results and discuss connections with other common Gaussian process priors.

Deciphering Spatio-Temporal Graph Forecasting: A Causal Lens and Treatment
Yutong Xia Yuxuan Liang Haomin Wen Xu Liu Kun Wang Zhengyang Zhou Roger Zimmermann



研究问题:本文旨在解决时空图预测中存在的两个主要问题,即时间分布外(OoD)问题和动态空间因果关系问题。
动机:时空图神经网络是处理时空图预测任务的主流方法,但它们在处理时间分布外的问题和动态空间因果关系问题上存在困难。
方法:本文提出了一个名为CaST的新框架,通过使用因果处理方法来解决这两个问题。具体来说,我们首先利用因果视角构建了一个结构因果模型来解析时空图的数据生成过程。为了处理时间分布外的问题,我们采用了一种新的解耦模块进行后门调整,将时间环境从输入数据中分离出来。此外,我们还利用前门调整和边缘级卷积来模拟因果关系的涟漪效应。
效果:我们在三个真实世界的数据集上进行了实验,结果显示CaST的有效性,它始终优于现有方法,并且具有良好的可解释性。

Spatio-Temporal Graph (STG) forecasting is a fundamental task in many real-world applications. Spatio-Temporal Graph Neural Networks have emerged as the most popular method for STG forecasting, but they often struggle with temporal out-of-distribution (OoD) issues and dynamic spatial causation. In this paper, we propose a novel framework called CaST to tackle these two challenges via causal treatments. Concretely, leveraging a causal lens, we first build a structural causal model to decipher the data generation process of STGs. To handle the temporal OoD issue, we employ the back-door adjustment by a novel disentanglement block to separate the temporal environments from input data. Moreover, we utilize the front-door adjustment and adopt edge-level convolution to model the ripple effect of causation. Experiments results on three real-world datasets demonstrate the effectiveness of CaST, which consistently outperforms existing methods with good interpretability. Our source code is available at https://github.com/yutong-xia/CaST.

Inferring Hybrid Neural Fluid Fields from Videos
Hong-Xing Yu Yang Zheng Yuan Gao Yitong Deng Bo Zhu Jiajun Wu



研究问题:本文旨在从稀疏多视角视频中恢复流体密度和速度。
动机:现有的神经动态重建方法主要依赖于光流,由于流体通常无形状且缺乏稳定的视觉特征,因此无法准确估计密度并揭示基本速度。
方法:我们提出了混合神经网络流体场(HyFluid),这是一种联合推断流体密度和速度场的神经方法。为了解决流体速度的视觉模糊性,我们引入了一系列基于物理的损失函数,强制推断出物理上可信的速度场,该场是无散的,并驱动密度传输。为了解决流体速度的湍流性质,我们设计了一种混合神经网络速度表示,包括捕获大部分无旋能量的基本神经网络速度场和模拟剩余湍流速度的涡粒子速度。
效果:我们的研究表明该方法能够恢复涡流流动的细节。这种方法为各种以3D不可压缩流为中心的学习和重建应用打开了可能性,包括流体重新模拟和编辑、未来预测和神经网络动态场景合成。

We study recovering fluid density and velocity from sparse multiview videos. Existing neural dynamic reconstruction methods predominantly rely on optical flows; therefore, they cannot accurately estimate the density and uncover the underlying velocity due to the inherent visual ambiguities of fluid velocity, as fluids are often shapeless and lack stable visual features. The challenge is further pronounced by the turbulent nature of fluid flows, which calls for properly designed fluid velocity representations. To address these challenges, we propose hybrid neural fluid fields (HyFluid), a neural approach to jointly infer fluid density and velocity fields. Specifically, to deal with visual ambiguities of fluid velocity, we introduce a set of physics-based losses that enforce inferring a physically plausible velocity field, which is divergence-free and drives the transport of density. To deal with the turbulent nature of fluid velocity, we design a hybrid neural velocity representation that includes a base neural velocity field that captures most irrotational energy and a vortex particle-based velocity that models residual turbulent velocity. We show that our method enables recovering vortical flow details. Our approach opens up possibilities for various learning and reconstruction applications centered around 3D incompressible flow, including fluid re-simulation and editing, future prediction, and neural dynamic scene composition. Project website: https://kovenyu.com/HyFluid/

Causal Discovery from Subsampled Time Series with Proxy Variables
Mingzhou Liu Xinwei Sun Lingjing Hu Yizhou Wang



研究问题:如何从被低频率采样的时间序列数据中推断出因果结构。
动机:由于测量频率远低于因果关系的频率,这成为了科学探究的主要难题。
方法:本文提出了一种无参数约束的基于约束的算法,通过利用未来可观察时间步的自我代理来消除隐藏变量带来的偏差,从而实现因果结构的完全识别。
效果:该算法在理论和真实世界的实验中都表现出优势,能够实现全因果识别。

Inferring causal structures from time series data is the central interest of many scientific inquiries. A major barrier to such inference is the problem of subsampling, *i.e.*, the frequency of measurement is much lower than that of causal influence. To overcome this problem, numerous methods have been proposed, yet either was limited to the linear case or failed to achieve identifiability. In this paper, we propose a constraint-based algorithm that can identify the entire causal structure from subsampled time series, without any parametric constraint. Our observation is that the challenge of subsampling arises mainly from hidden variables at the unobserved time steps. Meanwhile, every hidden variable has an observed proxy, which is essentially itself at some observable time in the future, benefiting from the temporal structure. Based on these, we can leverage the proxies to remove the bias induced by the hidden variables and hence achieve identifiability. Following this intuition, we propose a proxy-based causal discovery algorithm. Our algorithm is nonparametric and can achieve full causal identification. Theoretical advantages are reflected in synthetic and real-world experiments.

Streaming Factor Trajectory Learning for Temporal Tensor Decomposition
Shikai Fang Xin Yu Shibo Li Zheng Wang Robert Kirby Shandian Zhe



研究问题:现有的时间信息张量分解方法无法捕捉对象表示的演变过程,且缺乏从流数据中捕获这种演变的有效方法。
动机:为了解决这些问题,我们提出了一种针对时间张量分解的流因子轨迹学习(SFTL)方法。
方法:使用高斯过程(GPs)对因子轨迹进行建模,以灵活估计其时间演变。通过构建等效随机微分方程(SDE),将GPs转换为状态空间先验。开发了一种有效的在线滤波算法,在接收新数据时估计相关因子状态的解耦运行后验。
效果:实验表明,SFTL在合成任务和实际应用中都具有优势。

Practical tensor data is often along with time information. Most existing temporal decomposition approaches estimate a set of fixed factors for the objects in each tensor mode, and hence cannot capture the temporal evolution of the objects' representation. More important, we lack an effective approach to capture such evolution from streaming data, which is common in real-world applications. To address these issues, we propose Streaming Factor Trajectory Learning (SFTL) for temporal tensor decomposition. We use Gaussian processes (GPs) to model the trajectory of factors so as to flexibly estimate their temporal evolution. To address the computational challenges in handling streaming data, we convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE). We develop an efficient online filtering algorithm to estimate a decoupled running posterior of the involved factor states upon receiving new data. The decoupled estimation enables us to conduct standard Rauch-Tung-Striebel smoothing to compute the full posterior of all the trajectories in parallel, without the need for revisiting any previous data. We have shown the advantage of SFTL in both synthetic tasks and real-world applications.

Variational Inference with Gaussian Score Matching
Chirag Modi Robert M. Gower Charles Margossian Yuling Yao David Blei Lawrence K. Saul



研究问题:本文旨在提出一种新的变分推断(VI)方法,以近似计算贝叶斯统计中难以处理的后验分布。
动机:传统的变分推断方法通常通过拟合简单的参数分布来接近目标后验分布,优化证据下界(ELBO)等适当目标。
方法:本文提出的新方法基于得分匹配原理,即如果两个分布相等,那么它们的得分函数(即对数密度的梯度)在其支撑集上的每个点都相等。利用这一原理,我们开发了得分匹配变分推断(SM-VI),这是一种迭代算法,寻求在变分近似和精确后验之间的得分上进行匹配。
效果:实验结果表明,当变分族是高斯分布时,这种内部优化具有封闭形式解,我们称之为高斯得分匹配变分推断(GSM-VI)。GSM-VI是一种“黑箱”变分算法,只需要可微分的联合分布,因此可以应用于广泛的模型类别。在一系列真实世界的贝叶斯推理问题上进行的比较表明,GSM-VI比黑箱变分推断(BBVI)更快且准确度相当或更高。

Variational inference (VI) is a method to approximate the computationally intractable posterior distributions that arise in Bayesian statistics. Typically, VI fits a simple parametric distribution to be close to the target posterior, optimizing an appropriate objective such as the evidence lower bound (ELBO). In this work, we present a new approach to VI. Our method is based on the principle of score matching---namely, that if two distributions are equal then their score functions (i.e., gradients of the log density) are equal at every point on their support. With this principle, we develop score-matching VI, an iterative algorithm that seeks to match the scores between the variational approximation and the exact posterior. At each iteration, score-matching VI solves an inner optimization, one that minimally adjusts the current variational estimate to match the scores at a newly sampled value of the latent variables. We show that when the variational family is a Gaussian, this inner optimization enjoys a closed-form solution, which we call Gaussian score matching VI (GSM-VI). GSM-VI is a ``black box'' variational algorithm in that it only requires a differentiable joint distribution, and as such it can be applied to a wide class of models. We compare GSM-VI to black box variational inference (BBVI), which has similar requirements but instead optimizes the ELBO. We first study how GSM-VI behaves as a function of the problem dimensionality, the condition number of the target covariance matrix (when the target is Gaussian), and the degree of mismatch between the approximating and exact posterior distribution. We then study GSM-VI on a collection of real-world Bayesian inference problems from the posteriorDB database of datasets and models. We find that GSM-VI is faster than BBVI and equally or more accurate. Specifically, over a wide range of target posteriors, GSM-VI requires 10-100x fewer gradient evaluations than BBVI to obtain a comparable quality of approximation.

GPEX, A Framework For Interpreting Artificial Neural Networks
Amir Akbarnejad Gilbert Bigras Nilanjan Ray



研究问题:如何通过高斯过程(GPs)更好地理解深度人工神经网络(ANNs)的决策过程。
动机:现有的理论工作对ANN提出了严格的假设,但这些假设在新的深度架构中难以适应,因此需要一种新的方法来理解和解释ANN的决策过程。
方法:本文提出了一种证据下界,鼓励GP的后验分布与ANN的输出相匹配,而不对ANN做出任何要求。同时,作者还开发了一种新的计算技术,使得我们可以训练具有数十万个诱导点的GP,并使用GPU加速。
效果:实验结果表明,该方法能够成功地在5个数据集上找到与ANN输出相匹配的GP。此外,我们还使用这些GP的核函数来解释ANN的决策,提供了200多个易于人类理解的解释,证明了获得的GP能够揭示ANN的决策过程。

The analogy between Gaussian processes (GPs) and deep artificial neural networks (ANNs) has received a lot of interest, and has shown promise to unbox the blackbox of deep ANNs. Existing theoretical works put strict assumptions on the ANN (e.g. requiring all intermediate layers to be wide, or using specific activation functions). Accommodating those theoretical assumptions is hard in recent deep architectures, and those theoretical conditions need refinement as new deep architectures emerge. In this paper we derive an evidence lower-bound that encourages the GP's posterior to match the ANN's output without any requirement on the ANN. Using our method we find out that on 5 datasets, only a subset of those theoretical assumptions are sufficient. Indeed, in our experiments we used a normal ResNet-18 or feed-forward backbone with a single wide layer in the end. One limitation of training GPs is the lack of scalability with respect to the number of inducing points. We use novel computational techniques that allow us to train GPs with hundreds of thousands of inducing points and with GPU acceleration. As shown in our experiments, doing so has been essential to get a close match between the GPs and the ANNs on 5 datasets. We implement our method as a publicly available tool called GPEX: https://github.com/amirakbarnejad/gpex. On 5 datasets (4 image datasets, and 1 biological dataset) and ANNs with 2 types of functionality (classifier or attention-mechanism) we were able to find GPs whose outputs closely match those of the corresponding ANNs. After matching the GPs to the ANNs, we used the GPs' kernel functions to explain the ANNs' decisions. We provide more than 200 explanations (around 30 in the paper and the rest in the supplementary) which are highly interpretable by humans and show the ability of the obtained GPs to unbox the ANNs' decisions.

Why Did This Model Forecast This Future? Information-Theoretic Saliency for Counterfactual Explanations of Probabilistic Regression Models
Chirag Raman Alec Nonnemaker Amelia Villegas-Morcillo Hayley Hung Marco Loog



研究问题:提出一种后验显著性解释框架,用于概率多元时间序列预测(回归)中的反事实推理。
动机:在多变量时间序列预测中,缺乏对模型决策过程的解释和理解。
方法:基于米勒的社会科学研究解释框架,将反事实推理与显著性解释技术建立概念链接。利用信息论的显著性定义,并将其扩展到预测设置中,得到一个封闭形式的表达式,以确定哪些观察的时间步长对于模型做出概率预测是显著的。
效果:通过合成数据进行实证验证,并使用真实世界的数据和预测模型,展示了该框架如何帮助领域专家形成新的数据驱动假设,以了解特征之间的因果关系。

We propose a post hoc saliency-based explanation framework for counterfactual reasoning in probabilistic multivariate time-series forecasting (regression) settings. Building upon Miller's framework of explanations derived from research in multiple social science disciplines, we establish a conceptual link between counterfactual reasoning and saliency-based explanation techniques. To address the lack of a principled notion of saliency, we leverage a unifying definition of information-theoretic saliency grounded in preattentive human visual cognition and extend it to forecasting settings. Specifically, we obtain a closed-form expression for commonly used density functions to identify which observed timesteps appear salient to an underlying model in making its probabilistic forecasts. We empirically validate our framework in a principled manner using synthetic data to establish ground-truth saliency that is unavailable for real-world data. Finally, using real-world data and forecasting models, we demonstrate how our framework can assist domain experts in forming new data-driven hypotheses about the causal relationships between features in the wild.

Temporal Causal Mediation through a Point Process: Direct and Indirect Effects of Healthcare Interventions
Çağlar Hızlı S. T. John Anne Tuulikki Juuti Tuure Tapani Saarinen Kirsi Hannele Pietiläinen Pekka Marttinen



研究问题:如何准确估计外部干预对结果的直接和间接效应,并展示这些影响如何影响整个未来轨迹。
动机:现有的动态因果中介分析方法存在局限性,如仅适用于规则测量间隔、简单的参数模型,以及忽视长期中介-结果交互作用。
方法:提出一种非参数中介-结果模型,其中假设中介是一个与结果过程相互作用的时序点过程。通过此模型,估计外部干预对结果的直接和间接效应。
效果:在半合成数据上证明该方法能准确估计直接和间接效应。在真实世界医疗数据上,该模型推断出手术后血糖的临床意义明确的直接和间接效应轨迹。

Deciding on an appropriate intervention requires a causal model of a treatment, the outcome, and potential mediators. Causal mediation analysis lets us distinguish between direct and indirect effects of the intervention, but has mostly been studied in a static setting. In healthcare, data come in the form of complex, irregularly sampled time-series, with dynamic interdependencies between a treatment, outcomes, and mediators across time. Existing approaches to dynamic causal mediation analysis are limited to regular measurement intervals, simple parametric models, and disregard long-range mediator--outcome interactions. To address these limitations, we propose a non-parametric mediator--outcome model where the mediator is assumed to be a temporal point process that interacts with the outcome process. With this model, we estimate the direct and indirect effects of an external intervention on the outcome, showing how each of these affects the whole future trajectory. We demonstrate on semi-synthetic data that our method can accurately estimate direct and indirect effects. On real-world healthcare data, our model infers clinically meaningful direct and indirect effect trajectories for blood glucose after a surgery.

Practical Equivariances via Relational Conditional Neural Processes
Daolang Huang Manuel Haussmann Ulpu Remes S. T. John Grégoire Clarté Kevin Sebastian Luck Samuel Kaski Luigi Acerbi



研究问题:如何有效地将等变性质引入条件神经过程(CNPs)模型中,以提升模型的性能和适用性。
动机:许多机器学习任务,如时空建模、贝叶斯优化和连续控制,都包含等变性质,而现有的CNPs模型在处理多于两个输入维度的任务时无法有效扩展。
方法:提出关系条件神经过程(RCNPs)模型,通过此方法将等变性质引入到任何神经过程模型中,从而扩大等变神经过程的适用性和影响力。
效果:实验证明,RCNPs在一系列自然包含等变性质的任务上表现出了优秀的性能。

Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances – for example to translation – which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.

Variational Gaussian Processes with Decoupled Conditionals
Xinran Zhu Kaiwen Wu Natalie Maus Jacob R. Gardner David Bindel



研究问题:本文旨在解决变分高斯过程(VGP)的扩展性问题,即如何通过增加诱导点来提高模型精度,同时避免优化挑战和计算复杂性。
动机:尽管可以通过增加诱导点来减少近似误差,但这会导致优化挑战和计算复杂性的增加。为了实现可扩展性,我们考虑修改训练和测试条件als,使它们更具灵活性。
方法:我们研究了在条件中解耦预测均值和协方差的参数形式,并学习独立于预测均值和协方差的参数。我们根据这些更灵活的条件推导出新的证据下界(ELBO),并提供了应用解耦条件的两个具体示例。
效果:实验结果表明,这种额外的灵活性可以提高各种回归任务和贝叶斯优化(BO)应用的模型性能。

Variational Gaussian processes (GPs) approximate exact GP inference by using a small set of inducing points to form a sparse approximation of the true posterior, with the fidelity of the model increasing with additional inducing points. Although the approximation error in principle can be reduced through the use of more inducing points, this leads to scaling optimization challenges and computational complexity. To achieve scalability, inducing point methods typically introduce conditional independencies and then approximations to the training and test conditional distributions. In this paper, we consider an alternative approach to modifying the training and test conditionals, in which we make them more flexible. In particular, we investigate decoupling the parametric form of the predictive mean and covariance in the conditionals, and learn independent parameters for predictive mean and covariance. We derive new evidence lower bounds (ELBO) under these more flexible conditionals, and provide two concrete examples of applying the decoupled conditionals. Empirically, we find this additional flexibility leads to improved model performance on a variety of regression tasks and Bayesian optimization (BO) applications.

When can Regression-Adjusted Control Variate Help? Rare Events, Sobolev Embedding and Minimax Optimality
Jose Blanchet Haoxuan Chen Yiping Lu Lexing Ying



研究问题:本文研究了使用基于机器学习的估计器作为控制变量来减轻蒙特卡洛采样方差的问题。
动机:为了找出影响控制变量在减少方差方面的效率的关键因素。
方法:通过模拟从(随机)积分节点获取的观察结果,对Sobolev函数的矩进行模拟,并研究了一种特定的采用非参数回归调整控制变量的积分规则以降低蒙特卡洛模拟的方差。
效果:这种积分规则可以改善蒙特卡洛速率,并在充分平滑性假设下实现最小最大最优速率。同时,当存在罕见和极端事件时,蒙特卡洛算法的截断版本可以实现最小最大最优速率,而控制变量无法提高收敛速度。

This paper studies the use of a machine learning-based estimator as a control variate for mitigating the variance of Monte Carlo sampling. Specifically, we seek to uncover the key factors that influence the efficiency of control variates in reducing variance. We examine a prototype estimation problem that involves simulating the moments of a Sobolev function based on observations obtained from (random) quadrature nodes. Firstly, we establish an information-theoretic lower bound for the problem. We then study a specific quadrature rule that employs a nonparametric regression-adjusted control variate to reduce the variance of the Monte Carlo simulation. We demonstrate that this kind of quadrature rule can improve the Monte Carlo rate and achieve the minimax optimal rate under a sufficient smoothness assumption. Due to the Sobolev Embedding Theorem, the sufficient smoothness assumption eliminates the existence of rare and extreme events. Finally, we show that, in the presence of rare and extreme events, a truncated version of the Monte Carlo algorithm can achieve the minimax optimal rate while the control variate cannot improve the convergence rate.

Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Paul Pu Liang Yun Cheng Xiang Fan Chun Kai Ling Suzanne Nie Richard J. Chen Zihao Deng Nicholas Allen Randy Auerbach Faisal Mahmood Ruslan Salakhutdinov Louis-Philippe Morency



研究问题:如何量化多模态任务中输入模态与输出任务之间的相互作用?哪种多模态模型最适合捕捉这些相互作用?
动机:对多模态应用的兴趣激增,导致了大量的数据集和方法来表示和整合来自不同模态的信息。尽管有了这些实证进展,但仍存在基本的研究问题。
方法:提出了一种信息论方法来量化解决多模态任务所需的交互程度。我们称这三个度量为多模态分布的PID统计(或简称PID),并引入了两种新的PID统计量估计器,可扩展到高维分布。
效果:通过在已知PID的合成数据集和大型多模态基准测试集上进行广泛的实验,验证了PID估计的准确性。最后,展示了它们在以下方面的有用性:(1)量化多模态数据集中的交互;(2)量化多模态模型捕获的交互;(3)有原则的模型选择方法;(4)三个真实世界的案例研究,涉及病理学、情绪预测和机器人感知,在这些应用中,我们的框架有助于为每个应用推荐强大的多模态模型。

The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application.

Training Energy-Based Normalizing Flow with Score-Matching Objectives
Chen-Hao Chao Wei-Fang Sun Yen-Chang Hsu Zsolt Kira Chun-Yi Lee



研究问题:建立基于流和基于能量的生成模型之间的参数化关系,并提出一种新的基于流的建模方法,称为基于能量的正则化流(EBFlow)。
动机:优化EBFlow以实现得分匹配目标,可以完全绕过线性变换的雅可比行列式的计算。
方法:通过优化EBFlow使用得分匹配目标,可以在构建基于流的模型时无需增加每次训练迭代的计算时间复杂度。
效果:实验结果表明,与常用的最大似然估计方法相比,该方法在运行速度上实现了显著的提升,并在负对数似然(NLL)方面也具有明显的优势。

In this paper, we establish a connection between the parameterization of flow-based and energy-based generative models, and present a new flow-based modeling approach called energy-based normalizing flow (EBFlow). We demonstrate that by optimizing EBFlow with score-matching objectives, the computation of Jacobian determinants for linear transformations can be entirely bypassed. This feature enables the use of arbitrary linear layers in the construction of flow-based models without increasing the computational time complexity of each training iteration from $\mathcal{O}(D^2L)$ to $\mathcal{O}(D^3L)$ for an $L$-layered model that accepts $D$-dimensional inputs. This makes the training of EBFlow more efficient than the commonly-adopted maximum likelihood training method. In addition to the reduction in runtime, we enhance the training stability and empirical performance of EBFlow through a number of techniques developed based on our analysis of the score-matching methods. The experimental results demonstrate that our approach achieves a significant speedup compared to maximum likelihood estimation while outperforming prior methods with a noticeable margin in terms of negative log-likelihood (NLL).

Metropolis Sampling for Constrained Diffusion Models
Nic Fishman Leo Klarner Emile Mathieu Michael John Hutchinson Valentin De Bortoli



研究问题:现有的扩散模型方法无法指定任意的、领域相关的约束,且已有的噪声生成研究问题:现有的扩散模型方法无法指定任意的、领域相关的约束,且已有的噪声生成过程计算负担重或仅适用于欧几里得空间的凸子集。
动机:为了解决这些问题,本文提出了一种基于Metropolis采样的新噪声生成方案,以提高计算效率和实证性能。
方法:通过构建新的噪声生成过程,该过程对应于反射布朗运动的有效离散化。
效果:在一系列具有凸和非凸约束的问题设置中展示了该方法的可扩展性和灵活性,包括地理建模、机器人技术和蛋白质设计等领域的应用。

Denoising diffusion models have recently emerged as the predominant paradigm for generative modelling on image domains. In addition, their extension to Riemannian manifolds has facilitated a range of applications across the natural sciences. While many of these problems stand to benefit from the ability to specify arbitrary, domain-informed constraints, this setting is not covered by the existing (Riemannian) diffusion model methodology. Recent work has attempted to address this issue by constructing novel noising processes based on the reflected Brownian motion and logarithmic barrier methods. However, the associated samplers are either computationally burdensome or only apply to convex subsets of Euclidean space. In this paper, we introduce an alternative, simple noising scheme based on Metropolis sampling that affords substantial gains in computational efficiency and empirical performance compared to the earlier samplers. Of independent interest, we prove that this new process corresponds to a valid discretisation of the reflected Brownian motion. We demonstrate the scalability and flexibility of our approach on a range of problem settings with convex and non-convex constraints, including applications from geospatial modelling, robotics and protein design.

Should We Learn Most Likely Functions or Parameters?
Shikai Qiu Tim G. J. Rudner Sanyam Kapoor Andrew Gordon Wilson



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting.

ForecastPFN: Synthetically-Trained Zero-Shot Forecasting
Samuel Dooley Gurnoor Singh Khurana Chirag Mohapatra Siddartha Venkat Naidu Colin White



研究问题:大多数时间序列预测方法需要大量的训练数据集,但许多实际应用只有很少的初始观察值,限制了这些方法的应用。
动机:尽管有关于少量初始数据(所谓的“零样本”预测)的研究,但其性能取决于用于预训练的数据,效果不稳定。
方法:提出一种新的方法,设计出第一个纯粹在新的合成数据分布上训练的零样本预测模型——ForecastPFN。这是一个适应先验数据的网络,通过训练来近似贝叶斯推理,可以在一次前向传递中对新的时序数据集进行预测。
效果:实验表明,即使允许其他方法在数百个额外的同类数据点上进行训练,ForecastPFN进行的零样本预测也比最先进的预测方法更准确、更快。

The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called `zero-shot' forecasting), its performance is inconsistent depending on the data used for pretraining. In this work, we take a different approach and devise ForecastPFN, the first zero-shot forecasting model trained purely on a novel synthetic data distribution. ForecastPFN is a prior-data fitted network, trained to approximate Bayesian inference, which can make predictions on a new time series dataset in a single forward pass. Through extensive experiments, we show that zero-shot predictions made by ForecastPFN are more accurate and faster compared to state-of-the-art forecasting methods, even when the other methods are allowed to train on hundreds of additional in-distribution data points.

topic-2

Topic words :  3d,  object,  segmentation,  image,  video,  visual,  point,  semantic

Rotating Features for Object Discovery
Sindy Löwe Phillip Lippe Francesco Locatello Max Welling



研究问题:人类认知中的绑定问题,即大脑如何在固定的神经网络连接中表示和连接对象,仍是一个激烈的争论点。
动机:大多数机器学习在无监督环境下解决这个问题的努力都集中在基于插槽的方法上,由于其离散性质和难以表达不确定性,可能有所限制。
方法:本文提出了旋转特征,这是复值特征向更高维度的泛化,以及一种新的评估程序,用于从分布式表示中提取对象。此外,我们还展示了我们的方法对预训练特征的适用性。
效果:这些进步使我们能够将分布式的对象中心表示从简单的玩具数据扩展到真实世界的数据。我们相信这项工作为解决机器学习中的绑定问题开辟了新范式,并有可能激发该领域的进一步创新。

The binding problem in human cognition, concerning how the brain represents and connects objects within a fixed network of neural connections, remains a subject of intense debate. Most machine learning efforts addressing this issue in an unsupervised setting have focused on slot-based methods, which may be limiting due to their discrete nature and difficulty to express uncertainty. Recently, the Complex AutoEncoder was proposed as an alternative that learns continuous and distributed object-centric representations. However, it is only applicable to simple toy data. In this paper, we present Rotating Features, a generalization of complex-valued features to higher dimensions, and a new evaluation procedure for extracting objects from distributed representations. Additionally, we show the applicability of our approach to pre-trained features. Together, these advancements enable us to scale distributed object-centric representations from simple toy to real-world data. We believe this work advances a new paradigm for addressing the binding problem in machine learning and has the potential to inspire further innovation in the field.

Siamese Masked Autoencoders
Agrim Gupta Jiajun Wu Jia Deng Li Fei-Fei



研究问题:如何在计算机视觉中建立图像或场景之间的对应关系,特别是在存在遮挡、视角变化和对象外观变化的情况下。
动机:由于视频中的大量信息以及物体的动态特性,学习视频中的视觉对应关系是一项重大挑战。现有的方法往往需要复杂的数据增强、手工制作的基于跟踪的前任务或其他技术来防止表示崩溃。
方法:本文提出了一种名为Siamese Masked Autoencoders(SiamMAE)的方法,该方法是Masked Autoencoders(MAE)的一种简单扩展,用于从视频中学习视觉对应关系。SiamMAE随机抽取视频帧对,并对其进行不对称掩码处理。这些帧由编码器网络独立处理,而解码器由一系列交叉注意力层组成,负责预测未来帧中的缺失补丁。通过在未来帧中掩码大部分(95%)的补丁,同时保持过去帧不变,SiamMAE鼓励网络专注于物体运动并学习以物体为中心的表示。
效果:尽管其概念简单,但通过SiamMAE学习的特征在视频对象分割、关键点传播和语义部分传播任务上优于最先进的自监督方法。SiamMAE在不依赖数据增强、手工制作的基于跟踪的前任务或其他技术来防止表示崩溃的情况下,取得了竞争性的结果。

Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction (95%) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.

EgoEnv: Human-centric environment representations from egocentric video
Tushar Nagarajan Santhosh Kumar Ramakrishnan Ruta Desai James Hillis Kristen Grauman



研究问题:如何将第一人称视角的视频与其持续的环境进行关联,以更好地理解人类为中心的环境。
动机:目前的视觉理解方法主要关注从短视频片段中提取的视觉特征,这些特征与底层物理空间分离,只能捕捉到眼前的景象。
方法:通过学习对摄像头佩戴者(可能未被看到的)局部环境的预测性表示,将自我中心的视频和环境联系起来。
效果:在两个以人为中心的视频任务上,使用我们的环境感知特征的模型始终优于使用传统片段特征的模型。此外,尽管只针对模拟视频进行训练,但我们的方法成功处理了来自HouseTours和Ego4D的真实世界视频,并在Ego4D NLQ挑战上取得了最先进的结果。

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge.

Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion
Yash Sanjay Bhalgat Iro Laina Joao F. Henriques Andrea Vedaldi Andrew Zisserman



研究问题:三维实例分割是一项具有挑战性的任务,因为缺乏大规模的标注数据集。
动机:本文提出利用预训练的二维模型来解决三维实例分割的问题。
方法:提出了一种新的方法,通过神经场表示将二维分割提升到三维,并通过多视图一致性在帧之间进行融合。核心方法是慢快聚类目标函数,适用于场景中对象数量大的情况。
效果:通过创建新的半现实数据集“Messy Rooms”,展示了慢快聚类的可扩展性。该方法在ScanNet、Hypersim和Replica数据集以及新创建的Messy Rooms数据集上的表现优于最先进的方法,证明了慢快聚类方法的有效性和可扩展性。

Instance segmentation in 3D is a challenging task due to the lack of large-scale annotated datasets. In this paper, we show that this task can be addressed effectively by leveraging instead 2D pre-trained models for instance segmentation. We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation, which encourages multi-view consistency across frames. The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects. Unlike previous approaches, our method does not require an upper bound on the number of objects or object tracking across frames. To demonstrate the scalability of the slow-fast clustering, we create a new semi-realistic dataset called the Messy Rooms dataset, which features scenes with up to 500 objects per scene. Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets, as well as on our newly created Messy Rooms dataset, demonstrating the effectiveness and scalability of our slow-fast clustering method.

PAPR: Proximity Attention Point Rendering
Yanshu Zhang Shichong Peng Seyed Alireza Moazenipourasil Ke Li



研究问题:如何从零开始学习准确且简洁的场景表面点云表示,这是3D表示学习中的一个挑战。
动机:现有的基于点的学习方法常常受到梯度消失问题的影响,或者需要大量的点来精确地模拟场景几何和纹理。
方法:我们提出了一种名为Proximity Attention Point Rendering(PAPR)的新方法,该方法包括一个基于点的 scene representation 和一个可微分的渲染器。我们的 scene representation 使用一个点云,其中每个点都由其空间位置、前景得分和与视图无关的特征向量来描述。渲染器为每条光线选择相关的点,并使用它们关联的特征生成准确的色彩。
效果:PAPR 有效地学习了点云的位置以表示正确的场景几何,即使初始状态与目标几何大相径庭。值得注意的是,我们的方法在仅使用一组简洁的点的同时捕捉到了精细的纹理细节。我们还展示了该方法的四个实际应用:几何编辑、对象操作、纹理转移和曝光控制。

Learning accurate and parsimonious point cloud representations of scene surfaces from scratch remains a challenge in 3D representation learning. Existing point-based methods often suffer from the vanishing gradient problem or require a large number of points to accurately model scene geometry and texture. To address these limitations, we propose Proximity Attention Point Rendering (PAPR), a novel method that consists of a point-based scene representation and a differentiable renderer. Our scene representation uses a point cloud where each point is characterized by its spatial position, foreground score, and view-independent feature vector. The renderer selects the relevant points for each ray and produces accurate colours using their associated features. PAPR effectively learns point cloud positions to represent the correct scene geometry, even when the initialization drastically differs from the target geometry. Notably, our method captures fine texture details while using only a parsimonious set of points. We also demonstrate four practical applications of our method: geometry editing, object manipulation, texture transfer, and exposure control. More results and code are available on our project website at https://zvict.github.io/papr/.

$SE(3)$ Equivariant Convolution and Transformer in Ray Space
Yinshuang Xu Jiahui Lei Kostas Daniilidis



研究问题:如何从多个视角学习几何先验,以改善3D重建和新颖视图渲染。
动机:当输入视图在覆盖范围和视间基线方面不足时,3D重建和新颖视图渲染可以从几何先验中大大受益。
方法:通过提出一个在射线空间中的$SE(3)$等变卷积和变换器,仅根据相机的相对位姿,学习相对于坐标框架变换的多个视图的先验。
效果:我们的数学框架使我们能够超越卷积到射线空间的$SE(3)$等变注意力。我们在旋转平移数据集上展示了$SE(3)$等变性,无需进行变换增强。

3D reconstruction and novel view rendering can greatly benefit from geometric priors when the input views are not sufficient in terms of coverage and inter-view baselines. Deep learning of geometric priors from 2D images requires each image to be represented in a $2D$ canonical frame and the prior to be learned in a given or learned $3D$ canonical frame. In this paper, given only the relative poses of the cameras, we show how to learn priors from multiple views equivariant to coordinate frame transformations by proposing an $SE(3)$-equivariant convolution and transformer in the space of rays in 3D. We model the ray space as a homogeneous space of $SE(3)$ and introduce the $SE(3)$-equivariant convolution in ray space. Depending on the output domain of the convolution, we present convolution-based $SE(3)$-equivariant maps from ray space to ray space and to $\mathbb{R}^3$. Our mathematical framework allows us to go beyond convolution to $SE(3)$-equivariant attention in the ray space. We showcase how to tailor and adapt the equivariant convolution and transformer in the tasks of equivariant $3D$ reconstruction and equivariant neural rendering from multiple views. We demonstrate $SE(3)$-equivariance by obtaining robust results in roto-translated datasets without performing transformation augmentation.

Multi-Object Representation Learning via Feature Connectivity and Object-Centric Regularization
Alex Foo Wynne Hsu Mong-Li Lee



研究问题:如何从图像中发现以对象为中心的表示,以提高机器学习算法的鲁棒性、样本效率和可解释性。
动机:当前多对象图像的研究通常遵循生成方法,优化输入重构,但在模型容量显著增加的情况下,无法扩展到真实世界数据集。
方法:提出一种利用特征连通性将可能属于同一对象的相邻像素聚类的新方法。进一步设计两种以对象为中心的正则化项,以在潜在空间中细化对象表示,使该方法能够扩展到复杂的真实世界图像。
效果:在模拟、真实世界、复杂纹理和常见对象图像上的实验结果表明,与最先进的方法相比,发现的对象质量有显著提高,同时该方法具有样本效率和通用性。还表明,发现的以对象为中心的表示可以准确预测下游任务的关键对象属性,突出了该方法在多对象表示学习领域的潜力。

Discovering object-centric representations from images has the potential to greatly improve the robustness, sample efficiency and interpretability of machine learning algorithms. Current works on multi-object images typically follow a generative approach that optimizes for input reconstruction and fail to scale to real-world datasets despite significant increases in model capacity. We address this limitation by proposing a novel method that leverages feature connectivity to cluster neighboring pixels likely to belong to the same object. We further design two object-centric regularization terms to refine object representations in the latent space, enabling our approach to scale to complex real-world images. Experimental results on simulated, real-world, complex texture and common object images demonstrate a substantial improvement in the quality of discovered objects compared to state-of-the-art methods, as well as the sample efficiency and generalizability of our approach. We also show that the discovered object-centric representations can accurately predict key object properties in downstream tasks, highlighting the potential of our method to advance the field of multi-object representation learning.

Explore In-Context Learning for 3D Point Cloud Understanding
Zhongbin Fang Xiangtai Li Xia Li Joachim M. Buhmann Chen Change Loy Mengyuan Liu



研究问题:如何将上下文学习应用于3D点云领域,特别是在处理大量数据时。
动机:随着大规模模型在广泛数据上训练的兴起,上下文学习已成为自然语言处理和计算机视觉任务中显示出巨大潜力的新学习范式。然而,这一方法在3D点云领域的应用还处于探索阶段。
方法:提出了一种名为Point-In-Context的新型框架,专门用于3D点云的上下文学习。该框架将输入和输出都建模为每个任务的坐标。同时,还提出了联合采样模块,与通用点采样操作协同工作,有效解决了上述技术问题。
效果:通过广泛的实验验证了所提出方法在处理各种任务时的通用性和适应性。此外,通过更有效的提示选择策略,该框架超越了单独训练的模型的结果。

With the rise of large-scale models trained on broad data, in-context learning has become a new learning paradigm that has demonstrated significant potential in natural language processing and computer vision tasks. Meanwhile, in-context learning is still largely unexplored in the 3D point cloud domain. Although masked modeling has been successfully applied for in-context learning in 2D vision, directly extending it to 3D point clouds remains a formidable challenge. In the case of point clouds, the tokens themselves are the point cloud positions (coordinates) that are masked during inference. Moreover, position embedding in previous works may inadvertently introduce information leakage. To address these challenges, we introduce a novel framework, named Point-In-Context, designed especially for in-context learning in 3D point clouds, where both inputs and outputs are modeled as coordinates for each task. Additionally, we propose the Joint Sampling module, carefully designed to work in tandem with the general point sampling operator, effectively resolving the aforementioned technical issues. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks. Furthermore, with a more effective prompt selection strategy, our framework surpasses the results of individually trained models.

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
Stephanie Fu Netanel Yakir Tamir Shobhita Sundaram Lucy Chai Richard Zhang Tali Dekel Phillip Isola



研究问题:目前的感知相似度度量标准在像素和补丁级别上运行,无法捕捉图像布局、对象姿态和语义内容等中级别的相似性和差异性。
动机:开发一种全面评估图像的感知度量标准。
方法:收集人类对在各种方式上相似的图像对的相似性判断的新数据集,并使用最新的文本到图像模型创建在不同维度上被干扰的合成对。然后引入一种新的度量标准——DreamSim,以更好地符合人类的感知。
效果:实验结果表明,DreamSim不仅在合成数据上表现良好,而且在真实图像上也有很好的泛化能力,并在检索和重建任务上优于先前学习的度量标准和最近的大视觉模型。

Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks. Our project page: https://dreamsim-nights.github.io/

Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
Xudong XU Dejan Markovic Jacob Sandakly Todd Keebler Steven Krenn Alexander Richard



研究问题:如何通过计算机视觉和音频信号,为全身运动和语音产生的3D空间音频进行建模。
动机:尽管3D人体建模在计算机视觉领域得到了广泛关注,但模拟由身体运动和语音产生的等效声学(即3D空间音频)的模型尚未得到社区的充分关注。
方法:我们提出了一种能够为全身生成精确3D空间音频的模型。该系统以头戴式麦克风和身体姿势的音频信号为输入,并产生包围发射器身体的3D音场作为输出,可以从3D空间中的任意位置渲染空间音频。
效果:我们收集了首个多模态人体数据集,使用多个摄像头和345个麦克风进行录制。在实证评估中,我们证明,当我们使用适当的损失函数训练模型时,它可以产生准确的身体诱导音场。数据集和代码可在线获取。

While 3D human body modeling has received much attention in computer vision, modeling the acoustic equivalent, i.e. modeling 3D spatial audio produced by body motion and speech, has fallen short in the community. To close this gap, we present a model that can generate accurate 3D spatial audio for full human bodies. The system consumes, as input, audio signals from headset microphones and body pose, and produces, as output, a 3D sound field surrounding the transmitter's body, from which spatial audio can be rendered at any arbitrary position in the 3D space. We collect a first-of-its-kind multimodal dataset of human bodies, recorded with multiple cameras and a spherical array of 345 microphones. In an empirical evaluation, we demonstrate that our model can produce accurate body-induced sound fields when trained with a suitable loss. Dataset and code are available online.

DreamHuman: Animatable 3D Avatars from Text
Nikos Kolotouros Thiemo Alldieck Andrei Zanfir Eduard Gabriel Bazavan Mihai Fieraru Cristian Sminchisescu



研究问题:如何从文本描述中生成逼真的可动画3D人类模型。
动机:现有的文本到3D方法在生成方面取得了很大进展,但在控制和空间分辨率等方面仍有限制,无法生成可放置在不同姿势(即可重定位或可动画)的3D人类模型,且对复杂结构如人的人体测量一致性仍具有挑战性。
方法:通过将大型文本到图像合成模型、神经辐射场和统计人体模型连接在一个新颖的优化框架中,实现了从文本生成动态3D人类化身的目标。
效果:该方法能够生成各种外观、服装、肤色和体型的可动画、逼真的3D人类模型,并在视觉逼真度上优于通用的文本到3D方法和先前的基于文本的3D化身生成器。

We present \emph{DreamHuman}, a method to generate realistic animatable 3D human avatar models entirely from textual descriptions. Recent text-to-3D methods have made considerable strides in generation, but are still lacking in important aspects. Control and often spatial resolution remain limited, existing methods produce fixed rather than 3D human models that can be placed in different poses (i.e. re-posable or animatable), and anthropometric consistency for complex structures like people remains a challenge. \emph{DreamHuman} connects large text-to-image synthesis models, neural radiance fields, and statistical human body models in a novel optimization framework. This makes it possible to generate dynamic 3D human avatars with high-quality textures and learnt per-instance rigid and non rigid geometric deformations. We demonstrate that our method is capable to generate a wide variety of animatable, realistic 3D human models from text. These have diverse appearance, clothing, skin tones and body shapes, and outperform both generic text-to-3D approaches and previous text-based 3D avatar generators in visual fidelity.

SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling
Jiaxiang Dong Haixu Wu Haoran Zhang Li Zhang Jianmin Wang Mingsheng Long



研究问题:现有的时间序列预训练模型在遮蔽部分时间点时会严重破坏关键的时间序列变化,使得重建任务过于困难,无法有效引导表示学习。
动机:为了解决这一问题,我们提出了一种简单而有效的预训练框架——SimMTM,通过将遮蔽建模与流形学习相关联,以减轻重建任务的难度。
方法:SimMTM通过权重聚合多个掩码系列之外的多个邻居来恢复被遮蔽的时间点,从而组装来自多个被破坏但互补的时间序列变化。此外,SimMTM还学习揭示流形的局部结构,这对遮蔽建模很有帮助。
效果:实验结果表明,SimMTM在预测和分类这两个典型的时间序列分析任务中,无论是在同域还是跨域设置下,都取得了最先进的微调性能,超越了目前最先进的时间序列预训练方法。

Time series analysis is widely used in extensive areas. Recently, to reduce labeling expenses and benefit various tasks, self-supervised pre-training has attracted immense interest. One mainstream paradigm is masked modeling, which successfully pre-trains deep models by learning to reconstruct the masked content based on the unmasked part. However, since the semantic information of time series is mainly contained in temporal variations, the standard way of randomly masking a portion of time points will seriously ruin vital temporal variations of time series, making the reconstruction task too difficult to guide representation learning. We thus present SimMTM, a Simple pre-training framework for Masked Time-series Modeling. By relating masked modeling to manifold learning, SimMTM proposes to recover masked time points by the weighted aggregation of multiple neighbors outside the manifold, which eases the reconstruction task by assembling ruined but complementary temporal variations from multiple masked series. SimMTM further learns to uncover the local structure of the manifold, which is helpful for masked modeling. Experimentally, SimMTM achieves state-of-the-art fine-tuning performance compared to the most advanced time series pre-training methods in two canonical time series analysis tasks: forecasting and classification, covering both in- and cross-domain settings.

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction
Feng Wang Zilong Chen Guokang Wang Yafei Song Huaping Liu



研究问题:如何有效地从多视角或单眼视频中重建动态3D场景?
动机:动态场景通常包含大量静态区域,导致存储和计算冗余。
方法:提出一种新颖的Masked Space-Time Hash编码(MSTH)方法,将动态场景表示为3D哈希编码和4D哈希编码的加权组合,并通过学习型掩码来指导两个组件的权重,以反映每个3D位置的空间和时间重要性。
效果:该方法能降低哈希碰撞率,避免对静态区域的冗余查询和修改,使得用小尺寸的哈希表表示大量的时空体素成为可能。此外,由于不需要独立适应大量时间冗余特征,该方法更易于优化并快速收敛,仅需20分钟的训练即可处理一个300帧的动态场景。在广泛的动态场景评估中,MSTH始终优于先前最先进的方法,且只需20分钟的训练时间和130MB的内存存储。

In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size.Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene. We evaluate our method on extensive dynamic scenes. As a result, MSTH obtains consistently better results than previous state-of-the-art methods with only 20 minutes of training time and 130 MB of memory storage.

AIMS: All-Inclusive Multi-Level Segmentation for Anything
Lu Qi Jason Kuen Weidong Guo Jiuxiang Gu Zhe Lin Bo Du Yu Xu Ming-Hsuan Yang



研究问题:尽管图像分割在精确视觉实体分割方面取得了进展,但对于满足不同级别的区域选择的图像编辑应用的多样化需求仍然没有解决。
动机:本文提出了一个新的任务——全包含多级别分割(AIMS),该任务将视觉区域分割为三个级别:部分、实体和关系(两个具有某些语义关系的实体)。
方法:通过多数据集多任务训练构建了一个统一的AIMS模型,以解决注释不一致性和任务相关性两大挑战。具体来说,我们提出了任务互补性、关联性和提示掩码编码器来进行三级预测。
效果:大量实验表明,与单一数据集上的其他最新方法或同时进行的“分割任何事物”工作相比,我们的方法更有效,具有更强的泛化能力。我们将公开我们的代码和训练模型。

Despite the progress of image segmentation for accurate visual entity segmentation, completing the diverse requirements of image editing applications for different-level region-of-interest selections remains unsolved. In this paper, we propose a new task, All-Inclusive Multi-Level Segmentation (AIMS), which segments visual regions into three levels: part, entity, and relation (two entities with some semantic relationships). We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation. Specifically, we propose task complementarity, association, and prompt mask encoder for three-level predictions. Extensive experiments demonstrate the effectiveness and generalization capacity of our method compared to other state-of-the-art methods on a single dataset or the concurrent work on segment anything. We will make our code and training model publicly available.

L-CAD: Language-based Colorization with Any-level Descriptions using Diffusion Priors
Zheng Chang Shuchen Weng Peixuan Zhang Yu Li Si Li Boxin Shi



研究问题:本文旨在解决现有语言基础着色方法需要用户为图像中的大部分物体提供全面的颜色描述,导致性能不佳的问题。
动机:现有的语言基础着色方法存在需要用户提供全面颜色描述的不足,导致性能不佳。
方法:本文提出了一种统一的模型来进行任何级别的语言基础着色。利用预训练的跨模态生成模型来处理任何级别描述的内在模糊性,并设计了模块来保持局部空间结构,防止幽灵效应。
效果:通过提出新的采样策略,该模型在各种复杂场景中实现了实例感知的着色,并在实验结果中表现出优于语言基础和自动着色方法的效果。

Language-based colorization produces plausible and visually pleasing colors under the guidance of user-friendly natural language descriptions. Previous methods implicitly assume that users provide comprehensive color descriptions for most of the objects in the image, which leads to suboptimal performance. In this paper, we propose a unified model to perform language-based colorization with any-level descriptions. We leverage the pretrained cross-modality generative model for its robust language understanding and rich color priors to handle the inherent ambiguity of any-level descriptions. We further design modules to align with input conditions to preserve local spatial structures and prevent the ghosting effect. With the proposed novel sampling strategy, our model achieves instance-aware colorization in diverse and complex scenarios. Extensive experimental results demonstrate our advantages of effectively handling any-level descriptions and outperforming both language-based and automatic colorization methods. The code and pretrained models are available at: https://github.com/changzheng123/L-CAD.

Transient Neural Radiance Fields for Lidar View Synthesis and 3D Reconstruction
Anagh Malik Parsa Mirdehghan Sotiris Nousias Kyros Kutulakos David B. Lindell



研究问题:如何利用激光雷达或深度传感器的额外监督在NeRF框架中进行渲染。
动机:以前的激光雷达监督NeRF主要关注传统的相机图像渲染,并使用激光雷达衍生的点云数据作为辅助监督,因此它们未能结合激光雷达的底层图像形成模型。
方法:提出一种新的方法来渲染瞬态NeRFs,该方法将单个光子激光雷达系统测量的原始、时间分辨的光子计数直方图作为输入,并尝试从新的视角渲染这些直方图。与常规NeRFs不同,这种方法依赖于时间分辨版本的体积渲染方程来渲染激光雷达测量值,并在皮秒时间尺度上捕捉瞬态光传输现象。
效果:在第一个具有此类数据的模拟和捕获瞬态多视图扫描的单光子激光雷达原型上评估了该方法。总的来说,这项工作使NeRFs在瞬态时间尺度上的成像达到了一个新的维度,首次实现了从新视角渲染瞬态图像。此外,当在少数输入视点上训练时,我们的方法比基于点云的监督恢复了更好的几何形状和传统外观。瞬态NeRFs可能对自动驾驶、机器人和遥感等下游任务寻求模拟原始激光雷达测量的应用特别有用。

Neural radiance fields (NeRFs) have become a ubiquitous tool for modeling scene appearance and geometry from multiview imagery. Recent work has also begun to explore how to use additional supervision from lidar or depth sensor measurements in the NeRF framework. However, previous lidar-supervised NeRFs focus on rendering conventional camera imagery and use lidar-derived point cloud data as auxiliary supervision; thus, they fail to incorporate the underlying image formation model of the lidar. Here, we propose a novel method for rendering transient NeRFs that take as input the raw, time-resolved photon count histograms measured by a single-photon lidar system, and we seek to render such histograms from novel views. Different from conventional NeRFs, the approach relies on a time-resolved version of the volume rendering equation to render the lidar measurements and capture transient light transport phenomena at picosecond timescales. We evaluate our method on a first-of-its-kind dataset of simulated and captured transient multiview scans from a prototype single-photon lidar. Overall, our work brings NeRFs to a new dimension of imaging at transient timescales, newly enabling rendering of transient imagery from novel views. Additionally, we show that our approach recovers improved geometry and conventional appearance compared to point cloud-based supervision when training on few input viewpoints. Transient NeRFs may be especially useful for applications which seek to simulate raw lidar measurements for downstream tasks in autonomous driving, robotics, and remote sensing.

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
Ziyi Wu Jingyu Hu Wuyue Lu Igor Gilitschenski Animesh Garg



研究问题:如何提高基于插槽的对象为中心的学习模型在无监督对象发现和视觉生成任务上的性能。
动机:现有的基于插槽的学习方法在无监督对象发现和视觉生成任务上表现不佳,尤其是在图像和视频数据的生成质量上。
方法:提出了一种名为SlotDiffusion的对象为中心的潜在扩散模型(LDM),该模型通过改进插槽到图像的解码过程,提高了无监督对象分割和视觉生成的质量。
效果:实验结果表明,SlotDiffusion在六个数据集上的无监督对象分割和视觉生成任务上都优于以往的插槽模型。此外,我们学习到的对象特征可以用于现有的对象为中心的动态模型,从而提高视频预测质量和下游时间推理任务的性能。最后,我们在集成了自监督预训练的图像编码器的情况下,证明了SlotDiffusion在无约束的真实世界数据集(如PASCAL VOC和COCO)上的可扩展性。

Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion
Shitao Tang Fuyang Zhang Jiacheng Chen Peng Wang Yasutaka Furukawa



研究问题:本文旨在解决在有像素到像素对应关系的场景中,如全景图的视角裁剪或给定几何(深度图和姿态)的多视图图像生成的问题。
动机:现有的方法依赖于迭代的图像扭曲和修复,而MVDiffusion通过全局意识同时生成所有图像,解决了误差累积的问题。
方法:MVDiffusion引入了一种对应关系感知的注意力机制,实现了有效的跨视图交互。这种方法包括三个关键模块:1) 生成低分辨率图像并保持全局对应的生成模块;2) 在图像之间密集化空间覆盖的插值模块;3) 将图像上采样为高分辨率的超分辨率模块。
效果:对于全景图像,MVDiffusion能生成高达1024*1024像素的高分辨率照片真实图像。对于给定场景网格的几何条件多视图图像生成,MVDiffusion在纹理映射生成方面表现出了最先进的性能。

This paper introduces MVDiffusion, a simple yet effective multi-view image generation method for scenarios where pixel-to-pixel correspondences are available, such as perspective crops from panorama or multi-view images given geometry (depth maps and poses). Unlike prior methods that rely on iterative image warping and inpainting, MVDiffusion concurrently generates all images with a global awareness, encompassing high resolution and rich content, effectively addressing the error accumulation prevalent in preceding models. MVDiffusion specifically incorporates a correspondence-aware attention mechanism, enabling effective cross-view interaction. This mechanism underpins three pivotal modules: 1) a generation module that produces low-resolution images while maintaining global correspondence, 2) an interpolation module that densifies spatial coverage between images, and 3) a super-resolution module that upscales into high-resolution images. In terms of panoramic imagery, MVDiffusion generates high-resolution photorealistic images up to 1024*1024 pixels. For geometry-conditioned multi-view image generation, MVDiffusion demonstrates state-of-the-art performance on texture-map generation for a given scene mesh. We recommend referring to our Arxiv version at https://arxiv.org/pdf/2307.01097.pdf for the latest update. The project page is at https://mvdiffusion.github.io/.

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models
Youquan Liu Lingdong Kong Jun CEN Runnan Chen Wenwei Zhang Liang Pan Kai Chen Ziwei Liu



研究问题:本文旨在利用视觉基础模型(VFMs)开发一种新的框架,用于分割多样化的汽车点云序列。
动机:现有的方法在处理点云数据时需要大量的标注,且对不同来源、分辨率和规模的点云数据的泛化能力有限。
方法:本文提出了Seal框架,通过将VFMs直接蒸馏到点云中进行预训练,同时在相机到LiDAR和点到分割的正则化阶段强制实施空间和时间关系,以促进跨模态表示学习。
效果:实验结果表明,Seal在11个不同的点云数据集上表现出了优秀的性能和优越性,并在所有测试的点云数据集上的20个不同的少样本精调任务中都取得了显著的性能提升。

Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, obviating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment regularization stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear probing, surpassing random initialization by 36.9% mIoU and outperforming prior arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets. The code is available at this link.

Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching
Junsheng Zhou Baorui Ma Wenyuan Zhang Yi Fang Yu-Shen Liu Zhizhong Han



研究问题:如何实现2D图像和3D点云的跨模态注册。
动机:现有的方法在匹配点和像素模式以及估计刚体变换时存在困难,导致注册结果不稳定。
方法:提出一种可微分的概率PnP求解器来学习结构化的跨模态潜在空间,以表示像素特征和3D特征。设计了一个三元网络来学习VoxelPoint-to-Pixel匹配,并使用CNN设计了体素和像素分支来操作网格中的体素/像素卷积。通过直接在预测的姿态分布上施加监督来训练整个框架。
效果:在KITTI和nuScenes数据集上的实验结果表明,该方法显著优于现有方法。

Cross-modality registration between 2D images captured by cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic. Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks, and use Perspective-n-Points (PnP) to estimate rigid transformation during post-processing. However, these methods struggle to map points and pixels to a shared latent space robustly since points and pixels have very different characteristics with patterns learned in different manners (MLP and CNN), and they also fail to construct supervision directly on the transformation since the PnP is non-differentiable, which leads to unstable registration results. To address these problems, we propose to learn a structured cross-modality latent space to represent pixel features and 3D features via a differentiable probabilistic PnP solver. Specifically, we design a triplet network to learn VoxelPoint-to-Pixel matching, where we represent 3D elements using both voxels and points to learn the cross-modality latent space with pixels. We design both the voxel and pixel branch based on CNNs to operate convolutions on voxels/pixels represented in grids, and integrate an additional point branch to regain the information lost during voxelization. We train our framework end-to-end by imposing supervisions directly on the predicted pose distribution with a probabilistic PnP solver. To explore distinctive patterns of cross-modality features, we design a novel loss with adaptive-weighted optimization for cross-modality feature description. The experimental results on KITTI and nuScenes datasets show significant improvements over the state-of-the-art methods.

Context-PIPs: Persistent Independent Particles Demands Context Features
Weikang BIAN Zhaoyang Huang Xiaoyu Shi Yitong Dong Yijin Li Hongsheng Li



研究问题:本文旨在解决视频中持续独立粒子(PIPs)的问题,即追踪视频中的任意点。
动机:现有的方法在估计视频中点的长期轨迹时,忽视了空间上下文特征的利用。
方法:我们提出了一种名为Context-PIPs的新框架,通过聚合视频中的空间上下文特征,有效提高了点轨迹的准确性。该框架包含两个主要模块:1) 来源特征增强(SOFE)模块;2) 目标特征聚合(TAFA)模块。
效果:Context-PIPs显著改善了PIPs的所有方面,在CroHD上将遮挡点的Average Trajectory Error降低了11.4%,在TAP-Vid-Kinectics上将正确关键点的平均百分比提高了11.8%。

We tackle the problem of Persistent Independent Particles (PIPs), also called Tracking Any Point (TAP), in videos, which specifically aims at estimating persistent long-term trajectories of query points in videos. Previous methods attempted to estimate these trajectories independently to incorporate longer image sequences, therefore, ignoring the potential benefits of incorporating spatial context features. We argue that independent video point tracking also demands spatial context features. To this end, we propose a novel framework Context-PIPs, which effectively improves point trajectory accuracy by aggregating spatial context features in videos. Context-PIPs contains two main modules: 1) a SOurse Feature Enhancement (SOFE) module, and 2) a TArget Feature Aggregation (TAFA) module. Context-PIPs significantly improves PIPs all-sided, reducing 11.4\% Average Trajectory Error of Occluded Points (ATE-Occ) on CroHD and increasing 11.8\% Average Percentage of Correct Keypoint (A-PCK) on TAP-Vid-Kinectics. Demos are available at \url{https://wkbian.github.io/Projects/Context-PIPs/}.

4D Panoptic Scene Graph Generation
Jingkang Yang Jun CEN Wenxuan Peng Shuai Liu Fangzhou Hong Xiangtai Li Kaiyang Zhou Qifeng Chen Ziwei Liu



研究问题:如何让人工智能全面理解四维环境。
动机:我们生活在一个三维空间中,同时在时间这个第四维度上前进。为了能让人工智能对这样的四维环境有全面的理解,我们提出了4D Panoptic Scene Graph(PSG-4D)。
方法:PSG-4D将动态四维世界中感知的原始视觉数据抽象为节点和边,节点代表具有精确位置和状态信息的实体,边捕捉时间关系。我们还构建了一个丰富的PSG-4D数据集,并设计了PSG4DFormer模型进行预测和生成。
效果:实验表明,我们的模型可以作为未来PSG-4D研究的强基线。通过整合大型语言模型,我们可以实现动态场景理解。

We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce **4D Panoptic Scene Graph (PSG-4D)**, a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.

VoxDet: Voxel Learning for Novel Instance Detection
Bowen Li Jiashun Wang Yaoyu Hu Chen Wang Sebastian Scherer



研究问题:基于多视角模板检测未见过实例的问题是一个挑战,因为其开放世界的特性。
动机:传统的主要依赖二维表示和匹配技术的方法论,在处理姿态变化和遮挡问题上往往力不从心。
方法:我们引入VoxDet,一种创新的三维几何感知框架,充分利用强大的三维体素表示和可靠的体素匹配机制。VoxDet首先巧妙地提出了模板体素聚合(TVA)模块,有效地将多视角二维图像转化为三维体素特征。通过利用关联的相机位姿,这些特征被聚合成一个紧凑的三维模板体素。在新的实例检测中,这种体素表示显示出对遮挡和姿态变化的增强抵抗力。我们还发现,一个三维重建目标有助于预训练二维-三维映射在TVA中。其次,为了快速与模板体素对齐,VoxDet集成了一个查询体素匹配(QVM)模块。二维查询首先被转换为其体素表示,并学习二维-三维映射。我们发现,由于三维体素表示编码了几何信息,我们可以先估计相对旋转,然后比较对齐的体素,从而提高准确性和效率。
效果:我们在RoboTools上进行了大量实验,这是一个首个实例检测基准,其中20个独特的实例由摄像头外部参数录制。RoboTools还提供了24个具有超过9k个框注释的具有挑战性的杂乱场景。我们在要求高的LineMod-Occlusion、YCB-video和RoboTools基准上进行了详尽的实验,其中VoxDet以更快的速度显著优于各种二维基线。据我们所知,VoxDet是第一个将隐式三维知识用于二维新实例检测任务的方法。

Detecting unseen instances based on multi-view templates is a challenging problem due to its open-world nature. Traditional methodologies, which primarily rely on $2 \mathrm{D}$ representations and matching techniques, are often inadequate in handling pose variations and occlusions. To solve this, we introduce VoxDet, a pioneer 3D geometry-aware framework that fully utilizes the strong 3D voxel representation and reliable voxel matching mechanism. VoxDet first ingeniously proposes template voxel aggregation (TVA) module, effectively transforming multi-view 2D images into 3D voxel features. By leveraging associated camera poses, these features are aggregated into a compact 3D template voxel. In novel instance detection, this voxel representation demonstrates heightened resilience to occlusion and pose variations. We also discover that a $3 \mathrm{D}$ reconstruction objective helps to pre-train the 2D-3D mapping in TVA. Second, to quickly align with the template voxel, VoxDet incorporates a Query Voxel Matching (QVM) module. The 2D queries are first converted into their voxel representation with the learned 2D-3D mapping. We find that since the 3D voxel representations encode the geometry, we can first estimate the relative rotation and then compare the aligned voxels, leading to improved accuracy and efficiency. In addition to method, we also introduce the first instance detection benchmark, RoboTools, where 20 unique instances are video-recorded with camera extrinsic. RoboTools also provides 24 challenging cluttered scenarios with more than $9 \mathrm{k}$ box annotations. Exhaustive experiments are conducted on the demanding LineMod-Occlusion, YCB-video, and RoboTools benchmarks, where VoxDet outperforms various $2 \mathrm{D}$ baselines remarkably with faster speed. To the best of our knowledge, VoxDet is the first to incorporate implicit 3D knowledge for 2D novel instance detection tasks.

Diverse Shape Completion via Style Modulated Generative Adversarial Networks
Wesley Khademi Li Fuxin



研究问题:如何从部分观察中恢复物体的完整3D几何形状。
动机:形状补全问题本质上是多模态的,因为有许多合理的方式可以完成形状的缺失区域。这种多样性反映了形状的潜在不确定性,可能对下游任务如规划更有利。
方法:提出一种新的条件生成对抗网络,可以从部分观察到的点云生成多个多样化的可能补全。通过风格调制在网络中引入随机性,从完整的形状中提取风格代码并在训练过程中学习其分布,使风格代码能够明确地携带形状类别信息,从而得到更好的补全结果。
效果:在多个合成和真实数据集上的评估表明,该方法在尊重部分观察的同时,可以获得更多样化的补全结果,实现了显著的改进。

Shape completion aims to recover the full 3D geometry of an object from a partial observation. This problem is inherently multi-modal since there can be many ways to plausibly complete the missing regions of a shape. Such diversity would be indicative of the underlying uncertainty of the shape and could be preferable for downstream tasks such as planning. In this paper, we propose a novel conditional generative adversarial network that can produce many diverse plausible completions of a partially observed point cloud. To enable our network to produce multiple completions for the same partial input, we introduce stochasticity into our network via style modulation. By extracting style codes from complete shapes during training, and learning a distribution over them, our style codes can explicitly carry shape category information leading to better completions. We further introduce diversity penalties and discriminators at multiple scales to prevent conditional mode collapse and to train without the need for multiple ground truth completions for each partial input. Evaluations across several synthetic and real datasets demonstrate that our method achieves significant improvements in respecting the partial observations while obtaining greater diversity in completions.

A polar prediction model for learning to represent visual transformations
Pierre-Étienne H Fiquet Eero P Simoncelli



研究问题:所有生物体都能进行时间预测,其进化适应度水平取决于这些预测的准确性。在视觉感知的背景下,观察者和场景中物体的运动构成了感官信号的动态性,使得基于过去信号的部分预测未来信号成为可能。
动机:我们提出了一个自我监督表示学习框架,该框架提取并利用自然视频的规律性来计算准确的预测。通过类比傅立叶变换定理及其群论推广,我们优化了极坐标架构的参数以进行下一帧预测。
方法:通过控制实验,我们发现这种方法可以发现在数据中执行简单转换组的表示。当我们在自然视频数据集上训练时,我们的框架实现了优于传统运动补偿和传统深度网络的预测性能,同时保持了可解释性和速度。
效果:此外,极坐标计算可以被重新构造为类似于灵长类动物V1神经元的标准化简单和方向选择性复杂细胞模型的组件。因此,极坐标预测提供了一个理解视觉系统如何以简化时间预测的形式表示感官输入的原理性框架。

All organisms make temporal predictions, and their evolutionary fitness level depends on the accuracy of these predictions. In the context of visual perception, the motions of both the observer and objects in the scene structure the dynamics of sensory signals, allowing for partial prediction of future signals based on past ones. Here, we propose a self-supervised representation-learning framework that extracts and exploits the regularities of natural videos to compute accurate predictions. We motivate the polar architecture by appealing to the Fourier shift theorem and its group-theoretic generalization, and we optimize its parameters on next-frame prediction. Through controlled experiments, we demonstrate that this approach can discover the representation of simple transformation groups acting in data. When trained on natural video datasets, our framework achieves better prediction performance than traditional motion compensation and rivals conventional deep networks, while maintaining interpretability and speed. Furthermore, the polar computations can be restructured into components resembling normalized simple and direction-selective complex cell models of primate V1 neurons. Thus, polar prediction offers a principled framework for understanding how the visual system represents sensory inputs in a form that simplifies temporal prediction.

Equivariant Single View Pose Prediction Via Induced and Restriction Representations
Owen Lewis Howell David Klee Ondrej Biza Linfeng Zhao Robin Walters



研究问题:如何从二维图像中学习三维世界,并满足旋转和平移的三维预测。
动机:理想的神经网络架构应该能够利用物体在三维空间中的旋转和平移特性进行新颖图像的预测,但在二维平面上实现$SO(3)$-等变性是困难的。
方法:通过学习$SO(2)$-等变约束,我们构建了一种可以从二维图像中学习三维世界表示的新算法。
效果:我们的算法在PASCAL3D+和SYMSOL姿态估计任务上取得了最新的成果,证明了其有效性。

Learning about the three-dimensional world from two-dimensional images is a fundamental problem in computer vision. An ideal neural network architecture for such tasks would leverage the fact that objects can be rotated and translated in three dimensions to make predictions about novel images. However, imposing $SO(3)$-equivariance on two-dimensional inputs is difficult because the group of three-dimensional rotations does not have a natural action on the two-dimensional plane. Specifically, it is possible that an element of $SO(3)$ will rotate an image out of plane. We show that an algorithm that learns a three-dimensional representation of the world from two dimensional images must satisfy certain consistency properties which we formulate as $SO(2)$-equivariance constraints. We use the induced representation of $SO(2)$ on $SO(3)$ to construct and classify architectures that have two-dimensional inputs and which satisfy these consistency constraints. We prove that any architecture which respects said consistency constraints can be realized as an instance of our construction. We show that three previously proposed neural architectures for 3D pose prediction are special cases of our construction. We propose a new algorithm that is a learnable generalization of previously considered methods. We test our architecture on three pose predictions task and achieve SOTA results on both the PASCAL3D+ and SYMSOL pose estimation tasks.

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Antonín Vobecký Oriane Siméoni David Hurych Spyros Gidaris Andrei Bursuc Patrick Perez Josef Sivic



研究问题:如何从2D图像预测开放词汇的3D语义体素占用地图,以实现3D地面化、分割和自由形式语言查询的检索。
动机:由于2D-3D歧义和目标任务的开放词汇性质,获取3D标注的训练数据困难,这是一个挑战性的问题。
方法:设计一个新的模型架构用于开放词汇的3D语义占用预测,包括一个2D-3D编码器以及占用预测和3D语言头部。开发了一个利用三种模态(图像、语言和LiDAR点云)的自我监督学习算法,无需任何3D手动语言注释即可训练提出的架构。
效果:在几个开放词汇任务上进行了定量演示,包括使用现有数据集的零样本3D语义分割,以及使用我们提出并作为nuScenes扩展的小型数据集进行3D地面化和自由形式语言查询的检索。

We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes.

UP-NeRF: Unconstrained Pose Prior-Free Neural Radiance Field
Injae Kim Minhyuk Choi Hyunwoo J. Kim



研究问题:目前的神经辐射场(NeRF)模型在处理无约束的图像集合和变化的光照条件以及瞬态遮挡物时存在困难。
动机:为了解决这些问题,本文提出了一种无需相机位姿先验的神经辐射场(UP-NeRF)。
方法:通过优化颜色不敏感的特征场和分离模块来处理瞬态遮挡物对位姿估计的影响,并引入候选头部以实现更稳健的位姿估计,同时采用瞬态感知深度监督来最小化错误先验的影响。
效果:实验结果表明,与基线模型(包括BARF及其变体)相比,UP-NeRF在具有挑战性的互联网照片集合“Phototourism数据集”上表现出优越的性能。

Neural Radiance Field (NeRF) has enabled novel view synthesis with high fidelity given images and camera poses. Subsequent works even succeeded in eliminating the necessity of pose priors by jointly optimizing NeRF and camera pose. However, these works are limited to relatively simple settings such as photometrically consistent and occluder-free image collections or a sequence of images from a video. So they have difficulty handling unconstrained images with varying illumination and transient occluders. In this paper, we propose **UP-NeRF** (**U**nconstrained **P**ose-prior-free **Ne**ural **R**adiance **F**ields) to optimize NeRF with unconstrained image collections without camera pose prior. We tackle these challenges with surrogate tasks that optimize color-insensitive feature fields and a separate module for transient occluders to block their influence on pose estimation. In addition, we introduce a candidate head to enable more robust pose estimation and transient-aware depth supervision to minimize the effect of incorrect prior. Our experiments verify the superior performance of our method compared to the baselines including BARF and its variants in a challenging internet photo collection, *Phototourism dataset*. The code of UP-NeRF is available at https://github.com/mlvlab/UP-NeRF.

Detection Based Part-level Articulated Object Reconstruction from Single RGBD Image
Yuki Kawana Tatsuya Harada



研究问题:如何从单张RGBD图像重建多个人造铰接物体,并估计其姿态和运动学。
动机:目前的重建方法主要依赖于学习实例级的潜空间,而忽视了具有预定义部分数量的人造铰接物体的特性。
方法:提出一种新颖的部分级表示法,将实例表示为检测到的部分的组合。同时,提出了测试时的运动学感知部分融合、各向异性尺度归一化以及特征空间和输出空间之间的平衡策略等方法来解决检测性能、假阳性、部分大小和比例等问题。
效果:实验证明,该方法能够成功重建出多种结构,且在形状重建和运动学估计上优于现有方法。

We propose an end-to-end trainable, cross-category method for reconstructing multiple man-made articulated objects from a single RGBD image, focusing on part-level shape reconstruction and pose and kinematics estimation. We depart from previous works that rely on learning instance-level latent space, focusing on man-made articulated objects with predefined part counts. Instead, we propose a novel alternative approach that employs part-level representation, representing instances as combinations of detected parts. While our detect-then-group approach effectively handles instances with diverse part structures and various part counts, it faces issues of false positives, varying part sizes and scales, and an increasing model size due to end-to-end training. To address these challenges, we propose 1) test-time kinematics-aware part fusion to improve detection performance while suppressing false positives, 2) anisotropic scale normalization for part shape learning to accommodate various part sizes and scales, and 3) a balancing strategy for cross-refinement between feature space and output space to improve part detection while maintaining model size. Evaluation on both synthetic and real data demonstrates that our method successfully reconstructs variously structured multiple instances that previous works cannot handle, and outperforms prior works in shape reconstruction and kinematics estimation.

Flow-Attention-based Spatio-Temporal Aggregation Network for 3D Mask Detection
Yuxin Cao Yian Li Yumeng Zhu Derui Wang Minhui Xue



研究问题:由于欺骗攻击对人脸识别系统的安全威胁,反欺骗检测已成为必要。大多数基于深度学习的方法在3D面具上表现不佳,这些面具在外观和结构上高度模拟真实面孔,仅关注单帧输入的空间域,导致泛化能力不足。
动机:尽管传统的攻击方法取得了巨大的成功,但基于深度学习的方法在3D面具上表现不佳。最近引入的生物医学技术rPPG(远程光电容积脉搏波描记术)在一定程度上缓解了这个问题,但是这种方法对噪声干扰敏感,并且需要至少一秒钟(> 25帧)的观察时间,这会导致高昂的计算开销。
方法:为了解决这些挑战,我们提出了一种新的3D面具检测框架,名为FASTEN(基于流-注意力的时空聚合网络)。我们为该网络量身定制,使其更关注大运动中的细粒度细节,从而消除冗余的时空特征干扰,并在较少的帧中快速捕获3D面具的拼接痕迹。我们提出的网络包含三个关键模块:1)一个面部光学流网络,用于获取非RGB帧间流信息;2)流注意力,为每帧分配不同的显著性;3)时空聚合,聚合高层空间特征和时间转换特征。
效果:通过大量实验,FASTEN只需要五帧输入,并且在数据集内和跨数据集评估中,在多个检测指标上都优于八个竞争对手。此外,FASTEN已部署在现实世界的移动设备上进行实际的3D面具检测。

Anti-spoofing detection has become a necessity for face recognition systems due to the security threat posed by spoofing attacks. Despite great success in traditional attacks, most deep-learning-based methods perform poorly in 3D masks, which can highly simulate real faces in appearance and structure, suffering generalizability insufficiency while focusing only on the spatial domain with single frame input. This has been mitigated by the recent introduction of a biomedical technology called rPPG (remote photoplethysmography). However, rPPG-based methods are sensitive to noisy interference and require at least one second (> 25 frames) of observation time, which induces high computational overhead. To address these challenges, we propose a novel 3D mask detection framework, called FASTEN (Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the network for focusing more on fine-grained details in large movements, which can eliminate redundant spatio-temporal feature interference and quickly capture splicing traces of 3D masks in fewer frames. Our proposed network contains three key modules: 1) a facial optical flow network to obtain non-RGB inter-frame flow information; 2) flow attention to assign different significance to each frame; 3) spatio-temporal aggregation to aggregate high-level spatial features and temporal transition features. Through extensive experiments, FASTEN only requires five frames of input and outperforms eight competitors for both intra-dataset and cross-dataset evaluations in terms of multiple detection metrics. Moreover, FASTEN has been deployed in real-world mobile devices for practical 3D mask detection.

Reusable Slotwise Mechanisms
Trang Nguyen Amin Mansouri Kanika Madan Nguyen Duy Khuong Kartik Ahuja Dianbo Liu Yoshua Bengio



研究问题:如何通过有效的场景表示和对象子集之间的交互机制理解,提高机器人在新颖场景中的鲁棒性和泛化能力。
动机:现有的场景表示方法主要依赖对象槽位,但这种方法无法有效处理需要稀疏对象子集的复杂交互情况。
方法:提出可重用槽位机制(RSM)框架,该框架通过槽位间的通信和模块化架构,动态选择用于预测每个对象槽未来状态的可重用机制。同时,利用中心上下文信息(CCI),使选定的机制能够通过瓶颈访问剩余的槽位,以实现更高阶和复杂的交互模型。
效果:实验结果表明,RSM在各种未来预测和相关下游任务中的表现优于现有方法,包括视觉问答和动作规划。此外,RSM还展示了在复杂场景中的分布外泛化能力。

Agents with the ability to comprehend and reason about the dynamics of objects would be expected to exhibit improved robustness and generalization in novel scenarios. However, achieving this capability necessitates not only an effective scene representation but also an understanding of the mechanisms governing interactions among object subsets. Recent studies have made significant progress in representing scenes using object slots. In this work, we introduce Reusable Slotwise Mechanisms, or RSM, a framework that models object dynamics by leveraging communication among slots along with a modular architecture capable of dynamically selecting reusable mechanisms for predicting the future states of each object slot. Crucially, RSM leverages the Central Contextual Information (CCI), enabling selected mechanisms to access the remaining slots through a bottleneck, effectively allowing for modeling of higher order and complex interactions that might require a sparse subset of objects. Experimental results demonstrate the superior performance of RSM compared to state-of-the-art methods across various future prediction and related downstream tasks, including Visual Question Answering and action planning. Furthermore, we showcase RSM’s Out-of-Distribution generalization ability to handle scenes in intricate scenarios.

Spatio-Angular Convolutions for Super-resolution in Diffusion MRI
Matthew Lyon Paul Armitage Mauricio A Álvarez



研究问题:本文旨在提出一种新颖的扩散磁共振成像(dMRI)角度超分辨率方法,该方法基于参数连续卷积(PCConv)框架。
动机:现有的dMRI扫描时间长,需要高分辨率数据集。通过利用该领域独特的几何结构,提出了一种新的dMRI角度超分辨率方法。
方法:在PCConv框架的基础上,引入了傅里叶特征映射、全局坐标和特定领域的上下文等操作,构建了一个全参数连续卷积网络(PCCNN),并与现有模型进行比较。
效果:实验结果表明,PCCNN在性能上具有竞争力,同时使用的参数数量显著减少。此外,这种形式在临床相关的下游分析中具有良好的泛化能力,如基于固定点的分析、神经轴突方向弥散和密度成像。

Diffusion MRI (dMRI) is a widely used imaging modality, but requires long scanning times to acquire high resolution datasets. By leveraging the unique geometry present within this domain, we present a novel approach to dMRI angular super-resolution that extends upon the parametric continuous convolution (PCConv) framework. We introduce several additions to the operation including a Fourier feature mapping, 'global' co-ordinates, and domain specific context. Using this framework, we build a fully parametric continuous convolution network (PCCNN) and compare against existing models. We demonstrate the PCCNN performs competitively while using significantly fewer parameters. Moreover, we show that this formulation generalises well to clinically relevant downstream analyses such as fixel-based analysis, and neurite orientation dispersion and density imaging.

Mip-Grid: Anti-aliased Grid Representations for Neural Radiance Fields
Seungtae Nam Daniel Rho Jong Hwan Ko Eunbyung Park



研究问题:现有的神经辐射场(NeRF)模型在渲染3D场景和生成新视角图像时,存在“锯齿”或模糊的图像问题。
动机:为了解决这个问题,研究人员提出了mip-Grid方法,该方法将抗锯齿技术整合到基于网格的辐射场表示中。
方法:mip-Grid使用单一的共享网格表示和单次采样方法,通过简单的卷积操作生成多个网格,并使用尺度感知坐标从生成的多个网格中检索适当的特征。
效果:实验结果表明,mip-Grid大大提高了两种代表性的基于网格的方法——TensoRF和K-Planes的渲染性能,并在多尺度数据集上与mip-NeRF的性能相当,同时训练时间显著缩短。

Despite the remarkable achievements of neural radiance fields (NeRF) in representing 3D scenes and generating novel view images, the aliasing issue, rendering 'jaggies' or 'blurry' images at varying camera distances, remains unresolved in most existing approaches. The recently proposed mip-NeRF has effectively addressed this challenge by introducing integrated positional encodings (IPE). However, it relies on MLP architecture to represent the radiance fields, missing out on the fast training speed offered by the latest grid-based methods. In this work, we present mip-Grid, a novel approach that integrates anti-aliasing techniques into grid-based representations for radiance fields, mitigating the aliasing artifacts while enjoying fast training time. Notably, the proposed method uses a single-scale shared grid representation and a single-sampling approach, which only introduces minimal additions to the model parameters and computational costs. To handle scale ambiguity, mip-Grid generates multiple grids by applying simple convolution operations over the shared grid and uses the scale-aware coordinate to retrieve the appropriate features from the generated multiple grids. To test the effectiveness, we incorporated the proposed approach into the two recent representative grid-based methods, TensoRF and K-Planes. The experimental results demonstrated that mip-Grid greatly improved the rendering performance of both methods and showed comparable performance to mip-NeRF on multi-scale datasets while achieving significantly faster training time.

DreamSparse: Escaping from Plato’s Cave with 2D Diffusion Model Given Sparse Views
Paul Yoo Jiaxian Guo Yutaka Matsuo Shixiang Shane Gu



研究问题:如何从少量视图中合成新的视图图像。
动机:现有的方法在少数视图设置下,由于提供的信息不足,往往难以产生高质量的结果,或者需要对每个对象进行优化。
方法:我们探索利用预训练扩散模型中的强2D先验知识来合成新的视图图像。我们提出了DreamSparse框架,该框架使预训练的扩散模型能够生成几何和身份一致的新视图图像。
效果:实验结果表明,我们的框架能够有效地从稀疏视图中合成新的视图图像,并在训练和开放类别图像上都优于基线。

Synthesizing novel view images from a few views is a challenging but practical problem. Existing methods often struggle with producing high-quality results or necessitate per-object optimization in such few-view settings due to the insufficient information provided. In this work, we explore leveraging the strong 2D priors in pre-trained diffusion models for synthesizing novel view images. 2D diffusion models, nevertheless, lack 3D awareness, leading to distorted image synthesis and compromising the identity. To address these problems, we propose $\textit{DreamSparse}$, a framework that enables the frozen pre-trained diffusion model to generate geometry and identity-consistent novel view images. Specifically, DreamSparse incorporates a geometry module designed to capture features about spatial information from sparse views as a 3D prior. Subsequently, a spatial guidance model is introduced to convert rendered feature maps as spatial information for the generative process. This information is then used to guide the pre-trained diffusion model to encourage the synthesis of geometrically consistent images without further tuning. Leveraging the strong image priors in the pre-trained diffusion models, DreamSparse is capable of synthesizing high-quality novel views for both object and object-centric scene-level images and generalising to open-set images. Experimental results demonstrate that our framework can effectively synthesize novel view images from sparse views and outperforms baselines in both trained and open-set category images. More results can be found on our project page: https://sites.google.com/view/dreamsparse-webpage.

Object-centric Learning with Cyclic Walks between Parts and Whole
Ziyu Wang Mike Zheng Shou Mengmi Zhang



研究问题:如何从复杂的自然环境中学习以对象为中心的表示,使人类和具有推理能力的机器能够从低级感知特征中进行推理。
动机:目前的模型在处理复杂场景时,往往忽视了对象实体的组成性,以及视觉信息与对象实体之间的对应关系。
方法:提出一种在视觉转换器提取的感知特征和对象实体之间进行循环游走的方法,通过槽位注意力模块建立感知特征和槽位绑定的对象表示之间的对应关系,并利用部分和整体之间的相互作用形成循环一致性,作为监督信号训练槽位注意力模块。
效果:实验结果表明,使用这种方法训练的网络能够在复杂场景中区分前景和背景,发现对象,分割语义对象。与依赖解码器进行像素级或特征级重建的对象中心模型相比,该方法提供了强大的学习信号,避免了计算开销,提高了内存效率。

Learning object-centric representations from complex natural environments enables both humans and machines with reasoning abilities from low-level perceptual features. To capture compositional entities of the scene, we proposed cyclic walks between perceptual features extracted from vision transformers and object entities. First, a slot-attention module interfaces with these perceptual features and produces a finite set of slot representations. These slots can bind to any object entities in the scene via inter-slot competitions for attention. Next, we establish entity-feature correspondence with cyclic walks along high transition probability based on the pairwise similarity between perceptual features (aka "parts") and slot-binded object representations (aka "whole"). The whole is greater than its parts and the parts constitute the whole. The part-whole interactions form cycle consistencies, as supervisory signals, to train the slot-attention module. Our rigorous experiments on \textit{seven} image datasets in \textit{three} \textit{unsupervised} tasks demonstrate that the networks trained with our cyclic walks can disentangle foregrounds and backgrounds, discover objects, and segment semantic objects in complex scenes. In contrast to object-centric models attached with a decoder for the pixel-level or feature-level reconstructions, our cyclic walks provide strong learning signals, avoiding computation overheads and enhancing memory efficiency. Our source code and data are available at: \href{https://github.com/ZhangLab-DeepNeuroCogLab/Parts-Whole-Object-Centric-Learning/}{link}.

Injecting Multimodal Information into Rigid Protein Docking via Bi-level Optimization
Ruijia Wang YiWu Sun Yujie Luo Shaochuan Li Cheng Yang Xingyi Cheng Hui Li Chuan Shi Le Song



研究问题:本文旨在解决蛋白质-蛋白质复合物结构预测的问题,即如何从未结合状态预测其三维结构。
动机:现有的对接方法通常只使用单一模态信息(如序列或结构),导致预测结果不理想。
方法:本文提出了一种名为xTrimoBiDock的新模型,通过双层优化有效整合了序列和结构模态信息。具体来说,跨模态转换器结合多模态信息预测蛋白质间距离图,然后通过优化旋转平移变换将对接姿态与预测的距离图对齐。
效果:实验结果表明,相比基线方法,BiDock在具有挑战性的抗体-抗原对接问题上取得了显著的改进,最大相对提高了234%。

The structure of protein-protein complexes is critical for understanding binding dynamics, biological mechanisms, and intervention strategies. Rigid protein docking, a fundamental problem in this field, aims to predict the 3D structure of complexes from their unbound states without conformational changes. In this scenario, we have access to two types of valuable information: sequence-modal information, such as coevolutionary data obtained from multiple sequence alignments, and structure-modal information, including the 3D conformations of rigid structures. However, existing docking methods typically utilize single-modal information, resulting in suboptimal predictions. In this paper, we propose xTrimoBiDock (or BiDock for short), a novel rigid docking model that effectively integrates sequence- and structure-modal information through bi-level optimization. Specifically, a cross-modal transformer combines multimodal information to predict an inter-protein distance map. To achieve rigid docking, the roto-translation transformation is optimized to align the docked pose with the predicted distance map. In order to tackle this bi-level optimization problem, we unroll the gradient descent of the inner loop and further derive a better initialization for roto-translation transformation based on spectral estimation. Compared to baselines, BiDock achieves a promising result of a maximum 234% relative improvement in challenging antibody-antigen docking problem.

Template-free Articulated Neural Point Clouds for Reposable View Synthesis
Lukas Uzolas Elmar Eisemann Petr Kellnhofer



研究问题:本文旨在解决现有动态神经辐射场(NeRFs)在合成3D场景新视图时,对捕获对象姿势的再动画化困难以及视觉保真度低、重建时间长或仅限于特定应用领域等问题。
动机:目前的动态模型通常依赖于后向形变场,使得对捕获的对象姿势进行再动画化具有挑战性。此外,最先进的动态模型往往受限于低视觉保真度、长的重建时间或仅适用于狭窄的应用领域。
方法:本文提出了一种新颖的方法,利用基于点的表示和线性混合蒙皮(LBS),从稀疏的多视角视频中联合学习动态神经辐射场(NeRF)和相关的骨骼模型。我们的前向扭曲方法在合成新视图和姿势时实现了最先进的视觉保真度,同时与现有工作相比,显著减少了必要的学习时间。
效果:我们在各种常见的数据集上展示了我们表示的通用性,并在不需要特定于对象的骨骼模板的情况下获得了可重复使用的3D重建。

Dynamic Neural Radiance Fields (NeRFs) achieve remarkable visual quality when synthesizing novel views of time-evolving 3D scenes. However, the common reliance on backward deformation fields makes reanimation of the captured object poses challenging. Moreover, the state of the art dynamic models are often limited by low visual fidelity, long reconstruction time or specificity to narrow application domains.  In this paper, we present a novel method utilizing a point-based representation and Linear Blend Skinning (LBS) to jointly learn a Dynamic NeRF and an associated skeletal model from even sparse multi-view video. Our forward-warping approach achieves state-of-the-art visual fidelity when synthesizing novel views and poses while significantly reducing the necessary learning time when compared to existing work. We demonstrate the versatility of our representation on a variety of articulated objects from common datasets and obtain reposable 3D reconstructions without the need of object-specific skeletal templates.

Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction
Zechuan Zhang Li Sun Zongxin Yang Ling Chen Yi Yang



研究问题:如何从单张图片重建穿着衣服的三维人体模型,特别是在面对复杂姿势和宽松服装时。
动机:当前的方法在性能上存在限制,主要是因为它们依赖于不足的二维图像特征和不一致的查询方法。
方法:提出了一种基于变压器的新型架构——全球关联3D解耦变压器(GTA),用于重建穿着衣服的人体头像。该方法利用视觉变压器模型作为编码器来捕获全局关联的图像特征,然后通过使用可学习的嵌入作为查询进行跨平面生成,实现了对三平面特征的解耦。
效果:在CAPE和THuman2.0数据集上的全面实验表明,该方法在几何和纹理重建方面均优于最先进的方法,对于挑战性的姿势和宽松的服装表现出高度的鲁棒性,并能产生更高分辨率的纹理。

Reconstructing 3D clothed human avatars from single images is a challenging task, especially when encountering complex poses and loose clothing. Current methods exhibit limitations in performance, largely attributable to their dependence on insufficient 2D image features and inconsistent query methods. Owing to this, we present the Global-correlated 3D-decoupling Transformer for clothed Avatar reconstruction (GTA), a novel transformer-based architecture that reconstructs clothed human avatars from monocular images. Our approach leverages transformer architectures by utilizing a Vision Transformer model as an encoder for capturing global-correlated image features. Subsequently, our innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane features, using learnable embeddings as queries for cross-plane generation. To effectively enhance feature fusion with the tri-plane 3D feature and human body prior, we propose a hybrid prior fusion strategy combining spatial and prior-enhanced queries, leveraging the benefits of spatial localization and human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 datasets illustrate that our method outperforms state-of-the-art approaches in both geometry and texture reconstruction, exhibiting high robustness to challenging poses and loose clothing, and producing higher-resolution textures. Codes are available at https://github.com/River-Zhang/GTA.

Towards Robust and Expressive Whole-body Human Pose and Shape Estimation
Hui En Pang Zhongang Cai Lei Yang Qingyi Tao Zhonghua Wu Tianwei Zhang Ziwei Liu



研究问题:全身姿态和形状估计旨在从单目图像中联合预测整个人的身体的不同行为(如姿势,手势,面部表情)。由于野外场景的复杂性,现有方法往往表现不佳。
动机:这些模型的预测准确性受到边界框质量(如规模,对齐)的显著影响。理想边界框注释和模型检测结果之间的自然差异对全身姿态和形状估计的性能尤为不利。
方法:本文提出了一种新的框架来增强全身姿态和形状估计的鲁棒性。我们的框架包含三个新模块,从三个角度解决上述挑战:(1)定位模块增强了模型对图像空间中主体位置和语义的意识;(2)对比特征提取模块通过引入对比损失和正样本,鼓励模型对稳健的增强具有不变性;(3)像素对齐模块确保从预测的相机和身体模型参数重新投影的网格更准确且像素对齐。
效果:我们进行了全面实验,以证明我们的提出的框架在身体,手,脸和全身基准测试上的有效性。

Whole-body pose and shape estimation aims to jointly predict different behaviors (e.g., pose, hand gesture, facial expression) of the entire human body from a monocular image. Existing methods often exhibit suboptimal performance due to the complexity of in-the-wild scenarios. We argue that the prediction accuracy of these models is significantly affected by the quality of the _bounding box_, e.g., scale, alignment. The natural discrepancy between the ideal bounding box annotations and model detection results is particularly detrimental to the performance of whole-body pose and shape estimation. In this paper, we propose a novel framework to enhance the robustness of whole-body pose and shape estimation. Our framework incorporates three new modules to address the above challenges from three perspectives: (1) a **Localization Module** enhances the model's awareness of the subject's location and semantics within the image space; (2) a **Contrastive Feature Extraction Module** encourages the model to be invariant to robust augmentations by incorporating a contrastive loss and positive samples; (3) a **Pixel Alignment Module** ensures the reprojected mesh from the predicted camera and body model parameters are more accurate and pixel-aligned. We perform comprehensive experiments to demonstrate the effectiveness of our proposed framework on body, hands, face and whole-body benchmarks.

Contrastive Training of Complex-Valued Autoencoders for Object Discovery
Aleksandar Stanić Anand Gopalakrishnan Kazuki Irie Jürgen Schmidhuber



研究问题:如何改进现有的以对象为中心的模型,使其能够更好地绑定对象并处理更复杂的任务。
动机:目前的模型存在一些概念性的限制,如插槽数量固定、所有插槽的容量相等、训练成本高以及插槽内没有对象级别的关系因素等。
方法:通过引入架构修改和一种新的对比学习方法,大大改进了最先进的同步基模型。这是首次获得能够在多对象彩色数据集中发现对象的同步基模型,并能同时表示超过三个对象。
效果:实验结果表明,这种方法显著提高了模型的性能,使其能够更好地处理复杂的任务。

Current state-of-the-art object-centric models use slots and attention-based routing for binding. However, this class of models has several conceptual limitations: the number of slots is hardwired; all slots have equal capacity; training has high computational cost; there are no object-level relational factors within slots. Synchrony-based models in principle can address these limitations by using complex-valued activations which store binding information in their phase components. However, working examples of such synchrony-based models have been developed only very recently, and are still limited to toy grayscale datasets and simultaneous storage of less than three objects in practice. Here we introduce architectural modifications and a novel contrastive learning method that greatly improve the state-of-the-art synchrony-based model. For the first time, we obtain a class of synchrony-based models capable of discovering objects in an unsupervised manner in multi-object color datasets and simultaneously representing more than three objects.

ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields
Jiahua Dong Yu-Xiong Wang



研究问题:本文旨在提出一种基于文本指令的3D编辑方法ViCA-NeRF,以实现多视图一致性。
动机:目前的方法在处理3D编辑时缺乏对多视图一致性的关注。
方法:该方法利用深度信息和学习到的正则化来确保不同视图之间的一致性,并通过两个阶段的训练来细化场景的外观。
效果:实验结果表明,ViCA-NeRF相比现有技术提供了更灵活、高效且具有更高一致性和细节的编辑效果。

We introduce ViCA-NeRF, a view-consistency-aware method for 3D editing with text instructions. In addition to the implicit NeRF modeling, our key insight is to exploit two sources of regularization that {\em explicitly} propagate the editing information across different views, thus ensuring multi-view consistency. As {\em geometric regularization}, we leverage the depth information derived from the NeRF model to establish image correspondence between different views. As {\em learned regularization}, we align the latent codes in the 2D diffusion model between edited and unedited images, enabling us to edit key views and propagate the update to the whole scene. Incorporating these two regularizations, our ViCA-NeRF framework consists of two stages. In the initial stage, we blend edits from different views to create a preliminary 3D edit. This is followed by a second stage of NeRF training that is dedicated to further refining the scene's appearance. Experiments demonstrate that ViCA-NeRF provides more flexible, efficient editing with higher levels of consistency and details, compared with the state of the art.

MomentDiff: Generative Video Moment Retrieval from Random to Real
Pandeng Li Chen-Wei Xie Hongtao Xie Liming Zhao Lei Zhang Yun Zheng Deli Zhao Yongdong Zhang



研究问题:视频时刻检索旨在找到与给定语言描述对应的未修剪视频中的特定时间段。
动机:现有的方法在处理随机初始化和数据集位置偏差时存在困难,需要一种更有效、更通用的解决方案。
方法:提出了一种名为MomentDiff的生成性扩散框架,模拟了人类从随机浏览到逐渐定位的检索过程。通过将真实跨度扩散到随机噪声,并在文本和视频相似性的引导下学习去噪,模型能够从任意随机位置映射到真实时刻,从而能够在随机初始化的位置进行定位。
效果:实验结果表明,MomentDiff在三个公共基准上始终优于最先进的方法,并在提出的抗偏差数据集上表现出更好的泛化性和鲁棒性。

Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e.g., based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two ``anti-bias'' datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets will be released publicly.

Prototypical Variational Autoencoder for 3D Few-shot Object Detection
Weiliang Tang Biqi YANG Xianzhi Li Yun-Hui Liu Pheng-Ann Heng Chi-Wing Fu



研究问题:如何利用有限的标注样本进行3D点云对象检测。
动机:现有的方法在只有少量标注样本的情况下,检测性能往往受限于潜在特征的质量。
方法:设计了一种基于变分自编码器(VAE)的原型学习方案,称为原型VAE(P-VAE),用于增强采样特征的多样性和独特性。网络编码了一个多中心高斯混合模型(GMM)类似的后验分布,每个分布都以一个原型为中心。为了进行正则化,P-VAE引入了保留几何信息重建任务。
效果:实验结果表明,该方法在两个FS3D基准测试中的表现超过了现有技术。定量消融研究和定性原型分析进一步证明,我们的概率建模可以显著提高FS3D的原型学习。

Few-Shot 3D Point Cloud Object Detection (FS3D) is a challenging task, aiming to detect 3D objects of novel classes using only limited annotated samples for training. Considering that the detection performance highly relies on the quality of the latent features, we design a VAE-based prototype learning scheme, named prototypical VAE (P-VAE), to learn a probabilistic latent space for enhancing the diversity and distinctiveness of the sampled features. The network encodes a multi-center GMM-like posterior, in which each distribution centers at a prototype. For regularization, P-VAE incorporates a reconstruction task to preserve geometric information. To adopt P-VAE for the detection framework, we formulate Geometric-informative Prototypical VAE (GP-VAE) to handle varying geometric components and Class-specific Prototypical VAE (CP-VAE) to handle varying object categories. In the first stage, we harness GP-VAE to aid feature extraction from the input scene. In the second stage, we cluster the geometric-informative features into per-instance features and use CP-VAE to refine each instance feature with category-level guidance. Experimental results show the top performance of our approach over the state of the arts on two FS3D benchmarks. Quantitative ablations and qualitative prototype analysis further demonstrate that our probabilistic modeling can significantly boost prototype learning for FS3D.

$p$-Poisson surface reconstruction in curl-free flow from point clouds
Yesom Park Taekyung Lee Jooyoung Hahn Myungjoo Kang



研究问题:本文旨在从无序的点云中重建平滑的表面,同时保留几何形状,不依赖任何额外信息。
动机:现有的方法重建质量依赖于真实的内在函数值或表面法向量,而新的方法通过求解偏微分方程和微分矢量场的基本性质,可以稳健地重建高质量的表面。
方法:将p-泊松方程用于学习有符号距离函数(SDF),重建的表面由SDF的零水平集隐式表示。为了高效训练,通过引入SDF的梯度作为辅助变量,并将p-泊松方程直接应用于辅助变量作为一个硬约束来开发一个变量分割结构。基于梯度场的无旋属性,对辅助变量施加无旋约束,从而得到更忠实的重建。
效果:在标准基准数据集上的实验表明,所提出的内在表示提供了优越且稳健的重建。代码可在https://github.com/Yebbi/PINC获取。

The aim of this paper is the reconstruction of a smooth surface from an unorganized point cloud sampled by a closed surface, with the preservation of geometric shapes, without any further information other than the point cloud. Implicit neural representations (INRs) have recently emerged as a promising approach to surface reconstruction. However, the reconstruction quality of existing methods relies on ground truth implicit function values or surface normal vectors. In this paper, we show that proper supervision of partial differential equations and fundamental properties of differential vector fields are sufficient to robustly reconstruct high-quality surfaces. We cast the $p$-Poisson equation to learn a signed distance function (SDF) and the reconstructed surface is implicitly represented by the zero-level set of the SDF. For efficient training, we develop a variable splitting structure by introducing a gradient of the SDF as an auxiliary variable and impose the $p$-Poisson equation directly on the auxiliary variable as a hard constraint. Based on the curl-free property of the gradient field, we impose a curl-free constraint on the auxiliary variable, which leads to a more faithful reconstruction. Experiments on standard benchmark datasets show that the proposed INR provides a superior and robust reconstruction. The code is available at https://github.com/Yebbi/PINC.

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception
Junkun Yuan Xinyu Zhang Hao Zhou Jian Wang Zhongwei Qiu Zhiyin Shao Shaofeng Zhang Sifan Long Kun Kuang Kun Yao Junyu Han Errui Ding Lanfen Lin Fei Wu Jingdong Wang



研究问题:本文旨在探讨预训练模型在人类中心感知任务中的重要性,并提出一种新的预训练方法。
动机:通过重新审视掩蔽图像建模(MIM)的训练策略,发现人类结构先验具有巨大潜力。受此启发,我们进一步将直观的人类结构先验——人体部位——纳入预训练中。
方法:具体来说,我们使用这个先验来指导掩码采样过程。与人体部位区域对应的图像补丁具有较高的被遮蔽优先级。这鼓励模型在预训练期间更多地关注身体结构信息,从而在一系列人类中心感知任务上产生实质性的好处。为了进一步捕捉人类特征,我们提出了一种结构不变的对齐损失,该损失强制在不同的掩蔽视图下,由人体部位先验引导的图像紧密对齐。我们将整个方法称为HAP。
效果:HAP仅使用普通的ViT作为编码器,但在11个人类中心基准测试中建立了新的最先进的性能,并在一个数据集上取得了同等的结果。例如,HAP在MSMT17上实现了78.1%的mAP用于人员重识别,在PA-100K上实现了86.54%的mA用于行人属性识别,在MS COCO上实现了78.2%的AP用于2D姿态估计,以及在3DPW上实现了56.0 PA-MPJPE用于3D姿态和形状估计。

Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Specifically, we employ this prior to guide the mask sampling process. Image patches, corresponding to human part regions, have high priority to be masked out. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks. To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image. We term the entire method as HAP. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78.1% mAP on MSMT17 for person re-identification, 86.54% mA on PA-100K for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation.

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
rongkun Zheng Lu Qi Xi Chen Yi Wang Kun Wang Yu Qiao Hengshuang Zhao



研究问题:如何利用大量数据集提高视频实例分割的性能,同时解决注释数据集难以扩大的问题。
动机:由于注释数据集的高昂人力成本,我们拥有的是大量特定领域的孤立数据集。因此,吸引人们联合训练跨数据集聚合的模型以增强数据量和多样性。然而,由于类别空间的异质性,随着数据量的增加,单纯利用多个数据集会稀释模型对不同分类法的关注。因此,在提高分类精度的同时增加数据规模和丰富分类空间是重要的。
方法:我们提出名为Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation(TMT-VIS)的模型,设计了一个两阶段分类法聚合模块,首先从输入视频中编译分类法信息,然后在transformer解码器之前将这些分类法先验聚合成实例查询。
效果:我们在四个流行且具有挑战性的基准上进行了广泛的实验评估,包括YouTube-VIS 2019、YouTube-VIS 2021、OVIS和UVO。我们的模型在所有这些基准上都取得了显著的改进,并创造了新的最先进的记录。这些吸引人且令人鼓舞的结果证明了我们提出的方法是有效且通用的。

Training on large-scale datasets can boost the performance of video instance segmentation while the annotated datasets for VIS are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity. However, due to the heterogeneity in category space, as mask precision increase with the data volume, simply utilizing multiple datasets will dilute the attention of models on different taxonomy. Thus, increasing the data scale and enriching taxonomy space while improving classification precision is important. In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital challenge. Specifically, we design a two-stage taxonomy aggregation module that first compiles taxonomy information from input videos and then aggregates these taxonomy priors into instance queries before the transformer decoder. We conduct extensive experimental evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all these benchmarks. These appealing and encouraging results demonstrate the effectiveness and generality of our proposed approach. The code and trained models will be publicly available.

Reducing Shape-Radiance Ambiguity in Radiance Fields with a Closed-Form Color Estimation Method
Qihang Fang Yafei Song Keqiang Li Liefeng Bo



研究问题:现有的Neural radiance field (NeRF)模型在训练过程中存在形状-辐射二义性问题,即不能正确解耦场景的形状和辐射度。
动机:为了解决这一问题,本文提出了一种更适应的方法来降低形状-辐射二义性。
方法:该方法的核心是一种仅基于密度场的渲染方法。首先,我们根据密度场和拍摄图像估计颜色场,然后进行NeRF的渲染过程。我们还解决了在估计颜色场时的问题,包括遮挡和非均匀分布的视图。最后,我们将这种方法应用于规范NeRF的密度场。
效果:实验结果表明,我们的方法在定性和定量上都改善了NeRF的密度场。

Neural radiance field (NeRF) enables the synthesis of cutting-edge realistic novel view images of a 3D scene. It includes density and color fields to model the shape and radiance of a scene, respectively. Supervised by the photometric loss in an end-to-end training manner, NeRF inherently suffers from the shape-radiance ambiguity problem, i.e., it can perfectly fit training views but does not guarantee decoupling the two fields correctly. To deal with this issue, existing works have incorporated prior knowledge to provide an independent supervision signal for the density field, including total variation loss, sparsity loss, distortion loss, etc. These losses are based on general assumptions about the density field, e.g., it should be smooth, sparse, or compact, which are not adaptive to a specific scene. In this paper, we propose a more adaptive method to reduce the shape-radiance ambiguity. The key is a rendering method that is only based on the density field. Specifically, we first estimate the color field based on the density field and posed images in a closed form. Then NeRF's rendering process can proceed. We address the problems in estimating the color field, including occlusion and non-uniformly distributed views. Afterward, it is applied to regularize NeRF's density field. As our regularization is guided by photometric loss, it is more adaptive compared to existing ones. Experimental results show that our method improves the density field of NeRF both qualitatively and quantitatively. Our code is available at https://github.com/qihangGH/Closed-form-color-field.

Neural-Logic Human-Object Interaction Detection
Liulei Li Jianan Wei Wenguan Wang Yi Yang



研究问题:现有的Transformer-based HOI检测器通常接受预先组合的人类-物体对作为输入,缺乏在解码过程中探索实体之间新组合的可行性。
动机:本文提出了一种新的HOI检测器LogicHOI,利用神经逻辑推理和Transformer来推断实体之间的可行交互。
方法:具体来说,我们修改了普通Transformer中的自注意力机制,使其能够对⟨人类,动作,物体⟩三元组进行推理并构成新的交互。同时,这种推理过程受到两个关键属性的指导:可提供性(一个物体可能促进的潜在行动)和亲近性(人类和物体之间的空间关系)。我们将这两个属性用一阶逻辑公式化并将其转化为连续空间以约束我们的方法的学习过程,从而提高性能和零样本泛化能力。
效果:我们在V-COCO和HICO-DET上评估了LogicHOI,无论是在正常还是零样本设置下,都取得了显著优于现有方法的效果。

The interaction decoder utilized in prevalent Transformer-based HOI detectors typically accepts pre-composed human-object pairs as inputs. Though achieving remarkable performance, such a paradigm lacks feasibility and cannot explore novel combinations over entities during decoding. We present LogicHOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between. entities. Specifically, we modify. self-attention mechanism in the vanilla Transformer, enabling it to reason over the ⟨ human, action, object ⟩ triplet and constitute novel interactions. Meanwhile, such a reasoning process is guided by two crucial properties for understanding HOI: affordances (the potential actions an object can facilitate) and proxemics (the spatial relations between humans and objects). We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities. We evaluate L OGIC HOI on V-COCO and HICO-DET under both normal and zero-shot setups, achieving significant improvements over existing methods.

Binary Radiance Fields
Seungjoo Shin Jaesik Park



研究问题:提出一种存储高效的二值辐射场(BiRF)表示方法,通过二进制编码参数以$+1$或$-1$的格式对局部特征进行编码。
动机:现有的辐射场表示方法需要大量的存储空间,我们希望通过二值化策略和2D-3D混合特征网格设计来减少存储需求,提高表示效率。
方法:采用二进制编码参数以$+1$或$-1$的格式对局部特征进行编码,形成高度紧凑的特征编码,大大降低了存储大小。同时,我们的2D-3D混合特征网格设计使得特征编码更加紧凑,其中3D网格包含主要组件,2D网格捕获细节。
效果:实验结果表明,我们的二值辐射场表示方法在重建性能上超过了现有最先进的高效辐射场模型,且存储分配更低。特别是在静态场景重建方面,我们的模型仅使用0.5MB的存储空间,就实现了32.03 dB的PSNR(峰值信号噪声比)对于合成NeRF场景,34.48 dB对于合成NSVF场景,28.20 dB对于坦克和寺庙场景。我们希望提出的二值辐射场表示方法可以使辐射场在没有存储瓶颈的情况下更容易被应用。

In this paper, we propose binary radiance fields (BiRF), a storage-efficient radiance field representation employing binary feature encoding that encodes local features using binary encoding parameters in a format of either $+1$ or $-1$. This binarization strategy lets us represent the feature grid with highly compact feature encoding and a dramatic reduction in storage size. Furthermore, our 2D-3D hybrid feature grid design enhances the compactness of feature encoding as the 3D grid includes main components while 2D grids capture details. In our experiments, binary radiance field representation successfully outperforms the reconstruction performance of state-of-the-art (SOTA) efficient radiance field models with lower storage allocation. In particular, our model achieves impressive results in static scene reconstruction, with a PSNR of 32.03 dB for Synthetic-NeRF scenes, 34.48 dB for Synthetic-NSVF scenes, 28.20 dB for Tanks and Temples scenes while only utilizing 0.5 MB of storage space, respectively. We hope the proposed binary radiance field representation will make radiance fields more accessible without a storage bottleneck.

UE4-NeRF:Neural Radiance Field for Real-Time Rendering of Large-Scale Scene
Jiaming Gu Minchao Jiang Hongsheng Li Xiaoyuan Lu Guangming Zhu Syed Afaq Ali Shah Liang Zhang Mohammed Bennamoun



研究问题:本文旨在解决Neural Radiance Fields(NeRF)在实时渲染大型场景时的性能限制。
动机:目前的NeRF方法虽然能从照片中重建3D场景,但在实时渲染大型场景方面仍有显著局限。
方法:本文提出了一种名为UE4-NeRF的新型神经渲染系统,专为实时渲染大型场景设计。我们将每个大场景分割成不同的子NeRFs,并在场景内构建多个规则的八面体来初始化多边形网格,顶点会在训练过程中持续优化。我们还借鉴了细节层次(LOD)技术,为不同观察级别训练了不同详细程度的网格。
效果:我们的方法与Unreal Engine 4(UE4)的光栅化管线结合,实现了4K分辨率下大型场景的实时渲染,帧率高达43FPS。此外,通过实验,我们证明了该方法的渲染质量可与最先进的方法相媲美。

Neural Radiance Fields (NeRF) is a novel implicit 3D reconstruction method that shows immense potential and has been gaining increasing attention. It enables the reconstruction of 3D scenes solely from a set of photographs. However, its real-time rendering capability, especially for interactive real-time rendering of large-scale scenes, still has significant limitations. To address these challenges, in this paper, we propose a novel neural rendering system called UE4-NeRF, specifically designed for real-time rendering of large-scale scenes. We partitioned each large scene into different sub-NeRFs. In order to represent the partitioned independent scene, we initialize polygonal meshes by constructing multiple regular octahedra within the scene and the vertices of the polygonal faces are continuously optimized during the training process. Drawing inspiration from Level of Detail (LOD) techniques, we trained meshes of varying levels of detail for different observation levels. Our approach combines with the rasterization pipeline in Unreal Engine 4 (UE4), achieving real-time rendering of large-scale scenes at 4K resolution with a frame rate of up to 43 FPS. Rendering within UE4 also facilitates scene editing in subsequent stages. Furthermore, through experiments, we have demonstrated that our method achieves rendering quality comparable to state-of-the-art approaches. Project page: https://jamchaos.github.io/UE4-NeRF/.

Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu Jaemin Cho Prateek Yadav Mohit Bansal



研究问题:如何有效地利用预训练的图像-语言模型进行视频问答,特别是在只有部分视频输入与语言查询相关时,避免丢失重要的视觉线索。
动机:目前的图像-语言模型在进行视频问答时,通常将均匀采样的视频帧作为视觉输入,没有明确的基于语言的、时间建模。当只有部分视频输入与语言查询相关时,这种均匀的帧采样方法往往会遗漏重要的视觉线索。
方法:提出自我链接的视频定位-回答(SeViLA)框架,该框架利用单一的图像-语言模型(BLIP-2)来解决视频中的时间关键帧定位和问题回答两个问题。SeViLA框架由定位器和回答器两个模块组成,这两个模块都是从BLIP-2参数高效地微调出来的。
效果:实验结果表明,SeViLA框架在五个具有挑战性的视频问答和事件预测基准上优于几个强大的基线/先前的工作,并在精调(NExT-QA和STAR)和零样本(NExT-QA,STAR,How2QA,和VLEP)设置中实现了最先进的性能。

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP- 2) to tackle both temporal keyframe localization and question answering on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines/previous works on five challenging video question answering and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA and STAR) and zero-shot (NExT-QA, STAR, How2QA, and VLEP) settings. We show a comprehensive analysis of our framework, including the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

Depth-discriminative Metric Learning for Monocular 3D Object Detection
Wonhyeok Choi Mingyu Shin Sunghoon Im



研究问题:单目3D物体检测由于RGB图像中缺乏深度信息而面临重大挑战。
动机:许多现有方法通过为物体深度估计分配额外参数、利用额外的模块或数据来提高物体深度估计性能,而我们提出了一种新的度量学习方案,无需增加推理时间和模型大小,就能鼓励模型提取与视觉属性无关的深度判别特征。
方法:我们的方法采用距离保持函数来组织与地面真实对象深度相关的特征空间流形。提出的$(K,B,\epsilon)$准等距损失利用预先确定的成对距离限制作为调整对象描述符之间距离的指导,而不破坏自然特征流形的非线性。此外,我们还引入了一个辅助头来进行物体级深度估计,提高了深度质量,同时保持了推理时间。
效果:我们在各种基线上进行实验,展示了该方法的广泛应用性。结果显示,我们的方法在KITTI和Waymo上分别平均提高了23.51%和5.78%的性能。

Monocular 3D object detection poses a significant challenge due to the lack of depth information in RGB images. Many existing methods strive to enhance the object depth estimation performance by allocating additional parameters for object depth estimation, utilizing extra modules or data. In contrast, we introduce a novel metric learning scheme that encourages the model to extract depth-discriminative features regardless of the visual attributes without increasing inference time and model size. Our method employs the distance-preserving function to organize the feature space manifold in relation to ground-truth object depth. The proposed $(K,B,\epsilon)$-quasi-isometric loss leverages predetermined pairwise distance restriction as guidance for adjusting the distance among object descriptors without disrupting the non-linearity of the natural feature manifold. Moreover, we introduce an auxiliary head for object-wise depth estimation, which enhances depth quality while maintaining the inference time. The broad applicability of our method is demonstrated through experiments that show improvements in overall performance when integrated into various baselines. The results show that our method consistently improves the performance of various baselines by 23.51\% and 5.78\% on average across KITTI and Waymo, respectively.

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation
Siyu Jiao Yunchao Wei Yaowei Wang Yao Zhao Humphrey Shi



研究问题:如何提高预训练视觉-语言模型在零样本分割任务上的性能。
动机:目前的预训练视觉-语言模型在处理零样本分割任务时,通常采用生成掩码建议并使用CLIP进行分类的方法,但这种方法存在大量的误报。
方法:提出一种名为Mask-aware Fine-tuning(MAFT)的简单有效方法。首先,设计了一种可以同时处理任意数量图像和掩码建议的Image-Proposals CLIP Encoder(IP-CLIP Encoder)。然后,设计了*mask-aware loss*和*self-distillation loss*来微调IP-CLIP Encoder,确保CLIP对不同的掩码建议具有响应性,同时不牺牲其可转移性。
效果:在流行的零样本基准测试中,使用MAFT可以将现有方法的性能大幅提高:在COCO数据集上提高了50.4%(+ 8.2%),在Pascal-VOC数据集上提高了81.8%(+ 3.2%),在ADE20K数据集上提高了8.7%(+4.3%)。

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, *mask-aware loss* and *self-distillation loss* are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4\% (+ 8.2\%) on COCO, 81.8\% (+ 3.2\%) on Pascal-VOC, and 8.7\% (+4.3\%) on ADE20K in terms of mIoU for unseen classes. Codes will be provided for reproducibility. Code is available at https://github.com/jiaosiyu1999/MAFT.git .

ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding
Lunhao Duan Shanshan Zhao Nan Xue Mingming Gong Gui-Song Xia Dacheng Tao



研究问题:如何有效地利用Transformer进行3D点云理解,特别是在处理大量点云数据时。
动机:现有的方法在处理大量的点云数据时,由于计算成本高和无法有效捕捉局部几何结构的问题,效果并不理想。
方法:本文提出了一种新的Transformer模块ConDaFormer,它将立方体窗口分解为三个正交的2D平面,降低了注意力模型中的点数,同时引入了深度卷积以捕捉局部几何信息。
效果:实验结果表明,ConDaFormer能够有效地捕捉长范围的上下文信息和局部先验知识,并在几个3D点云理解基准测试中取得了良好的效果。

Transformers have been recently explored for 3D point cloud understanding with impressive progress achieved. A large number of points, over 0.1 million, make the global self-attention infeasible for point cloud data. Thus, most methods propose to apply the transformer in a local region, e.g., spherical or cubic window. However, it still contains a large number of Query-Key pairs, which requires high computational costs. In addition, previous methods usually learn the query, key, and value using a linear projection without modeling the local 3D geometric structure. In this paper, we attempt to reduce the costs and model the local geometry prior by developing a new transformer block, named ConDaFormer. Technically, ConDaFormer disassembles the cubic window into three orthogonal 2D planes, leading to fewer points when modeling the attention in a similar range. The disassembling operation is beneficial to enlarging the range of attention without increasing the computational complexity, but ignores some contexts. To provide a remedy, we develop a local structure enhancement strategy that introduces a depth-wise convolution before and after the attention. This scheme can also capture the local geometric information. Taking advantage of these designs, ConDaFormer captures both long-range contextual information and local priors. The effectiveness is demonstrated by experimental results on several 3D point cloud understanding benchmarks. Our code will be available.

Cross-Scale MAE: A Tale of Multiscale Exploitation in Remote Sensing
Maofeng Tang Andrei Liviu Cozma Konstantinos Georgiou Hairong Qi



研究问题:遥感图像分析面临独特的挑战,如广泛的地理覆盖范围、硬件限制和多尺度图像不对准等问题。本文旨在重新审视经典的多尺度表示学习问题,并在遥感图像理解的自监督学习框架下进行研究。
动机:由于遥感图像的独特性,传统的预训练语言模型在处理遥感图像时存在一些困难。因此,本文提出了一种基于掩码自动编码器(MAE)的自我监督模型Cross-Scale MAE,以解决遥感图像的多尺度表示学习问题。
方法:在预训练阶段,Cross-Scale MAE采用尺度增强技术和对比损失与生成损失来强制实施跨尺度一致性约束,以确保一致且有意义的表示,适用于各种下游任务。此外,我们的实现利用xFormers库在单个GPU上加速网络预训练,同时保持学到的表示的质量。
效果:实验评估表明,Cross-Scale MAE的性能优于标准的MAE和其他最先进的遥感MAE方法。

Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE). During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses, to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.

STREAMER: Streaming Representation Learning and Event Segmentation in a Hierarchical Manner
Ramy Mounir Sujal Vijayaraghavan Sudeep Sarkar



研究问题:如何以分层的方式对流式感知输入进行语义分组和分割。
动机:解决在流式感知输入中,如何将数据按照不同层次进行语义分组,并同时学习每个组的全局表示的问题。
方法:提出STREAMER模型,该模型逐层训练,适应输入领域的复杂性。每一层有两个主要目标:对未来进行准确预测,并为达到同一目标的其他层级提供必要信息。通过在不同级别检测预测误差峰值来构建事件层次结构,其中检测到的边界触发自底向上的信息流。在事件边界,一层的输入表示成为更高层输入。此外,设计了一个通信模块,在预测过程中促进上下信息的交换。
效果:在自我监督和流式训练方式下,模型只需一次遍历训练数据即可完成训练。在EPIC-KITCHENS数据集上进行的实验表明,模型在时间事件分割任务上表现良好。使用学习到的表示进行的事件检索实验也证明了模型的视频事件表示质量高。

We present a novel self-supervised approach for hierarchical representation learning and segmentation of perceptual inputs in a streaming fashion. Our research addresses how to semantically group streaming inputs into chunks at various levels of a hierarchy while simultaneously learning, for each chunk, robust global representations throughout the domain. To achieve this, we propose STREAMER, an architecture that is trained layer-by-layer, adapting to the complexity of the input domain. In our approach, each layer is trained with two primary objectives: making accurate predictions into the future and providing necessary information to other levels for achieving the same objective. The event hierarchy is constructed by detecting prediction error peaks at different levels, where a detected boundary triggers a bottom-up information flow. At an event boundary, the encoded representation of inputs at one layer becomes the input to a higher-level layer. Additionally, we design a communication module that facilitates top-down and bottom-up exchange of information during the prediction process. Notably, our model is fully self-supervised and trained in a streaming manner, enabling a single pass on the training data. This means that the model encounters each input only once and does not store the data. We evaluate the performance of our model on the egocentric EPIC-KITCHENS dataset, specifically focusing on temporal event segmentation. Furthermore, we conduct event retrieval experiments using the learned representations to demonstrate the high quality of our video event representations. Illustration videos and code are available on our project page: https://ramymounir.com/publications/streamer

Temporal Continual Learning with Prior Compensation for Human Motion Prediction
Jianwei Tang Jiangxin Sun Xiaotong Lin lifang zhang Wei-Shi Zheng Jian-Fang Hu



研究问题:本文旨在解决人体运动预测中,以往方法对不同时刻预测等同对待导致的研究问题:本文旨在解决人体运动预测中,以往方法对不同时刻预测等同对待导致的短期预测学习受阻和过去预测的先验信息在后续预测中的应用受限的问题。
动机:为了解决这些问题,作者提出了一种新的多阶段训练框架——时间持续学习(TCL),并引入了先验补偿因子(PCF)以更好地保留先验信息。
方法:通过理论推导,作者得出了更合理的优化目标,并将PCF引入模型训练以补偿丢失的先验信息。此外,TCL框架可以很容易地与不同的人体运动预测骨干模型集成,并适应各种数据集和应用。
效果:在四个人体运动预测基准数据集上的大量实验表明,TCL具有有效性和灵活性。

Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent predictions is limited. In this paper, we introduce a novel multi-stage training framework called Temporal Continual Learning (TCL) to address the above challenges. To better preserve prior information, we introduce the Prior Compensation Factor (PCF). We incorporate it into the model training to compensate for the lost prior information. Furthermore, we derive a more reasonable optimization objective through theoretical derivation. It is important to note that our TCL framework can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of TCL. The code is available at https://github.com/hyqlat/TCL.

Keypoint-Augmented Self-Supervised Learning for Medical Image Segmentation with Limited Annotation
Zhangsihao Yang Mengwei Ren Kaize Ding Guido Gerig Yalin Wang



研究问题:如何通过预训练CNN模型(如UNet)来改善在低标注环境下的医学图像分割。
动机:尽管对比学习方法在提取全局和局部特征时取得了一些进展,但它们在捕捉生物解剖学中的关键长程空间依赖性方面存在限制。
方法:提出了一种关键点增强融合层,该层可以提取同时保留短程和长程自我注意力的特征表示。具体来说,我们在多个尺度上通过引入一个额外的输入来增强CNN特征图,该输入学习了局部化关键点特征之间的长程空间自我注意力。此外,我们还为该框架引入了全局和局部的自我监督预训练。
效果:实验结果表明,我们的方法在MRI和CT分割任务上都优于CNN和Transformer-based UNets,且在所有架构都使用随机初始化权重进行训练的情况下,我们的方法通过产生更鲁棒的自我注意力并实现最先进的分割结果,进一步超越了现有的自监督学习方法。

Pretraining CNN models (i.e., UNet) through self-supervision has become a powerful approach to facilitate medical image segmentation under low annotation regimes. Recent contrastive learning methods encourage similar global representations when the same image undergoes different transformations, or enforce invariance across different image/patch features that are intrinsically correlated. However, CNN-extracted global and local features are limited in capturing long-range spatial dependencies that are essential in biological anatomy. To this end, we present a keypoint-augmented fusion layer that extracts representations preserving both short- and long-range self-attention. In particular, we augment the CNN feature map at multiple scales by incorporating an additional input that learns long-range spatial self-attention among localized keypoint features. Further, we introduce both global and local self-supervised pretraining for the framework. At the global scale, we obtain global representations from both the bottleneck of the UNet, and by aggregating multiscale keypoint features. These global features are subsequently regularized through image-level contrastive objectives. At the local scale, we define a distance-based criterion to first establish correspondences among keypoints and encourage similarity between their features. Through extensive experiments on both MRI and CT segmentation tasks, we demonstrate the architectural advantages of our proposed method in comparison to both CNN and Transformer-based UNets, when all architectures are trained with randomly initialized weights. With our proposed pretraining strategy, our method further outperforms existing SSL methods by producing more robust self-attention and achieving state-of-the-art segmentation results. The code is available at https://github.com/zshyang/kaf.git.

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow
Zhaoying Pan Daniel Geng Andrew Owens



研究问题:本文旨在提出一种简单、自监督的方法,用于放大视频中的微小运动。
动机:现有的方法需要使用合成的放大数据集进行训练,而本文的方法避免了这一需求,利用了现有的光流估计器的能力。
方法:通过给定一个输入视频和一个放大因子,操纵视频使其新的光流按所需量进行缩放。通过提出的损失函数来训练模型,该函数估计生成视频的光流并惩罚其与给定放大因子的偏差。
效果:通过在一系列真实世界和合成视频上进行视觉质量和定量指标评估,证明了该方法的有效性。同时,该方法适用于有监督和无监督的光流方法。

This paper presents a simple, self-supervised method for magnifying subtle motions in video: given an input video and a magnification factor, we manipulate the video such that its new optical flow is scaled by the desired amount. To train our model, we propose a loss function that estimates the optical flow of the generated video and penalizes how far if deviates from the given magnification factor. Thus, training involves differentiating through a pretrained optical flow network. Since our model is self-supervised, we can further improve its performance through test-time adaptation, by finetuning it on the input video. It can also be easily extended to magnify the motions of only user-selected objects. Our approach avoids the need for synthetic magnification datasets that have been used to train prior learning-based approaches. Instead, it leverages the existing capabilities of off-the-shelf motion estimators. We demonstrate the effectiveness of our method through evaluations of both visual quality and quantitative metrics on a range of real-world and synthetic videos, and we show our method works for both supervised and unsupervised optical flow methods.

FLSL: Feature-level Self-supervised Learning
Qing Su Anton Netchaev Hai Li Shihao Ji



研究问题:现有的自监督学习方法主要针对实例级别的表示,无法很好地泛化到密集预测任务,如目标检测和分割。
动机:为了将自监督学习与密集预测对齐,本文首次展示了视觉变换器(ViT)的潜在均值漂移聚类过程,该过程与自然图像语义(如物体和事物的世界)有很好的对齐。
方法:通过采用变压器进行联合嵌入和聚类,提出了一种双层特征聚类的自监督学习方法,称为特征级自我监督学习(FLSL)。
效果:实验表明,FLSL在密集预测任务中取得了显著的改进,在Mask R-CNN使用ViT-S/16和ViT-S/8作为主干的情况下,分别在MS-COCO上实现了44.9%(+2.8%)的AP和46.5%的AP在目标检测以及40.8%(+2.3%)的AP和42.1%的AP在实例分割上。FLSL在所有其他基准测试中始终优于现有的自监督学习方法,包括在UAVDT上的无人机目标检测和在DAVIS 2017上的视频实例分割。

Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg, MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation. Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a bi-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)% AP and 42.1% AP in instance segmentation on MS-COCO, using Mask R-CNN with ViT-S/16 and ViT-S/8 as backbone, respectively. FLSL consistently outperforms existing SSL methods across additional benchmarks, including UAV object detection on UAVDT, and video instance segmentation on DAVIS 2017. We conclude by presenting visualization and various ablation studies to better understand the success of FLSL. The source code is available at https://github.com/ISL-CV/FLSL.

OBJECT 3DIT: Language-guided 3D-aware Image Editing
Oscar Michel Anand Bhattad Eli VanderBilt Ranjay Krishna Aniruddha Kembhavi Tanmay Gupta



研究问题:现有的图像编辑工具通常忽视了图像背后的3D几何结构,导致编辑结果可能与图像形成的几何和光照条件脱节。
动机:为了解决这个问题,我们提出了语言引导的3D感知编辑任务,即根据语言指令编辑图像中的物体,同时保持与底层3D场景的一致性。
方法:我们创建了一个包含40万个编辑示例的基准数据集OBJect,并开发了用于四种编辑任务的单任务和多任务模型3DIT。
效果:实验结果表明,我们的模型能够理解整个场景的3D构成,考虑到周围的物体、表面、光照条件、阴影和物理上合理的物体配置。令人惊讶的是,仅在合成场景上训练的3DIT模型的编辑能力可以泛化到真实世界的图像。

Existing image editing tools, while powerful, typically disregard the underlying 3D geometry from which the image is projected. As a result, edits made using these tools may become detached from the geometry and lighting conditions that are at the foundation of the image formation process; such edits break the portrayal of a coherent 3D world. 3D-aware generative models are a promising solution, but currently only succeed on small datasets or at the level of a single object. In this work, we formulate the new task of language-guided 3D-aware editing, where objects in an image should be edited according to a language instruction while remaining consistent with the underlying 3D scene. To promote progress towards this goal, we release OBJect: a benchmark dataset of 400K editing examples created from procedurally generated 3D scenes. Each example consists of an input image, editing instruction in language, and the edited image. We also introduce 3DIT: single and multi-task models for four editing tasks. Our models show impressive abilities to understand the 3D composition of entire scenes, factoring in surrounding objects, surfaces, lighting conditions, shadows, and physically-plausible object configurations. Surprisingly, training on only synthetic scenes from \dataset, editing capabilities of 3DIT generalize to real-world images.

Density of States Prediction of Crystalline Materials via Prompt-guided Multi-Modal Transformer
Namkyeong Lee Heewoong Noh Sungwon Kim Dongmin Hyun Gyoung S. Na Chanyoung Park



研究问题:如何从获得的表示中预测晶体材料的能带结构(DOS)。
动机:现有的方法主要关注获取高质量的晶体材料表示以进行DOS预测,而忽视了能量水平对DOS的影响。
方法:通过多模态转换器整合从晶体材料和能量中获得的异构信息,模拟晶体材料中的原子与各种能量级别的复杂关系进行DOS预测。
效果:在两种类型的DOS(声子DOS和电子DOS)以及各种真实场景的广泛实验中,DOSTransformer表现出优越性。

The density of states (DOS) is a spectral property of crystalline materials, which provides fundamental insights into various characteristics of the materials. While previous works mainly focus on obtaining high-quality representations of crystalline materials for DOS prediction, we focus on predicting the DOS from the obtained representations by reflecting the nature of DOS: DOS determines the general distribution of states as a function of energy. That is, DOS is not solely determined by the crystalline material but also by the energy levels, which has been neglected in previous works. In this paper, we propose to integrate heterogeneous information obtained from the crystalline materials and the energies via a multi-modal transformer, thereby modeling the complex relationships between the atoms in the crystalline materials and various energy levels for DOS prediction. Moreover, we propose to utilize prompts to guide the model to learn the crystal structural system-specific interactions between crystalline materials and energies. Extensive experiments on two types of DOS, i.e., Phonon DOS and Electron DOS, with various real-world scenarios demonstrate the superiority of DOSTransformer. The source code for DOSTransformer is available at https://github.com/HeewoongNoh/DOSTransformer.

GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization
Vicente Vivanco Cepeda Gaurav Kumar Nayak Mubarak Shah



研究问题:全球地理定位旨在精确定位地球上任何地方拍摄的图像,由于地理景观的巨大变化,这个任务具有相当大的挑战。
动机:现有的基于图像检索的方法无法在全球范围内解决这个问题,因为构建一个覆盖整个世界的大图像库是不可行的。现有的方法将地球划分为离散的地理单元,将问题转化为分类任务,但其性能受到预定义类别的限制,当图像的位置与其类别中心有显著偏差时,通常会导致定位不准确。
方法:我们提出了GeoCLIP,一种受CLIP启发的新型图像到GPS检索方法,该方法强制对齐图像和其对应的GPS位置。GeoCLIP的位置编码器通过随机傅立叶特征使用位置编码来模拟地球作为一个连续函数,并构建了一个分层表示,以捕获不同分辨率的信息,从而产生一个语义丰富的高维特征,即使用于地理定位之外也是适用的。
效果:我们在基准数据集上进行了广泛的实验和消融研究,证明了我们方法的有效性。即使在只有20%的训练数据的情况下,我们也取得了有竞争力的性能,即使在有限的数据设置中也显示出其有效性。此外,我们还通过利用我们的图像编码器的CLIP主干,从文本查询的角度定性地展示了地理定位。

Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. This task has considerable challenges due to the immense variation in geographic landscapes. The image-to-image retrieval-based approaches fail to solve this problem on a global scale as it is not feasible to construct a large gallery of images covering the entire world. Instead, existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. However, their performance is limited by the predefined classes and often results in inaccurate localizations when an image's location significantly deviates from its class center. To overcome these limitations, we propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations. GeoCLIP's location encoder models the Earth as a continuous function by employing positional encoding through random Fourier features and constructing a hierarchical representation that captures information at varying resolutions to yield a semantically rich high-dimensional feature suitable to use even beyond geo-localization. To the best of our knowledge, this is the first work employing GPS encoding for geo-localization. We demonstrate the efficacy of our method via extensive experiments and ablations on benchmark datasets. We achieve competitive performance with just 20% of training data, highlighting its effectiveness even in limited-data settings. Furthermore, we qualitatively demonstrate geo-localization using a text query by leveraging the CLIP backbone of our image encoder. The project webpage is available at: https://vicentevivan.github.io/GeoCLIP

ConRad: Image Constrained Radiance Fields for 3D Generation from a Single Image
Senthil Purushwalkam Nikhil Naik



研究问题:如何从单张RGB图像重建3D对象?
动机:现有的方法在从文本提示生成3D模型方面取得了令人印象深刻的结果,但无法简单地根据输入的RGB数据进行条件设置。
方法:提出一种名为Image Constrained Radiance Fields(ConRad)的新型神经辐射场变体,通过结合预训练的扩散模型和单张RGB图像来优化ConRad表示的参数。
效果:实验表明,ConRad表示可以在简化图像细节保留的同时产生逼真的3D重建,与现有最先进的基线相比,我们的3D重建更能保持对输入的忠实度,同时在ShapeNet物体基准测试上显示出显著改善的定量性能。

We present a novel method for reconstructing 3D objects from a single RGB image. Our method leverages the latest image generation models to infer the hidden 3D structure while remaining faithful to the input image. While existing methods obtain impressive results in generating 3D models from text prompts, they do not provide an easy approach for conditioning on input RGB data. Naive extensions of these methods often lead to improper alignment in appearance between the input image and the 3D reconstructions. We address these challenges by introducing Image Constrained Radiance Fields (ConRad), a novel variant of neural radiance fields. ConRad is an efficient 3D representation that explicitly captures the appearance of an input image in one viewpoint. We propose a training algorithm that leverages the single RGB image in conjunction with pretrained Diffusion Models to optimize the parameters of a ConRad representation. Extensive experiments show that ConRad representations can simplify preservation of image details while producing a realistic 3D reconstruction. Compared to existing state-of-the-art baselines, we show that our 3D reconstructions remain more faithful to the input and produce more consistent 3D models while demonstrating significantly improved quantitative performance on a ShapeNet object benchmark.

LART: Neural Correspondence Learning with Latent Regularization Transformer for 3D Motion Transfer
Haoyu Chen Hao Tang Radu Timofte Luc Van Gool Guoying Zhao



研究问题:本文旨在解决将动态输入序列的移动转移到静态3D对象的问题,并实现高保真和逼真的视觉效果。
动机:现有的方法需要关键点标注或预定义源和目标网格之间的对应关系,且无法处理大型未见过的细节丰富的3D目标。
方法:提出一种名为LART的新型3D Transformer框架进行3D运动转移。通过精心设计的架构,LART能够隐式地学习对应关系,无需关键点标注或预定义对应关系,并能处理大型未见过的细节丰富的3D目标。此外,还引入了一种新的潜在度量正则化来改进运动生成。
效果:实验结果表明,提出的LART在少量AMASS数据集样本下就能生成具有合理视觉效果的运动,显示出高效的学习能力。该方法在运动转移、内容生成、时间插值和运动去噪等应用中具有潜力。

3D motion transfer aims at transferring the motion from a dynamic input sequence to a static 3D object and outputs an identical motion of the target with high-fidelity and realistic visual effects. In this work, we propose a novel 3D Transformer framework called LART for 3D motion transfer. With carefully-designed architectures, LART is able to implicitly learn the correspondence via a flexible geometry perception. Thus, unlike other existing methods, LART does not require any key point annotations or pre-defined correspondence between the motion source and target meshes and can also handle large-size full-detailed unseen 3D targets. Besides, we introduce a novel latent metric regularization on the Transformer for better motion generation. Our rationale lies in the observation that the decoded motions can be approximately expressed as linearly geometric distortion at the frame level. The metric preservation of motions could be translated to the formation of linear paths in the underlying latent space as a rigorous constraint to control the synthetic motions occurring in the construction of the latent space. The proposed LART shows a high learning efficiency with the need for a few samples from the AMASS dataset to generate motions with plausible visual effects. The experimental results verify the potential of our generative model in applications of motion transfer, content generation, temporal interpolation, and motion denoising. The code is made available: https://github.com/mikecheninoulu/LART.

Hyper-HMM: aligning human brains and semantic features in a common latent event space
Caroline Lee Jane Han Ma Feilong Guo Jiahui James Haxby Christopher Baldassano



研究问题:现有的对齐方法主要关注空间超对齐(假设精确的时间对应)或时间对齐(假设精确的空间对应),本研究旨在提出一种同时对齐大脑中时间和空间特征的混合模型。
动机:自然刺激引发复杂的神经反应,这些反应在空间和时间属性上因人而异。当前的对齐方法无法同时考虑到这两个方面,因此需要一种新的模型来解决这个问题。
方法:研究者提出了一种称为“超HMM”的混合模型,该模型可以同时对齐大脑中的时间特性和空间特性。模型通过线性投影将体素映射到降维的潜在空间,并在其中将时序分割成相应的时间事件。这种方法允许追踪每个个体通过事件序列的心理轨迹,也允许与其他特征空间(如刺激内容)进行对齐。
效果:使用学生观看课堂讲座视频的fMRI数据集进行实验,结果显示,超HMM可以将所有参与者和视频的语义内容映射到一个共同的低维空间,并且这些映射可以推广到未包含在训练数据中的数据。这种新的模型为研究自然刺激引发的个体认知动态提供了新的视角。

Naturalistic stimuli evoke complex neural responses with spatial and temporal properties that differ across individuals. Current alignment methods focus on either spatial hyperalignment (assuming exact temporal correspondence) or temporal alignment (assuming exact spatial correspondence). Here, we propose a hybrid model, the Hyper-HMM, that simultaneously aligns both temporal and spatial features across brains. The model learns to linearly project voxels to a reduced-dimension latent space, in which timecourses are segmented into corresponding temporal events. This approach allows tracking of each individual's mental trajectory through an event sequence, and also allows for alignment with other feature spaces such as stimulus content. Using an fMRI dataset in which students watch videos of class lectures, we demonstrate that the Hyper-HMM can be used to map all participants and the semantic content of the videos into a common low-dimensional space, and that these mappings generalize to held-out data. Our model provides a new window into individual cognitive dynamics evoked by complex naturalistic stimuli.

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Kumar Ashutosh Santhosh Kumar Ramakrishnan Triantafyllos Afouras Kristen Grauman



研究问题:如何通过理解人类在长视频中执行的多个关键步骤,以实现最终目标状态,来感知人类的行为。
动机:现有工作大多将关键步骤识别与更广泛的结构隔离开来,或者严格地将关键步骤限制在特定的顺序脚本中。
方法:我们提出从操作视频中自动发现任务图,以概率方式表示人们执行关键步骤的方式,然后利用此图对新视频中的关键步骤识别进行正则化。
效果:在多个真实世界教学视频数据集上,我们展示了其影响:更可靠的零样本关键步骤定位和改进的视频表示学习,超过了最先进的水平。

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

CorresNeRF: Image Correspondence Priors for Neural Radiance Fields
Yixing Lao Xiaogang Xu zhipeng cai Xihui Liu Hengshuang Zhao



研究问题:现有的神经辐射场(NeRF)模型在稀疏输入视图的挑战性场景下性能下降。
动机:提出一种利用图像对应关系先验监督NeRF训练的方法,以提高其在稀疏视图设置下的性能。
方法:通过添加关于对应点重投影误差和深度误差的损失项,将由现成方法计算的图像对应关系先验注入训练过程。
效果:实验结果表明,该方法在各种数据集上,无论是密度基础还是SDF基础的神经隐式表示,都能提高NeRF在稀疏视图设置下的性能,且在光度和几何度量上都优于先前的方法。

Neural implicit representations in Neural Radiance Fields (NeRF) have achieved impressive results in novel view synthesis and surface reconstruction tasks. However, their performance suffers under challenging scenarios with sparse input views. We present CorresNeRF, a method to leverage image correspondence priors computed by off-the-shelf methods to supervise the training of NeRF. These correspondence priors are first augmented and filtered with our adaptive algorithm. Then they are injected into the training process by adding loss terms on the reprojection error and depth error of the correspondence points. We evaluate our methods on novel view synthesis and surface reconstruction tasks with density-based and SDF-based neural implicit representations across different datasets. We show that this simple yet effective technique can be applied as a plug-and-play module to improve the performance of NeRF under sparse-view settings across different NeRF variants. Our experiments show that we outperform previous methods in both photometric and geometric metrics. The source code is available at https://github.com/yxlao/corres-nerf.

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding
Shuhan Tan Tushar Nagarajan Kristen Grauman



研究问题:如何降低计算密集型的自我中心视频理解模型的计算成本,以适应更多实际应用。
动机:自我中心的视频理解模型在最近的研究中取得了进步,但其高昂的计算成本阻碍了其在许多现实世界的应用。
方法:提出EgoDistill,一种基于蒸馏的方法,通过结合稀疏视频帧的语义和轻量级的IMU阅读头部运动来重构重型自我中心视频片段特征。进一步设计了一种基于IMU的自我监督预训练策略。
效果:该方法显著提高了效率,需要的GFLOPs比等效的视频模型少200倍。在Ego4D和EPIC-Kitchens数据集上展示了其有效性,超越了最先进的高效视频理解方法。

Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy ego-centric video clip features by combining the semantics from a sparse set of video frames with head motion from lightweight IMU readings. We further devise a novel IMU-based self-supervised pretraining strategy. Our method leads to significant improvements in efficiency, requiring 200× fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPIC- Kitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.

FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow
Cameron Omid Smith Yilun Du Ayush Tewari Vincent Sitzmann



研究问题:如何从已定位的图像中重建3D神经场,以实现自我监督表示学习。
动机:现有的3D场景学习者在大规模视频数据上的应用受到其对精确相机姿态的依赖的限制,而这种依赖需要通过结构从运动中获取,成本高昂。
方法:我们提出了一种在线单次前向传递中联合重建相机姿态和3D神经场景表示的方法。首先,我们将帧到帧的光流提升为通过可微渲染的3D场景流,保留图像处理骨干的局部性和位移等变性。然后,通过加权最小二乘拟合场景流场进行SE(3)相机姿态估计。
效果:我们在多样化的真实世界视频数据集上进行了实验,结果显示我们的方法在传统上对优化基姿态估计技术具有挑战性的序列上表现稳健。

Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.

Asynchrony-Robust Collaborative Perception via Bird's Eye View Flow
Sizhe Wei Yuxi Wei Yue Hu Yifan Lu Yiqi Zhong Siheng Chen Ya Zhang



研究问题:多智能体系统中,由于通信延迟、中断和时钟不同步等问题,各智能体的感知能力受到限制。
动机:为了解决这一问题,本文提出了一种基于鸟瞰图流的异步协作感知系统CoBEVFlow。
方法:该系统通过补偿运动来对齐多个智能体发送的异步协作消息,以实现协作感知。具体来说,它使用鸟瞰图流来模拟场景中的运动,并将异步感知特征重新分配到适当的位置,以减轻异步性的影响。
效果:实验结果表明,CoBEVFlow在处理不规则连续时间戳的异步协作消息时无需进行离散化,并且只传输原始感知特征,避免了额外的噪声。在模拟不同真实世界场景的合成协作感知数据集IRV2V和真实世界数据集DAIR-V2X上进行的大量实验表明,CoBEVFlow始终优于其他基线,并在极端异步设置中表现稳健。

Collaborative perception can substantially boost each agent's perception ability by facilitating communication among multiple agents. However, temporal asynchrony among agents is inevitable in the real world due to communication delays, interruptions, and clock misalignments. This issue causes information mismatch during multi-agent fusion, seriously shaking the foundation of collaboration. To address this issue, we propose CoBEVFlow, an asynchrony-robust collaborative perception system based on bird's eye view (BEV) flow. The key intuition of CoBEVFlow is to compensate motions to align asynchronous collaboration messages sent by multiple agents. To model the motion in a scene, we propose BEV flow, which is a collection of the motion vector corresponding to each spatial location. Based on BEV flow, asynchronous perceptual features can be reassigned to appropriate positions, mitigating the impact of asynchrony. CoBEVFlow has two advantages: (i) CoBEVFlow can handle asynchronous collaboration messages sent at irregular, continuous time stamps without discretization; and (ii) with BEV flow, CoBEVFlow only transports the original perceptual features, instead of generating new perceptual features, avoiding additional noises. To validate CoBEVFlow's efficacy, we create IRregular V2V(IRV2V), the first synthetic collaborative perception dataset with various temporal asynchronies that simulate different real-world scenarios. Extensive experiments conducted on both IRV2V and the real-world dataset DAIR-V2X show that CoBEVFlow consistently outperforms other baselines and is robust in extremely asynchronous settings. The code is available at https://github.com/MediaBrain-SJTU/CoBEVFlow.

ViSt3D: Video Stylization with 3D CNN
Ayush Pande Gaurav Sharma



研究问题:如何有效地对视频进行风格化处理。
动机:虽然图像风格化在近期取得了快速的发展,但视频风格化由于其复杂性,相对来说探索的较少。
方法:提出了一种直接使用3D CNN进行视频风格化的方法,该方法首先将视频中的动作和外观进行解耦,然后对外观部分进行风格化处理,最后再添加回动作部分并解码得到最终的风格化视频。
效果:首次成功地使用3D CNN进行了视频风格化处理,并且在纹理风格化方面优于现有的2D方法。

Visual stylization has been a very popular research area in recent times. While image stylization has seen a rapid advancement in the recent past, video stylization, while being more challenging, is relatively less explored. The immediate method of stylizing videos by stylizing each frame independently has been tried with some success. To the best of our knowledge, we present the first approach to video stylization using 3D CNN directly, building upon insights from 2D image stylization. Stylizing video is highly challenging, as the appearance and video motion, which includes both camera and subject motions, are inherently entangled in the representations learnt by a 3D CNN. Hence, a naive extension of 2D CNN stylization methods to 3D CNN does not work. To perform stylization with 3D CNN, we propose to explicitly disentangle motion and appearance, stylize the appearance part, and then add back the motion component and decode the final stylized video. In addition, we propose a dataset, curated from existing datasets, to train video stylization networks. We also provide an independently collected test set to study the generalization of video stylization methods. We provide results on this test dataset comparing the proposed method with 2D stylization methods applied frame by frame. We show successful stylization with 3D CNN for the first time, and obtain better stylization in terms of texture cf.\ the existing 2D methods.

Color Equivariant Convolutional Networks
Attila Lengyel Ombretta Strafforello Robert-Jan Bruintjes Alexander Gielisse Jan van Gemert



研究问题:如何在保持颜色信息的同时,使卷积神经网络对颜色的变换具有不变性?
动机:现有的卷积神经网络在处理由于意外记录条件引起的颜色变化时会面临困难,而颜色不变性虽然解决了这个问题,但会移除所有颜色信息,牺牲了判别能力。
方法:提出颜色等变卷积(CEConvs),这是一种新的深度学习构建模块,可以在颜色谱上共享形状特征,同时保留重要的颜色信息。通过在神经网络中引入色调变换的参数共享,将等变性的概念从几何变换扩展到光度变换。
效果:实验结果表明,CEConvs在各种任务的下游性能和对颜色变化的鲁棒性方面都有显著改进,包括训练-测试分布偏移。这种方法可以无缝集成到现有的架构中,如ResNets,为解决CNN中基于颜色的变化提供了有希望的解决方案。

Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does so at the cost of removing all color information, which sacrifices discriminative power. In this paper, we propose Color Equivariant Convolutions (CEConvs), a novel deep learning building block that enables shape feature sharing across the color spectrum while retaining important color information. We extend the notion of equivariance from geometric to photometric transformations by incorporating parameter sharing over hue-shifts in a neural network. We demonstrate the benefits of CEConvs in terms of downstream performance to various tasks and improved robustness to color changes, including train-test distribution shifts. Our approach can be seamlessly integrated into existing architectures, such as ResNets, and offers a promising solution for addressing color-based domain shifts in CNNs.

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection
Haibao Yu Yingjuan Tang Enze Xie Jilei Mao Ping Luo Zaiqing Nie



研究问题:如何克服交通环境中的时间异步性和有限的通信条件,提高自动驾驶感知能力。
动机:目前的车辆-基础设施协同3D(VIC3D)对象检测方法存在融合不准确的问题,限制了基础设施数据的利用。
方法:提出特征流网络(FFNet),这是一种基于特征流的特征融合框架,通过预测未来特征来补偿时间异步性,并利用序列化基础设施帧的时序连贯性传输特征流,同时引入自监督训练方法,使FFNet能从原始基础设施序列中生成具有特征预测能力的特征流。
效果:实验结果表明,该方法优于现有的协同检测方法,且只需要约1/100的原始数据传输成本,并在DAIR-V2X数据集上实现了所有延迟在一个模型中的覆盖。

Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities. However, the uncertain temporal asynchrony and limited communication conditions that are present in traffic environments can lead to fusion misalignment and constrain the exploitation of infrastructure data. To address these issues in vehicle-infrastructure cooperative 3D (VIC3D) object detection, we propose the Feature Flow Net (FFNet), a novel cooperative detection framework. FFNet is a flow-based feature fusion framework that uses a feature flow prediction module to predict future features and compensate for asynchrony. Instead of transmitting feature maps extracted from still-images, FFNet transmits feature flow, leveraging the temporal coherence of sequential infrastructure frames. Furthermore, we introduce a self-supervised training approach that enables FFNet to generate feature flow with feature prediction ability from raw infrastructure sequences. Experimental results demonstrate that our proposed method outperforms existing cooperative detection methods while only requiring about 1/100 of the transmission cost of raw data and covers all latency in one model on the DAIR-V2X dataset. The code is available https://github.com/haibao-yu/FFNet-VIC3D.

Semantic segmentation of sparse irregular point clouds for leaf/wood discrimination
Yuchen BAI Jean-Baptiste Durand Grégoire Laurent Vincent Florence Forbes



研究问题:如何从无人机获取的稀疏点云中准确区分树叶和树枝。
动机:由于树叶面积对植物与大气之间的气体交换模型影响重大,因此需要一种能够精确测量森林叶面积的方法。同时,无人机可以频繁地重新访问以追踪植被对气候变化的反应,但搭载在无人机上的微型传感器通常只能提供有限密度的点云,且受遮挡影响,点云密度从树冠顶部到底部会显著下降。
方法:提出了一种基于Pointnet++架构的神经网络模型,只使用点的几何信息(不包括任何光谱信息)。为了应对局部数据稀疏性,提出了一种创新的采样方案,旨在保留重要的局部几何信息。还提出了一种适应严重类别不平衡的损失函数。
效果:实验结果表明,该模型在处理无人机点云上优于最先进的替代方案。未来可能会考虑从树冠下方获取更密集的点云来进一步提高模型性能。

Lidar (Light Detection and Ranging) has become an essential part of the remote sensing toolbox used for biosphere monitoring. In particular, Lidar provides the opportunity to map forest leaf area with unprecedented accuracy, while leaf area has remained an important source of uncertainty affecting models of gas exchanges between the vegetation and the atmosphere. Unmanned Aerial Vehicles (UAV) are easy to mobilize and therefore allow frequent revisits to track the response of vegetation to climate change. However, miniature sensors embarked on UAVs usually provide point clouds of limited density, which are further affected by a strong decrease in density from top to bottom of the canopy due to progressively stronger occlusion. In such a context, discriminating leaf points from wood points presents a significant challenge due in particular to strong class imbalance and spatially irregular sampling intensity. Here we introduce a neural network model based on the Pointnet ++ architecture which makes use of point geometry only (excluding any spectral information). To cope with local data sparsity, we propose an innovative sampling scheme which strives to preserve local important geometric information. We also propose a loss function adapted to the severe class imbalance. We show that our model outperforms state-of-the-art alternatives on UAV point clouds. We discuss future possible improvements, particularly regarding much denser point clouds acquired from below the canopy.

Learning Visual Prior via Generative Pre-Training
Jinheng Xie Kai Ye Yudong Li Yuexiang Li Kevin Qinghong Lin Yefeng Zheng Linlin Shen Mike Zheng Shou



研究问题:如何通过深度学习模型学习并显式表示视觉数据中的各种特性,如物体位置和形状,作为视觉先验,并影响许多视觉任务。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:受语言模型进展的启发,我们提出了一种通过生成预训练学习视觉先验的方法,称为VisorGPT。通过将视觉位置(例如边界框、人体姿势和实例掩码)离散化为序列,VisorGPT可以通过最大化可能性来建模视觉先验。此外,我们还研究了提示工程,以统一各种视觉位置,并能够从学习的先验中自定义采样顺序输出。
效果:实验结果表明,VisorGPT在建模视觉先验和扩展到新场景方面非常有效,这可能会激发我们将离散视觉位置整合到当前语言模型的学习范式中,以进一步感知视觉世界。

Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, e.g., object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic results. This work aims to explicitly learn the visual prior and enable the customization of sampling. Inspired by advances in language modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed VisorGPT. By discretizing visual locations, e.g., bounding boxes, human pose, and instance masks, into sequences, VisorGPT can model visual prior through likelihood maximization. Besides, prompt engineering is investigated to unify various visual locations and enable customized sampling of sequential outputs from the learned prior. Experimental results demonstrate the effectiveness of VisorGPT in modeling visual prior and extrapolating to novel scenes, potentially motivating that discrete visual locations can be integrated into the learning paradigm of current language models to further perceive visual world. Code is available at https://sierkinhane.github.io/visor-gpt.

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars
Yukun Huang Jianan Wang Ailing Zeng He CAO Xianbiao Qi Yukai Shi Zheng-Jun Zha Lei Zhang



研究问题:如何生成和动画化高质量的3D人物模型。
动机:尽管现有的方法在基于文本的常见物体3D生成方面取得了一些成果,但创建高质量且可动画化的3D人物模型仍然具有挑战性。
方法:提出了一种名为DreamWaltz的新框架,该框架通过文本指导和人体参数先验来生成和动画化复杂的3D人物模型。它使用3D一致的遮挡感知得分蒸馏采样(SDS)优化典型姿势的隐式神经表示,并通过3D感知骨架条件提供视图对齐的监督,以实现无瑕疵和多面体的复杂人物生成。对于动画,该方法从各种姿势的扩散模型的丰富图像先验中学习了一个可动画化的3D人物表示,可以在不重新训练的情况下为任意姿势的复杂非绑定人物生成动画。
效果:广泛的评估表明,DreamWaltz是一种有效且稳健的方法,可以创建具有复杂形状、外观以及用于动画的新颖姿势的3D人物模型。该框架还进一步实现了具有多样化组合的复杂场景的创建,包括人物-人物、人物-物体和人物-场景交互。

We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning which enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable 3D avatar representation from abundant image priors of diffusion model conditioned on various poses, which could animate complex non-rigged avatars given arbitrary poses without retraining. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and animation results.

Learning Dictionary for Visual Attention
Yingjie Liu Xuan Liu Hui Yu XUAN TANG Xian Wei



研究问题:如何利用注意力机制捕捉数据中的全局结构和长距离关系,提高深度视觉模型在各种计算机视觉任务上的性能。
动机:目前的注意力机制在捕获全局结构和长距离关系方面表现出色,可以提升深度视觉模型在多种计算机视觉任务上的表现。
方法:提出一种基于字典学习的注意模块(Dic-Attn),将此问题建模为稀疏先验的分解和重建问题,受到人类视觉感知系统中稀疏编码的启发。该模块将输入分解为字典和相应的稀疏表示,使得能够解耦视觉数据中的潜在非线性结构信息,并重建注意力嵌入。通过在空间和通道域应用转换操作,动态选择字典的原子和稀疏表示。最后,更新的字典和稀疏表示捕获全局上下文信息并重建注意力图。
效果:在各种计算机视觉任务上进行大量实验,如图像和点云分类,验证了该方法取得了良好的性能,并与最先进的注意力方法进行了强有力的竞争比较。

Recently, the attention mechanism has shown outstanding competence in capturing global structure information and long-range relationships within data, thus enhancing the performance of deep vision models on various computer vision tasks. In this work, we propose a novel dictionary learning-based attention (\textit{Dic-Attn}) module, which models this issue as a decomposition and reconstruction problem with the sparsity prior, inspired by sparse coding in the human visual perception system. The proposed \textit{Dic-Attn} module decomposes the input into a dictionary and corresponding sparse representations, allowing for the disentanglement of underlying nonlinear structural information in visual data and the reconstruction of an attention embedding. By applying transformation operations in the spatial and channel domains, the module dynamically selects the dictionary's atoms and sparse representations. Finally, the updated dictionary and sparse representations capture the global contextual information and reconstruct the attention maps. The proposed \textit{Dic-Attn} module is designed with plug-and-play compatibility, allowing for integration into deep attention encoders. Our approach offers an intuitive and elegant means to exploit the discriminative information from data, promoting visual attention construction. Extensive experimental results on various computer vision tasks, e.g., image and point cloud classification, validate that our method achieves promising performance, and shows a strong competitive comparison with state-of-the-art attention methods.

3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes
Haotian Xue Antonio Torralba Joshua B. Tenenbaum Daniel LK Yamins Yunzhu Li Hsiao-Yu Tung



研究问题:如何从复杂的场景视频中学习三维基础的视觉直观物理模型。
动机:人类对于给定动作下,场景如何随时间演变有强烈的直觉,这种直觉通常被称为视觉直观物理,是进行有效操作以实现期望结果的关键能力。
方法:本文提出了一个框架,该框架能够从包含流体的复杂场景视频中学习三维基础的视觉直观物理模型。该方法由一个条件神经辐射场(NeRF)风格的视觉前端和一个基于3D点的动态预测后端组成,通过它们可以施加强大的关联和结构归纳偏置来捕获底层环境的结构。
效果:实验结果表明,我们的模型可以从原始图像中学习并进行长期的未来预测,并在复杂场景下的推断设置中表现出强大的泛化能力。

Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D point-based dynamics prediction backend, using which we can impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks acquired using color prior. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We generate datasets including three challenging scenarios involving fluid, granular materials, and rigid objects in the simulation. The datasets do not include any dense particle information so most previous 3D-based intuitive physics pipelines can barely deal with that. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that once trained, our model can achieve strong generalization in complex scenarios under extrapolate settings.

VPP: Efficient Conditional 3D Generation via Voxel-Point Progressive Representation
Zekun Qi Muzhou Yu Runpei Dong Kaisheng Ma



研究问题:本文旨在解决条件3D生成的低推理效率、生成类别有限和下游应用受限的问题。
动机:目前的3D生成方法在推理效率、生成类别和应用范围上存在限制,需要进行改进。
方法:本文提出了一种通过体素点渐进表示(VPP)的渐进生成方法。该方法利用了提出的体素语义生成器的结构化体素表示和点上采样器的非结构化点表示的稀疏性,实现了多类别对象的高效生成。
效果:实验结果表明,VPP能够在0.2秒内生成高质量的8K点云,并在各种3D下游任务中表现出优秀的表示转移性能。

Conditional 3D generation is undergoing a significant advancement, enabling the free creation of 3D content from inputs such as text or 2D images. However, previous approaches have suffered from low inference efficiency, limited generation categories, and restricted downstream applications. In this work, we revisit the impact of different 3D representations on generation quality and efficiency. We propose a progressive generation method through Voxel-Point Progressive Representation (VPP). VPP leverages structured voxel representation in the proposed Voxel Semantic Generator and the sparsity of unstructured point representation in the Point Upsampler, enabling efficient generation of multi-category objects. VPP can generate high-quality 8K point clouds within 0.2 seconds. Additionally, the masked generation Transformer allows for various 3D downstream tasks, such as generation, editing, completion, and pre-training. Extensive experiments demonstrate that VPP efficiently generates high-fidelity and diverse 3D shapes across different categories, while also exhibiting excellent representation transfer performance. Codes will be released at https://github.com/qizekun/VPP.

Triangulation Residual Loss for Data-efficient 3D Pose Estimation
Jiachen Zhao Tao Yu Liang An Yipeng Huang Fang Deng Qionghai Dai



研究问题:如何有效地利用多视角无标记数据进行三维姿态估计。
动机:现有的三维监督模型需要大规模的三维标注数据集,但现有数据量不足以训练出理想性能的监督模型,特别是在动物姿态估计方面。
方法:提出三角测量残差损失(TR loss)用于多视角三维姿态估计的数据高效训练。该方法通过最小化三角测量矩阵的最小奇异值,在无需三维监督的情况下对初始二维关键点估计进行微调。
效果:在Human3.6M数据集上,仅使用5%的二维标注训练数据,该方法就实现了25.8mm的MPJPE和具有竞争力的28.7mm MPJPE,证明了其在数据效率训练上的能力。

This paper presents Triangulation Residual loss (TR loss) for multiview 3D pose estimation in a data-efficient manner. Existing 3D supervised models usually require large-scale 3D annotated datasets, but the amount of existing data is still insufficient to train supervised models to achieve ideal performance, especially for animal pose estimation. To employ unlabeled multiview data for training, previous epipolar-based consistency provides a self-supervised loss that considers only the local consistency in pairwise views, resulting in limited performance and heavy calculations. In contrast, TR loss enables self-supervision with global multiview geometric consistency. Starting from initial 2D keypoint estimates, the TR loss can fine-tune the corresponding 2D detector without 3D supervision by simply minimizing the smallest singular value of the triangulation matrix in an end-to-end fashion. Our method achieves the state-of-the-art 25.8mm MPJPE and competitive 28.7mm MPJPE with only 5\% 2D labeled training data on the Human3.6M dataset. Experiments on animals such as mice demonstrate our TR loss's data-efficient training ability.

RayDF: Neural Ray-surface Distance Fields with Multi-view Consistency
Zhuoman Liu Bo Yang Yan Luximon Ajay Kumar Jinxi Li



研究问题:本文研究了三维形状的连续表示问题。
动机:现有的成功方法大多是基于坐标的隐式神经表示,但它们在渲染新视图或恢复显式表面点方面效率低下。一些工作开始将3D形状公式化为基于射线的神经函数,但由于缺乏多视图几何一致性,学习到的结构较差。
方法:我们提出了一个新的框架RayDF,包括三个主要组件:1)简单的射线-表面距离场,2)新颖的双射线可见性分类器,3)一个多视图一致性优化模块,以驱动学习的射线-表面距离具有多视图几何一致性。
效果:我们在三个公共数据集上广泛评估了我们的方法,在合成和具有挑战性的现实世界3D场景中的3D表面点重建方面表现出显著的性能,明显超过了现有的基于坐标和基于射线的基线。最值得注意的是,我们的方法比基于坐标的方法快1000倍的速度来渲染一个800x800的深度图像,显示出我们的方法在3D形状表示方面的优越性。我们的代码和数据可在https://github.com/vLAR-group/RayDF获取。

In this paper, we study the problem of continuous 3D shape representations. The majority of existing successful methods are coordinate-based implicit neural representations. However, they are inefficient to render novel views or recover explicit surface points. A few works start to formulate 3D shapes as ray-based neural functions, but the learned structures are inferior due to the lack of multi-view geometry consistency. To tackle these challenges, we propose a new framework called RayDF. It consists of three major components: 1) the simple ray-surface distance field, 2) the novel dual-ray visibility classifier, and 3) a multi-view consistency optimization module to drive the learned ray-surface distances to be multi-view geometry consistent. We extensively evaluate our method on three public datasets, demonstrating remarkable performance in 3D surface point reconstruction on both synthetic and challenging real-world 3D scenes, clearly surpassing existing coordinate-based and ray-based baselines. Most notably, our method achieves a 1000x faster speed than coordinate-based methods to render an 800x800 depth image, showing the superiority of our method for 3D shape representation. Our code and data are available at https://github.com/vLAR-group/RayDF

Mask Propagation for Efficient Video Semantic Segmentation
Yuetian Weng Mingfei Han Haoyu He Mingjie Li Lina Yao Xiaojun Chang Bohan Zhuang



研究问题:视频语义分割(VSS)涉及为视频序列中的每个像素分配一个语义标签。
动机:虽然现有的方法通过在视频帧之间利用时间关系扩展图像语义分割模型,取得了良好的效果,但这些方法的计算成本往往很高。
方法:本文提出了一种名为MPVSS的高效掩码传播框架。首先,我们在稀疏的关键帧上使用强大的基于查询的图像分割器生成准确的二进制掩码和类别预测。然后,我们设计了一个流估计模块,利用学习到的查询生成一组与关键帧上的掩码预测相关的分段感知流图。最后,将掩码-流对进行变形,作为非关键帧上的掩码预测。通过重用关键帧上的预测,我们避免了需要单独处理大量视频帧的问题,从而减轻了时间冗余并显著降低了计算成本。
效果:在VSPW和Cityscapes上的大量实验表明,我们的掩码传播框架实现了最先进的准确性和效率权衡。例如,我们的Swin-L主干模型在VSPW数据集上的性能比使用MiT-B5的MRCFA提高了4.0% mIoU,仅需要26% FLOPs。此外,与每帧Mask2Former基线相比,我们的框架在Cityscapes验证集上最多减少了4×FLOPs,仅导致最多2% mIoU下降。代码可在https://github.com/ziplab/MPVSS获取。

Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4× FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https://github.com/ziplab/MPVSS.

Learning Dense Flow Field for Highly-accurate Cross-view Camera Localization
Zhenbo Song XiangHui Ze Jianfeng Lu Yujiao Shi



研究问题:本文旨在解决如何估计地面图像相对于包含局部环境的卫星图像的3自由度相机姿态的问题。
动机:现有的方法无法充分利用像素级别的特征度量,因此我们提出了一种新颖的端到端方法,通过学习地面和卫星图像对中的密集像素级流场来计算相机姿态。
方法:我们的方法通过在像素级别构建特征度量,实现了全图像监督,以学习不同视角下的显著几何配置和视觉外观。具体来说,我们的方法使用两个不同的卷积网络进行地面和卫星特征提取。然后,我们使用固定的相机高度假设将地面特征图投影到鸟瞰图(BEV)以实现初步的几何对齐。为了进一步建立BEV和卫星特征之间的内容关联,我们引入了一个残差卷积块来优化投影的BEV特征。我们在基于RAFT的流解码器网络上对优化后的BEV特征图和卫星特征图执行光流估计。获得密集流对应关系后,我们应用最小二乘法过滤匹配的内点并回归地面相机姿态。
效果:大量实验表明,我们的方法相比最先进的方法有显著改进。特别是在KITTI、Ford multi-AV、VIGOR和Oxford RobotCar数据集上,我们的方法分别将中位定位误差降低了89%、19%、80%和35%。

This paper addresses the problem of estimating the 3-DoF camera pose for a ground-level image with respect to a satellite image that encompasses the local surroundings. We propose a novel end-to-end approach that leverages the learning of dense pixel-wise flow fields in pairs of ground and satellite images to calculate the camera pose. Our approach differs from existing methods by constructing the feature metric at the pixel level, enabling full-image supervision for learning distinctive geometric configurations and visual appearances across views. Specifically, our method employs two distinct convolution networks for ground and satellite feature extraction. Then, we project the ground feature map to the bird's eye view (BEV) using a fixed camera height assumption to achieve preliminary geometric alignment. To further establish the content association between the BEV and satellite features, we introduce a residual convolution block to refine the projected BEV feature. Optical flow estimation is performed on the refined BEV feature map and the satellite feature map using flow decoder networks based on RAFT. After obtaining dense flow correspondences, we apply the least square method to filter matching inliers and regress the ground camera pose. Extensive experiments demonstrate significant improvements compared to state-of-the-art methods. Notably, our approach reduces the median localization error by 89\%, 19\%, 80\%, and 35\% on the KITTI, Ford multi-AV, VIGOR, and Oxford RobotCar datasets, respectively.

Volume Feature Rendering for Fast Neural Radiance Field Reconstruction
Kang Han Wei Xiang Lu Yu



研究问题:本文旨在解决在NeRF渲染过程中,颜色神经网络评估次数过多导致的计算复杂度高的问题。
动机:目前的NeRF渲染过程中,颜色神经网络的多次评估是计算复杂度的主要来源,限制了其渲染速度。
方法:本文提出了一种体积特征渲染(VFR)方法,将光线查询的特征向量整合为一个特征向量,然后通过颜色神经网络转换为最终像素颜色,从而减少了颜色神经网络的评估次数。
效果:实验结果表明,该方法在合成和真实世界数据集上都达到了最先进的渲染质量,同时与现有方法相比,训练时间更短。

Neural radiance fields (NeRFs) are able to synthesize realistic novel views from multi-view images captured from distinct positions and perspectives. In NeRF's rendering pipeline, neural networks are used to represent a scene independently or transform queried learnable feature vector of a point to the expected color or density. With the aid of geometry guides either in the form of occupancy grids or proposal networks, the number of color neural network evaluations can be reduced from hundreds to dozens in the standard volume rendering framework. However, many evaluations of the color neural network are still a bottleneck for fast NeRF reconstruction. This paper revisits volume feature rendering (VFR) for the purpose of fast NeRF reconstruction. The VFR integrates the queried feature vectors of a ray into one feature vector, which is then transformed to the final pixel color by a color neural network. This fundamental change to the standard volume rendering framework requires only one single color neural network evaluation to render a pixel, which substantially lowers the high computational complexity of the rendering framework attributed to a large number of color neural network evaluations. Consequently, we can use a comparably larger color neural network to achieve a better rendering quality while maintaining the same training and rendering time costs. This approach achieves the state-of-the-art rendering quality on both synthetic and real-world datasets while requiring less training time compared with existing methods.

LICO: Explainable Models with Language-Image COnsistency
Yiming Lei Zilong Li Yangyang Li Junping Zhang Hongming Shan



研究问题:如何解释深度学习模型的决策过程?
动机:现有的解释方法如Grad-CAM等,由于仅依赖类别标签生成注意力图,导致图像和显著性图之间的对应关系往往不完整。
方法:本文提出了一种名为LICO的语言-图像一致性模型,通过将可学习的语言学提示与相应的视觉特征进行粗到细的关联,以实现可解释的图像分类。
效果:在八个基准数据集上的大量实验结果表明,所提出的LICO在生成更具解释性的注意力图方面取得了显著改进,并与现有的解释方法(如Grad-CAM)相结合。值得注意的是,LICO在不引入任何计算开销的情况下提高了现有模型的分类性能。

Interpreting the decisions of deep learning models has been actively studied since the explosion of deep neural networks. One of the most convincing interpretation approaches is salience-based visual interpretation, such as Grad-CAM, where the generation of attention maps depends merely on categorical labels. Although existing interpretation methods can provide explainable decision clues, they often yield partial correspondence between image and saliency maps due to the limited discriminative information from one-hot labels. This paper develops a Language-Image COnsistency model for explainable image classification, termed LICO, by correlating learnable linguistic prompts with corresponding visual features in a coarse-to-fine manner. Specifically, we first establish a coarse global manifold structure alignment by minimizing the distance between the distributions of image and language features. We then achieve fine-grained saliency maps by applying optimal transport (OT) theory to assign local feature maps with class-specific prompts. Extensive experimental results on eight benchmark datasets demonstrate that the proposed LICO achieves a significant improvement in generating more explainable attention maps in conjunction with existing interpretation methods such as Grad-CAM. Remarkably, LICO improves the classification performance of existing models without introducing any computational overhead during inference.

FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models
Lihe Yang Xiaogang Xu Bingyi Kang Yinghuan Shi Hengshuang Zhao



研究问题:语义分割由于需要精细的标注,数据收集和标注过程繁重且昂贵。
动机:提出FreeMask,利用生成模型产生的合成图像来减轻数据收集和标注的负担。
方法:首先,我们根据真实数据集提供的语义遮罩合成丰富的训练图像,为语义分割模型产生额外的对齐图像-遮罩训练对。然后,我们通过联合训练或预训练真实图像来研究合成图像的作用。同时,我们设计了一个强大的过滤原则来抑制错误合成的区域。此外,我们建议不平等地对待不同的语义遮罩,优先处理更难的遮罩并为它们采样更多的相应合成图像。
效果:实验结果表明,无论是联合训练还是使用我们的过滤和重新采样的合成图像进行预训练,都可以大大提高分割模型的性能。例如,在ADE20K上从48.7提高到52.0。

Semantic segmentation has witnessed tremendous progress due to the proposal of various advanced network architectures. However, they are extremely hungry for delicate annotations to train, and the acquisition is laborious and unaffordable. Therefore, we present FreeMask in this work, which resorts to synthetic images from generative models to ease the burden of both data collection and annotation procedures. Concretely, we first synthesize abundant training images conditioned on the semantic masks provided by realistic datasets. This yields extra well-aligned image-mask training pairs for semantic segmentation models. We surprisingly observe that, solely trained with synthetic images, we already achieve comparable performance with real ones (e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we investigate the role of synthetic images by joint training with real images, or pre-training for real images. Meantime, we design a robust filtering principle to suppress incorrectly synthesized regions. In addition, we propose to inequally treat different semantic masks to prioritize those harder ones and sample more corresponding synthetic images for them. As a result, either jointly trained or pre-trained with our filtered and re-sampled synthesized images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on ADE20K.

Recaptured Raw Screen Image and Video Demoiréing via Channel and Spatial Modulations
Huanjing Yue Yijia Cheng Xin Liu Jingyu Yang



研究问题:智能手机摄像头捕捉的屏幕内容已成为信息共享的常见方式,但这些图像和视频经常研究问题:智能手机摄像头捕捉的屏幕内容已成为信息共享的常见方式,但这些图像和视频经常受到由相机滤镜阵列和数字显示网格之间频率混叠引起的莫尔图案的损害。
动机:观察到原始域中的莫尔图案比sRGB域中的更简单,并且原始颜色通道中的莫尔图案具有不同的属性。因此,提出了一种针对原始输入的图像和视频去莫尔网络。
方法:引入了一个颜色分离的特征分支,并通过通道和空间调制与传统的特征混合分支融合。具体来说,通道调制利用调制的颜色分离特征增强颜色混合特征。空间调制利用具有大感受野的特征来调制具有小感受野的特征。此外,建立了第一个良好对齐的原始视频去莫尔(RawVDemoiré)数据集,并提出了通过插入交替模式进行有效时间对齐的方法。
效果:实验证明,该方法在图像和视频去莫尔方面都取得了最先进的性能。该数据集和代码将在本工作被接受后发布。

Capturing screen contents by smartphone cameras has become a common way for information sharing. However, these images and videos are often degraded by moiré patterns, which are caused by frequency aliasing between the camera filter array and digital display grids. We observe that the moiré patterns in raw domain is simpler than those in sRGB domain, and the moiré patterns in raw color channels have different properties. Therefore, we propose an image and video demoiréing network tailored for raw inputs. We introduce a color-separated feature branch, and it is fused with the traditional feature-mixed branch via channel and spatial modulations. Specifically, the channel modulation utilizes modulated color-separated features to enhance the color-mixed features. The spatial modulation utilizes the feature with large receptive field to modulate the feature with small receptive field. In addition, we build the first well-aligned raw video demoiréing (RawVDemoiré) dataset and propose an efficient temporal alignment method by inserting alternating patterns. Experiments demonstrate that our method achieves state-of-the-art performance for both image and video demoiréing. Our dataset and code will be released after the acceptance of this work.

Activity Grammars for Temporal Action Segmentation
Dayoung Gong Joonseok Lee Deunsol Jung Suha Kwak Minsu Cho



研究问题:如何对未修剪的活动视频进行时间动作分割。
动机:现有的方法无法理解多级语义的组成结构,导致时间动作分割任务具有挑战性。
方法:提出一种有效的活动语法来指导神经网络的时间动作分割预测。设计了一种新的语法归纳算法KARI从动作序列数据中提取强大的上下文无关语法,并开发了一种高效的通用解析器BEP,根据归纳出的语法和递归规则将帧级别的概率分布转换为可靠的动作序列。
效果:在两个标准基准测试集Breakfast和50 Salads上,该方法在性能和可解释性方面都显著提高了时间动作分割的效果。

Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties of parts. The task of temporal action segmentation remains challenging for the reason, aiming at translating an untrimmed activity video into a sequence of action segments. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm, dubbed KARI, that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser, dubbed BEP, that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.

GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
Mingzhen Sun Weining Wang Zihan Qin Jiahui Sun Sihan Chen Jing Liu



研究问题:本文旨在解决视频生成中的全局连贯性和局部真实性问题。
动机:目前的非自回归方法在生成视频时,往往无法同时保证全局连贯性和局部真实性。
方法:本文提出了一种新的非自回归方法GLOBER,首先通过视频编码器将视频编码为全局特征,然后基于这些全局特征,通过一个建立在扩散模型上的视频解码器,以非自回归的方式合成视频帧。
效果:实验结果表明,该方法在多个基准测试上都取得了新的最先进的结果,有效地提高了视频生成的全局连贯性和局部真实性。

Video generation necessitates both global coherence and local realism. This work presents a novel non-autoregressive method GLOBER, which first generates global features to obtain comprehensive global guidance and then synthesizes video frames based on the global features to generate coherent videos. Specifically, we propose a video auto-encoder, where a video encoder encodes videos into global features, and a video decoder, built on a diffusion model, decodes the global features and synthesizes video frames in a non-autoregressive manner. To achieve maximum flexibility, our video decoder perceives temporal information through normalized frame indexes, which enables it to synthesize arbitrary sub video clips with predetermined starting and ending frame indexes. Moreover, a novel adversarial loss is introduced to improve the global coherence and local realism between the synthesized video frames. Finally, we employ a diffusion-based video generator to fit the global features outputted by the video encoder for video generation. Extensive experimental results demonstrate the effectiveness and efficiency of our proposed method, and new state-of-the-art results have been achieved on multiple benchmarks.

Learning Adaptive Tensorial Density Fields for Clean Cryo-ET Reconstruction
YUANHAO WANG Ramzi Idoughi Wolfgang Heidrich



研究问题:如何从倾斜系列低温电子显微镜(cryo-ET)数据重建3D结构。
动机:cryo-ET是一种强大的成像技术,但面临诸如缺失楔形采集、大数据量和高噪声水平等挑战。
方法:提出了一种基于学习的框架,使用自适应张量基表示来表示扫描样本的3D密度场,优化四叉树结构对感兴趣体积进行划分,学习每个节点中表示密度场的张量的向量矩阵分解,并使用结合可微分断层形成模型和三种正则化项(总变分、边界一致性约束和各向同性傅里叶先验)的损失函数。
效果:通过合成数据和真实数据展示了该框架优于现有方法,提高了重建质量,同时减少了计算时间和内存占用。

We present a novel learning-based framework for reconstructing 3D structures from tilt-series cryo-Electron Tomography (cryo-ET) data. Cryo-ET is a powerful imaging technique that can achieve near-atomic resolutions. Still, it suffers from challenges such as missing-wedge acquisition, large data size, and high noise levels. Our framework addresses these challenges by using an adaptive tensorial-based representation for the 3D density field of the scanned sample. First, we optimize a quadtree structure to partition the volume of interest. Then, we learn a vector-matrix factorization of the tensor representing the density field in each node. Moreover, we use a loss function that combines a differentiable tomographic formation model with three regularization terms: total variation, boundary consistency constraint, and an isotropic Fourier prior. Our framework allows us to query the density at any location using the learned representation and obtain a high-quality 3D tomogram. We demonstrate the superiority of our framework over existing methods using synthetic and real data. Thus, our framework boosts the quality of the reconstruction while reducing the computation time and the memory footprint. The code is available at https://github.com/yuanhaowang1213/adaptivetensordf.

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator
Xiaolong Wang Runsen Xu Zhuofan Cui Zeyu Wan Yu Zhang



研究问题:本文提出了一种新的细粒度跨视图地理定位方法。
动机:现有的方法在处理地面图像和卫星图像对齐时,存在遮挡、重叠范围小和季节性变化等问题。
方法:我们首先采用可微分的球面变换,将地面图像的视角与卫星地图对齐,然后将地面图像和航空图像放在同一视角和同一平面上,将问题转化为图像对齐问题。为了解决这些问题,我们提出了一种鲁棒的相关感知单应性估计器,用于对齐转换后的地面图像和卫星图像的相似部分。
效果:通过使用单应性矩阵将转换后的地面图像的中心点映射到卫星图像,并确定地面相机的方向,我们的方法实现了亚像素分辨率和米级GPS精度。在VIGOR基准测试中,我们的方法在相同区域和跨区域泛化任务上分别将平均度量定位误差降低了21.3%和32.4%,在KITTI基准测试的相同区域评估中降低了34.4%。

In this paper, we introduce a novel approach to fine-grained cross-view geo-localization. Our method aligns a warped ground image with a corresponding GPS-tagged satellite image covering the same area using homography estimation. We first employ a differentiable spherical transform, adhering to geometric principles, to accurately align the perspective of the ground image with the satellite map. This transformation effectively places ground and aerial images in the same view and on the same plane, reducing the task to an image alignment problem. To address challenges such as occlusion, small overlapping range, and seasonal variations, we propose a robust correlation-aware homography estimator to align similar parts of the transformed ground image with the satellite image. Our method achieves sub-pixel resolution and meter-level GPS accuracy by mapping the center point of the transformed ground image to the satellite image using a homography matrix and determining the orientation of the ground camera using a point above the central axis. Operating at a speed of 30 FPS, our method outperforms state-of-the-art techniques, reducing the mean metric localization error by 21.3\% and 32.4\% in same-area and cross-area generalization tasks on the VIGOR benchmark, respectively, and by 34.4\% on the KITTI benchmark in same-area evaluation.

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
Shuhuai Ren Aston Zhang Yi Zhu Shuai Zhang Shuai Zheng Mu Li Alex Smola Xu Sun



研究问题:本文提出了一种名为POMP的视觉-语言模型预训练方法。
动机:为了提高视觉识别任务的性能,需要开发一种能有效压缩语义信息并具有强迁移能力的视觉概念提示。
方法:通过使用大规模文本语料库和知识图谱进行联合训练,POMP能够有效地捕捉到丰富的视觉概念,并在各种视觉识别任务中直接应用,以零样本的方式提升识别性能。
效果:实验结果表明,POMP在21个数据集上取得了最先进的性能,例如在10个分类数据集上的平均准确率为67.0%(比CoOp高出3.1%),在开放式词汇Pascal VOC分割任务上的hIoU为84.4%(比ZSSeg高出6.9%)。

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).

From ViT Features to Training-free Video Object Segmentation via Streaming-data Mixture Models
Roy Uziel Or Dinari Oren Freifeld



研究问题:本文旨在解决半监督视频物体分割任务中的问题,即如何利用第一帧的二值掩码预测后续帧的物体掩码。
动机:现有的主要解决方案存在两个主要缺点:1)对视频的昂贵且通常需要监督的训练;2)推理过程中的大内存占用。
方法:本文提出了一种无需训练的解决方案,具有小内存占用,并取得了最先进的结果。该方法将预训练的基于深度学习的特征(在静态图像上训练)与更经典的流数据聚类方法相结合。
效果:在关键的基准测试中,如DAVIS-2017和YouTube-VOS 2018验证数据集,该方法表现出色。此外,由于基于集群的紧凑表示的低内存占用,该方法可以很好地扩展到高分辨率的ViT特征。

In the task of semi-supervised video object segmentation, the input is the binary mask of an object in the first frame, and the desired output consists of the corresponding masks of that object in the subsequent frames. Existing leading solutions have two main drawbacks: 1) an expensive and typically-supervised training on videos; 2) a large memory footprint during inference. Here we present a training-free solution, with a low-memory footprint, that yields state-of-the-art results. The proposed method combines pre-trained deep learning-based features (trained on still images) with more classical methods for streaming-data clustering. Designed to adapt to temporal concept drifts and generalize to diverse video content without relying on annotated images or videos, the method eliminates the need for additional training or fine-tuning, ensuring fast inference and immediate applicability to new videos. Concretely, we represent an object via a dynamic ensemble of temporally- and spatially-coherent mixtures over a representation built from pre-trained ViT features and positional embeddings. A convolutional conditional random field further improves spatial coherence and helps reject outliers. We demonstrate the efficacy of the method on key benchmarks: the DAVIS-2017 and YouTube-VOS 2018 validation datasets. Moreover, by the virtue of the low-memory footprint of the compact cluster-based representation, the method scales gracefully to high-resolution ViT features. Our code is available at https://github.com/BGU-CS-VIL/Training-Free-VOS

NeRF-IBVS: Visual Servo Based on NeRF for Visual Localization and Navigation
Yuanze Wang Yichao Yan Dianxi Shi Wenhan Zhu Jianqiang Xia Tan Jeff Songchang Jin KE GAO XIAOBO LI Xiaokang Yang



研究问题:如何仅使用少量定位图像进行准确的视觉定位?
动机:获取大量定位图像和密集的3D标签在现实世界中具有挑战性和成本高昂。
方法:利用少量的定位图像和NeRF提供的粗略伪3D标签训练坐标回归网络,然后通过PNP从回归网络估计粗略姿态,最后使用NeRF提供的场景先验进行基于图像的视觉伺服(IBVS)的姿态优化。
效果:在7-Scenes和12-Scenes数据集上的大量实验表明,该方法在相同设置下优于最先进的方法,只需要5%到25%的训练数据。此外,该方法可以自然扩展到基于IBVS的视觉导航任务,并在模拟实验中验证了其有效性。

Visual localization is a fundamental task in computer vision and robotics. Training existing visual localization methods requires a large number of posed images to generalize to novel views, while state-of-the-art methods generally require dense ground truth 3D labels for supervision. However, acquiring a large number of posed images and dense 3D labels in the real world is challenging and costly. In this paper, we present a novel visual localization method that achieves accurate localization while using only a few posed images compared to other localization methods. To achieve this, we first use a few posed images with coarse pseudo-3D labels provided by NeRF to train a coordinate regression network. Then a coarse pose is estimated from the regression network with PNP. Finally, we use the image-based visual servo (IBVS) with the scene prior provided by NeRF for pose optimization. Furthermore, our method can provide effective navigation prior, which enable navigation based on IBVS without using custom markers and depth sensor. Extensive experiments on 7-Scenes and 12-Scenes datasets demonstrate that our method outperforms state-of-the-art methods under the same setting, with only 5\% to 25\% training data. Furthermore, our framework can be naturally extended to the visual navigation task based on IBVS, and its effectiveness is verified in simulation experiments.

Incomplete Multimodality-Diffused Emotion Recognition
Yuanzhi Wang Yong Li Zhen Cui



研究问题:本文旨在解决在现实场景中,由于模态信息的缺失导致的人脸情绪识别(MER)性能下降的问题。
动机:与单一模态相比,多模态信息具有互补性,有助于理解人的情绪。然而,在实际场景中,模态信息的缺失会阻碍多模态理解,降低MER的性能。
方法:本文提出了一种不完全多模态扩散情绪识别(IMDer)方法来应对模态信息缺失下的MER挑战。IMDer利用基于得分的扩散模型来恢复缺失的模态信息,该模型将输入的高斯噪声映射到缺失模态的期望分布空间,并按照其原始分布恢复缺失数据。特别是,为了减少缺失模态和恢复模态之间的语义歧义,将可用的模态作为条件来引导和优化扩散基恢复过程。
效果:实验结果表明,IMDer在各种缺失模态模式下都能获得最先进的MER准确性。

Human multimodal emotion recognition (MER) aims to perceive and understand human emotions via various heterogeneous modalities, such as language, vision, and acoustic. Compared with unimodality, the complementary information in the multimodalities facilitates robust emotion understanding. Nevertheless, in real-world scenarios, the missing modalities hinder multimodal understanding and result in degraded MER performance. In this paper, we propose an Incomplete Multimodality-Diffused emotion recognition (IMDer) method to mitigate the challenge of MER under incomplete multimodalities. To recover the missing modalities, IMDer exploits the score-based diffusion model that maps the input Gaussian noise into the desired distribution space of the missing modalities and recovers missing data abided by their original distributions. Specially, to reduce semantic ambiguity between the missing and the recovered modalities, the available modalities are embedded as the condition to guide and refine the diffusion-based recovering process. In contrast to previous work, the diffusion-based modality recovery mechanism in IMDer allows to simultaneously reach both distribution consistency and semantic disambiguation. Feature visualization of the recovered modalities illustrates the consistent modality-specific distribution and semantic alignment. Besides, quantitative experimental results verify that IMDer obtains state-of-the-art MER accuracy under various missing modality patterns.

Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities
Andrii Zadaianchuk Maximilian Seitzer Georg Martius



研究问题:如何利用大规模未标注的视频数据集进行基于对象的学习,以获取结构化表示。
动机:无监督的视频对象中心学习是一条有前景的途径,可以从大规模的未标注视频集合中学习结构化表示,但之前的方法只能在有限的领域内扩展到真实世界的数据集中。
方法:我们提出了一种新的方法,使用预训练的自监督特征作为形式的时间特征相似性损失。这种损失编码了图像补丁之间的语义和时间相关性,是一种引入运动偏差的自然方式用于对象发现。
效果:实验结果表明,这种方法在具有挑战性的合成MOVi数据集上取得了最先进的性能。当与特征重建损失结合使用时,我们的模型是第一个能够扩展到像YouTube-VIS这样的无约束视频数据集的对象中心视频模型。

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS. https://martius-lab.github.io/videosaur/

XAGen: 3D Expressive Human Avatars Generation
Zhongcong Xu Jianfeng Zhang Jun Hao Liew Jiashi Feng Mike Zheng Shou



研究问题:目前的3D-aware GAN模型主要关注身体主要关节的控制,忽视了面部表情、颌部姿势、手部姿势等表达属性的操纵。
动机:为了解决这一问题,我们提出了XAGen,这是第一个能够对人形化身进行身体、面部和手部表达控制的3D生成模型。
方法:我们设计了一种多尺度、多部分的3D表示法来模拟面部和手部的精细细节,并基于此提出了一种多部分渲染技术,将身体的合成与面部和手部的合成分离,以简化模型训练并提高几何质量。此外,我们还设计了多部分判别器,根据外观和细粒度控制能力评估生成的化身的质量。
效果:实验表明,XAGen在真实性、多样性和表达控制能力方面超越了现有的最佳方法。

Recent advances in 3D-aware GAN models have enabled the generation of realistic and controllable human body images. However, existing methods focus on the control of major body joints, neglecting the manipulation of expressive attributes, such as facial expressions, jaw poses, hand poses, and so on. In this work, we present XAGen, the first 3D generative model for human avatars capable of expressive control over body, face, and hands. To enhance the fidelity of small-scale regions like face and hands, we devise a multi-scale and multi-part 3D representation that models fine details. Based on this representation, we propose a multi-part rendering technique that disentangles the synthesis of body, face, and hands to ease model training and enhance geometric quality. Furthermore, we design multi-part discriminators that evaluate the quality of the generated avatars with respect to their appearance and fine-grained control capabilities. Experiments show that XAGen surpasses state-of-the-art methods in terms of realism, diversity, and expressive control abilities. Code and data will be made available at https://showlab.github.io/xagen.

MG-ViT: A Multi-Granularity Method for Compact and Efficient Vision Transformers
Yu Zhang Yepeng Liu Duoqian Miao Qi Zhang Yiwei Shi Liang Hu



研究问题:如何降低视觉变换器(ViT)的计算成本。
动机:现有的压缩ViT的研究大多采用单一粒度分割图像,忽视了图像中的重要信息往往集中在少数区域,需要多粒度的注意力分配。
方法:提出一种简单而有效的多粒度策略来压缩ViT,并设计了一个两阶段的多粒度框架MG-ViT,以平衡ViT的性能和计算成本。在单粒度推理阶段,输入图像被分割成少量小块进行简单推理;如果必要,会启动多粒度推理阶段,将重要块进一步细分为更细粒度的块进行后续推理。此外,我们将多粒度策略扩展到分层ViT,用于检测和分割等下游任务。
效果:大量实验证明多粒度策略的有效性。例如,在ImageNet上,MG-ViT在不损失性能的情况下,减少了LV-ViT-S 47%的FLOPs和DeiT-S 56%的FLOPs。

Vision Transformer (ViT) faces obstacles in wide application due to its huge computational cost. Almost all existing studies on compressing ViT adopt the manner of splitting an image with a single granularity, with very few exploration of splitting an image with multi-granularity. As we know, important information often randomly concentrate in few regions of an image, necessitating multi-granularity attention allocation to an image. Enlightened by this, we introduce the multi-granularity strategy to compress ViT, which is simple but effective. We propose a two-stage multi-granularity framework, MG-ViT, to balance ViT’s performance and computational cost. In single-granularity inference stage, an input image is split into a small number of patches for simple inference. If necessary, multi-granularity inference stage will be instigated, where the important patches are further subsplit into multi-finer-grained patches for subsequent inference. Moreover, prior studies on compression only for classification, while we extend the multi-granularity strategy to hierarchical ViT for downstream tasks such as detection and segmentation. Extensive experiments Prove the effectiveness of the multi-granularity strategy. For instance, on ImageNet, without any loss of performance, MG-ViT reduces 47\% FLOPs of LV-ViT-S and 56\% FLOPs of DeiT-S.

Exploiting Contextual Objects and Relations for 3D Visual Grounding
Li Yang Chunfeng Yuan Ziqi Zhang Zhongang Qi Yan Xu Wei Liu Ying Shan Bing Li Weiping Yang Peng Li Yan Wang Weiming Hu



研究问题:如何通过自然语言输入在3D场景中识别视觉对象,这是使机器理解和参与真实世界环境的关键任务。
动机:由于需要捕捉3D上下文信息以区分目标对象和复杂3D场景,这个任务具有挑战性。缺乏对上下文对象和关系的注释进一步加剧了困难。
方法:我们提出了一种新的模型CORE-3DVG,通过显式学习上下文对象和关系来解决这些问题。我们的方法通过三个顺序模块化网络完成3D视觉定位,包括一个文本引导的对象检测网络、一个关系匹配网络和一个目标识别网络。在训练过程中,我们引入了一个伪标签自我生成策略和一个弱监督方法来分别促进上下文对象和关系的学习。
效果:我们在具有挑战性的Nr3D、Sr3D和ScanRefer数据集上验证了我们的模型,并展示了最先进的性能。

3D visual grounding, the task of identifying visual objects in 3D scenes based on natural language inputs, plays a critical role in enabling machines to understand and engage with the real-world environment. However, this task is challenging due to the necessity to capture 3D contextual information to distinguish target objects from complex 3D scenes. The absence of annotations for contextual objects and relations further exacerbates the difficulties. In this paper, we propose a novel model, CORE-3DVG, to address these challenges by explicitly learning about contextual objects and relations. Our method accomplishes 3D visual grounding via three sequential modular networks, including a text-guided object detection network, a relation matching network, and a target identification network. During training, we introduce a pseudo-label self-generation strategy and a weakly-supervised method to facilitate the learning of contextual objects and relations, respectively. The proposed techniques allow the networks to focus more effectively on referred objects within 3D scenes by understanding their context better. We validate our model on the challenging Nr3D, Sr3D, and ScanRefer datasets and demonstrate state-of-the-art performance. Our code will be public at https://github.com/yangli18/CORE-3DVG.

SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation
Haobo Jiang Mathieu Salzmann Zheng Dang Jin Xie Jian Yang



研究问题:本文旨在提出一种基于SE(3)扩散模型的点云注册框架,用于现实世界中6D物体姿态估计。
动机:现有的方法在处理真实世界场景中的6D物体姿态估计时存在一些问题,如精确度不高等。因此,本文提出了一种新的基于SE(3)扩散模型的点云注册框架。
方法:该框架将3D注册任务表述为一个去噪扩散过程,通过逐步注入噪声(扰动转换)来逐渐优化源点云的姿态,以获得与模型点云的精确对齐。训练过程中包括两个操作:SE(3)扩散过程和SE(3)反向过程。
效果:实验结果表明,该扩散注册框架在TUD-L、LINEMOD和Occluded-LINEMOD等真实世界数据集上表现出优秀的6D物体姿态估计性能。

In this paper, we introduce an SE(3) diffusion model-based point cloud registration framework for 6D object pose estimation in real-world scenarios. Our approach formulates the 3D registration task as a denoising diffusion process, which progressively refines the pose of the source point cloud to obtain a precise alignment with the model point cloud. Training our framework involves two operations: An SE(3) diffusion process and an SE(3) reverse process. The SE(3) diffusion process gradually perturbs the optimal rigid transformation of a pair of point clouds by continuously injecting noise (perturbation transformation). By contrast, the SE(3) reverse process focuses on learning a denoising network that refines the noisy transformation step-by-step, bringing it closer to the optimal transformation for accurate pose estimation. Unlike standard diffusion models used in linear Euclidean spaces, our diffusion model operates on the SE(3) manifold. This requires exploiting the linear Lie algebra $\mathfrak{se}(3)$ associated with SE(3) to constrain the transformation transitions during the diffusion and reverse processes. Additionally, to effectively train our denoising network, we derive a registration-specific variational lower bound as the optimization objective for model learning. Furthermore, we show that our denoising network can be constructed with a surrogate registration model, making our approach applicable to different deep registration networks. Extensive experiments demonstrate that our diffusion registration framework presents outstanding pose estimation performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.

DiViNeT: 3D Reconstruction from Disparate Views using Neural Template Regularization
Aditya Vora Akshay Gadi Patil Hao Zhang



研究问题:本文旨在提出一种基于体积渲染的神经表面重建方法,该方法只需要最少三个不同的RGB图像作为输入。
动机:现有的表面重建方法在处理稀疏视图时存在严重的问题,导致重建结果中存在显著的间隙。为了解决这个问题,我们提出了一种新的方法,通过学习一组神经网络模板作为表面先验来对重建进行正则化。
方法:我们的方法被命名为DiViNet,它分为两个阶段。第一阶段在不同的场景中学习模板,形式为3D高斯函数,无需3D监督。在重建阶段,我们的预测模板作为锚点帮助“缝合”稀疏区域的曲面。
效果:实验结果表明,我们的方法不仅能够完成曲面几何的重建,而且能在一定程度上从稀疏的输入视图中重建出曲面的细节。在DTU和BlendedMVS数据集上,我们的方法在处理这种稀疏视图时实现了最好的重建质量,而且在使用密集视图作为输入时,其表现与竞争方法相当甚至更好。

We present a volume rendering-based neural surface reconstruction method that takes as few as three disparate RGB images as input. Our key idea is to regularize the reconstruction, which is severely ill-posed and leaving significant gaps between the sparse views, by learning a set of neural templates that act as surface priors. Our method, coined DiViNet, operates in two stages. The first stage learns the templates, in the form of 3D Gaussian functions, across different scenes, without 3D supervision. In the reconstruction stage, our predicted templates serve as anchors to help “stitch” the surfaces over sparse regions. We demonstrate that our approach is not only able to complete the surface geometry but also reconstructs surface details to a reasonable extent from few disparate input views. On the DTU and BlendedMVS datasets, our approach achieves the best reconstruction quality among existing methods in the presence of such sparse views and performs on par, if not better, with competing methods when dense views are employed as inputs.

Unleash the Potential of Image Branch for Cross-modal 3D Object Detection
Yifan Zhang Qijian Zhang Junhui Hou Yixuan Yuan Guoliang Xing



研究问题:现有的跨模态3D检测器没有充分利用图像信息,以解决基于激光雷达的检测器的瓶颈问题。
动机:为了实现可靠和精确的场景理解,自动驾驶汽车通常结合多种传感模式,利用其互补属性。
方法:本文提出了一种新的跨模态3D物体检测器,名为UPIDet,旨在从两个方面释放图像分支的潜力。首先,UPIDet引入了一个新的2D辅助任务,称为归一化局部坐标图估计。这种方法使得可以从图像模态学习局部空间感知特征,以补充稀疏的点云。其次,我们发现通过从图像分支的训练目标反向传播的梯度,可以利用简洁而有效的点到像素模块增强点云主干的表示能力。
效果:大量的实验和消融研究验证了我们的方法的有效性。值得注意的是,我们在竞争激烈的KITTI基准测试中获得了自行车类别的最高排名。源代码可在https://github.com/Eaphan/UPIDet获取。

To achieve reliable and precise scene understanding, autonomous vehicles typically incorporate multiple sensing modalities to capitalize on their complementary attributes. However, existing cross-modal 3D detectors do not fully utilize the image domain information to address the bottleneck issues of the LiDAR-based detectors. This paper presents a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects. First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation. This approach enables the learning of local spatial-aware features from the image modality to supplement sparse point clouds. Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch, utilizing a succinct and effective point-to-pixel module. Extensive experiments and ablation studies validate the effectiveness of our method. Notably, we achieved the top rank in the highly competitive cyclist class of the KITTI benchmark at the time of submission. The source code is available at https://github.com/Eaphan/UPIDet.

Weakly Supervised 3D Open-vocabulary Segmentation
Kunhao Liu Fangneng Zhan Jiahui Zhang MUYU XU Yingchen Yu Abdulmotaleb El Saddik Christian Theobalt Eric Xing Shijian Lu



研究问题:如何克服缺乏大规模和多样化的3D开放词汇分割数据集的问题,训练出强大且具有泛化能力的模型。
动机:3D开放词汇分割是计算机视觉研究的基本功能,但目前由于缺乏大规模的训练数据,这一任务进展缓慢。
方法:利用预训练的2D开放词汇分割模型CLIP和DINO的知识,通过弱监督的方式解决3D开放词汇分割的挑战。具体地,只给出场景中物体的开放词汇文本描述,将CLIP和DINO的开放词汇多模态知识和对象推理能力提炼到一个神经辐射场(NeRF)中,有效地将2D特征提升到一致的3D分割。
效果:实验表明,该方法在某些场景中甚至优于使用分割注释进行完全监督训练的模型,说明可以从2D图像和文本-图像对有效学习3D开放词汇分割。

Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at https://github.com/Kunhao-Liu/3D-OVS.

LightSpeed: Light and Fast Neural Light Fields on Mobile Devices
Aarush Gupta Junli Cao Chaoyang Wang Ju Hu Sergey Tulyakov Jian Ren Laszlo Attila Jeni



研究问题:如何在移动设备上实时生成新视角的图像?
动机:由于计算能力和存储空间的限制,在移动设备上进行实时新颖视角图像合成具有挑战性。
方法:利用经典的光片(双平面)表示法,学习从光线表示到像素颜色的直接映射,以实现高效的神经光场学习。
效果:该方法不仅提供了比现有光场方法更好的渲染质量,而且在渲染质量和速度之间取得了更好的平衡。

Real-time novel-view image synthesis on mobile devices is prohibitive due to the limited computational power and storage. Using volumetric rendering methods, such as NeRF and its derivatives, on mobile devices is not suitable due to the high computational cost of volumetric rendering. On the other hand, recent advances in neural light field representations have shown promising real-time view synthesis results on mobile devices. Neural light field methods learn a direct mapping from a ray representation to the pixel color. The current choice of ray representation is either stratified ray sampling or Plücker coordinates, overlooking the classic light slab (two-plane) representation, the preferred representation to interpolate between light field views. In this work, we find that using the light slab representation is an efficient representation for learning a neural light field. More importantly, it is a lower-dimensional ray representation enabling us to learn the 4D ray space using feature grids which are significantly faster to train and render. Although mostly designed for frontal views, we show that the light-slab representation can be further extended to non-frontal scenes using a divide-and-conquer strategy. Our method provides better rendering quality than prior light field methods and a significantly better trade-off between rendering quality and speed than prior light field methods.

Fine-Grained Visual Prompting
Lingfeng Yang Yueze Wang Xiang Li Xinlong Wang Jian Yang



研究问题:现有的视觉语言模型在实例级别的任务中,如精确定位和识别,性能有限。
动机:尽管已有的视觉提示设计(如彩色框或圈)可以改善模型对感兴趣对象的认识,但与语言提示相比,视觉提示的设计很少被探索。
方法:本文通过探索更精细的标记,如分割掩码及其变体,仔细研究了视觉提示设计。同时引入了一个利用从通用分割模型获得的像素级注释进行细粒度视觉提示的新型零样本框架。
效果:实验结果表明,被称为模糊反向掩码的直接应用模糊技术在目标掩码外部显示出卓越的效果。这种提示策略利用精确的掩码注释减少对弱相关区域的聚焦,同时保持目标和周围背景的空间连贯性。Fine-grained Visual Prompting (FGVP)在参考COCO、RefCOCO+和RefCOCOg基准测试中的指代表达式零样本理解上表现出优越的性能。它比现有方法平均提高了3.0%到4.6%,在RefCOCO+测试子集上的最大改进达到了12.5%。在PACO数据集上进行的部件检测实验进一步验证了FGVP优于现有的视觉提示技术。

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our **F**ine-**G**rained **V**isual **P**rompting (**FGVP**) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0\% to 4.6\%, with a maximum improvement of 12.5\% on the RefCOCO+ testA subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. Code is available at https://github.com/ylingfeng/FGVP.

DSR: Dynamical Surface Representation as Implicit Neural Networks for Protein
Daiwen Sun He Huang Yao Li Xinqi Gong Qiwei Ye



研究问题:本文旨在提出一种新的基于神经网络的方法,通过3D和时间上蛋白质表面的隐式表示来模拟蛋白质动力学。
动机:现有的模拟蛋白质动力学的方法存在局限性,需要一种更有效、可扩展的新方法。
方法:该方法利用有符号距离函数(SDFs)的零水平集来表示蛋白质表面,实现蛋白质动态在时间和空间上的连续表示。
效果:实验结果表明,该方法能准确捕捉蛋白质动态轨迹,并能在3D和时间上进行插值和外推。这是首次成功模拟大规模蛋白质动态的研究,为蛋白质动力学研究提供了一种有前景的新方法。

We propose a novel neural network-based approach to modeling protein dynamics using an implicit representation of a protein’s surface in 3D and time. Our method utilizes the zero-level set of signed distance functions (SDFs) to represent protein surfaces, enabling temporally and spatially continuous representations of protein dynamics. Our experimental results demonstrate that our model accurately captures protein dynamic trajectories and can interpolate and extrapolate in 3D and time. Importantly, this is the first study to introduce this method and successfully model large-scale protein dynamics. This approach offers a promising alternative to current methods, overcoming the limitations of first-principles-based and deep learning methods, and provides a more scalable and efficient approach to modeling protein dynamics. Additionally, our surface representation approach simplifies calculations and allows identifying movement trends and amplitudes of protein domains, making it a useful tool for protein dynamics research. Codes are available at https://github.com/Sundw-818/DSR, and we have a project webpage that shows some video results, https://sundw-818.github.io/DSR/.

Achieving Cross Modal Generalization with Multimodal Unified Representation
Yan Xia Hai Huang Jieming Zhu Zhou Zhao



研究问题:本文提出了一种新的跨模态泛化(CMG)任务,旨在解决预训练阶段从配对的多模态数据中学习统一离散表示的挑战。
动机:现有的多模态表示学习方法主要关注粗粒度对齐或依赖于不同模态的信息完全对齐的假设,这在现实世界的场景中是不现实的。
方法:为了克服这个限制,我们提出了Uni-Code,包括双向跨模态信息解耦(DCID)模块和多模态指数移动平均(MM-EMA)两个关键贡献。这些方法促进了模态之间的双向监督,并在共享的离散潜在空间中对齐语义等价的信息,实现了多模态序列的细粒度统一表示。
效果:在预训练阶段,我们研究了各种模态组合,包括视听、音视和视听文本的三模态组合。在各种下游任务上的大量实验,如跨模态事件分类、定位、检索、基于查询的视频分割和跨数据集事件定位,证明了我们提出的方法的有效性。代码可在https://github.com/haihuangcode/CMG获取。

This paper introduces a novel task called Cross Modal Generalization (CMG), which addresses the challenge of learning a unified discrete representation from paired multimodal data during pre-training. Then in downstream tasks, the model can achieve zero-shot generalization ability in other modalities when only one modal is labeled. Existing approaches in multimodal representation learning focus more on coarse-grained alignment or rely on the assumption that information from different modalities is completely aligned, which is impractical in real-world scenarios. To overcome this limitation, we propose \textbf{Uni-Code}, which contains two key contributions: the Dual Cross-modal Information Disentangling (DCID) module and the Multi-Modal Exponential Moving Average (MM-EMA). These methods facilitate bidirectional supervision between modalities and align semantically equivalent information in a shared discrete latent space, enabling fine-grained unified representation of multimodal sequences. During pre-training, we investigate various modality combinations, including audio-visual, audio-text, and the tri-modal combination of audio-visual-text. Extensive experiments on various downstream tasks, i.e., cross-modal event classification, localization, cross-modal retrieval, query-based video segmentation, and cross-dataset event localization, demonstrate the effectiveness of our proposed methods. The code is available at https://github.com/haihuangcode/CMG.

Text Promptable Surgical Instrument Segmentation with Vision-Language Models
Zijian Zhou Oluwatosin Alabi Meng Wei Tom Vercauteren Miaojing Shi



研究问题:本文旨在解决微创手术中手术器械多样性和差异化的挑战,特别是在手术器械分割问题上。
动机:由于微创手术中手术器械的多样性和差异化,传统的手术器械分割方法面临挑战。受视觉语言模型最新进展的启发,作者提出了一种新颖的文本提示手术器械分割方法。
方法:该方法以预训练的图像和文本编码器作为模型基础,设计了一个由注意力和卷积为基础的提示机制组成的文本提示掩码解码器进行手术器械分割预测。通过一种新的提示混合机制,模型利用多个文本提示对每个手术器械进行提示,从而提高分割性能。此外,还引入了硬器械区域强化模块以提高图像特征理解和分割精度。
效果:在几个手术器械分割数据集上的大量实验表明,该模型具有优越的性能和良好的泛化能力。据作者所知,这是首次将提示方法应用于手术器械分割,为机器人辅助手术领域提供了巨大的实际应用潜力。

In this paper, we propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments in minimally invasive surgeries. We redefine the task as text promptable, thereby enabling a more nuanced comprehension of surgical instruments and adaptability to new instrument types. Inspired by recent advancements in vision-language models, we leverage pretrained image and text encoders as our model backbone and design a text promptable mask decoder consisting of attention- and convolution-based prompting schemes for surgical instrument segmentation prediction. Our model leverages multiple text prompts for each surgical instrument through a new mixture of prompts mechanism, resulting in enhanced segmentation performance. Additionally, we introduce a hard instrument area reinforcement module to improve image feature comprehension and segmentation precision. Extensive experiments on several surgical instrument segmentation datasets demonstrate our model's superior performance and promising generalization capability. To our knowledge, this is the first implementation of a promptable approach to surgical instrument segmentation, offering significant potential for practical application in the field of robotic-assisted surgery.

Towards A Richer 2D Understanding of Hands at Scale
Tianyi Cheng Dandan Shan Ayda Sultan Hassen Richard Ely Locke Higgins David Fouhey



研究问题:如何让AI系统更好地理解手部交互?
动机:通过观察他人的手部交互,人类可以学习到很多关于如何与世界互动的知识。为了帮助AI系统获得对手部交互的更深入理解,我们引入了一种新的模型。
方法:我们的系统产生的结果比过去的系统在更大的规模上更为丰富。我们的输出包括手部的框和段,接触物体,以及工具接触和抓取的第二物体。这种方法的支持来自于四个数据集的257K张图像、401K只手、288K个物体和19K个第二物体的注释。
效果:我们的结果显示,该方法提供了丰富的信息,并且具有良好的性能和泛化能力。

As humans, we learn a lot about how to interact with the world by observing others interacting with their hands. To help AI systems obtain a better understanding of hand interactions, we introduce a new model that produces a rich understanding of hand interaction. Our system produces a richer output than past systems at a larger scale. Our outputs include boxes and segments for hands, in-contact objects, and second objects touched by tools as well as contact and grasp type. Supporting this method are annotations of 257K images, 401K hands, 288K objects, and 19K second objects spanning four datasets. We show that our method provides rich information and performs and generalizes well.

InfoCD: A Contrastive Chamfer Distance Loss for Point Cloud Completion
Fangzhou Lin Yun Yue Ziming Zhang Songlin Hou Kazunori Yamada Vijaya B Kolachalama Venkatesh Saligrama



研究问题:如何有效地衡量和学习三维点云之间的相似性,同时解决现有方法对异常值敏感的问题。
动机:现有的Chamfer距离(CD)度量和训练损失在衡量点云间的距离上非常流行,但它们对异常值敏感。
方法:本文提出了一种新颖的对比性Chamfer距离损失函数InfoCD,通过学习匹配点的扩散以实现点云之间更好的分布对齐,并考虑表面相似度估计器。
效果:实验结果表明,使用InfoCD进行点云补全,在所有流行的基于CD损失的基线上都有显著的改进,并在几个基准数据集上取得了新的最先进的结果。

A point cloud is a discrete set of data points sampled from a 3D geometric surface. Chamfer distance (CD) is a popular metric and training loss to measure the distances between point clouds, but also well known to be sensitive to outliers. To address this issue, in this paper we propose InfoCD, a novel contrastive Chamfer distance loss to learn to spread the matched points for better distribution alignments between point clouds as well as accounting for a surface similarity estimator. We show that minimizing InfoCD is equivalent to maximizing a lower bound of the mutual information between the underlying geometric surfaces represented by the point clouds, leading to a regularized CD metric which is robust and computationally efficient for deep learning. We conduct comprehensive experiments for point cloud completion using InfoCD and observe significant improvements consistently over all the popular baseline networks trained with CD-based losses, leading to new state-of-the-art results on several benchmark datasets. Demo code is available at https://github.com/Zhang-VISLab/NeurIPS2023-InfoCD.

Jigsaw: Learning to Assemble Multiple Fractured Objects
Jiaxin Lu Yifan Sun Qixing Huang



研究问题:本文旨在开发一种新的框架,用于从多个碎片中组装物理上破碎的3D物体。
动机:自动化装配3D断裂在骨科、考古学和我们的日常生活中至关重要。
方法:该文提出一种名为Jigsaw的新框架,利用全局和局部几何的分层特征来匹配和对齐断裂表面。该框架由四个组件组成:(1)具有注意力层的前端点特征提取器;(2)分割断裂和原始部分的表面;(3)在断裂表面点之间找到对应关系;(4)恢复碎片的全局姿势的稳健全局对齐。
效果:在Breaking Bad数据集上评估Jigsaw,其性能优于最先进的方法。该方法也很好地适用于各种断裂模式、对象和未见过的情况。这是专为多片3D断裂装配设计的第一个基于学习的方法。

Automated assembly of 3D fractures is essential in orthopedics, archaeology, and our daily life. This paper presents Jigsaw, a novel framework for assembling physically broken 3D objects from multiple pieces. Our approach leverages hierarchical features of global and local geometry to match and align the fracture surfaces. Our framework consists of four components: (1) front-end point feature extractor with attention layers, (2) surface segmentation to separate fracture and original parts, (3) multi-parts matching to find correspondences among fracture surface points, and (4) robust global alignment to recover the global poses of the pieces. We show how to jointly learn segmentation and matching and seamlessly integrate feature matching and rigidity constraints. We evaluate Jigsaw on the Breaking Bad dataset and achieve superior performance compared to state-of-the-art methods. Our method also generalizes well to diverse fracture modes, objects, and unseen instances. To the best of our knowledge, this is the first learning-based method designed specifically for 3D fracture assembly over multiple pieces. Our code is available at https://jiaxin-lu.github.io/Jigsaw/.

PrObeD: Proactive Object Detection Wrapper
Vishal Asnani Abhinav Kumar Suya You Xiaoming Liu



研究问题:现有的二维物体检测方法在处理通用和伪装图像时,由于神经网络的全局极小值并非最优,导致训练后的物体检测器性能不佳。
动机:为了解决这个问题,本文提出了一种基于主动方案的包装器PrObeD,通过学习信号来提高物体检测器的性能。
方法:PrObeD由编码器-解码器架构组成,编码器网络生成一个与输入图像相关的信号(模板)对输入图像进行加密,解码器从加密的图像中恢复这个模板。通过学习最优模板,可以得到具有改进检测性能的物体检测器。
效果:在MS-COCO、CAMO、COD10K和NC4K数据集上的实验表明,应用PrObeD后,不同的检测器都有所改进。

Previous research in $2D$ object detection focuses on various tasks, including detecting objects in generic and camouflaged images. These works are regarded as passive works for object detection as they take the input image as is. However, convergence to global minima is not guaranteed to be optimal in neural networks; therefore, we argue that the trained weights in the object detector are not optimal. To rectify this problem, we propose a wrapper based on proactive schemes, PrObeD, which enhances the performance of these object detectors by learning a signal. PrObeD consists of an encoder-decoder architecture, where the encoder network generates an image-dependent signal termed templates to encrypt the input images, and the decoder recovers this template from the encrypted images. We propose that learning the optimum template results in an object detector with an improved detection performance. The template acts as a mask to the input images to highlight semantics useful for the object detector. Finetuning the object detector with these encrypted images enhances the detection performance for both generic and camouflaged. Our experiments on MS-COCO, CAMO, COD$10$K, and NC$4$K datasets show improvement over different detectors after applying PrObeD. Our models/codes are available at https://github.com/vishal3477/Proactive-Object-Detection.

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence
Junyi Zhang Charles Herrmann Junhwa Hur Luisa Polania Cabrera Varun Jampani Deqing Sun Ming-Hsuan Yang



研究问题:扩散模型特征在理解处理单个图像和对象上的表现如何?
动机:探索扩散模型特征在多个不同图像和对象上的应用。
方法:利用稳定扩散(SD)特征进行语义和密集对应,并通过简单的后处理,发现SD特征可以与最新的表征学习特征相媲美。
效果:通过融合这两种特征,性能显著优于现有方法,并在基准数据集上实现了重要的性能提升。

Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images. Project page: https://sd-complements-dino.github.io/.

Locality-Aware Generalizable Implicit Neural Representation
Doyup Lee Chiheon Kim Minsu Cho Wook-Shin Han



研究问题:如何提高预训练语言模型对结构化知识的利用,以增强语言表示。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Generalizable implicit neural representation (INR) enables a single continuous function, i.e., a coordinate-based neural network, to represent multiple data instances by modulating its weights or intermediate features using latent codes. However, the expressive power of the state-of-the-art modulation is limited due to its inability to localize and capture fine-grained details of data entities such as specific pixels and rays. To address this issue, we propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. The transformer encoder predicts a set of latent tokens from a data instance to encode local information into each latent token. The locality-aware INR decoder extracts a modulation vector by selectively aggregating the latent tokens via cross-attention for a coordinate input and then predicts the output by progressively decoding with coarse-to-fine modulation through multiple frequency bandwidths. The selective token aggregation and the multi-band feature modulation enable us to learn locality-aware representation in spatial and spectral aspects, respectively. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks such as image generation.

Task-aware Distributed Source Coding under Dynamic Bandwidth
Po-han Li Sravan Kumar Ankireddy Ruihan Zhao Hossein Nourkhiz Mahjoub Ehsan Moradi Pari ufuk topcu Sandeep P. Chinchali Hyeji Kim



研究问题:如何在多传感器网络中有效地压缩相关数据以最小化通信负载。
动机:在多传感器网络中,每个传感器独立地压缩数据并将其传输到中央节点。由于有限的通信带宽,压缩器需要只学习与任务相关的特征。此外,最终的性能在很大程度上取决于总可用带宽。在实践中,经常会遇到带宽可用性的变化。因此,压缩器必须动态地利用任何时刻的最大可用带宽。
方法:我们提出了一种新的分布式压缩框架,由独立的编码器和联合解码器组成,我们称之为神经分布式主成分分析(NDPCA)。NDPCA通过学习低秩任务表示和有效地在传感器之间分配带宽,灵活地将来自多个源的数据压缩到任何可用的带宽,从而减少了计算和存储开销。
效果:实验表明,与具有均匀带宽分配的自动编码器相比,NDPCA在提高多视角机器人手臂操作的成功率方面提高了9%,在卫星图像物体检测任务的准确性方面提高了14%。

Efficient compression of correlated data is essential to minimize communication overload in multi-sensor networks. In such networks, each sensor independently compresses the data and transmits them to a central node. A decoder at the central node decompresses and passes the data to a pre-trained machine learning-based task model to generate the final output. Due to limited communication bandwidth, it is important for the compressor to learn only the features that are relevant to the task. Additionally, the final performance depends heavily on the total available bandwidth. In practice, it is common to encounter varying availability in bandwidth. Since higher bandwidth results in better performance, it is essential for the compressor to dynamically take advantage of the maximum available bandwidth at any instant. In this work, we propose a novel distributed compression framework composed of independent encoders and a joint decoder, which we call neural distributed principal component analysis (NDPCA). NDPCA flexibly compresses data from multiple sources to any available bandwidth with a single model, reducing compute and storage overhead. NDPCA achieves this by learning low-rank task representations and efficiently distributing bandwidth among sensors, thus providing a graceful trade-off between performance and bandwidth. Experiments show that NDPCA improves the success rate of multi-view robotic arm manipulation by 9% and the accuracy of object detection tasks on satellite imagery by 14% compared to an autoencoder with uniform bandwidth allocation.

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Susan Liang Chao Huang Yapeng Tian Anurag Kumar Chenliang Xu



研究问题:机器能否在新颖的位置和视角下,对音频-视觉场景进行真实、匹配的音频-视觉体验记录?
动机:通过研究一种新的任务——真实世界的音频-视觉场景合成,以及一种首创的基于NeRF的多模态学习方法,来回答这个问题。
方法:我们提出了一个声音感知的音频生成模块,将音频传播的先验知识整合到NeRF中,使音频生成与视觉环境的3D几何和材料属性隐含关联。我们还展示了一个坐标转换模块,用于表示相对于声源的视角,使模型能够学习以声源为中心的声场。
效果:我们在高质量的真实世界音频-视觉场景(RWAVS)数据集上展示了该方法的优势,并在基于模拟的声音空间(SoundSpaces)数据集上也取得了显著的效果。

Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task---real-world audio-visual scene synthesis---and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset. Notably, we refer readers to view our demo videos for convincing comparisons.

CP-SLAM: Collaborative Neural Point-based SLAM System
Jiarui Hu Mao Mao Hujun Bao Guofeng Zhang Zhaopeng Cui



研究问题:本文旨在提出一种协同的隐式神经网络同时定位与地图构建(SLAM)系统,以处理RGB-D图像序列。
动机:为了在统一的框架中实现所有这些模块,提出了一种新的基于点的3D场景表示方法,并设计了分布式到集中的学习策略以提高系统的一致性和协作性。
方法:提出的系统包括完整的前端和后端模块,如里程计、环路检测、子地图融合和全局优化。其中,每个点都维护一个用于场景编码的可学习神经特征,并与某一关键帧关联。
效果:实验证明,该方法在相机跟踪和地图构建等任务上均优于传统方法。

This paper presents a collaborative implicit neural simultaneous localization and mapping (SLAM) system with RGB-D image sequences, which consists of complete front-end and back-end modules including odometry, loop detection, sub-map fusion, and global refinement. In order to enable all these modules in a unified framework, we propose a novel neural point based 3D scene representation in which each point maintains a learnable neural feature for scene encoding and is associated with a certain keyframe. Moreover, a distributed-to-centralized learning strategy is proposed for the collaborative implicit SLAM to improve consistency and cooperation. A novel global optimization framework is also proposed to improve the system accuracy like traditional bundle adjustment. Experiments on various datasets demonstrate the superiority of the proposed method in both camera tracking and mapping.

CAST: Cross-Attention in Space and Time for Video Action Recognition
Dongho Lee Jongseo Lee Jinwoo Choi



研究问题:如何通过空间和时间理解,在视频中识别人类行为。
动机:现有的动作识别模型在对视频的空间和时间理解上存在不平衡的问题。
方法:提出了一种名为Cross-Attention in Space and Time(CAST)的新颖双流架构,仅使用RGB输入即可实现对视频的平衡空间和时间理解。
效果:通过在具有不同特征的公共基准测试集EPIC-Kitchens-100、Something-Something-V2和Kinetics-400上的广泛实验,证明了该方法的优越性能。

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-Kitchens-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics. The code is available at https://github.com/KHU-VLL/CAST.

Improving Graph Matching with Positional Reconstruction Encoder-Decoder Network
Yixiao Zhou Ruiqi Jia Hongxiang Lin Hefeng Quan Yumeng Zhao Xiaoqing Lyu



研究问题:本文旨在解决现有深度图匹配方法在捕捉语义关键点的空间关系方面的不足。
动机:现有的深度图匹配方法在从语义关键点的位置构建图的过程中,无法充分捕获关键点之间的相对空间关系。
方法:提出了一种位置重建编码器-解码器(PR-EnDec),用于建模图的内在空间结构,并基于PR-EnDec提出了一个端到端的图匹配网络PREGM。
效果:在三个公开的关键点匹配数据集上的大量实验结果表明,所提出的PREGM方法的有效性。

Deriving from image matching and understanding, semantic keypoint matching aims at establishing correspondence between keypoint sets in images. As graphs are powerful tools to represent points and their complex relationships, graph matching provides an effective way to find desired semantic keypoint correspondences. Recent deep graph matching methods have shown excellent performance, but there is still a lack of exploration and utilization of spatial information of keypoints as nodes in graphs. More specifically, existing methods are insufficient to capture the relative spatial relations through current graph construction approaches from the locations of semantic keypoints. To address these issues, we introduce a positional reconstruction encoder-decoder (PR-EnDec) to model intrinsic graph spatial structure, and present an end-to-end graph matching network PREGM based on PR-EnDec. Our PR-EnDec consists of a positional encoder that learns effective node spatial embedding with the affine transformation invariance, and a spatial relation decoder that further utilizes the high-order spatial information by reconstructing the locational structure of graphs contained in the node coordinates. Extensive experimental results on three public keypoint matching datasets demonstrate the effectiveness of our proposed PREGM.

Query-based Temporal Fusion with Explicit Motion for 3D Object Detection
Jinghua Hou Zhe Liu dingkang liang Zhikang Zou Xiaoqing Ye Xiang Bai



研究问题:如何有效地利用时间信息来提高自动驾驶车辆的3D检测性能。
动机:现有的方法要么基于密集BEV特征进行时间融合,要么基于稀疏3D提案特征进行时间融合,但前者对前景物体的关注不够,导致计算成本较高且性能不佳;后者需要执行耗时的操作来生成稀疏的3D提案特征,其性能受到3D提案质量的限制。
方法:本文提出了一种简单而有效的基于查询的时间融合网络(QTNet)。主要思想是利用前几帧中的物体查询来增强当前物体查询的表示,这是通过提出的运动引导的时间建模(MTM)模块实现的,该模块利用物体查询在时间维度上的空间位置信息来可靠地构建相邻帧之间的相关性。
效果:实验结果表明,我们提出的QTNet在nuScenes数据集上优于基于BEV或提案的方式。此外,MTM是一个即插即用的模块,可以集成到一些先进的纯LiDAR或多模态3D检测器中,甚至在nuScenes数据集上以可忽略的计算成本和延迟带来新的SOTA性能。这些实验有力地证明了我们的方法的优越性和普遍性。代码可在https://github.com/AlmoonYsl/QTNet获取。

Effectively utilizing temporal information to improve 3D detection performance is vital for autonomous driving vehicles. Existing methods either conduct temporal fusion based on the dense BEV features or sparse 3D proposal features. However, the former does not pay more attention to foreground objects, leading to more computation costs and sub-optimal performance. The latter implements time-consuming operations to generate sparse 3D proposal features, and the performance is limited by the quality of 3D proposals. In this paper, we propose a simple and effective Query-based Temporal Fusion Network (QTNet). The main idea is to exploit the object queries in previous frames to enhance the representation of current object queries by the proposed Motion-guided Temporal Modeling (MTM) module, which utilizes the spatial position information of object queries along the temporal dimension to construct their relevance between adjacent frames reliably. Experimental results show our proposed QTNet outperforms BEV-based or proposal-based manners on the nuScenes dataset. Besides, the MTM is a plug-and-play module, which can be integrated into some advanced LiDAR-only or multi-modality 3D detectors and even brings new SOTA performance with negligible computation cost and latency on the nuScenes dataset. These experiments powerfully illustrate the superiority and generalization of our method. The code is available at https://github.com/AlmoonYsl/QTNet.

SLIBO-Net: Floorplan Reconstruction via Slicing Box Representation with Local Geometry Regularization
Jheng-Wei Su Kuei-Yu Tung Chi-Han Peng Peter Wonka Hung-Kuo Chu



研究问题:本文旨在改进从无结构的3D点云重建2D平面图的方法。
动机:当前方法在语义质量、有效表示和局部几何细节等方面存在不足,需要改进。
方法:提出SLIBO-Net,一种创新的从无结构3D点云重建2D平面图的方法。该方法采用基于变压器的新型架构,提供改进的房间形状监督和可管理的令牌数量。通过将几何先验作为正则化机制和后处理步骤,增强了对局部几何细节的捕捉。
效果:该方法在Structure3D数据集上取得了新的最先进的成果。重建的平面图显示出增强的语义可信度,显著提高了重建的整体质量和逼真度。代码和数据集在线可用。

This paper focuses on improving the reconstruction of 2D floorplans from unstructured 3D point clouds. We identify opportunities for enhancement over the existing methods in three main areas: semantic quality, efficient representation, and local geometric details. To address these, we presents SLIBO-Net, an innovative approach to reconstructing 2D floorplans from unstructured 3D point clouds. We propose a novel transformer-based architecture that employs an efficient floorplan representation, providing improved room shape supervision and allowing for manageable token numbers. By incorporating geometric priors as a regularization mechanism and post-processing step, we enhance the capture of local geometric details. We also propose a scale-independent evaluation metric, correcting the discrepancy in error treatment between varying floorplan sizes. Our approach notably achieves a new state-of-the-art on the Structure3D dataset. The resultant floorplans exhibit enhanced semantic plausibility, substantially improving the overall quality and realism of the reconstructions. Our code and dataset are available online.

Connecting Multi-modal Contrastive Representations
Zehan Wang Yang Zhao Xize Cheng Haifeng Huang Jiageng Liu Aoxiong Yin Li Tang Linjun Li Yongqi Wang Ziang Zhang Zhou Zhao



研究问题:本文旨在解决多模态对比表示学习(MCR)需要大量高质量数据对的问题,提出一种新的无需配对数据的学习方法。
动机:目前的MCR方法依赖于大量的高质量数据对,限制了其在更多模态上的发展。
方法:本文提出了一种名为“连接多模态对比表示”(C-MCR)的新方法。具体来说,对于在$(\mathcal{A}$, $\mathcal{B})$和$(\mathcal{B}$, $\mathcal{C})$模态对上预训练的两个现有的MCR,我们将它们投影到一个新的空间中,并使用来自重叠模态$mathcal{B}$的数据在新的空间中对两个MCR进行对齐。同时,由于模态对$(mathcal{A}$, $mathcal{B})$和$(\mathcal{B}$, $\mathcal{C})$在每个MCR中已经对齐,通过重叠模态学习的连接也可以转移到非重叠模态对$(\mathcal{A}$, $mathcal{C})$。
效果:实验结果表明,C-MCR在音频-视觉领域无需使用任何配对数据就能达到最先进的性能,并在音频-图像检索、音频-视觉源定位和反事实音频-图像识别任务上表现出色。此外,C-MCR在3D语言学习方面也实现了先进的零射弹3D点云分类精度。

Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on $(\mathcal{A}$, $\mathcal{B})$ and $(\mathcal{B}$, $\mathcal{C})$ modality pairs, we project them to a new space and use the data from the overlapping modality $\mathcal{B}$ to aligning the two MCRs in the new space. Meanwhile, since the modality pairs $(\mathcal{A}$, $\mathcal{B})$ and $(\mathcal{B}$, $\mathcal{C})$ are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair $(\mathcal{A}$, $\mathcal{C})$. To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. To demonstrate the effectiveness of C-MCR, we take the field of audio-visual and 3D-language learning as examples. Specifically, we connect CLIP and CLAP via texts to derive audio-visual representations, and integrate CLIP and ULIP via images for 3D-language representations. Remarkably, without using any paired data, C-MCR for audio-visual achieves state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced zero-shot 3D point cloud classification accuracy on ModelNet40. Our project page is available at \url{https://c-mcr.github.io/C-MCR/}

SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding
Paul-Edouard Sarlin Eduard Trulls Marc Pollefeys Jan Hosang Simon Lynen



研究问题:我们能否使用原始图像来自动创建易于人类和机器理解的更好地图?
动机:现有的语义2D地图在细节、准确性以及自动化创建和维护方面存在限制。
方法:我们引入了SNAP,这是一个可以从地面和空中图像中学习丰富2D神经地图的深度网络。我们的模型通过数十亿个街景图像上的相机姿态进行监督训练,以对齐从不同输入估计的神经地图。
效果:SNAP能够解决传统方法无法处理的具有挑战性的图像查询的位置问题,并在本地化任务上大幅超越现有技术。此外,我们的神经地图不仅编码了几何和外观信息,还发现了无需显式监督的高度级语义信息,这为数据高效的语义场景理解的有效预训练提供了可能,有望实现更详细地图的低成本创建。

Semantic 2D maps are commonly used by humans and machines for navigation purposes, whether it's walking or driving. However, these maps have limitations: they lack detail, often contain inaccuracies, and are difficult to create and maintain, especially in an automated fashion. Can we use _raw imagery_ to automatically create _better maps_ that can be easily interpreted by both humans and machines? We introduce SNAP, a deep network that learns rich 2D _neural_ maps from ground-level and overhead images. We train our model to align neural maps estimated from different inputs, supervised only with camera poses over tens of millions of StreetView images. SNAP can resolve the location of challenging image queries beyond the reach of traditional methods, outperforming the state of the art in localization by a large margin. Moreover, our neural maps encode not only geometry and appearance but also high-level semantics, discovered without explicit supervision. This enables effective pre-training for data-efficient semantic scene understanding, with the potential to unlock cost-efficient creation of more detailed maps.

SwiFT: Swin 4D fMRI Transformer
Peter Yongho Kim Junbeom Kwon Sunghwan Joo Sangyoon Bae Donggyu Lee Yoonho Jung Shinjae Yoo Jiook Cha Taesup Moon



研究问题:如何从高维数据如功能磁共振成像(fMRI)中直接学习大脑动态?
动机:现有的fMRI分析方法使用手工制作的特征,但特征提取过程可能会丢失fMRI扫描中的关键信息。
方法:提出SwiFT模型,一种可以直接从fMRI体积中以记忆和计算高效的方式学习大脑动态的Swin Transformer架构。通过实施4D窗口多头自注意力机制和绝对位置嵌入来实现。
效果:在多个大规模的静息状态fMRI数据集上进行评估,包括人类连接组项目(HCP)、青少年大脑认知发展(ABCD)和英国生物银行(UKB)数据集,预测性别、年龄和认知智力。实验结果表明,SwiFT始终优于最新的先进模型。此外,利用其端到端学习能力,展示了基于对比损失的自我监督预训练可以增强下游任务的性能。

Modeling spatiotemporal brain dynamics from high-dimensional data, such as functional Magnetic Resonance Imaging (fMRI), is a formidable task in neuroscience. Existing approaches for fMRI analysis utilize hand-crafted features, but the process of feature extraction risks losing essential information in fMRI scans. To address this challenge, we present SwiFT (Swin 4D fMRI Transformer), a Swin Transformer architecture that can learn brain dynamics directly from fMRI volumes in a memory and computation-efficient manner. SwiFT achieves this by implementing a 4D window multi-head self-attention mechanism and absolute positional embeddings. We evaluate SwiFT using multiple large-scale resting-state fMRI datasets, including the Human Connectome Project (HCP), Adolescent Brain Cognitive Development (ABCD), and UK Biobank (UKB) datasets, to predict sex, age, and cognitive intelligence. Our experimental outcomes reveal that SwiFT consistently outperforms recent state-of-the-art models. Furthermore, by leveraging its end-to-end learning capability, we show that contrastive loss-based self-supervised pre-training of SwiFT can enhance performance on downstream tasks. Additionally, we employ an explainable AI method to identify the brain regions associated with sex classification. To our knowledge, SwiFT is the first Swin Transformer architecture to process dimensional spatiotemporal brain functional data in an end-to-end fashion. Our work holds substantial potential in facilitating scalable learning of functional brain imaging in neuroscience research by reducing the hurdles associated with applying Transformer models to high-dimensional fMRI.

Self-Adaptive Motion Tracking against On-body Displacement of Flexible Sensors
Chengxu Zuo Jiawei Fang Shihui Guo Yipeng Qin



研究问题:如何应对传感器在人体上的位移问题,以实现对人的状态的普遍感知。
动机:由于传感器的灵活性和易于集成为可穿戴系统,它们有希望用于普遍感知人体状态。然而,由于设备不能在不同的会话中固定在一个位置,传感器在身体上的位移是不可避免的。这个问题给后续的机器学习算法带来了复杂的模式和重大的挑战。
方法:我们提出了一种新的自适应运动跟踪网络来解决这个挑战。我们的网络由三个新的组件组成:(1)一个轻量级的可学习的仿射变换层,其参数可以调整以有效地适应未知的位移;(2)一个傅立叶编码的长短期记忆网络,用于更好地识别模式;(3)一个新的序列差异损失,配备辅助回归器,用于无监督调整仿射变换参数。
效果:实验结果表明,我们的方法能够有效地处理传感器在人体上的位移问题,提高了对人的状态的普遍感知的准确性。

Flexible sensors are promising for ubiquitous sensing of human status due to their flexibility and easy integration as wearable systems. However, on-body displacement of sensors is inevitable since the device cannot be firmly worn at a fixed position across different sessions. This displacement issue causes complicated patterns and significant challenges to subsequent machine learning algorithms. Our work proposes a novel self-adaptive motion tracking network to address this challenge. Our network consists of three novel components: i) a light-weight learnable Affine Transformation layer whose parameters can be tuned to efficiently adapt to unknown displacements; ii) a Fourier-encoded LSTM network for better pattern identification; iii) a novel sequence discrepancy loss equipped with auxiliary regressors for unsupervised tuning of Affine Transformation parameters.

Towards Consistent Video Editing with Text-to-Image Diffusion Models
Zicheng Zhang Bonan Li Xuecheng Nie Congying Han Tiande Guo Luoqi Liu



研究问题:现有的文本到图像扩散模型在视频编辑中存在一致性和时序性问题。
动机:这些问题是由于在学习时序信息时,新加入的模块导致特征空间的协变量偏移,从而影响了编辑能力。
方法:提出了一种新的EI^2模型,通过引入Shift-restricted Temporal Attention Module (STAM)和Fine-coarse Frame Attention Module (FFAM)来解决上述问题。STAM使用实例中心化层替换了Layer Normalization来保留时序特征的分布,同时使用注意力层进行标准化映射以转换时序特征并限制方差偏移。FFAM则利用全局帧的精细-粗略空间信息进一步增加时序一致性。
效果:实验表明,提出的EI^2模型具有优越的性能。

Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner. Despite their low requirements of data and computation, these methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence, limiting their applications in the real world. In this paper, we propose to address the above issues with a novel EI$^2$ model towards Enhancing vIdeo Editing consIstency of TTI-based frameworks. Specifically, we analyze and find that the inconsistent problem is caused by newly added modules into TTI models for learning temporal information. These modules lead to covariate shift in the feature space, which harms the editing capability. Thus, we design EI$^2$ to tackle the above drawbacks with two classical modules: Shift-restricted Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM). First, through theoretical analysis, we demonstrate that covariate shift is highly related to Layer Normalization, thus STAM employs a Instance Centering layer replacing it to preserve the distribution of temporal features. In addition, STAM employs an attention layer with normalized mapping to transform temporal features while constraining the variance shift. As the second part, we incorporate STAM with a novel FFAM, which efficiently leverages fine-coarse spatial information of overall frames to further enhance temporal consistency. Extensive experiments demonstrate the superiority of the proposed EI$^2$ model.

Opening the Vocabulary of Egocentric Actions
Dibyadip Chatterjee Fadime Sener Shugao Ma Angela Yao



研究问题:本文旨在解决自主视频中的动作识别问题,特别是在开放词汇环境下对新对象进行动作识别的问题。
动机:尽管现有的自主视频数据集已经相当大,但仍存在两个主要问题:动作组合的稀疏性和交互对象的封闭性。
方法:本文提出了一种新的开放词汇动作识别任务。通过一个与对象无关的动词编码器和一个基于提示的对象编码器,将动词和对象预测解耦。提示利用CLIP表示来预测开放词汇的交互对象。
效果:在EPIC-KITCHENS-100和Assembly101数据集上创建了开放词汇基准测试。与封闭式动作方法无法泛化不同,该方法非常有效。此外,我们的对象编码器在识别新交互对象方面显著优于现有的开放词汇视觉识别方法。

Human actions in egocentric videos often feature hand-object interactions composed of a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations — sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic _verb encoder_ and a prompt-based _object encoder_. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.

Point Cloud Completion with Pretrained Text-to-Image Diffusion Models
Yoni Kasten Ohad Rahamim Gal Chechik



研究问题:如何有效地补全现实世界中收集到的不完整点云数据。
动机:现有的补全方法依赖于特定预定义对象的数据集来指导不完整和可能噪声的点云的完成,但这些方法在分布外(OOD)对象上表现不佳。
方法:我们提出了一种名为SDS-Complete的方法,该方法使用预训练的文本到图像扩散模型,并利用给定对象的不完整点云的文本语义,获得完整的表面表示。
效果:我们在现实世界的深度传感器和激光扫描仪捕获的不完整扫描对象上评估了SDS-Complete,并证明其在处理通常不存在于常见数据集中的物体方面是有效的。

Point cloud data collected in real-world applications are often incomplete. This is because they are observed from partial viewpoints, which capture only a specific perspective or angle, or due to occlusion and low resolution. Existing completion approaches rely on datasets of specific predefined objects to guide the completion of incomplete, and possibly noisy, point clouds. However, these approaches perform poorly with Out-Of-Distribution (OOD) objects, which are either absent from the dataset or poorly represented. In recent years, the field of text-guided image generation has made significant progress, leading to major breakthroughs in text guided shape generation. We describe an approach called SDS-Complete that uses a pre-trained text-to-image diffusion model and leverages the text semantic of a given incomplete point cloud of an object, to obtain a complete surface representation. SDS-Complete can complete a variety of objects at test time optimization without the need for an expensive collection of 3D information. We evaluate SDS-Complete on incomplete scanned objects, captured by real-world depth sensors and LiDAR scanners, and demonstrate that is effective in handling objects which are typically absent from common datasets.

Language-driven Scene Synthesis using Multi-conditional Diffusion Model
An Dinh Vuong Minh Nhat VU Toan Tien Nguyen Baoru Huang Dzung Nguyen Thieu Vo Anh Nguyen



研究问题:本文旨在解决场景合成问题,特别是如何结合文本提示、人体运动和现有物体进行多模态场景合成。
动机:尽管已有一些研究尝试通过人类运动、房间布局或空间图等方式进行场景合成,但将文本提示纳入考虑的场景合成任务却鲜有研究。
方法:本文提出了一种语言驱动的场景合成任务,该任务整合了文本提示、人体运动和现有物体进行场景合成。为了处理多个条件并将它们编码到一个统一空间中,我们提出了一种多条件扩散模型,该模型通过显式预测原始数据分布的引导点,与其它扩散文献中的隐式统一方法有所不同。
效果:理论分析和大量实验结果表明,我们的方法优于最先进的基准测试,并能够实现自然的场景编辑应用。

Scene synthesis is a challenging problem with several industrial applications. Recently, substantial efforts have been directed to synthesize the scene using human motions, room layouts, or spatial graphs as the input. However, few studies have addressed this problem from multiple modalities, especially combining text prompts. In this paper, we propose a language-driven scene synthesis task, which is a new task that integrates text prompts, human motion, and existing objects for scene synthesis. Unlike other single-condition synthesis tasks, our problem involves multiple conditions and requires a strategy for processing and encoding them into a unified space. To address the challenge, we present a multi-conditional diffusion model, which differs from the implicit unification approach of other diffusion literature by explicitly predicting the guiding points for the original data distribution. We demonstrate that our approach is theoretically supportive. The intensive experiment results illustrate that our method outperforms state-of-the-art benchmarks and enables natural scene editing applications. The source code and dataset can be accessed at https://lang-scene-synth.github.io/.

SceneScape: Text-Driven Consistent Scene Generation
Rafail Fridman Amit Abecasis Yoni Kasten Tali Dekel



研究问题:如何仅根据输入的文本提示和相机位姿,生成各种场景的长期视频。
动机:目前的模型只能生成有限领域的视频,缺乏对各种场景的广泛适用性。
方法:提出一种新颖的框架,通过结合预训练的文本到图像模型的生成能力和预训练的单眼深度预测模型学习的几何先验知识,在线生成此类视频。为了解决实现3D一致性的关键挑战,即合成描绘几何上可信的场景的视频,我们部署了在线测试时间训练,鼓励当前帧的预测深度图与合成场景在几何上保持一致。
效果:实验结果表明,该方法可以生成多样化的场景,如太空船、洞穴或冰城堡中的走廊视图,并在各种知识驱动任务上取得了显著改进。

We present a method for text-driven perpetual view generation -- synthesizing long-term videos of various scenes solely, given an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a \emph{unified} mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles.

Inner-Outer Aware Reconstruction Model for Monocular 3D Scene Reconstruction
Yu-Kun Qiu Guohao Xu Wei-Shi Zheng



研究问题:单目三维场景重建旨在基于已定位的图像重建场景的三维结构。
动机:现有的基于体积的方法直接预测截断符号距离函数(TSDF)体积,取得了有希望的结果。但是,非表面体素具有各种特征,特别是表面内侧的体素与外侧的体素非常不同,因为它们之间存在固有间隙。因此,将内部表面和外部表面体素分组到同一类别会迫使分类器花费其容量来弥合差距。相比之下,由于存在固有间隙,分类器相对容易区分内部表面和外部表面体素。
方法:受此启发,我们提出了内部-外部感知重建(IOAR)模型。 IOAR探索了一种新的粗到细策略来分类外部表面、内部表面和表面体素。此外,IOAR将占用分支与TSDF分支分开以避免它们之间的相互干扰。由于我们的模型可以更好地分类表面、外部表面和内部表面体素,因此它可以比现有方法预测更精确的网格。
效果:在ScanNet、ICL-NUIM和TUM-RGBD数据集上的实验结果表明了我们模型的有效性和泛化能力。代码可在https://github.com/YorkQiu/InnerOuterAwareReconstruction获取。

Monocular 3D scene reconstruction aims to reconstruct the 3D structure of scenes based on posed images. Recent volumetric-based methods directly predict the truncated signed distance function (TSDF) volume and have achieved promising results. The memory cost of volumetric-based methods will grow cubically as the volume size increases, so a coarse-to-fine strategy is necessary for saving memory. Specifically, the coarse-to-fine strategy distinguishes surface voxels from non-surface voxels, and only potential surface voxels are considered in the succeeding procedure. However, the non-surface voxels have various features, and in particular, the voxels on the inner side of the surface are quite different from those on the outer side since there exists an intrinsic gap between them. Therefore, grouping inner-surface and outer-surface voxels into the same class will force the classifier to spend its capacity to bridge the gap. By contrast, it is relatively easy for the classifier to distinguish inner-surface and outer-surface voxels due to the intrinsic gap. Inspired by this, we propose the inner-outer aware reconstruction (IOAR) model. IOAR explores a new coarse-to-fine strategy to classify outer-surface, inner-surface and surface voxels. In addition, IOAR separates occupancy branches from TSDF branches to avoid mutual interference between them. Since our model can better classify the surface, outer-surface and inner-surface voxels, it can predict more precise meshes than existing methods. Experiment results on ScanNet, ICL-NUIM and TUM-RGBD datasets demonstrate the effectiveness and generalization of our model. The code is available at https://github.com/YorkQiu/InnerOuterAwareReconstruction.

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
Hao Li Jingkuan Song Lianli Gao Xiaosu Zhu Heng Tao Shen



研究问题:本文旨在解决跨模态检索中由于低质量数据引发的随机不确定性,导致预测结果不可靠的问题。
动机:现有的跨模态检索方法在处理低质量数据时,如损坏的图像、快速的视频和非详细的文本,其预测结果往往不可靠。
方法:本文提出了一种新的原型基于随机不确定性量化(PAU)框架,通过构建各种可学习的原型来代表整个语义子空间,并利用Dempster-Shafer理论和主观逻辑理论建立证据与Dirichlet分布参数关联的证据理论框架,以提供准确的不确定性和可靠的跨模态检索预测。
效果:在MSR-VTT、MSVD、DiDeMo和MS-COCO四个主要基准数据集上进行的大量实验表明,该方法有效提高了预测的准确性和可靠性。

Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.

Orthogonal Non-negative Tensor Factorization based Multi-view Clustering
Jing Li Quanxue Gao QIANQIAN WANG Ming Yang Wei Xia



研究问题:现有的基于非负矩阵分解的多视角聚类方法在每个视角上分别进行非负矩阵分解,忽视了视角间的影响,无法充分利用视图内的空间结构和视图间的互补信息。
动机:为了解决这个问题,我们提出了正交非负张量分解(Orth-NTF)并开发了一种基于Orth-NTF和单侧正交约束的新型多视角聚类方法。
方法:我们的模型直接对由视图的锚图组成的三阶张量执行Orth-NTF,从而直接考虑了视角间的关系。此外,我们使用张量Schatten p-范数正则化作为描述多视角数据集群结构的三阶张量的秩近似,并利用视图间的互补信息。
效果:我们在各种基准数据集上进行了广泛的实验,结果表明我们提出的方法能够达到满意的聚类性能。

Multi-view clustering (MVC) based on non-negative matrix factorization (NMF) and its variants have attracted much attention due to their advantages in clustering interpretability. However, existing NMF-based multi-view clustering methods perform NMF on each view respectively and ignore the impact of between-view. Thus, they can't well exploit the within-view spatial structure and between-view complementary information. To resolve this issue, we present orthogonal non-negative tensor factorization (Orth-NTF) and develop a novel multi-view clustering based on Orth-NTF with one-side orthogonal constraint. Our model directly performs Orth-NTF on the 3rd-order tensor which is composed of anchor graphs of views. Thus, our model directly considers the between-view relationship. Moreover, we use the tensor Schatten $p$-norm regularization as a rank approximation of the 3rd-order tensor which characterizes the cluster structure of multi-view data and exploits the between-view complementary information. In addition, we provide an optimization algorithm for the proposed method and prove mathematically that the algorithm always converges to the stationary KKT point. Extensive experiments on various benchmark datasets indicate that our proposed method is able to achieve satisfactory clustering performance.

3D Copy-Paste: Physically Plausible Object Insertion for Monocular 3D Detection
Yunhao Ge Hong-Xing Yu Cheng Zhao Yuliang Guo Xinyu Huang Liu Ren Laurent Itti Jiajun Wu



研究问题:单目3D物体检测中,真实数据集中的物体多样性和数量有限。
动机:虽然通过在真实场景中插入虚拟物体可以提高物体的多样性和数量,但由于缺乏在复杂真实捕获场景中的有效3D物体插入方法,这一目标仍然难以实现。
方法:我们提出了一种在复杂真实室内场景中插入物理上可信的虚拟物体的方法,用于增强单目3D物体检测。主要挑战是在杂乱的真实场景中自动识别虚拟资产(如位置、外观、大小等)的合理物理属性。
效果:我们的实验表明,这种增强方法显著提高了现有的单目3D物体模型的性能,并达到了最先进的性能水平。这是首次证明,作为生成性数据增强技术,物理上可信的3D物体插入可以显著提高鉴别性下游任务(如单目3D物体检测)的性能。

A major challenge in monocular 3D object detection is the limited diversity and quantity of objects in real datasets. While augmenting real scenes with virtual objects holds promise to improve both the diversity and quantity of the objects, it remains elusive due to the lack of an effective 3D object insertion method in complex real captured scenes. In this work, we study augmenting complex real indoor scenes with virtual objects for monocular 3D object detection. The main challenge is to automatically identify plausible physical properties for virtual assets (e.g., locations, appearances, sizes, etc.) in cluttered real scenes. To address this challenge, we propose a physically plausible indoor 3D object insertion approach to automatically copy virtual objects and paste them into real scenes. The resulting objects in scenes have 3D bounding boxes with plausible physical locations and appearances. In particular, our method first identifies physically feasible locations and poses for the inserted objects to prevent collisions with the existing room layout. Subsequently, it estimates spatially-varying illumination for the insertion location, enabling the immersive blending of the virtual objects into the original scene with plausible appearances and cast shadows. We show that our augmentation method significantly improves existing monocular 3D object models and achieves state-of-the-art performance. For the first time, we demonstrate that a physically plausible 3D object insertion, serving as a generative data augmentation technique, can lead to significant improvements for discriminative downstream tasks such as monocular 3D object detection. Code: https://github.com/gyhandy/3D-Copy-Paste.

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
Zhuoyan Luo Yicheng Xiao Yong Liu Shuyan Li Yitong Wang Yansong Tang Xiu Li Yujiu Yang



研究问题:本文旨在解决视频对象分割(RVOS)任务中,由于缺乏全局视频内容视图,导致无法有效利用帧间关系和理解物体时间变化文本描述的问题。
动机:目前的RVOS方法将任务模型化为序列预测问题,并对每一帧分别进行多模态交互和分割,但这种方法忽视了全局的视频内容视图,导致在处理物体的时间变化文本描述时存在困难。
方法:本文提出了一种名为“语义辅助对象聚类”(SOC)的方法,该方法通过聚合视频内容和文本指导来进行统一的时序建模和跨模态对齐。通过将一组帧级别的对象嵌入与语言标记关联起来,SOC促进了跨模态和时间步长的联合空间学习。此外,还提出了多模态对比监督来帮助构建良好的对齐的联合空间。
效果:在流行的RVOS基准上进行了广泛的实验,该方法在所有基准上都优于最先进的竞争对手,且显著提高了分割的稳定性和适应性。

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code is available at https://github.com/RobertLuo1/NeurIPS2023_SOC.

NeuroGF: A Neural Representation for Fast Geodesic Distance and Path Queries
Qijian Zhang Junhui Hou Yohanes Yudhi Adikusuma Wenping Wang Ying He



研究问题:传统的计算三维网格模型上测地线的算法效率低下,不适合需要大量查询任意点对点测地线的场景。
动机:尽管深度学习隐式函数在3D几何表示中受到欢迎,但尚未有关于神经隐式测地线表示的研究。
方法:我们首次尝试使用隐式学习框架来表示测地线,提出了神经测地场(NeuroGF),可以学习编码给定三维网格模型的所有点对测地线。
效果:我们在常见的3D对象模型和真实捕获的场景级网格上进行评估,结果显示我们在表示准确性和查询效率方面具有出色的性能。此外,NeuroGF还提供了一种方便的方式,可以在统一的表示中联合编码3D几何和测地线。

Geodesics play a critical role in many geometry processing applications. Traditional algorithms for computing geodesics on 3D mesh models are often inefficient and slow, which make them impractical for scenarios requiring extensive querying of arbitrary point-to-point geodesics. Recently, deep implicit functions have gained popularity for 3D geometry representation, yet there is still no research on neural implicit representation of geodesics. To bridge this gap, we make the first attempt to represent geodesics using implicit learning frameworks. Specifically, we propose neural geodesic field (NeuroGF), which can be learned to encode all-pairs geodesics of a given 3D mesh model, enabling to efficiently and accurately answer queries of arbitrary point-to-point geodesic distances and paths. Evaluations on common 3D object models and real-captured scene-level meshes demonstrate our exceptional performances in terms of representation accuracy and querying efficiency. Besides, NeuroGF also provides a convenient way of jointly encoding both 3D geometry and geodesics in a unified representation. Moreover, the working mode of per-model overfitting is further extended to generalizable learning frameworks that can work on various input formats such as unstructured point clouds, which also show satisfactory performances for unseen shapes and categories. Our code and data are available at https://github.com/keeganhk/NeuroGF.

NeRF Revisited: Fixing Quadrature Instability in Volume Rendering
Mikaela Angelina Uy Kiyohiro Nakayama Guandao Yang Rahul Krishna Thomas Leonidas Guibas Ke Li



研究问题:神经辐射场(NeRF)在合成新视图时依赖于体积渲染,但此过程存在数值近似导致的积分不稳定性问题。
动机:解决现有神经辐射场方法中存在的采样不一致、层次采样不精确以及射线终止距离分位数的模型参数不可微等问题。
方法:通过将基于样本的渲染方程重新公式化,使其对应于分段线性体积密度下的精确积分,从而同时解决了多个问题。
效果:该方法在纹理清晰度、几何重建和深度监督等方面都优于传统的基于样本的渲染方程,可以作为现有基于NeRF的方法中的替代方案。

Neural radiance fields (NeRF) rely on volume rendering to synthesize novel views. Volume rendering requires evaluating an integral along each ray, which is numerically approximated with a finite sum that corresponds to the exact integral along the ray under piecewise constant volume density. As a consequence, the rendered result is unstable w.r.t. the choice of samples along the ray, a phenomenon that we dub quadrature instability. We propose a mathematically principled solution by reformulating the sample-based rendering equation so that it corresponds to the exact integral under piecewise linear volume density. This simultaneously resolves multiple issues: conflicts between samples along different rays, imprecise hierarchical sampling, and non-differentiability of quantiles of ray termination distances w.r.t. model parameters. We demonstrate several benefits over the classical sample-based rendering equation, such as sharper textures, better geometric reconstruction, and stronger depth supervision. Our proposed formulation can be also be used as a drop-in replacement to the volume rendering equation of existing NeRF-based methods. Our project page can be found at pl-nerf.github.io.

Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes
Yihong Sun Bharath Hariharan



研究问题:目前的单目深度估计技术在动态场景中表现不佳,因为物体的运动可能既可以通过假设物体的独立运动来解释,也可以通过改变其深度来解释。
动机:为了解决这一问题,我们提出了Dynamo-Depth方法,通过联合学习单目深度、3D独立流场和运动分割来消除动态运动的歧义。
方法:我们的关键见解是,尽管存在根本的基本歧义,但良好的运动分割初始估计足以联合学习深度和独立运动。
效果:我们的方法在Waymo Open和nuScenes数据集上的单目深度估计方面取得了最先进的性能,显著提高了移动物体的深度。

Unsupervised monocular depth estimation techniques have demonstrated encouraging results but typically assume that the scene is static. These techniques suffer when trained on dynamical scenes, where apparent object motion can equally be explained by hypothesizing the object's independent motion, or by altering its depth. This ambiguity causes depth estimators to predict erroneous depth for moving objects. To resolve this issue, we introduce Dynamo-Depth, an unifying approach that disambiguates dynamical motion by jointly learning monocular depth, 3D independent flow field, and motion segmentation from unlabeled monocular videos. Specifically, we offer our key insight that a good initial estimation of motion segmentation is sufficient for jointly learning depth and independent motion despite the fundamental underlying ambiguity. Our proposed method achieves state-of-the-art performance on monocular depth estimation on Waymo Open and nuScenes Dataset with significant improvement in the depth of moving objects. Code and additional results are available at https://dynamo-depth.github.io.

E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning
Xiuhong Lin Changjie Qiu zhipeng cai Siqi Shen Yu Zang Weiquan Liu Xuesheng Bian Matthias Müller Cheng Wang



研究问题:本文旨在解决近年来由于其无与伦比的时间分辨率和动态范围而成为有前景的视觉传感器的事件相机中的2D-3D注册问题。
动机:尽管在计算机视觉中,将2D RGB图像注册到3D点云是一个长期存在的问题,但之前的研究并未对事件相机进行过此类研究。为此,我们提出了E2PNet,这是第一个基于学习的方法来进行事件到点云的注册。
方法:E2PNet的核心是一种名为Event-Points-to-Tensor (EP2T)的新型特征表示网络,它将事件数据编码为一个二维网格状的特征张量。这种网格状的特征使得成熟的基于RGB的框架能够轻松地用于事件到点云的注册,无需更改超参数和训练过程。EP2T将事件输入视为时空点云。与将所有维度等同对待的标准3D学习架构不同,EP2T中新颖的采样和信息聚合模块被设计用来处理空间和时间维度的非均匀性。
效果:我们在MVSEC和VECtor数据集上的实验表明,E2PNet优于手工制作和其他基于学习的方法。与基于RGB的注册相比,由于使用了事件数据,E2PNet对于极端光照或快速运动更具鲁棒性。除了2D-3D注册外,我们还展示了EP2T在其他视觉任务(如流估计、事件到图像重建和对象识别)中的潜力。

Event cameras have emerged as a promising vision sensor in recent years due to their unparalleled temporal resolution and dynamic range. While registration of 2D RGB images to 3D point clouds is a long-standing problem in computer vision, no prior work studies 2D-3D registration for event cameras. To this end, we propose E2PNet, the first learning-based method for event-to-point cloud registration. The core of E2PNet is a novel feature representation network called Event-Points-to-Tensor (EP2T), which encodes event data into a 2D grid-shaped feature tensor. This grid-shaped feature enables matured RGB-based frameworks to be easily used for event-to-point cloud registration, without changing hyper-parameters and the training procedure. EP2T treats the event input as spatio-temporal point clouds. Unlike standard 3D learning architectures that treat all dimensions of point clouds equally, the novel sampling and information aggregation modules in EP2T are designed to handle the inhomogeneity of the spatial and temporal dimensions. Experiments on the MVSEC and VECtor datasets demonstrate the superiority of E2PNet over hand-crafted and other learning-based methods. Compared to RGB-based registration, E2PNet is more robust to extreme illumination or fast motion due to the use of event data. Beyond 2D-3D registration, we also show the potential of EP2T for other vision tasks such as flow estimation, event-to-image reconstruction and object recognition. The source code can be found at: https://github.com/Xmu-qcj/E2PNet.

GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks
Zhonghang Li Lianghao Xia Yong Xu Chao Huang



研究问题:近年来,随着交通管理和旅行规划需求的增加,时空预测技术得到了快速发展。尽管先进的端到端模型在提高预测性能方面取得了显著成功,但其集成和扩展带来了重大挑战。
动机:本文旨在通过引入一个无缝集成并提升下游基线性能的时空预训练框架来解决这些挑战。
方法:该框架建立在两个关键设计上:(i)我们提出了一个时空掩码自动编码器作为学习时空依赖性的预训练模型。该模型包含了定制的参数学习器和分层的空间模式编码网络,专门用于捕捉常被现有方法忽视的时空定制表示和集群内及集群间的区域语义关系。(ii)我们引入了一种自适应掩码策略作为预训练机制的一部分。这种策略引导掩码自动编码器学习稳健的时空表示,并以易到难的方式促进不同关系的建模,范围从集群内到集群间。
效果:我们在代表性基准上进行了广泛的实验,证明了我们提出的方法的有效性。我们的模型实现已在 https://github.com/HKUDS/GPT-ST 上公开发布。

In recent years, there has been a rapid development of spatio-temporal prediction techniques in response to the increasing demands of traffic management and travel planning. While advanced end-to-end models have achieved notable success in improving predictive performance, their integration and expansion pose significant challenges. This work aims to address these challenges by introducing a spatio-temporal pre-training framework that seamlessly integrates with downstream baselines and enhances their performance. The framework is built upon two key designs: (i) We propose a spatio-temporal mask autoencoder as a pre-training model for learning spatio-temporal dependencies. The model incorporates customized parameter learners and hierarchical spatial pattern encoding networks. These modules are specifically designed to capture spatio-temporal customized representations and intra- and inter-cluster region semantic relationships, which have often been neglected in existing approaches. (ii) We introduce an adaptive mask strategy as part of the pre-training mechanism. This strategy guides the mask autoencoder in learning robust spatio-temporal representations and facilitates the modeling of different relationships, ranging from intra-cluster to inter-cluster, in an easy-to-hard training manner. Extensive experiments conducted on representative benchmarks demonstrate the effectiveness of our proposed method. We have made our model implementation publicly available at https://github.com/HKUDS/GPT-ST.

Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos
Matthew Chang Aditya Prakash Saurabh Gupta



研究问题:如何有效地利用自我中心视频进行机器人任务,并解决遮挡和人的手与机器人末端执行器之间的视觉不匹配问题。
动机:过去的工作将人手视为干扰并从场景中移除,但人手也提供了学习的重要信号。
方法:提出一种提取场景因子表示的方法,将代理(人手)和环境分离,以缓解遮挡和不匹配问题,同时保留信号,从而简化了下游机器人任务的模型设计。该方法的核心是我们的视频修复模型VIDM,该模型利用现实世界图像的先验知识(通过大规模预训练的扩散模型)和对象在视频早期帧中的外观(通过注意力)。
效果:实验证明VIDM能有效提高自我中心视频的修复质量,以及我们的因子表示对众多任务的有效性:物体检测、操作对象的3D重建,以及从视频中学习奖励函数、策略和功能。

The analysis and use of egocentric videos for robotics tasks is made challenging by occlusion and the visual mismatch between the human hand and a robot end-effector. Past work views the human hand as a nuisance and removes it from the scene. However, the hand also provides a valuable signal for learning. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving the in-painting quality in egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.

Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects
Tianhang Cheng Wei-Chiu Ma Kaiyu Guan Antonio Torralba Shenlong Wang



研究问题:如何从单一图像中重建物体的几何形状、材质和光照。
动机:现实世界中存在大量相同的物体,当它们在一起时,可以为我们提供有效的3D推理线索。
方法:提出一种新的逆向图形框架——Structure from Duplicates(SfD),通过识别图像中的多个相同物体实例,联合估计所有实例的6自由度姿态,然后采用逆向图形管道共同推理物体的形状、材质和环境光,同时遵守所有实例之间的共享几何和材质约束。
效果:利用物体副本作为单图像逆向图形的鲁棒先验,并提出了平面内旋转稳健的结构运动(SfM)公式进行联合6自由度物体姿态估计。通过利用单一图像的多视图线索,SfD生成了更真实、更详细的3D重建,显著优于现有的单图像重建模型和多视图重建方法。

Abstract Our world is full of identical objects (\emph{e.g.}, cans of coke, cars of same model). These duplicates, when seen together, provide additional and strong cues for us to effectively reason about 3D. Inspired by this observation, we introduce Structure from Duplicates (SfD), a novel inverse graphics framework that reconstructs geometry, material, and illumination from a single image containing multiple identical objects. SfD begins by identifying multiple instances of an object within an image, and then jointly estimates the 6DoF pose for all instances. An inverse graphics pipeline is subsequently employed to jointly reason about the shape, material of the object, and the environment light, while adhering to the shared geometry and material constraint across instances. Our primary contributions involve utilizing object duplicates as a robust prior for single-image inverse graphics and proposing an in-plane rotation-robust Structure from Motion (SfM) formulation for joint 6-DoF object pose estimation. By leveraging multi-view cues from a single image, SfD generates more realistic and detailed 3D reconstructions, significantly outperforming existing single image reconstruction models and multi-view reconstruction approaches with a similar or greater number of observations.

Deep Non-line-of-sight Imaging from Under-scanning Measurements
Yue Li Yueyi Zhang Juntian Ye Feihu Xu Zhiwei Xiong



研究问题:如何从稀疏的测量中重建满意的结果。
动机:现有的传统算法在重建结果上表现不佳或计算时间长,因此需要一种更有效的方法。
方法:提出了一种基于深度学习的非视线成像方法,该方法由两个主要组件组成:瞬态恢复网络(TRN)和体积重建网络(VRN)。
效果:该方法在合成数据和公共真实世界数据上都表现出优越的性能,并在极低的扫描网格(即8×8)下显示出令人印象深刻的鲁棒性,同时提供了高速推理(比现有的迭代解决方案快50倍)。

Active confocal non-line-of-sight (NLOS) imaging has successfully enabled seeing around corners relying on high-quality transient measurements. However, acquiring spatial-dense transient measurement is time-consuming, raising the question of how to reconstruct satisfactory results from under-scanning measurements (USM). The existing solutions, involving the traditional algorithms, however, are hindered by unsatisfactory results or long computing times. To this end, we propose the first deep-learning-based approach to NLOS imaging from USM. Our proposed end-to-end network is composed of two main components: the transient recovery network (TRN) and the volume reconstruction network (VRN). Specifically, TRN takes the under-scanning measurements as input, utilizes a multiple kernel feature extraction module and a multiple feature fusion module, and outputs sufficient-scanning measurements at the high-spatial resolution. Afterwards, VRN incorporates the linear physics prior of the light-path transport model and reconstructs the hidden volume representation. Besides, we introduce regularized constraints that enhance the perception of more local details while suppressing smoothing effects. The proposed method achieves superior performance on both synthetic data and public real-world data, as demonstrated by extensive experimental results with different under-scanning grids. Moreover, the proposed method delivers impressive robustness at an extremely low scanning grid (i.e., 8$\times$8) and offers high-speed inference (i.e., 50 times faster than the existing iterative solution).

HASSOD: Hierarchical Adaptive Self-Supervised Object Detection
Shengcao Cao Dhiraj Joshi Liangyan Gui Yu-Xiong Wang



研究问题:如何实现无需人类监督的物体检测和理解物体组成。
动机:借鉴人类视觉系统的无监督学习和理解物体整体结构的能力,提出一种新的方法。
方法:提出了一种分层自适应自我监督物体检测(HASSOD)方法,通过分层自适应聚类策略将区域分组为基于自我监督视觉表示的对象掩码,并自适应确定每张图像中的对象数量。同时,通过分析掩码之间的覆盖关系并构建树形结构来识别对象的层次结构。
效果:在广泛使用的图像数据集上进行的大量实验表明,HASSOD优于现有方法,从而推动了自我监督物体检测领域的进展。在LVIS和SA-1B上,HASSOD将Mask AR从20.2提高到22.5,从17.0提高到26.0。

The human visual perception system demonstrates exceptional capabilities in learning without explicit supervision and understanding the part-to-whole composition of objects. Drawing inspiration from these two abilities, we propose Hierarchical Adaptive Self-Supervised Object Detection (HASSOD), a novel approach that learns to detect objects and understand their compositions without human supervision. HASSOD employs a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations, adaptively determining the number of objects per image. Furthermore, HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures. This additional self-supervised learning task leads to improved detection performance and enhanced interpretability. Lastly, we abandon the inefficient multi-round self-training process utilized in prior methods and instead adapt the Mean Teacher framework from semi-supervised learning, which leads to a smoother and more efficient training process. Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection. Notably, we improve Mask AR from 20.2 to 22.5 on LVIS, and from 17.0 to 26.0 on SA-1B. Code is available at https://github.com/Shengcao-Cao/HASSOD.

CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
Anthony Fuller Koreen Millard James R Green



研究问题:如何利用大量稀疏标记的多模态遥感数据进行有效的自监督学习。
动机:遥感数据具有广阔的空间和丰富的模式,但标记信息稀疏,因此需要一种能够有效利用这些数据的自监督学习方法。
方法:提出CROMA框架,结合对比学习和重建学习的目标,分别对掩蔽的多光谱光学和合成孔径雷达样本进行编码并进行跨模态对比学习,然后通过一个轻量级解码器预测被掩蔽的图像块。
效果:实验表明,CROMA在各种遥感任务上的表现优于当前最先进的模型,包括分类和分割等任务,并且其表示可以广泛应用于遥感应用。

A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples—aligned in space and time—and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X- and 2D-ALiBi, which spatially biases our cross- and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to $17.6\times$ larger at test-time. CROMA outperforms the current SoTA multispectral model, evaluated on: four classification benchmarks—finetuning (avg.$\uparrow$ 1.8%), linear (avg.$\uparrow$ 2.4%) and nonlinear (avg.$\uparrow$ 1.4%) probing, $k$NN classification (avg.$\uparrow$ 3.5%), and $K$-means clustering (avg.$\uparrow$ 8.4%); and three segmentation benchmarks (avg.$\uparrow$ 6.4%). CROMA’s rich, optionally multimodal representations can be widely leveraged across remote sensing applications.

DELIFFAS: Deformable Light Fields for Fast Avatar Synthesis
YoungJoong Kwon Lingjie Liu Henry Fuchs Marc Habermann Christian Theobalt



研究问题:如何生成可控且逼真的数字人类头像。
动机:虽然现有的方法在逼真度或推理速度上取得了显著进步,但同时具备这两种特性的问题仍未解决。
方法:提出一种名为DELIFFAS的新方法,将人类的外观参数化为附着在可控变形人体网格模型上的表面光场。
效果:通过精心设计的人体表示和监督策略,实现了最先进的合成结果和推理时间,相关视频结果和代码可在指定网站获取。

Generating controllable and photorealistic digital human avatars is a long-standing and important problem in Vision and Graphics. Recent methods have shown great progress in terms of either photorealism or inference speed while the combination of the two desired properties still remains unsolved. To this end, we propose a novel method, called DELIFFAS, which parameterizes the appearance of the human as a surface light field that is attached to a controllable and deforming human mesh model. At the core, we represent the light field around the human with a deformable two-surface parameterization, which enables fast and accurate inference of the human appearance. This allows perceptual supervision on the full image compared to previous approaches that could only supervise individual pixels or small patches due to their slow runtime. Our carefully designed human representation and supervision strategy leads to state-of-the-art synthesis results and inference time. The video results and code are available at https://vcai.mpi-inf.mpg.de/projects/DELIFFAS.

Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements
Gaurav Shrivastava Ser-Nam Lim Abhinav Shrivastava



研究问题:本文旨在提出一种新颖的鲁棒框架,用于处理低级别的视觉任务,如去噪、对象移除、帧插值和超分辨率,而无需任何外部训练数据集。
动机:目前的方法需要依赖大量的外部训练数据,而本文提出的新方法通过优化受污染的测试序列来直接学习神经网络模块的权重,利用视频的空间-时间连贯性和内部统计信息。
方法:我们引入了一种新的空间金字塔损失函数,利用视频中不同尺度的空间-时间补丁重复性的特性。这种损失函数增强了对输入帧中无序噪声的鲁棒性,并进一步提高了我们的框架对输入帧退化的鲁棒性。
效果:我们在DAVIS、UCF-101和VIMEO90K-T等标准视频数据集上进行了定性和定量评估,实验结果表明,我们的方法在去噪、对象移除和帧插值等下游任务上取得了最先进的结果。

In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus. Our proposed approach directly learns the weights of neural modules by optimizing over the corrupted test sequence, leveraging the spatio-temporal coherence and internal statistics of videos. Furthermore, we introduce a novel spatial pyramid loss that leverages the property of spatio-temporal patch recurrence in a video across the different scales of the video. This loss enhances robustness to unstructured noise in both the spatial and temporal domains. This further results in our framework being highly robust to degradation in input frames and yields state-of-the-art results on downstream tasks such as denoising, object removal, and frame interpolation. To validate the effectiveness of our approach, we conduct qualitative and quantitative evaluations on standard video datasets such as DAVIS, UCF-101, and VIMEO90K-T.

Coupled Reconstruction of Cortical Surfaces by Diffeomorphic Mesh Deformation
Hao Zheng Hongming Li Yong Fan



研究问题:如何准确重建大脑磁共振成像中的皮质表面。
动机:由于大脑MRI中的部分体积效应和大脑皮层的薄而高度折叠的模式,从大脑MRI准确重建皮质表面仍然是一个挑战。
方法:开发了一种新的深度学习框架,联合重建内部(白质)和外部(软脑膜)皮质表面以及它们之间的中间(中厚)表面,并直接从3D MRIs估计皮质厚度。
效果:在包括ADNI和OASIS在内的两个大规模神经影像数据集上进行评估,该方法在准确性、表面规则性和计算效率方面实现了最先进的皮质表面重建性能。

Accurate reconstruction of cortical surfaces from brain magnetic resonance images (MRIs) remains a challenging task due to the notorious partial volume effect in brain MRIs and the cerebral cortex's thin and highly folded patterns. Although many promising deep learning-based cortical surface reconstruction methods have been developed, they typically fail to model the interdependence between inner (white matter) and outer (pial) cortical surfaces, which can help generate cortical surfaces with spherical topology. To robustly reconstruct the cortical surfaces with topological correctness, we develop a new deep learning framework to jointly reconstruct the inner, outer, and their in-between (midthickness) surfaces and estimate cortical thickness directly from 3D MRIs. Our method first estimates the midthickness surface and then learns three diffeomorphic flows jointly to optimize the midthickness surface and deform it inward and outward to the inner and outer cortical surfaces respectively, regularized by topological correctness. Our method also outputs a cortex thickness value for each surface vertex, estimated from its diffeomorphic deformation trajectory. Our method has been evaluated on two large-scale neuroimaging datasets, including ADNI and OASIS, achieving state-of-the-art cortical surface reconstruction performance in terms of accuracy, surface regularity, and computation efficiency.

NVFi: Neural Velocity Fields for 3D Physics Learning from Dynamic Videos
Jinxi Li Ziyang Song Bo Yang



研究问题:本文旨在从多视角视频中建立3D场景动态模型。
动机:与大部分现有工作关注训练期间的常见任务——新视图合成不同,我们提出仅从视频帧同时学习3D场景的几何、外观和物理速度,以支持包括未来帧预测、无监督3D语义场景分解和动态运动转移在内的多个期望应用。
方法:我们的方法由三个主要组件组成:1)关键帧动态辐射场;2)帧间速度场;3)联合关键帧和帧间优化模块,这是我们框架的核心,可以有效训练两个网络。
效果:为了验证我们的方法,我们引入了两个动态3D数据集:1)动态对象数据集;2)动态室内场景数据集。我们在多个数据集上进行了广泛的实验,证明我们的方法在所有基线上都表现出优越的性能,特别是在未来帧预测和无监督3D语义场景分解的关键任务上。

In this paper, we aim to model 3D scene dynamics from multi-view videos. Unlike the majority of existing works which usually focus on the common task of novel view synthesis within the training time period, we propose to simultaneously learn the geometry, appearance, and physical velocity of 3D scenes only from video frames, such that multiple desirable applications can be supported, including future frame extrapolation, unsupervised 3D semantic scene decomposition, and dynamic motion transfer. Our method consists of three major components, 1) the keyframe dynamic radiance field, 2) the interframe velocity field, and 3) a joint keyframe and interframe optimization module which is the core of our framework to effectively train both networks. To validate our method, we further introduce two dynamic 3D datasets: 1) Dynamic Object dataset, and 2) Dynamic Indoor Scene dataset. We conduct extensive experiments on multiple datasets, demonstrating the superior performance of our method over all baselines, particularly in the critical tasks of future frame extrapolation and unsupervised 3D semantic scene decomposition.

Tame a Wild Camera: In-the-Wild Monocular Camera Calibration
Shengjie Zhu Abhinav Kumar Masa Hu Xiaoming Liu



研究问题:单目自然图像的3D传感,如深度估计和3D物体检测,变得越来越重要。然而,未知的内在参数阻碍了其发展和部署。
动机:先前的方法依赖于特定的3D对象或强烈的几何先验,如使用棋盘格或强加曼哈顿世界假设进行单目相机标定。这项工作通过利用单目3D先验来校准内在参数。
方法:给定一个未失真的图像作为输入,我们的方法校准完整的4自由度(DoF)内在参数。首先,我们表明内在参数由两个经过充分研究的单目先验决定:单目深度图和表面法线图。然而,这种方法需要低偏差和低方差的深度估计。或者,我们引入入射场,定义为3D空间中的点和2D成像平面上的像素之间的入射射线。
效果:我们展示了1) 入射场是图像裁剪和调整大小不变的像素级内在参数化;2) 入射场是一个可学习的单目3D先验,由单目深度图和表面法线确定像素级;3) 使用估计的入射场,鲁棒的RANSAC算法恢复内在参数。我们在合成和零样本测试数据集上展示了该方法的有效性。除了校准外,我们还在图像操纵检测和恢复、未校准的两视图姿态估计和3D传感等下游应用中展示了效果。

3D sensing for monocular in-the-wild images, e.g., depth estimation and 3D object detection, has become increasingly important. However, the unknown intrinsic parameter hinders their development and deployment. Previous methods for the monocular camera calibration rely on specific 3D objects or strong geometry prior, such as using a checkerboard or imposing a Manhattan World assumption. This work instead calibrates intrinsic via exploiting the monocular 3D prior. Given an undistorted image as input, our method calibrates the complete 4 Degree-of-Freedom (DoF) intrinsic parameters. First, we show intrinsic is determined by the two well-studied monocular priors: monocular depthmap and surface normal map. However, this solution necessitates a low-bias and low-variance depth estimation. Alternatively, we introduce the incidence field, defined as the incidence rays between points in 3D space and pixels in the 2D imaging plane. We show that: 1) The incidence field is a pixel-wise parametrization of the intrinsic invariant to image cropping and resizing. 2) The incidence field is a learnable monocular 3D prior, determined pixel-wisely by up-to-sacle monocular depthmap and surface normal. With the estimated incidence field, a robust RANSAC algorithm recovers intrinsic. We show the effectiveness of our method through superior performance on synthetic and zero-shot testing datasets. Beyond calibration, we demonstrate downstream applications in image manipulation detection \& restoration, uncalibrated two-view pose estimation, and 3D sensing.

Described Object Detection: Liberating Object Detection with Flexible Expressions
Chi Xie Zhao Zhang Yixuan Wu Feng Zhu Rui Zhao Shuang Liang



研究问题:本文旨在将物体检测任务从开放式词汇对象检测(OVD)和参考表达式理解(REC)扩展到描述对象检测(DOD),以处理更实际的场景。
动机:目前的物体检测任务,如OVD和REC,无法处理灵活的语言表达和预存对象的局限性。因此,作者提出了一个更具实用性的DOD任务。
方法:通过构建描述检测数据集($D^3$),包含灵活的语言表达(无论是简短的类别名称还是详细的描述),并在所有图像上标注所有被描述的对象,为DOD建立研究基础。同时,对现有的SOTA方法进行评估,发现并解决了REC、OVD和双功能方法在DOD任务中的问题。
效果:基于上述发现,作者提出了一种基线方法,通过重构训练数据和引入二进制分类子任务,显著提高了REC方法的性能,超过了现有方法。相关数据和代码可在https://github.com/shikras/d-cube获取,相关工作可在https://github.com/Charles-Xie/awesome-described-object-detection查看。

Detecting objects based on language information is a popular task that includes Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this paper, we advance them to a more practical setting called *Described Object Detection* (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object. We establish the research foundation for DOD by constructing a *Description Detection Dataset* ($D^3$). This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission. By evaluating previous SOTA methods on $D^3$, we find some troublemakers that fail current REC, OVD, and bi-functional methods. REC methods struggle with confidence scores, rejecting negative instances, and multi-target scenarios, while OVD methods face constraints with long and complex descriptions. Recent bi-functional methods also do not work well on DOD due to their separated training procedures and inference strategies for REC and OVD tasks. Building upon the aforementioned findings, we propose a baseline that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods. Data and code are available at https://github.com/shikras/d-cube and related works are tracked in https://github.com/Charles-Xie/awesome-described-object-detection.

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs
Guangyao Zhai Evin Pinar Örnek Shun-Cheng Wu Yan Di Federico Tombari Nassir Navab Benjamin Busam



研究问题:本文旨在解决现有方法在场景图生成中的局限性,如忽视场景-物体和物体-物体关系,导致结果不一致。
动机:为了提高场景图生成的一致性、质量和多样性,我们提出了一种全新的生成模型CommonScenes。
方法:CommonScenes采用两个分支进行生成,一个通过变分自编码器预测整体场景布局,另一个通过潜在扩散生成兼容的形状,同时保留形状多样性。
效果:实验结果表明,CommonScenes在场景图生成的一致性、质量和多样性方面优于其他方法。

Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.

RH-BrainFS: Regional Heterogeneous Multimodal Brain Networks Fusion Strategy
Hongting Ye Yalu Zheng Yueying Li Ke Zhang Youyong Kong Yonggui Yuan



研究问题:本文旨在解决多模态脑网络研究中结构连接(SC)和功能连接(FC)的区域异质性问题,以及现有方法中通过"简单模式"进行融合的低效性。
动机:现有的多模态脑网络研究主要关注SC和FC两种模态,但它们之间的关系复杂,且在区域层面的耦合是异质的。然而,以前的研究忽视了SC和FC之间的模态区域异质性,并通过"简单模式"进行融合,这影响了模型的整体性能。
方法:本文提出了一种新的区域异质性多模态脑网络融合策略(RH-BrainFS)。首先,引入了一个脑子图网络模块来提取脑网络的区域特性,然后使用一个新的基于变压器的融合瓶颈模块来缓解SC和FC之间的区域异质性。
效果:实验结果表明,该方法在各种神经科学任务上优于几种最先进的方法。

Multimodal fusion has become an important research technique in neuroscience that completes downstream tasks by extracting complementary information from multiple modalities. Existing multimodal research on brain networks mainly focuses on two modalities, structural connectivity (SC) and functional connectivity (FC). Recently, extensive literature has shown that the relationship between SC and FC is complex and not a simple one-to-one mapping. The coupling of structure and function at the regional level is heterogeneous. However, all previous studies have neglected the modal regional heterogeneity between SC and FC and fused their representations via "simple patterns", which are inefficient ways of multimodal fusion and affect the overall performance of the model. In this paper, to alleviate the issue of regional heterogeneity of multimodal brain networks, we propose a novel Regional Heterogeneous multimodal Brain networks Fusion Strategy (RH-BrainFS). Briefly, we introduce a brain subgraph networks module to extract regional characteristics of brain networks, and further use a new transformer-based fusion bottleneck module to alleviate the issue of regional heterogeneity between SC and FC. To the best of our knowledge, this is the first paper to explicitly state the issue of structural-functional modal regional heterogeneity and to propose a solution. Extensive experiments demonstrate that the proposed method outperforms several state-of-the-art methods in a variety of neuroscience tasks.

DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field
Chenyangguang Zhang Yan Di Ruida Zhang Guangyao Zhai Fabian Manhardt Federico Tombari Xiangyang Ji



研究问题:如何从单张RGB图像中重建手持物体?
动机:现有的使用有向距离场(DDF)的方法在捕捉复杂的手-物体交互方面存在局限性,因为SDF只能在目标附近可靠,无法同时编码局部手和物体线索。
方法:提出一种新的方法DDF-HO,利用有向距离场(DDF)作为形状表示。与SDF不同,DDF将3D空间中的一条射线(包括一个原点和一个方向)映射到相应的DDF值,包括确定射线是否与对象相交的二进制可见性信号和测量给定方向上原点到目标的距离的距离值。通过引入一种新颖的基于2D射线的特征聚合方案和一种3D交感性的手部姿态嵌入,结合2D-3D特征来建模手-物体交互。
效果:在合成和真实世界的数据集上的大量实验表明,DDF-HO在所有基线上都取得了显著的改进,特别是在Chamfer Distance下,提高了约80%。代码可在https://github.com/ZhangCYG/DDFHO获取。

Reconstructing hand-held objects from a single RGB image is an important and challenging problem. Existing works utilizing Signed Distance Fields (SDF) reveal limitations in comprehensively capturing the complex hand-object interactions, since SDF is only reliable within the proximity of the target, and hence, infeasible to simultaneously encode local hand and object cues. To address this issue, we propose DDF-HO, a novel approach leveraging Directed Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in 3D space, consisting of an origin and a direction, to corresponding DDF values, including a binary visibility signal determining whether the ray intersects the objects and a distance value measuring the distance from origin to target in the given direction. We randomly sample multiple rays and collect local to global geometric features for them by introducing a novel 2D ray-based feature aggregation scheme and a 3D intersection-aware hand pose embedding, combining 2D-3D features to model hand-object interactions. Extensive experiments on synthetic and real-world datasets demonstrate that DDF-HO consistently outperforms all baseline methods by a large margin, especially under Chamfer Distance, with about 80% leap forward. Codes are available at https://github.com/ZhangCYG/DDFHO.

PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas
Zheng Chen Yan-Pei Cao Yuan-Chen Guo Chen Wang Ying Shan Song-Hai Zhang



研究问题:如何从广基线全景图像中合成新的视角。
动机:现有的神经辐射场方法在处理广基线全景图像时,由于难以从稀疏的360度视图中学习准确的几何形状,往往会过拟合训练视图。
方法:提出PanoGRF,一种用于广基线全景图像的可泛化球面辐射场,构建包含360度场景先验的球面辐射场。与在透视图像上训练的可泛化辐射场不同,PanoGRF避免了从全景到透视转换的信息损失,并直接根据球面投影聚合每个全景视图的3D样本点的几何和外观特征。此外,由于某些区域在广基线设置下只能从一个视图看到,而从其他视图看不到,PanoGRF将360度单眼深度先验纳入球面深度估计,以改善几何特征。
效果:在多个全景数据集上的实验结果表明,PanoGRF在广基线全景图像(如OmniSyn)和透视图像(如IBRNet,NeuRay)上显著优于最先进的可泛化视图合成方法。

Achieving an immersive experience enabling users to explore virtual environments with six degrees of freedom (6DoF) is essential for various applications such as virtual reality (VR). Wide-baseline panoramas are commonly used in these applications to reduce network bandwidth and storage requirements. However, synthesizing novel views from these panoramas remains a key challenge. Although existing neural radiance field methods can produce photorealistic views under narrow-baseline and dense image captures, they tend to overfit the training views when dealing with wide-baseline panoramas due to the difficulty in learning accurate geometry from sparse $360^{\circ}$ views. To address this problem, we propose PanoGRF, Generalizable Spherical Radiance Fields for Wide-baseline Panoramas, which construct spherical radiance fields incorporating $360^{\circ}$ scene priors. Unlike generalizable radiance fields trained on perspective images, PanoGRF avoids the information loss from panorama-to-perspective conversion and directly aggregates geometry and appearance features of 3D sample points from each panoramic view based on spherical projection. Moreover, as some regions of the panorama are only visible from one view while invisible from others under wide baseline settings, PanoGRF incorporates $360^{\circ}$ monocular depth priors into spherical depth estimation to improve the geometry features. Experimental results on multiple panoramic datasets demonstrate that PanoGRF significantly outperforms state-of-the-art generalizable view synthesis methods for wide-baseline panoramas (e.g., OmniSyn) and perspective images (e.g., IBRNet, NeuRay).

Toward Re-Identifying Any Animal
Bingliang Jiao Lingqiao Liu Liying Gao Ruiqi Wu Guosheng Lin PENG WANG Yanning Zhang



研究问题:目前的再识别(ReID)模型主要针对特定类别如人或车辆进行设计和训练,限制了其在开放世界中的适用性。
动机:考虑到ReID技术对于野生动物种群和迁移模式跟踪的重要性,提出了一个新的任务“在野外识别任何动物”(ReID-AW)。
方法:创建了一个名为Wildlife-71的综合数据集,包含来自71个不同野生动物类别的ReID数据。开发了一个名为UniReID的通用再识别模型,用于处理遇到的任何未见过的动物类别。使用基于目标类别预选图像生成的动态提示机制增强模型对目标类别的适应性。利用大规模预训练的语言模型GPT-4获取的显式语义知识,使UniReID能够专注于区分目标类别中的个体的区域。
效果:实验表明,UniReID模型具有显著的泛化能力,在处理任意野生动物类别方面表现出色,为野生动物保护和研究目的的ReID领域提供了重大进步。

The current state of re-identification (ReID) models poses limitations to their applicability in the open world, as they are primarily designed and trained for specific categories like person or vehicle. In light of the importance of ReID technology for tracking wildlife populations and migration patterns, we propose a new task called ``Re-identify Any Animal in the Wild'' (ReID-AW). This task aims to develop a ReID model capable of handling any unseen wildlife category it encounters. To address this challenge, we have created a comprehensive dataset called Wildlife-71, which includes ReID data from 71 different wildlife categories. This dataset is the first of its kind to encompass multiple object categories in the realm of ReID. Furthermore, we have developed a universal re-identification model named UniReID specifically for the ReID-AW task. To enhance the model's adaptability to the target category, we employ a dynamic prompting mechanism using category-specific visual prompts. These prompts are generated based on knowledge gained from a set of pre-selected images within the target category. Additionally, we leverage explicit semantic knowledge derived from the large-scale pre-trained language model, GPT-4. This allows UniReID to focus on regions that are particularly useful for distinguishing individuals within the target category. Extensive experiments have demonstrated the remarkable generalization capability of our UniReID model. It showcases promising performance in handling arbitrary wildlife categories, offering significant advancements in the field of ReID for wildlife conservation and research purposes.

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai Ruiping Wang Xilin CHEN



研究问题:视频问答(VideoQA)是评估机器理解人类日常行为的重要工具,但复杂的视频情况推理仍然具有挑战性。
动机:人类通过一系列情节记忆作为锚点快速定位与问题相关的关键时刻进行推理,而现有的模型难以实现这种有效的推理策略。
方法:提出一种Glance-Focus模型,该模型在扫视阶段训练一个编码器-解码器生成一组动态事件记忆,然后在聚焦阶段,这些事件记忆作为桥梁建立问题与高层次事件概念和低层次长视频内容之间的关联。
效果:在四个多事件视频问答基准测试(STAR、EgoTaskQA、AGQA、NExT-QA)上进行的广泛实验表明,提出的模型取得了最先进的结果,超越了当前大型模型在各种具有挑战性的推理任务上的表现。

Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents’ ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging. In contrast, humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. To mimic this effective reasoning strategy, we propose the Glance- Focus model. One simple way is to apply an action detection model to predict a set of actions as key memories. However, these actions within a closed set vocabulary are hard to generalize to various video domains. Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage. Apart from using supervised bipartite matching to obtain the event memories, we further design an unsupervised memory generation method to get rid of dependence on event annotations. Next, at the focusing stage, these event memories act as a bridge to establish the correlation between the questions with high-level event concepts and low-level lengthy video content. Given the question, the model first focuses on the generated key event memory, then focuses on the most relevant moment for reasoning through our designed multi-level cross- attention mechanism. We conduct extensive experiments on four Multi-Event VideoQA benchmarks including STAR, EgoTaskQA, AGQA, and NExT-QA. Our proposed model achieves state-of-the-art results, surpassing current large models in various challenging reasoning tasks. The code and models are available at https://github.com/ByZ0e/Glance-Focus.

PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation
Yuhan Ding Fukun Yin Jiayuan Fan Hui Li Xin Chen Wen Liu Chongshan Lu Gang YU Tao Chen



研究问题:如何有效地对大规模户外场景进行神经表征。
动机:现有的隐式神经表征方法在采样空间中采样和融合单个点,但由于采样空间的爆炸性增长,对无边界的大型户外场景进行详细纹理的精细表示和合成仍然是一个挑战。
方法:我们提出了一种点扩散隐函数(PDF)来对大型场景进行神经表征。该方法的核心是一个大规模的点云超分辨率扩散模块,该模块将由几张训练图像重建的稀疏点云增强为一个密集的点云作为显式先验。然后在渲染阶段,只保留采样半径内有先验点的采样点,即采样空间从无边界空间缩小到场景表面。同时,为了填补点云无法提供的的场景背景,我们采用基于Mip-NeRF 360的区域采样来建模背景表示。
效果:实验证明,我们的方法在大型场景新视角合成方面非常有效,优于相关的最先进的基线方法。

Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.

CluB: Cluster Meets BEV for LiDAR-Based 3D Object Detection
Yingjie Wang Jiajun Deng Yuenan Hou Yao Li Yu Zhang Jianmin Ji Wanli Ouyang Yanyong Zhang



研究问题:如何有效地将两种互补的表示方法(BEV-based detectors和cluster-based detectors)结合到一个统一的框架中。
动机:目前的激光雷达3D探测器主要分为两类,即基于BEV的检测器和基于聚类的检测器,这两类方法各有优势,但如何将它们有效结合仍是一个挑战。
方法:本文提出了一种新的3D物体检测框架CluB,通过在基于BEV的检测器中加入辅助的聚类分支,丰富了物体在特征和查询层面的表示。具体来说,CluB包括两个步骤:首先,构建一个聚类特征扩散模块,以微妙且自适应的方式建立聚类特征与BEV特征之间的关联;其次,设计一个聚类查询生成模块,直接从聚类分支利用投票中心,从而丰富物体查询的多样性。
效果:在Waymo和nuScenes数据集上进行了大量实验,CluB在这两项基准测试上都取得了最先进的性能。

Currently, LiDAR-based 3D detectors are broadly categorized into two groups, namely, BEV-based detectors and cluster-based detectors. BEV-based detectors capture the contextual information from the Bird's Eye View (BEV) and fill their center voxels via feature diffusion with a stack of convolution layers, which, however, weakens the capability of presenting an object with the center point. On the other hand, cluster-based detectors exploit the voting mechanism and aggregate the foreground points into object-centric clusters for further prediction. In this paper, we explore how to effectively combine these two complementary representations into a unified framework. Specifically, we propose a new 3D object detection framework, referred to as CluB, which incorporates an auxiliary cluster-based branch into the BEV-based detector by enriching the object representation at both feature and query levels. Technically, CluB is comprised of two steps. First, we construct a cluster feature diffusion module to establish the association between cluster features and BEV features in a subtle and adaptive fashion. Based on that, an imitation loss is introduced to distill object-centric knowledge from the cluster features to the BEV features. Second, we design a cluster query generation module to leverage the voting centers directly from the cluster branch, thus enriching the diversity of object queries. Meanwhile, a direction loss is employed to encourage a more accurate voting center for each cluster. Extensive experiments are conducted on Waymo and nuScenes datasets, and our CluB achieves state-of-the-art performance on both benchmarks.

Lightweight Vision Transformer with Bidirectional Interaction
Qihang Fan Huaibo Huang Xiaoqiang Zhou Ran He



研究问题:本文旨在解决视觉骨干网络中局部和全局上下文双向交互的问题。
动机:尽管视觉骨干网络在同时建模图像的局部和全局上下文方面取得了显著进步,但这两种上下文之间的双向交互尚未得到充分探索和利用。
方法:本文提出了一种名为FASA的全自适应自我注意力机制,用于视觉变压器模型,以上下文感知的方式对局部和全局信息以及它们之间的双向交互进行建模。具体来说,FASA采用自适应卷积来提取局部表示,同时在降采样空间中使用自我注意来提取全局表示。然后,它在局部和全局表示之间进行双向适应过程以模拟它们的交互。此外,我们还引入了细粒度的降采样策略,以提高降采样的自我注意机制的细粒度全局感知能力。基于FASA,我们开发了一个轻量级的视觉骨干网络系列,即FAT系列。
效果:在多个视觉任务上的大量实验表明,FAT实现了令人印象深刻的性能。值得注意的是,FAT仅使用4.5M参数和0.7G FLOPs就在ImageNet-1K上实现了77.6%的准确率,超过了具有相似模型大小和计算成本的最先进卷积神经网络和变压器。此外,我们的模型在现代GPU上的速度比其他模型更快。

Recent advancements in vision backbones have significantly improved their performance by simultaneously modeling images’ local and global contexts. However, the bidirectional interaction between these two contexts has not been well explored and exploited, which is important in the human visual system. This paper proposes a **F**ully **A**daptive **S**elf-**A**ttention (FASA) mechanism for vision transformer to model the local and global information as well as the bidirectional interaction between them in context-aware ways. Specifically, FASA employs self-modulated convolutions to adaptively extract local representation while utilizing self-attention in down-sampled space to extract global representation. Subsequently, it conducts a bidirectional adaptation process between local and global representation to model their interaction. In addition, we introduce a fine-grained downsampling strategy to enhance the down-sampled self-attention mechanism for finer-grained global perception capability. Based on FASA, we develop a family of lightweight vision backbones, **F**ully **A**daptive **T**ransformer (FAT) family. Extensive experiments on multiple vision tasks demonstrate that FAT achieves impressive performance. Notably, FAT accomplishes a **77.6%** accuracy on ImageNet-1K using only **4.5M** parameters and **0.7G** FLOPs, which surpasses the most advanced ConvNets and Transformers with similar model size and computational costs. Moreover, our model exhibits faster speed on modern GPU compared to other models.

Single-Stage Visual Query Localization in Egocentric Videos
Hanwen Jiang Santhosh Kumar Ramakrishnan Kristen Grauman



研究问题:本文旨在解决长形式自我中心视频的视觉查询定位问题,需要对视觉指定的对象进行时空搜索和定位,这对于构建情景记忆系统至关重要。
动机:现有的工作通过复杂的多阶段管道利用成熟的目标检测和跟踪方法来执行VQL,但每个阶段都是独立训练的,管道的复杂性导致推理速度慢。
方法:我们提出了一种新的单阶段VQL框架VQLoC,它是端到端可训练的。我们的关键思想是首先建立查询-视频关系的全面理解,然后一次性进行时空定位。具体来说,我们通过考虑查询与视频帧之间的查询-帧对应关系以及相邻视频帧之间的帧-帧对应关系来建立查询-视频关系。
效果:我们的实验表明,我们的方法比先前的VQL方法准确率提高了20%,同时推理速度提高了10倍。VQLoC也是Ego4D VQ2D挑战赛排行榜的第一名。

Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline results in slow inference speeds. We propose VQLoC, a novel single-stage VQL framework that is end-to-end trainable. Our key idea is to first build a holistic understanding of the query-video relationship and then perform spatio-temporal localization in a single shot manner. Specifically, we establish the query-video relationship by jointly considering query-to-frame correspondences between the query and each video frame and frame-to-frame correspondences between nearby video frames. Our experiments demonstrate that our approach outperforms prior VQL methods by $20$% accuracy while obtaining a $10\times$ improvement in inference speed. VQLoC is also the top entry on the Ego4D VQ2D challenge leaderboard.

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
Yingying Fan Yu Wu Bo Du Yutian Lin



研究问题:本文旨在解决弱监督的音视视频解析任务(AVVP),即识别和定位所有音频/视觉模态中的事件。
动机:以往的工作只关注跨模态的视频级总体标签去噪,忽视了段级别标签噪声,即相邻的视频片段可能包含不同的事件。而识别段级别的事件具有挑战性,因为其标签可能是视频中发生的任何事件组合。
方法:我们从语言的角度来解决AVVP问题,设计语言提示来描述每个视频中各种事件的出现情况。然后,计算语言提示与片段之间的相似性,最相似的提示的事件被认定为段级别的标签。此外,为了处理误标段,我们提出对不可靠的段进行动态重新加权以调整它们的标签。
效果:实验表明,我们这种简单而有效的方法比最先进的方法有大幅度的提高。

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events on the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar prompt is regarded as the segment-level label. In addition, to deal with the mislabeled segments, we propose to perform dynamic re-weighting on the unreliable segments to adjust their labels. Experiments show that our simple yet effective approach outperforms state-of-the-art methods by a large margin.

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs
Peng Jin Yang Wu Yanbo Fan Zhongqian Sun Yang Wei Li Yuan



研究问题:目前的文本驱动人体运动生成方法主要采用序列建模,但这些紧凑的文本表示可能会过度强调动作名称,而忽视其他重要属性,缺乏指导微妙区别运动的精细细节。
动机:为了解决上述问题,本文提出了一种层次化语义图的方法,用于对人体运动生成进行细粒度的控制。
方法:我们将运动描述分解为包括三个级别的层次化语义图:运动、动作和具体细节。这种从全局到局部的结构有助于全面理解运动描述,并对运动生成进行细粒度控制。
效果:在两个基准人体运动数据集HumanML3D和KIT上的大量实验表明,我们的方法具有优越的性能。更令人鼓舞的是,通过修改层次化语义图的边缘权重,我们的方法可以不断细化生成的运动,这可能对社区产生深远影响。

Most text-driven human motion generation methods employ sequential modeling approaches, e.g., transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-training weights are available at https://github.com/jpthu17/GraphMotion.

ReTR: Modeling Rendering Via Transformer for Generalizable Neural Surface Reconstruction
Yixun Liang Hao He Ying-Cong Chen



研究问题:现有的神经表面重建技术由于采用过于简化的体积渲染过程,存在深度分布信心不足和表面推理不准确的问题。
动机:本文提出了一种新的框架Reconstruction TRansformer(ReTR),利用变压器架构重新设计渲染过程,以实现复杂的渲染交互建模。
方法:ReTR引入了一个可学习的元射线令牌,并利用交叉注意力机制模拟采样点与渲染过程的交互,从而渲染出观察到的颜色。同时,通过在高维特征空间而不是颜色空间中操作,ReTR减轻了对源视图投影颜色的敏感性。
效果:实验结果表明,该方法在各种数据集上均表现出色,无论是在重建质量还是泛化能力方面,都优于当前最先进的方法。

Generalizable neural surface reconstruction techniques have attracted great attention in recent years. However, they encounter limitations of low confidence depth distribution and inaccurate surface reasoning due to the oversimplified volume rendering process employed. In this paper, we present Reconstruction TRansformer (ReTR), a novel framework that leverages the transformer architecture to redesign the rendering process, enabling complex render interaction modeling. It introduces a learnable $\textit{meta-ray token}$ and utilizes the cross-attention mechanism to simulate the interaction of rendering process with sampled points and render the observed color. Meanwhile, by operating within a high-dimensional feature space rather than the color space, ReTR mitigates sensitivity to projected colors in source views. Such improvements result in accurate surface assessment with high confidence. We demonstrate the effectiveness of our approach on various datasets, showcasing how our method outperforms the current state-of-the-art approaches in terms of reconstruction quality and generalization ability. $\textit{Our code is available at }$ https://github.com/YixunLiang/ReTR.

HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork
Bipasha Sen Gaurav Singh Aditya Agarwal Rohith Agaram Madhava Krishna Srinath Sridhar



研究问题:如何利用高维网络权重空间学习具有泛化能力的NeRF先验,以捕获场景和物体的高质量外观和形状。
动机:现有的工作在泛化、多视图一致性和质量改进方面存在限制,因此提出HyP-NeRF,一种使用超网络学习具有泛化能力的类别级NeRF先验的潜在条件方法。
方法:我们不仅使用超网络来估计NeRF的权重,还估计多分辨率哈希编码,从而显著提高质量。此外,我们还引入了去噪和微调策略,对由超网络估计的NeRF渲染的图像进行去噪,并在保留多视图一致性的同时进行微调。
效果:这些改进使我们能够将HyP-NeRF用作多个下游任务的通用先验,包括从单视图或杂乱场景重建NeRF以及文本到NeRF。我们在三个任务上进行了定性比较和评估:泛化、压缩和检索,展示了我们最先进的结果。

Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Rather than using hypernetworks to estimate only the weights of a NeRF, we estimate both the weights and the multi-resolution hash encodings resulting in significant quality gains. To improve quality even further, we incorporate a denoise and finetune strategy that denoises images rendered from NeRFs estimated by the hypernetwork and finetunes it while retaining multiview consistency. These improvements enable us to use HyP-NeRF as a generalizable prior for multiple downstream tasks including NeRF reconstruction from single-view or cluttered scenes and text-to-NeRF. We provide qualitative comparisons and evaluate HyP-NeRF on three tasks: generalization, compression, and retrieval, demonstrating our state-of-the-art results.

Learning Unseen Modality Interaction
Yunhua Zhang Hazel Doughty Cees G. M. Snoek



研究问题:本文挑战了多模态学习中模态完整假设,即训练期间所有感兴趣的模态组合都是可用的,并致力于在推理期间推广到未见过的组合。
动机:为了解决未见过模态交互的问题,提出了一种解决方案,通过将不同模态的多维特征投影到一个保留丰富信息的共同空间中,使信息可以通过简单的求和操作累积在所有可用的模态上。
方法:进一步通过伪监督来减少对训练过程中较少判别性模态组合的过拟合,该伪监督指示了模态预测的可靠性。
效果:通过在多模态视频分类、机器人状态回归和多媒体检索等任务上进行评估,证明了该方法对于不同的任务和模态都是有效的。

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality’s prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. Project website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.

Face Reconstruction from Facial Templates by Learning Latent Space of a Generator Network
Hatef Otroshi Shahreza Sébastien Marcel



研究问题:本文主要关注针对人脸识别系统的模板反转攻击,并提出一种从人脸模板重建人脸图像的新方法。
动机:在基于生成对抗网络(GAN)的框架下,我们学习了从人脸模板到预训练的人脸生成网络的中间潜在空间的映射,从而可以生成高分辨率的真实重建人脸图像。
方法:我们在预训练的人脸生成网络的中间潜在空间中,通过对抗性训练的方式,学习了一个从人脸模板到重建人脸图像的映射。
效果:我们的实验表明,该方法在白盒和黑盒攻击场景下都能成功进行人脸识别系统的重构,并且重构的人脸图像具有可迁移性,可用于对其他人脸识别系统的攻击。

In this paper, we focus on the template inversion attack against face recognition systems and propose a new method to reconstruct face images from facial templates. Within a generative adversarial network (GAN)-based framework, we learn a mapping from facial templates to the intermediate latent space of a pre-trained face generation network, from which we can generate high-resolution realistic reconstructed face images. We show that our proposed method can be applied in whitebox and blackbox attacks against face recognition systems. Furthermore, we evaluate the transferability of our attack when the adversary uses the reconstructed face image to impersonate the underlying subject in an attack against another face recognition system. Considering the adversary's knowledge and the target face recognition system, we define five different attacks and evaluate the vulnerability of state-of-the-art face recognition systems. Our experiments show that our proposed method achieves high success attack rates in whitebox and blackbox scenarios. Furthermore, the reconstructed face images are transferable and can be used to enter target face recognition systems with a different feature extractor model. We also explore important areas in the reconstructed face images that can fool the target face recognition system.

SpatialRank: Urban Event Ranking with NDCG Optimization on Spatiotemporal Data
BANG AN Xun Zhou Yongjian Zhong Tianbao Yang



研究问题:城市事件排名旨在预测未来最危险的前k个地点,如交通事故和犯罪。
动机:由于地点之间的复杂动态时空关联性、城市事件在空间上的不均匀分布以及难以正确排序具有相似特征的附近地点,这个问题具有挑战性。
方法:我们提出了一种名为SpatialRank的新型空间事件排名方法。该方法通过从数据中动态学习地点之间的时空依赖关系来优化NDCG损失。
效果:实验证明,SpatialRank能有效识别犯罪和交通事故的最危险地点,并在NDCG方面比最先进的方法高出12.7%。

The problem of urban event ranking aims at predicting the top-$k$ most risky locations of future events such as traffic accidents and crimes. This problem is of fundamental importance to public safety and urban administration especially when limited resources are available. The problem is, however, challenging due to complex and dynamic spatio-temporal correlations between locations, uneven distribution of urban events in space, and the difficulty to correctly rank nearby locations with similar features. Prior works on event forecasting mostly aim at accurately predicting the actual risk score or counts of events for all the locations. Rankings obtained as such usually have low quality due to prediction errors. Learning-to-rank methods directly optimize measures such as Normalized Discounted Cumulative Gain (NDCG), but cannot handle the spatiotemporal autocorrelation existing among locations. Due to the common assumption that items are independent. In this paper, we bridge the gap by proposing a novel spatial event ranking approach named SpatialRank. SpatialRank features adaptive graph convolution layers that dynamically learn the spatiotemporal dependencies across locations from data. In addition, the model optimizes through surrogates a hybrid NDCG loss with a spatial component to better rank neighboring spatial locations. We design an importance-sampling with a spatial filtering algorithm to effectively evaluate the loss during training. Comprehensive experiments on three real-world datasets demonstrate that SpatialRank can effectively identify the top riskiest locations of crimes and traffic accidents and outperform state-of-art methods in terms of NDCG by up to 12.7%.

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation
Zibo Zhao Wen Liu Xin Chen Xianfang Zeng Rui Wang Pei Cheng BIN FU Tao Chen Gang YU Shenghua Gao



研究问题:如何有效地从2D图像或文本生成3D形状。
动机:直接从图像或文本学习条件生成模型到3D形状,由于3D形状具有与2D图像和文本显著不同的分布的额外维度,往往会导致结果与条件不一致。
方法:提出了一种先对齐后生成的新方法,通过在形状-图像-文本对齐的空间中表示3D形状来弥合三个模态之间的领域差距,并促进多模态条件下的3D形状生成。该方法包括两个模型:形状-图像-文本对齐变分自编码器(SITA-VAE)和条件对齐形状潜在扩散模型(ASLDM)。
效果:实验证明,该方法能生成更高质量、更多样化的3D形状,更好地符合视觉或纹理条件输入,验证了形状-图像-文本对齐空间在跨模态3D形状生成中的有效性。

We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds
Gang Zhang Chen Junnan Guohuan Gao Jianmin Li Xiaolin Hu



研究问题:本文旨在解决自动驾驶系统中的3D物体检测问题,特别是由于3D场景中点的稀疏分布导致的主要挑战。
动机:现有的高性能方法通常使用小内核的3D稀疏卷积神经网络来提取特征,但这会阻止空间上断开的特征之间的信息交换。虽然一些新的方法试图通过引入大内核卷积或自我注意力机制来解决这个问题,但它们要么只能实现有限的精度改进,要么会导致过度的计算成本。
方法:我们提出了HEDNet,一种用于3D物体检测的分层编码器-解码器网络。该网络利用编码器-解码器块在空间空间中捕获特征之间的长范围依赖性,特别是对于大型和远距离的物体。
效果:我们在Waymo Open和nuScenes数据集上进行了广泛的实验。实验结果表明,HEDNet在这两个方面都优于先前最先进的方法,同时具有竞争力的效率。

3D object detection in point clouds is important for autonomous driving systems. A primary challenge in 3D object detection stems from the sparse distribution of points within the 3D scene. Existing high-performance methods typically employ 3D sparse convolutional neural networks with small kernels to extract features. To reduce computational costs, these methods resort to submanifold sparse convolutions, which prevent the information exchange among spatially disconnected features. Some recent approaches have attempted to address this problem by introducing large-kernel convolutions or self-attention mechanisms, but they either achieve limited accuracy improvements or incur excessive computational costs. We propose HEDNet, a hierarchical encoder-decoder network for 3D object detection, which leverages encoder-decoder blocks to capture long-range dependencies among features in the spatial space, particularly for large and distant objects. We conducted extensive experiments on the Waymo Open and nuScenes datasets. HEDNet achieved superior detection accuracy on both datasets than previous state-of-the-art methods with competitive efficiency. The code is available at https://github.com/zhanggang001/HEDNet.

Slot-guided Volumetric Object Radiance Fields
DI QI Tong Yang Xiangyu Zhang



研究问题:如何有效地从单张图像中分解复杂场景为单个对象,实现无监督的三维物体中心表示学习。
动机:现有的方法在处理复杂的场景分解任务时,往往需要大量的标注数据和计算资源,且无法实现真正的无监督学习。
方法:提出了一种新的框架sVORF(slot-guided Volumetric Object Radiance Fields),通过将物体槽位作为指导,将体积物体辐射场进行组合,实现无监督的3D场景分解。
效果:在复杂的合成数据集(如Room-Diverse)的场景分解和生成任务上取得了优秀的结果,并在真实世界场景(如LLFF数据集)的对象分割任务上也表现出良好的性能。

We present a novel framework for 3D object-centric representation learning. Our approach effectively decomposes complex scenes into individual objects from a single image in an unsupervised fashion. This method, called \underline{s}lot-guided \underline{V}olumetric \underline{O}bject \underline{R}adiance \underline{F}ields~(sVORF), composes volumetric object radiance fields with object slots as a guidance to implement unsupervised 3D scene decomposition. Specifically, sVORF obtains object slots from a single image via a transformer module, maps these slots to volumetric object radiance fields with a hypernetwork and composes object radiance fields with the guidance of object slots at a 3D location. Moreover, sVORF significantly reduces memory requirement due to small-sized pixel rendering during training. We demonstrate the effectiveness of our approach by showing top results in scene decomposition and generation tasks of complex synthetic datasets (e.g., Room-Diverse). Furthermore, we also confirm the potential of sVORF to segment objects in real-world scenes (e.g., the LLFF dataset). We hope our approach can provide preliminary understanding of the physical world and help ease future research in 3D object-centric representation learning.

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection
Lingchen Meng Xiyang Dai Jianwei Yang Dongdong Chen Yinpeng Chen Mengchen Liu Yi-Ling Chen Zuxuan Wu Lu Yuan Yu-Gang Jiang



研究问题:本文旨在解决现实世界数据集中极度不平衡的长尾巴对象检测问题,其中许多尾部类别的实例稀缺。
动机:现有的长尾巴对象检测方法主要通过探索具有图像级别标签的额外数据来解决数据不平衡问题,但由于语义模糊和位置敏感性,这种方法的效果有限。
方法:本文提出了一种名为RichSem的方法,该方法从粗糙的位置中学习丰富的语义,无需精确的边界框。RichSem利用图像中的丰富语义作为额外的“软监督”来训练检测器。具体来说,我们在检测器中添加了一个语义分支来学习这些软语义并增强特征表示。
效果:实验结果表明,RichSem在LVIS的整体和稀有类别上都取得了一致的改进,且无需复杂的训练和测试程序,达到了最先进的性能。此外,我们还在其他长尾数据集上进行了额外的实验,证明了我们方法的有效性。

Long-tailed object detection (LTOD) aims to handle the extreme data imbalance in real-world datasets, where many tail classes have scarce instances. One popular strategy is to explore extra data with image-level labels, yet it produces limited results due to (1) semantic ambiguity---an image-level label only captures a salient part of the image, ignoring the remaining rich semantics within the image; and (2) location sensitivity---the label highly depends on the locations and crops of the original image, which may change after data transformations like random cropping. To remedy this, we propose RichSem, a simple but effective method, which is robust to learn rich semantics from coarse locations without the need of accurate bounding boxes. RichSem leverages rich semantics from images, which are then served as additional ``soft supervision'' for training detectors. Specifically, we add a semantic branch to our detector to learn these soft semantics and enhance feature representations for long-tailed object detection. The semantic branch is only used for training and is removed during inference. RichSem achieves consistent improvements on both overall and rare-category of LVIS under different backbones and detectors. Our method achieves state-of-the-art performance without requiring complex training and testing procedures. Moreover, we show the effectiveness of our method on other long-tailed datasets with additional experiments.

Uni3DETR: Unified 3D Detection Transformer
Zhenyu Wang Ya-Li Li Xi Chen Hengshuang Zhao Shengjin Wang



研究问题:目前,针对特定场景(室内或室外)的基于点云的3D检测器存在差异大、缺乏统一网络架构的问题。
动机:由于从各种环境中收集的点云中对象分布和点密度的差异,以及3D度量的复杂性,目前还没有能够适应不同场景的统一网络架构。
方法:本文提出了Uni3DETR,一种统一的3D检测器,在同一框架内解决室内和室外3D检测问题。具体来说,我们采用带有点-体素交互的检测变换器进行目标预测,利用体素特征和点进行交叉注意力,并对数据差异具有抵抗力。然后,我们提出了查询点的混合,对于密集的小范围室内场景充分挖掘全局信息,对于大范围稀疏的室外场景则充分利用局部信息。此外,我们提出的解耦IoU通过将$xy$和$z$空间分开,为定位提供了一个易于优化的训练目标。
效果:大量实验证明,Uni3DETR在室内和室外3D检测上始终表现出优异的性能。与之前可能在某些特定数据集上表现良好但在不同场景下性能大幅下降的专业检测器相比,Uni3DETR在异构条件下显示出强大的泛化能力。

Existing point cloud based 3D detectors are designed for the particular scene, either indoor or outdoor ones. Because of the substantial differences in object distribution and point density within point clouds collected from various environments, coupled with the intricate nature of 3D metrics, there is still a lack of a unified network architecture that can accommodate diverse scenes. In this paper, we propose Uni3DETR, a unified 3D detector that addresses indoor and outdoor 3D detection within the same framework. Specifically, we employ the detection transformer with point-voxel interaction for object prediction, which leverages voxel features and points for cross-attention and behaves resistant to the discrepancies from data. We then propose the mixture of query points, which sufficiently exploits global information for dense small-range indoor scenes and local information for large-range sparse outdoor ones. Furthermore, our proposed decoupled IoU provides an easy-to-optimize training target for localization by disentangling the $xy$ and $z$ space. Extensive experiments validate that Uni3DETR exhibits excellent performance consistently on both indoor and outdoor 3D detection. In contrast to previous specialized detectors, which may perform well on some particular datasets but suffer a substantial degradation on different scenes, Uni3DETR demonstrates the strong generalization ability under heterogeneous conditions (Fig. 1).

Neural Lighting Simulation for Urban Scenes
Ava Pun Gary Sun Jingkang Wang Yun Chen Ze Yang Sivabalan Manivasagam Wei-Chiu Ma Raquel Urtasun



研究问题:户外光照条件的变化会显著改变城市景观的外观,如果训练过程中未考虑到这些变化,可能会对基于图像的机器人感知系统的性能造成损害。
动机:为了解决这个问题,我们提出了一种名为LightSim的神经光照相机模拟系统,用于生成不同光照条件下的大量图像数据集。
方法:LightSim从收集的原始传感器数据中自动构建大规模的光照感知数字双胞胎,并将场景分解为具有精确几何形状、外观和估计场景光照的动态演员和静态背景。然后,通过物理基础和可学习延迟渲染的结合,对修改后的场景进行真实的重照明,如改变太阳位置、修改阴影或改变太阳亮度,从而产生空间和时间一致的相机视频。
效果:实验表明,LightSim生成的重照明结果比现有工作更真实。更重要的是,在LightSim生成的数据上训练感知模型可以显著提高其性能。

Different outdoor illumination conditions drastically alter the appearance of urban scenes, and they can harm the performance of image-based robot perception systems if not seen during training. Camera simulation provides a cost-effective solution to create a large dataset of images captured under different lighting conditions. Towards this goal, we propose LightSim, a neural lighting camera simulation system that enables diverse, realistic, and controllable data generation. LightSim automatically builds lighting-aware digital twins at scale from collected raw sensor data and decomposes the scene into dynamic actors and static background with accurate geometry, appearance, and estimated scene lighting. These digital twins enable actor insertion, modification, removal, and rendering from a new viewpoint, all in a lighting-aware manner. LightSim then combines physically-based and learnable deferred rendering to perform realistic relighting of modified scenes, such as altering the sun location and modifying the shadows or changing the sun brightness, producing spatially- and temporally-consistent camera videos. Our experiments show that LightSim generates more realistic relighting results than prior work. Importantly, training perception models on data generated by LightSim can significantly improve their performance. Our project page is available at https://waabi.ai/lightsim/.

A Dual-Stream Neural Network Explains the Functional Segregation of Dorsal and Ventral Visual Pathways in Human Brains
Minkyu Choi Kuan Han Xiaokai Wang Yizhen Zhang Zhongming Liu



研究问题:计算机视觉系统通常使用单一的前馈路径,这导致其鲁棒性、适应性或效率不如人类视觉。
动机:为了弥合这一差距,我们开发了一种受人类眼睛和大脑启发的双流视觉模型。
方法:该模型在输入级别模拟人眼如何使用大细胞视网膜神经节细胞和小细胞视网膜神经节细胞将视网膜输入分离到大脑,后端则通过两个并行的卷积神经网络分支处理分离的输入模式,模拟人脑如何使用背侧和腹侧皮质通路进行并行视觉处理。
效果:通过比较该模型与人类大脑处理同一部电影的情况,我们发现WhereCNN和WhatCNN分支分别主要对应于视觉皮层的背侧和腹侧通路,这主要是因为它们不同的学习目标,而非它们的视网膜采样或对注意力驱动眼球运动的敏感性的差异。这种双流模型在受大脑启发的计算机视觉中迈出了重要一步,使并行神经网络能够积极地探索和理解视觉环境。

The human visual system uses two parallel pathways for spatial processing and object recognition. In contrast, computer vision systems tend to use a single feedforward pathway, rendering them less robust, adaptive, or efficient than human vision. To bridge this gap, we developed a dual-stream vision model inspired by the human eyes and brain. At the input level, the model samples two complementary visual patterns to mimic how the human eyes use magnocellular and parvocellular retinal ganglion cells to separate retinal inputs to the brain. At the backend, the model processes the separate input patterns through two branches of convolutional neural networks (CNN) to mimic how the human brain uses the dorsal and ventral cortical pathways for parallel visual processing. The first branch (WhereCNN) samples a global view to learn spatial attention and control eye movements. The second branch (WhatCNN) samples a local view to represent the object around the fixation. Over time, the two branches interact recurrently to build a scene representation from moving fixations. We compared this model with the human brains processing the same movie and evaluated their functional alignment by linear transformation. The WhereCNN and WhatCNN branches were found to differentially match the dorsal and ventral pathways of the visual cortex, respectively, primarily due to their different learning objectives, rather than their distinctions in retinal sampling or sensitivity to attention-driven eye movements. These model-based results lead us to speculate that the distinct responses and representations of the ventral and dorsal streams are more influenced by their distinct goals in visual attention and object recognition than by their specific bias or selectivity in retinal inputs. This dual-stream model takes a further step in brain-inspired computer vision, enabling parallel neural networks to actively explore and understand the visual surroundings.

MAViL: Masked Audio-Video Learners
Po-Yao Huang Vasu Sharma Hu Xu Chaitanya Ryali Haoqi Fan Yanghao Li Shang-Wen Li Gargi Ghosh Jitendra Malik Christoph Feichtenhofer



研究问题:如何利用自我监督学习从音频和视频中学习表示。
动机:现有的方法在多模态分类和检索任务上的性能有待提高,并且需要依赖其他模态的信息进行单模态的微调或推理。
方法:提出Masked Audio-Video Learners(MAViL)模型,通过三种互补的自我监督形式进行训练:重建被遮蔽的原始音频和视频输入、使用遮蔽进行的模内和模间对比学习以及预测从前两个目标中学习到的对齐和上下文化的音频-视频表示的自我训练。
效果:实验结果表明,MAViL在AudioSet和VGGSound上取得了最先进的音频-视频分类性能,超越了最近的自我监督模型和使用外部标记数据的监督模型。此外,使用MAViL预训练不仅可以提高多模态分类和检索任务的性能,还可以改善每个模态的独立表示,无需在单模态微调或推理过程中依赖其他模态的信息。

We present Masked Audio-Video Learners (MAViL) to learn audio-visual representations with three complementary forms of self-supervision: (1) reconstructing masked raw audio and video inputs, (2) intra-modal and inter-modal contrastive learning with masking, and (3) self-training to predict aligned and contextualized audio-video representations learned from the first two objectives. Empirically, MAViL achieves state-of-the-art audio-video classification performance on AudioSet (53.3 mAP) and VGGSound (67.1\% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data. Notably, pre-training with MAViL not only enhances performance in multimodal classification and retrieval tasks, but it also improves the representations of each modality in isolation, without relying on information from the other modality during uni-modal fine-tuning or inference. The code and models are available at https://github.com/facebookresearch/MAViL.

FourierHandFlow: Neural 4D Hand Representation Using Fourier Query Flow
Jihyun Lee Junbong Jang Donghwan Kim Minhyuk Sung Tae-Kyun Kim



研究问题:现有的4D形状表示模型无法有效地捕捉关节形状之间的隐含对应关系或规范抖动的临时变形。
动机:为了解决上述问题,本文提出了一种结合3D占据场和关节感知查询流(表示为傅立叶级数)的时空连续手部表示方法——FourierHandFlow。
方法:通过学习输入RGB序列的傅立叶系数来保证平滑和连续的临时形状动态。同时,通过两种类型的傅立叶查询流(姿势流和形状流)来有效建模关节手部的时空变形。
效果:实验结果表明,该方法在基于视频的4D重建任务上取得了最先进的结果,并且比现有的3D/4D隐式形状表示更具计算效率。此外,我们还展示了使用学习的隐式形状对应进行运动插值和外推以及纹理转移的结果。据我们所知,FourierHandFlow是第一个从RGB视频中学习到的神经4D连续手部表示。

Recent 4D shape representations model continuous temporal evolution of implicit shapes by (1) learning query flows without leveraging shape and articulation priors or (2) decoding shape occupancies separately for each time value. Thus, they do not effectively capture implicit correspondences between articulated shapes or regularize jittery temporal deformations. In this work, we present FourierHandFlow, which is a spatio-temporally continuous representation for human hands that combines a 3D occupancy field with articulation-aware query flows represented as Fourier series. Given an input RGB sequence, we aim to learn a fixed number of Fourier coefficients for each query flow to guarantee smooth and continuous temporal shape dynamics. To effectively model spatio-temporal deformations of articulated hands, we compose our 4D representation based on two types of Fourier query flow: (1) pose flow that models query dynamics influenced by hand articulation changes via implicit linear blend skinning and (2) shape flow that models query-wise displacement flow. In the experiments, our method achieves state-of-the-art results on video-based 4D reconstruction while being computationally more efficient than the existing 3D/4D implicit shape representations. We additionally show our results on motion inter- and extrapolation and texture transfer using the learned correspondences of implicit shapes. To the best of our knowledge, FourierHandFlow is the first neural 4D continuous hand representation learned from RGB videos. The code will be publicly accessible.

NeuralGF: Unsupervised Point Normal Estimation by Learning Neural Gradient Function
Qing Li Huifang Feng Kanle Shi Yue Gao Yi Fang Yu-Shen Liu Zhizhong Han



研究问题:本文旨在解决3D点云中正常估计的问题,即如何直接从无监督的点云数据中估计有向法线。
动机:目前最先进的方法依赖于从正常监督学习的局部表面的先验知识,但这种方法在真实扫描中无法获得,限制了其应用范围。此外,没有单独的后处理程序,形状间的法线方向一致性也很难实现。
方法:本文提出了一种新的方法,通过引入新的神经网络梯度函数学习范式,鼓励网络拟合输入的点云并在点上产生单位范数的梯度,从而直接从点云中估计有向法线。具体来说,我们引入了损失函数,使查询点逐步达到移动目标并聚合到近似表面上,从而学习数据的全局表面表示。同时,我们将梯度纳入表面近似,以测量查询点的最小有符号偏差,从而得到与表面关联的一致梯度场。
效果:实验结果表明,该方法在噪声、异常值和密度变化方面具有鲁棒性,并且在无向和有向法线估计任务上都能比最新的方法学习出更准确的法线。源代码和预训练模型已公开。

Normal estimation for 3D point clouds is a fundamental task in 3D geometry processing. The state-of-the-art methods rely on priors of fitting local surfaces learned from normal supervision. However, normal supervision in benchmarks comes from synthetic shapes and is usually not available from real scans, thereby limiting the learned priors of these methods. In addition, normal orientation consistency across shapes remains difficult to achieve without a separate post-processing procedure. To resolve these issues, we propose a novel method for estimating oriented normals directly from point clouds without using ground truth normals as supervision. We achieve this by introducing a new paradigm for learning neural gradient functions, which encourages the neural network to fit the input point clouds and yield unit-norm gradients at the points. Specifically, we introduce loss functions to facilitate query points to iteratively reach the moving targets and aggregate onto the approximated surface, thereby learning a global surface representation of the data. Meanwhile, we incorporate gradients into the surface approximation to measure the minimum signed deviation of queries, resulting in a consistent gradient field associated with the surface. These techniques lead to our deep unsupervised oriented normal estimator that is robust to noise, outliers and density variations. Our excellent results on widely used benchmarks demonstrate that our method can learn more accurate normals for both unoriented and oriented normal estimation tasks than the latest methods. The source code and pre-trained model are publicly available.

LuminAIRe: Illumination-Aware Conditional Image Repainting for Lighting-Realistic Generation
Jiajun Tang Haofeng Zhong Shuchen Weng Boxin Shi



研究问题:本文旨在解决近期条件图像重绘(CIR)方法中存在的不真实的光照效果问题。
动机:现有的条件图像重绘方法在处理光照效果时存在不真实感,因此需要提出一种新的方法来解决这个问题。
方法:通过参数化光照表示和基于学习的先验知识,从给定的背景图像和解析掩码中明确估计环境光照和3D几何条件。然后,通过提出的物理基础的光照渲染和光照注意力模块将这些3D条件转换为光照图像。最后,将光照图像注入到光照信息生成过程中,得到具有和谐光照效果的前背景区域重绘图像。
效果:实验结果证明,该方法生成的重绘图像在光照效果上优于现有方法,并通过收集带有光照注释和丰富外观变化的Car-LuminAIRe数据集进行验证。

We present the ilLumination-Aware conditional Image Repainting (LuminAIRe) task to address the unrealistic lighting effects in recent conditional image repainting (CIR) methods. The environment lighting and 3D geometry conditions are explicitly estimated from given background images and parsing masks using a parametric lighting representation and learning-based priors. These 3D conditions are then converted into illumination images through the proposed physically-based illumination rendering and illumination attention module. With the injection of illumination images, physically-correct lighting information is fed into the lighting-realistic generation process and repainted images with harmonized lighting effects in both foreground and background regions can be acquired, whose superiority over the results of state-of-the-art methods is confirmed through extensive experiments. For facilitating and validating the LuminAIRe task, a new dataset Car-LuminAIRe with lighting annotations and rich appearance variants is collected.

Generative Category-level Object Pose Estimation via Diffusion Models
Jiyao Zhang Mingdong Wu Hao Dong



研究问题:本文旨在解决多假设问题,即在部分观察点云中进行类别级别的物体姿态估计。
动机:尽管现有的方法可以在部分观察的点云中进行类别级别的物体姿态估计,但它们面临着挑战。因此,本文提出了一种新的解决方案,将类别级别的物体姿态估计重新定义为条件生成模型,从而摆脱了传统的点对点回归。
方法:本文利用基于分数的扩散模型来估计物体的姿态,通过从扩散模型中采样候选对象并进行两步处理来实现:首先通过似然估计过滤掉异常值,然后对剩余的候选对象进行均值池化。为了避免在估计似然时进行昂贵的集成过程,本文引入了一种替代方法,即从原始的基于分数的模型中提取出能量基础模型,从而实现端到端的似然估计。
效果:该方法在REAL275数据集上取得了最先进的性能,严格5 ◦ 2cm和5 ◦ 5cm指标分别超过了50%和60%。此外,该方法还表现出强大的泛化能力,可以适应新类别而无需微调,并且可以轻松适应物体姿态跟踪任务,与当前最先进的基线方法相比具有相当的结果。

Object pose estimation plays a vital role in embodied AI and computer vision, enabling intelligent agents to comprehend and interact with their surroundings. Despite the practicality of category-level pose estimation, current approaches encounter challenges with partially observed point clouds, known as the multihypothesis issue. In this study, we propose a novel solution by reframing categorylevel object pose estimation as conditional generative modeling, departing from traditional point-to-point regression. Leveraging score-based diffusion models, we estimate object poses by sampling candidates from the diffusion model and aggregating them through a two-step process: filtering out outliers via likelihood estimation and subsequently mean-pooling the remaining candidates. To avoid the costly integration process when estimating the likelihood, we introduce an alternative method that distils an energy-based model from the original score-based model, enabling end-to-end likelihood estimation. Our approach achieves state-of-the-art performance on the REAL275 dataset, surpassing 50% and 60% on strict 5 ◦ 2cm and 5 ◦ 5cm metrics, respectively. Furthermore, our method demonstrates strong generalization to novel categories without the need for fine-tuning and can readily adapt to object pose tracking tasks, yielding comparable results to the current state-of-the-art baselines. Our checkpoints and demonstrations can be found at https://sites.google.com/view/genpose.

Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions
Ruihai Wu Kai Cheng Yan Zhao Chuanruo Ning Guanqi Zhan Hao Dong



研究问题:如何让家庭助理机器人在各种环境中感知和操作三维关节对象。
动机:现有的研究主要关注单对象场景,忽视了环境约束和机器人形态(如遮挡和物理限制)带来的现实限制。
方法:提出一种环境感知的可负担性框架,结合了对象级别的可执行先验知识和环境约束。为了解决组合爆炸的问题并提高数据效率,引入了一种新颖的对比可负担性学习框架,可以在包含单个遮挡物的场景中进行训练,并推广到具有复杂遮挡物组合的场景。
效果:实验表明,该方法能有效学习考虑环境约束的可负担性。

Perceiving and manipulating 3D articulated objects in diverse environments is essential for home-assistant robots. Recent studies have shown that point-level affordance provides actionable priors for downstream manipulation tasks. However, existing works primarily focus on single-object scenarios with homogeneous agents, overlooking the realistic constraints imposed by the environment and the agent's morphology, e.g., occlusions and physical limitations. In this paper, we propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints. Unlike object-centric affordance approaches, learning environment-aware affordance faces the challenge of combinatorial explosion due to the complexity of various occlusions, characterized by their quantities, geometries, positions and poses. To address this and enhance data efficiency, we introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations. Experiments demonstrate the effectiveness of our proposed approach in learning affordance considering environment constraints.

A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation
Qitao Zhao Ce Zheng Mengyuan Liu Chen Chen



研究问题:现有的3D人体姿态估计主要依赖于长期的时间线索(即使用大量的视频帧)来提高准确性,但这会导致性能饱和、难以计算和因果问题。
动机:由于2D关节坐标没有视觉提示,无法感知空间上下文,因此需要解决这个问题。
方法:利用现成的2D姿态检测器产生的中间视觉表示,无需在3D任务上进行微调。这些表示(如特征图)由于骨干网络的区域操作,隐式地编码了以关节为中心的空间上下文。
效果:在不使用任何时间信息的情况下,该方法在速度和精度上都显著优于其上下文无关的对应物PoseFormer和其他使用数百个视频帧的最新方法。

The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues (i.e., using a daunting number of video frames) for improved accuracy, which incurs performance saturation, intractable computation and the non-causal problem. This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues. To address this issue, we propose a straightforward yet powerful solution: leveraging the $\textit{readily available}$ intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors -- no finetuning on the 3D task is even needed. The key observation is that, while the pose detector learns to localize 2D joints, such representations (e.g., feature maps) implicitly encode the joint-centric spatial context thanks to the regional operations in backbone networks. We design a simple baseline named $\textbf{Context-Aware PoseFormer}$ to showcase its effectiveness. $\textit{Without access to any temporal information}$, the proposed method significantly outperforms its context-agnostic counterpart, PoseFormer, and other state-of-the-art methods using up to $\textit{hundreds of}$ video frames regarding both speed and precision. $\textit{Project page:}$ https://qitaozhao.github.io/ContextAware-PoseFormer

Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships
Abhra Chaudhuri Massimiliano Mancini Zeynep Akata Anjan Dutta



研究问题:本文旨在通过将抽象的关系表示形式解释为图像视图的可解释图,来解构这种抽象性。
动机:尽管细粒度表示学习的最新进展利用局部到全局(浮现)关系实现了最先进的结果,但这些方法依赖的关系表示是抽象的。
方法:我们设计了一种名为“传递恢复分解”(TRD)的图空间搜索算法,该算法可以在实例和类别级别识别抽象浮现关系的可解释等价物,无需后计算。
效果:实验结果表明,TRD能够实现与最先进的技术相媲美甚至更好的性能,同时是完全可解释的。

Recent advances in fine-grained representation learning leverage local-to-global (emergent) relationships for achieving state-of-the-art results. The relational representations relied upon by such methods, however, are abstract. We aim to deconstruct this abstraction by expressing them as interpretable graphs over image views. We begin by theoretically showing that abstract relational representations are nothing but a way of recovering transitive relationships among local views. Based on this, we design Transitivity Recovering Decompositions (TRD), a graph-space search algorithm that identifies interpretable equivalents of abstract emergent relationships at both instance and class levels, and with no post-hoc computations. We additionally show that TRD is provably robust to noisy views, with empirical evidence also supporting this finding. The latter allows TRD to perform at par or even better than the state-of-the-art, while being fully interpretable. Implementation is available at https://github.com/abhrac/trd.

Rank-DETR for High Quality Object Detection
Yifan Pu Weicong Liang Yiduo Hao Yuhui Yuan Yukang Yang Chao Zhang Han Hu Gao Huang



研究问题:现有的DETR模型在预测物体边界框时,由于分类得分与定位准确性的不匹配,导致排名靠前的预测结果的定位质量较低。
动机:为了提高DETR模型的定位精度和减少误报率,需要设计一种以排序为导向的检测器。
方法:提出了一种简单而高效的基于DETR的目标检测器Rank-DETR,包括(i)一种以排序为导向的架构设计,可以促进积极预测并抑制消极预测,以及(ii)一种以排序为导向的损失函数和匹配成本设计,优先选择定位精度更高的预测结果进行排序,以提高高IoU阈值下的AP。
效果:将该方法应用于最新的SOTA方法(如H-DETR和DINO-DETR),并在使用不同主干网络(如ResNet-50、Swin-T和Swin-L)的情况下报告了强大的COCO目标检测结果,证明了该方法的有效性。

Modern detection transformers (DETRs) use a set of object queries to predict a list of bounding boxes, sort them by their classification confidence scores, and select the top-ranked predictions as the final detection results for the given input image. A highly performant object detector requires accurate ranking for the bounding box predictions. For DETR-based detectors, the top-ranked bounding boxes suffer from less accurate localization quality due to the misalignment between classification scores and localization accuracy, thus impeding the construction of high-quality detectors. In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs, combinedly called Rank-DETR. Our key contributions include: (i) a rank-oriented architecture design that can prompt positive predictions and suppress the negative ones to ensure lower false positive rates, as well as (ii) a rank-oriented loss function and matching cost design that prioritizes predictions of more accurate localization accuracy during ranking to boost the AP under high IoU thresholds. We apply our method to improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong COCO object detection results when using different backbones such as ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our approach. Code is available at \url{https://github.com/LeapLabTHU/Rank-DETR}.

Greatness in Simplicity: Unified Self-Cycle Consistency for Parser-Free Virtual Try-On
Chenghu Du junyin Wang Shuqing Liu Shengwu Xiong



研究问题:图像虚拟试穿任务由于非刚性服装变形建模的复杂性和人体内部服装的强大特征纠缠而具有挑战性。
动机:现有的方法需要通过辅助任务(如利用“教师知识”和双生成器)来解耦人体特征中的服装特征,这可能会成为主生成器在下游任务中的瓶颈。此外,现有的服装变形方法缺乏感知现实世界中服装与人体之间的关联的能力,导致不真实的对齐效果。
方法:提出了一种基于统一自我循环一致性的无解析器虚拟试穿网络(USC-PFN),该网络仅使用单个生成器就能实现不同服装之间的稳健转换,真实再现现实生活中的非刚性几何服装变形。具体来说,我们首先提出了一个带有循环模式的自我循环一致性架构,它只使用真实的未配对的服装人像图像作为输入进行训练,有效地消除了模型输入端不负责任的先验知识的影响。此外,我们构建了一个马尔可夫随机场来模拟更自然、更真实的服装变形。
效果:实验证明,该方法在流行的虚拟试穿基准测试上取得了最先进的性能。

Image-based virtual try-on tasks remain challenging, primarily due to inherent complexities associated with non-rigid garment deformation modeling and strong feature entanglement of clothing within human body. Recent groundbreaking formulations, such as in-painting, cycle consistency, and knowledge distillation, have facilitated self-supervised generation of try-on images. However, these paradigms necessitate the disentanglement of garment features within human body features through auxiliary tasks, such as leveraging 'teacher knowledge' and dual generators. The potential presence of irresponsible prior knowledge in the auxiliary task can serve as a significant bottleneck for the main generator (e.g., 'student model') in the downstream task. Moreover, existing garment deformation methods lack the ability to perceive the correlation between the garment and the human body in the real world, leading to unrealistic alignment effects. To tackle these limitations, we present a new parser-free virtual try-on network based on unified self-cycle consistency (USC-PFN), which enables robust translation between different garments using just a single generator, faithfully replicating non-rigid geometric deformation of garments in real-life scenarios. Specifically, we first propose a self-cycle consistency architecture with a circular mode. It utilizes real unpaired garment-person images exclusively as input for training, effectively eliminating the impact of irresponsible prior knowledge at the model input end. Additionally, we formulate a Markov Random Field to simulate a more natural and realistic garment deformation. Furthermore, USC-PFN can leverage a general generator for self-supervised cycle training. Experiments demonstrate that our method achieves state-of-the-art performance on a popular virtual try-on benchmark.

FaceComposer: A Unified Model for Versatile Facial Content Creation
Jiayu Wang Kang Zhao Yifeng Ma Shiwei Zhang Yingya Zhang Yujun Shen Deli Zhao Jingren Zhou



研究问题:本文旨在开发一种统一的生成模型,用于完成各种面部内容创建任务。
动机:现有的面部生成模型无法满足多样化的面部内容创作需求,如文本条件面部合成、文本引导的面部编辑和面部动画等。
方法:基于潜在扩散框架,本文提出了FaceComposer模型,采用组成式生成范式,并利用多种面部特定条件(如身份特征和投影归一化坐标代码)来释放模型的创造力。同时,清理了一些现有的面部图像数据集,收集了约500小时的说话人面部视频,形成了高质量的大规模多模态面部数据库。在U-Net结构中引入了时间自注意力模块,使模型能够在图像和视频混合的环境中学习去噪过程。
效果:实验结果表明,本文的方法不仅在每个单独的任务上取得了与最先进的技术相当甚至更好的性能,而且在一次前向传递中促进了一些组合任务,展示了其作为面部领域基础生成模型的潜力。此外,我们还开发了一个界面,使用户能够享受我们的一站式服务,创建、编辑和动画化他们自己的字符。代码、数据集、模型和界面将公开发布。

This work presents FaceComposer, a unified generative model that accomplishes a variety of facial content creation tasks, including text-conditioned face synthesis, text-guided face editing, face animation etc. Based on the latent diffusion framework, FaceComposer follows the paradigm of compositional generation and employs diverse face-specific conditions, e.g., Identity Feature and Projected Normalized Coordinate Code, to release the model creativity at all possible. To support text control and animation, we clean up some existing face image datasets and collect around 500 hours of talking-face videos, forming a high-quality large-scale multi-modal face database. A temporal self-attention module is incorporated into the U-Net structure, which allows learning the denoising process on the mixture of images and videos. Extensive experiments suggest that our approach not only achieves comparable or even better performance than state-of-the-arts on each single task, but also facilitates some combined tasks with one-time forward, demonstrating its potential in serving as a foundation generative model in face domain. We further develop an interface such that users can enjoy our one-step service to create, edit, and animate their own characters. Code, dataset, model, and interface will be made publicly available.

IEBins: Iterative Elastic Bins for Monocular Depth Estimation
Shuwei Shao Zhongcai Pei Xingming Wu Zhong Liu Weihai Chen Zhengguo Li



研究问题:单目深度估计是计算机视觉中的基本主题和许多下游应用的核心技术。
动机:最近,一些方法将单目深度估计重新定义为分类-回归问题,其中概率分布和箱位中心的线性组合用于预测深度。
方法:本文提出了一种新的迭代弹性箱(IEBins)概念,用于基于分类-回归的单目深度估计。提出的IEBins旨在通过逐步优化搜索范围来寻找高质量的深度,这涉及到多个阶段,每个阶段在其前一阶段的基础上在目标箱上执行更细粒度的深度搜索。
效果:在KITTI、NYU-Depth-v2和SUN RGB-D数据集上的大量实验表明,提出的方法超过了先前最先进的竞争对手。源代码可在https://github.com/ShuweiShao/IEBins上公开获取。

Monocular depth estimation (MDE) is a fundamental topic of geometric computer vision and a core technique for many downstream applications. Recently, several methods reframe the MDE as a classification-regression problem where a linear combination of probabilistic distribution and bin centers is used to predict depth. In this paper, we propose a novel concept of iterative elastic bins (IEBins) for the classification-regression-based MDE. The proposed IEBins aims to search for high-quality depth by progressively optimizing the search range, which involves multiple stages and each stage performs a finer-grained depth search in the target bin on top of its previous stage. To alleviate the possible error accumulation during the iterative process, we utilize a novel elastic target bin to replace the original target bin, the width of which is adjusted elastically based on the depth uncertainty. Furthermore, we develop a dedicated framework composed of a feature extractor and an iterative optimizer that has powerful temporal context modeling capabilities benefiting from the GRU-based architecture. Extensive experiments on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets demonstrate that the proposed method surpasses prior state-of-the-art competitors. The source code is publicly available at https://github.com/ShuweiShao/IEBins.

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment
Zihui Xue Kristen Grauman



研究问题:如何将不同视角的人类活动进行统一表示,以应用于机器人技术和增强现实。
动机:现有方法需要同步的配对视角数据来学习视图不变的特征,限制了其应用范围。
方法:提出一种自监督嵌入方法AE2,通过在时间上对齐自我中心和外向中心的视频,即使它们不是同时捕获或在同一环境中,也能学习到细粒度的动作特征。
效果:在四个数据集上进行的评估表明,AE2方法在常规和交叉视图设置下的多种细粒度下游任务上都优于现有工作。

The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to learning view-invariant features from paired synchronized viewpoints. We relax that strong data assumption and propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time, even when not captured simultaneously or in the same environment. To this end, we propose AE2, a self-supervised embedding approach with two key designs: (1) an object-centric encoder that explicitly focuses on regions corresponding to hands and active objects; (2) a contrastive-based alignment objective that leverages temporally reversed frames as negative samples. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context, comprising four datasets---including an ego tennis forehand dataset we collected, along with dense per-frame labels we annotated for each dataset. On the four datasets, our AE2 method strongly outperforms prior work in a variety of fine-grained downstream tasks, both in regular and cross-view settings.

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
Chuofan Ma Yi Jiang Xin Wen Zehuan Yuan XIAOJUAN QI



研究问题:如何从图像-文本对中可靠地推导出区域-词的对齐,以学习开放词汇对象检测的对象级视觉语言表示。
动机:现有的方法通常依赖于预训练或自训练的视觉语言模型进行对齐,这在定位精度或泛化能力上存在局限性。
方法:本文提出了一种新的方法CoDet,通过将区域-词的对齐重新表述为共同出现的对象发现问题,克服了对预先对齐的视觉语言空间的依赖。
效果:大量实验表明,CoDet在开放词汇检测中具有优越的性能和引人注目的可扩展性,例如,通过扩大视觉主干,CoDet在OV-LVIS上实现了37.0 $AP^m_{novel}$ 和44.7 $AP^m_{all}$,超过了以前的SoTA 4.2 $AP^m_{novel}$ 和9.8 $AP^m_{all}$。代码可在https://github.com/CVMI-Lab/CoDet获取。

Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $AP^m_{novel}$ and 44.7 $AP^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $AP^m_{novel}$ and 9.8 $AP^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF
Stefan Lionar Xiangyu Xu Min Lin Gim Hee Lee



研究问题:本文旨在解决当前单视图RGB-D输入的3D重建领域的两个主要限制:1)处理大量查询点的Transformer解码器效率低下;2)3D表示难以恢复高保真细节。
动机:目前最先进的方法MCC通过结合视觉变压器和大规模训练在单视图RGB-D输入的3D重建领域取得了前所未有的成功,但存在上述两个主要限制。
方法:本文提出了一种新的方法NU-MCC,包括两个关键创新:邻域解码器和排斥性无符号距离函数(Repulsive UDF)。首先,我们的邻域解码器引入中心点作为输入视觉特征的有效代理,使每个查询点仅关注一个小邻域。这种设计不仅导致更快的推理速度,而且能够利用更精细的视觉特征来改善3D纹理的恢复。其次,我们提出的排斥性UDF是MCC中使用占有场的新颖替代方案,显著提高了3D对象重建的质量。
效果:实验结果表明,NU-MCC能够学习强大的3D表示,大大推进了单视图3D重建领域的最新技术。特别是在CO3D-v2数据集上,NU-MCC的F1得分比MCC高出9.7%,并且运行速度超过5倍。

Remarkable progress has been made in 3D reconstruction from single-view RGB-D inputs. MCC is the current state-of-the-art method in this field, which achieves unprecedented success by combining vision Transformers with large-scale training. However, we identified two key limitations of MCC: 1) The Transformer decoder is inefficient in handling large number of query points; 2) The 3D representation struggles to recover high-fidelity details. In this paper, we propose a new approach called NU-MCC that addresses these limitations. NU-MCC includes two key innovations: a Neighborhood decoder and a Repulsive Unsigned Distance Function (Repulsive UDF). First, our Neighborhood decoder introduces center points as an efficient proxy of input visual features, allowing each query point to only attend to a small neighborhood. This design not only results in much faster inference speed but also enables the exploitation of finer-scale visual features for improved recovery of 3D textures. Second, our Repulsive UDF is a novel alternative to the occupancy field used in MCC, significantly improving the quality of 3D object reconstruction. Compared to standard UDFs that suffer from holes in results, our proposed Repulsive UDF can achieve more complete surface reconstruction. Experimental results demonstrate that NU-MCC is able to learn a strong 3D representation, significantly advancing the state of the art in single-view 3D reconstruction. Particularly, it outperforms MCC by 9.7% in terms of the F1-score on the CO3D-v2 dataset with more than 5x faster running speed.

D$^2$CSG: Unsupervised Learning of Compact CSG Trees with Dual Complements and Dropouts
Fenggen Yu Qimin Chen Maham Tanveer Ali Mahdavi Amiri Hao Zhang



研究问题:如何利用神经网络模型进行无监督学习,以获取3D CAD形状的紧凑构造实体几何(CSG)表示。
动机:现有的神经CSG模型无法有效地处理复杂和高亏格的CAD形状,需要一种更有效的方法来学习和重建这些形状。
方法:提出了一个由两个互补网络分支组成的新型神经网络模型D$^2$CSG,通过固定顺序的四维原始组件装配来重构3D形状。该模型具有专门的残差分支来组装可能复杂的形状补体,并通过权重丢弃进一步去除冗余的原始组件,以提高CSG树的紧凑性。
效果:实验证明,D$^2$CSG能够产生更紧凑、质量更高、更具自然性的CSG重建,特别是在处理复杂和高亏格的CAD形状时,其性能明显优于所有现有方法。

We present D$^2$CSG, a neural model composed of two dual and complementary network branches, with dropouts, for unsupervised learning of compact constructive solid geometry (CSG) representations of 3D CAD shapes. Our network is trained to reconstruct a 3D shape by a fixed-order assembly of quadric primitives, with both branches producing a union of primitive intersections or inverses. A key difference between D$^2$CSG and all prior neural CSG models is its dedicated residual branch to assemble the potentially complex shape complement, which is subtracted from an overall shape modeled by the cover branch. With the shape complements, our network is provably general, while the weight dropout further improves compactness of the CSG tree by removing redundant primitives. We demonstrate both quantitatively and qualitatively that D$^2$CSG produces compact CSG reconstructions with superior quality and more natural primitives than all existing alternatives, especially over complex and high-genus CAD shapes.

FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing
Mingyuan Zhang Huirong Li Zhongang Cai Jiawei Ren Lei Yang Ziwei Liu



研究问题:现有的文本驱动运动生成方法在生成复杂运动序列和精细描述方面存在困难,限制了其在更广泛用户群体中的应用。
动机:为了解决这些挑战,我们提出了FineMoGen,一种基于扩散的运动生成和编辑框架,能够根据用户的精细指令合成精细运动。
方法:FineMoGen建立在一种新的转换器架构Spatio-Temporal Mixture Attention SAMI上,该架构从两个角度优化全局注意力模板的生成:1)显式建模时空组合的约束;2)利用稀疏激活的混合专家自适应提取精细特征。
效果:通过广泛的实验验证,FineMoGen在运动生成质量上超过了最先进的方法。特别是在现代大型语言模型(LLM)的帮助下,FineMoGen实现了零样本运动编辑能力,可以忠实地操纵具有精细指令的运动序列。

Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions.This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. Specifically, FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention SAMI. SAMI optimizes the generation of the global attention template from two perspectives: 1) explicitly modeling the constraints of spatio-temporal composition; and 2) utilizing sparsely-activated mixture-of-experts to adaptively extract fine-grained features. To facilitate a large-scale study on this new fine-grained motion generation task, we contribute the HuMMan-MoGen dataset, which consists of 2,968 videos and 102,336 fine-grained spatio-temporal descriptions. Extensive experiments validate that FineMoGen exhibits superior motion generation quality over state-of-the-art methods. Notably, FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models (LLM), which faithfully manipulates motion sequences with fine-grained instructions.

VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang Hangjie Yuan Shiwei Zhang Dayou Chen Jiuniu Wang Yingya Zhang Yujun Shen Deli Zhao Jingren Zhou



研究问题:如何实现可控的视频合成,特别是在考虑时间动态变化和跨帧时间一致性的情况下。
动机:为了提高视觉内容创作的控制性,目前的自定义图像合成已经取得了显著的进步,但视频合成仍面临挑战。
方法:本文提出了一种基于组合生成范式的VideoComposer,允许用户通过文本条件、空间条件以及更重要的是时间条件灵活地合成视频。具体来说,考虑到视频数据的特性,我们引入了压缩视频中的运动向量作为显式控制信号,以提供关于时间动态的指导。此外,我们还开发了一个时空条件编码器(STC-encoder),作为统一接口来有效整合序列输入的空间和时间关系,使模型能够更好地利用时间条件,从而实现更高的帧间一致性。
效果:实验结果表明,VideoComposer能够在合成的视频中同时控制空间和时间模式,形式包括文本描述、草图序列、参考视频甚至简单的手工制作的动作。代码和模型已公开发布在[链接]。

The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoComposer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. The code and models are publicly available at https://videocomposer.github.io.

DäRF: Boosting Radiance Fields from Sparse Input Views with Monocular Depth Adaptation
Jiuhn Song Seonghoon Park Honggyu An Seokju Cho Min-Seop Kwak Sungjin Cho Seungryong Kim



研究问题:现有的Neural radiance field(NeRF)模型在已知视角数量大幅减少时,性能会严重下降。
动机:为了解决这个问题,研究人员尝试使用外部先验知识,但这种方法只对某些类型的场景或数据集有效。因此,研究者提出利用预训练在大规模RGB-D数据集上的单目深度估计(MDE)网络可能是解决问题的关键。
方法:研究者提出了一个名为DäRF的新框架,通过在线互补训练将NeRF和单目深度估计的优点结合起来,实现了少数真实世界图像的鲁棒NeRF重建。
效果:实验表明,该框架在室内和室外真实世界数据集上均取得了最先进的结果,无论是定量还是定性,都表现出一致且可靠的性能。

Neural radiance field (NeRF) shows powerful performance in novel view synthesis and 3D geometry reconstruction, but it suffers from critical performance degradation when the number of known viewpoints is drastically reduced. Existing works attempt to overcome this problem by employing external priors, but their success is limited to certain types of scenes or datasets. Employing monocular depth estimation (MDE) networks, pretrained on large-scale RGB-D datasets, with powerful generalization capability may be a key to solving this problem: however, using MDE in conjunction with NeRF comes with a new set of challenges due to various ambiguity problems exhibited by monocular depths. In this light, we propose a novel framework, dubbed DäRF, that achieves robust NeRF reconstruction with a handful of real-world images by combining the strengths of NeRF and monocular depth estimation through online complementary training. Our framework imposes the MDE network's powerful geometry prior to NeRF representation at both seen and unseen viewpoints to enhance its robustness and coherence. In addition, we overcome the ambiguity problems of monocular depths through patch-wise scale-shift fitting and geometry distillation, which adapts the MDE network to produce depths aligned accurately with NeRF geometry. Experiments show our framework achieves state-of-the-art results both quantitatively and qualitatively, demonstrating consistent and reliable performance in both indoor and outdoor real-world datasets.

Segment Anything in 3D with NeRFs
Jiazhong Cen Zanwei Zhou Jiemin Fang chen yang Wei Shen Lingxi Xie Dongsheng Jiang XIAOPENG ZHANG Qi Tian



研究问题:如何将强大的2D视觉基础模型SAM扩展到3D对象分割。
动机:避免在3D中复制昂贵的数据获取和注释过程,利用Neural Radiance Field (NeRF)作为连接多视图2D图像到3D空间的廉价且现成的先验。
方法:提出称为SA3D(用于3D中的任何事物的分割)的解决方案。只需为目标对象在单个视图中提供手动分割提示(例如,粗略的点),用于在该视图中使用SAM生成其2D掩码。接下来,SA3D通过各种视图交替执行掩码反向渲染和跨视图自我提示,以迭代完成使用体素网格构建的目标对象的3D掩码。前者将SAM在当前视图中获得的2D掩模投影到带有由NeRF学习的密度分布的3D掩模上;后者从另一个视图中由NeRF渲染的2D掩模自动提取可靠的提示作为SAM的输入。
效果:实验表明,SA3D适应各种场景并在几分钟内实现3D分割。这项研究提供了一种通用而高效的方法,将2D视觉基础模型提升到3D,只要2D模型可以稳定地处理多个视图中的提示性分割。

Recently, the Segment Anything Model (SAM) emerged as a powerful vision foundation model which is capable to segment anything in 2D images. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the Neural Radiance Field (NeRF) as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively complete the 3D mask of the target object constructed with voxel grids. The former projects the 2D mask obtained by SAM in the current view onto 3D mask with guidance of the density distribution learned by the NeRF; The latter extracts reliable prompts automatically as the input to SAM from the NeRF-rendered 2D mask in another view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within minutes. Our research offers a generic and efficient methodology to lift a 2D vision foundation model to 3D, as long as the 2D model can steadily address promptable segmentation across multiple views.

MultiMoDN—Multimodal, Multi-Task, Interpretable Modular Networks
Vinitra Swamy Malika Satayeva Jibril Frej Thierry Bossy Thijs Vogels Martin Jaggi Tanja Käser Mary-Anne Hartley



研究问题:如何预测多种真实世界任务,特别是在一个模型中需要特别多样化的特征空间?
动机:当前的多模态(MM)模型在融合不同类型数据表示时存在限制,如解释性差和对模态可用性的依赖。
方法:提出MultiModN,一种多模态、模块化网络,能在任意数量、组合或类型的模态中以序列方式融合潜在表示,同时为任何数量或组合的预测任务提供细粒度的实时预测反馈。
效果:在多个基准多模态数据集上进行10个真实世界任务的实验,结果显示MultiModN的序列多模态融合不逊色于并行融合的基线。通过模拟具有挑战性的非随机缺失(MNAR)偏见,证明与并行融合基线相反,MultiModN不会错误地学习MNAR,并在面对不同的MNAR模式时具有更强的鲁棒性。这是首个固有的MNAR抗性多模态建模方法。

Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in parallel, which not only limits their interpretability but also creates a dependency on modality availability. We present MultiModN, a multimodal, modular network that fuses latent representations in a sequence of any number, combination, or type of modality while providing granular real-time predictive feedback on any number or combination of predictive tasks. MultiModN's composable pipeline is interpretable-by-design, as well as innately multi-task and robust to the fundamental issue of biased missingness. We perform four experiments on several benchmark MM datasets across 10 real-world tasks (predicting medical diagnoses, academic performance, and weather), and show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion. By simulating the challenging bias of missing not-at-random (MNAR), this work shows that, contrary to MultiModN, parallel fusion baselines erroneously learn MNAR and suffer catastrophic failure when faced with different patterns of MNAR at inference. To the best of our knowledge, this is the first inherently MNAR-resistant approach to MM modeling. In conclusion, MultiModN provides granular insights, robustness, and flexibility without compromising performance.

Towards Label-free Scene Understanding by Vision Foundation Models
Runnan Chen Youquan Liu Lingdong Kong Nenglun Chen Xinge ZHU Yuexin Ma Tongliang Liu Wenping Wang



研究问题:本文旨在探索视觉基础模型在无标签场景理解中的应用。
动机:虽然对比视觉语言预训练(CLIP)和分割任何物体(SAM)等视觉基础模型在图像分类和分割任务上表现出色,但它们在无标签场景理解中的潜力尚未得到探索。
方法:提出一种新的跨模态噪声监督(CNS)方法,利用CLIP和SAM的优势同时监督2D和3D网络。通过引入预测一致性正则化来共同训练2D和3D网络,然后使用SAM的鲁棒特征表示进一步强制网络的潜在空间一致性。
效果:在各种室内和室外数据集上的实验表明,该方法在理解2D和3D开放环境方面具有优越性能。我们的2D和3D网络在ScanNet上实现了无标签语义分割,mIoU分别为28.4%和33.5%,分别提高了4.7%和7.9%。在nuImages和nuScenes数据集上,性能分别为22.1%和26.8%,分别提高了3.5%和6.0%。

Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4\% and 33.5\% mIoU on ScanNet, improving 4.7\% and 7.9\%, respectively. For nuImages and nuScenes datasets, the performance is 22.1\% and 26.8\% with improvements of 3.5\% and 6.0\%, respectively. Code is available. (https://github.com/runnanchen/Label-Free-Scene-Understanding)

Aligning Gradient and Hessian for Neural Signed Distance Function
Ruian Wang Zixiong Wang Yunxiao Zhang Shuangmin Chen Shiqing Xin Changhe Tu Wenping Wang



研究问题:如何从无序的点云中直接学习有符号距离函数(SDF),以重建一个无漏水的表面。
动机:在平滑表面上,存在一个薄壳空间,其中SDF在所有地方都可微分,使得SDF的梯度是其海森矩阵的特征向量,对应的特征值为0。我们的方法基于这个观察结果,即对齐SDF的梯度和海森矩阵可以更有效地控制梯度方向,从而更准确地反映形状的真实变化。
方法:我们提出了一种从无序点云中直接学习SDF的方法,无需法线信息。通过使梯度和海森矩阵对齐,我们可以更有效地控制梯度方向,确保梯度变化更准确地反映形状的真实变化。
效果:大量的实验结果表明,我们的方法能够准确地恢复底层形状,同时有效地抑制幽灵几何体的存在。

The Signed Distance Function (SDF), as an implicit surface representation, provides a crucial method for reconstructing a watertight surface from unorganized point clouds. The SDF has a fundamental relationship with the principles of surface vector calculus. Given a smooth surface, there exists a thin-shell space in which the SDF is differentiable everywhere such that the gradient of the SDF is an eigenvector of its Hessian matrix, with a corresponding eigenvalue of zero. In this paper, we introduce a method to directly learn the SDF from point clouds in the absence of normals. Our motivation is grounded in a fundamental observation: aligning the gradient and the Hessian of the SDF provides a more efficient mechanism to govern gradient directions. This, in turn, ensures that gradient changes more accurately reflect the true underlying variations in shape. Extensive experimental results demonstrate its ability to accurately recover the underlying shape while effectively suppressing the presence of ghost geometry.

GNeSF: Generalizable Neural Semantic Fields
Hanlin Chen Chen Li Mengqi Guo Zhiwen Yan Gim Hee Lee



研究问题:现有的3D场景分割方法需要昂贵的每场景优化,限制了其在推理过程中对新场景的泛化。
动机:为了解决这个问题,我们提出了一种基于隐式表示的可泛化的3D分割框架。
方法:我们的框架接受多视角图像特征和语义地图作为输入,而不是仅依赖空间信息,以避免过度拟合到特定场景的几何和语义信息。我们还提出了一种新的软投票机制,用于聚合来自不同视角的2D语义信息。
效果:实验结果表明,我们的方法在性能上与特定场景的方法相当,甚至在某些情况下超过了仅使用2D标注的强监督方法。

3D scene segmentation based on neural implicit representation has emerged recently with the advantage of training only on 2D supervision. However, existing approaches still requires expensive per-scene optimization that prohibits generalization to novel scenes during inference. To circumvent this problem, we introduce a \textit{generalizable} 3D segmentation framework based on implicit representation. Specifically, our framework takes in multi-view image features and semantic maps as the inputs instead of only spatial information to avoid overfitting to scene-specific geometric and semantic information. We propose a novel soft voting mechanism to aggregate the 2D semantic information from different views for each 3D point. In addition to the image features, view difference information is also encoded in our framework to predict the voting scores. Intuitively, this allows the semantic information from nearby views to contribute more compared to distant ones. Furthermore, a visibility module is also designed to detect and filter out detrimental information from occluded views. Due to the generalizability of our proposed method, we can synthesize semantic maps or conduct 3D semantic segmentation for novel scenes with solely 2D semantic supervision. Experimental results show that our approach achieves comparable performance with scene-specific approaches. More importantly, our approach can even outperform existing strong supervision-based approaches with only 2D annotations.

Type-to-Track: Retrieve Any Object via Prompt-based Tracking
Pha Nguyen Kha Gia Quach Kris M. Kitani Khoa Luu



研究问题:如何通过自然语言描述来跟踪视频中的对象。
动机:克服传统方法(依赖边界框或类别注释)的局限性。
方法:提出一种名为Type-to-Track的新范式,用户可以通过键入自然语言描述来跟踪视频中的对象。创建了一个新的数据集GroOT,包含各种类型的对象及其详细描述外观和动作的文本标题。开发了一种基于变换器的eMbed-ENcoDE-extRact框架(MENDER),使用三阶张量分解。
效果:在五个场景中的实验表明,MENDER方法在准确性和效率方面优于另一种两阶段设计,准确率提高14.7%,速度提高4倍。

One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions describing their appearance and action in detail. Additionally, we introduce two new evaluation protocols and formulate evaluation metrics specifically for this task. We develop a new efficient method that models a transformer-based eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor decomposition. The experiments in five scenarios show that our MENDER approach outperforms another two-stage design in terms of accuracy and efficiency, up to 14.7\% accuracy and $4\times$ speed faster.

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization
Minghua Liu Chao Xu Haian Jin Linghao Chen Mukund Varma T Zexiang Xu Hao Su



研究问题:单图像3D重建是一个重要但具有挑战性的任务,需要对自然世界有广泛的了解。
动机:许多现有的方法通过在二维扩散模型的指导下优化神经辐射场来解决此问题,但存在优化时间长、3D结果不一致和几何形状差的问题。
方法:本文提出了一种新的方法,该方法接受任何对象的单个图像作为输入,并在单次前馈传递中生成完整的360度3D纹理网格。首先,对于输入视图,我们使用视图条件二维扩散模型Zero123生成多视图图像,然后尝试将它们提升到3D空间。由于传统的重建方法在多视图预测上存在不一致的问题,我们在基于SDF的可泛化神经表面重建方法的基础上构建了我们的3D重建模块,并提出了几个关键的训练策略来实现360度网格的重建。
效果:无需昂贵的优化,我们的方法在显著少于现有方法的时间内重建3D形状。此外,我们的方法更有利于更好的几何形状,产生更一致的3D结果,并更紧密地遵循输入图像。我们在合成数据和野外图像上评估我们的方法,并在网格质量和运行时方面展示了其优越性。此外,我们的方法可以通过与现成的文本到图像扩散模型集成,无缝支持文本到3D任务。

Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
Minghua Liu Ruoxi Shi Kaiming Kuang Yinhao Zhu Xuanlin Li Shizhong Han Hong Cai Fatih Porikli Hao Su



研究问题:如何利用大规模文本、图像和点云数据学习多模态联合表示,以实现开放世界的3D形状理解。
动机:现有的多模态对比学习框架在处理大规模3D表示时存在困难,且对噪声文本描述的处理能力有限。
方法:OpenShape通过融合多个3D数据集扩大训练数据规模,提出自动过滤和丰富噪声文本描述的策略,以及优化3D主干网络的方法。同时,引入了一种新的困难负样本挖掘模块以提高训练效率。
效果:在零样本3D分类基准测试中,OpenShape表现出优越的能力,如在Objaverse-LVIS基准测试上实现了46.8%的零样本准确率,远高于现有方法的不到10%。在ModelNet40上,OpenShape实现了85.3%的准确率,比之前的零样本基线方法提高了20%,与一些全监督方法相当。此外,OpenShape学习的嵌入能够编码广泛的视觉和语义概念,并促进精细的文本-3D和图像-3D交互。

We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions. We also explore and compare strategies for scaling 3D backbone networks and introduce a novel hard negative mining module for more efficient training. We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than 10% for existing methods. OpenShape also achieves an accuracy of 85.3% on ModelNet40, outperforming previous zero-shot baseline methods by 20% and performing on par with some fully-supervised methods. Furthermore, we show that our learned embeddings encode a wide range of visual and semantic concepts (e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications, such as point cloud captioning and point cloud-conditioned image generation.

TransHP: Image Classification with Hierarchical Prompting
Wenhao Wang Yifan Sun Wei Li Yi Yang



研究问题:本文旨在探索一种层次提示机制,用于解决层次图像分类(HIC)任务。
动机:与现有的HIC方法不同,我们的层次提示是首次明确地将祖先类别信息作为有助于后代类别区分的标记提示注入模型中,这种模仿人类视觉识别的方式可能更有利于提高分类精度。
方法:我们将这种提示机制建模为一个带有层次提示的Transformer(TransHP)。TransHP包括三个步骤:1) 学习一组提示令牌来表示粗糙(祖先)类别;2) 在中间块实时预测输入图像的粗糙类别;3) 将预测到的粗糙类别的提示令牌注入到中间特征中。尽管TransHP对所有输入图像的参数都保持一致,但注入的粗糙类别提示条件会修改后续的特征提取过程,并鼓励动态关注后代类别之间的相对细微差异。
效果:大量实验表明,TransHP在准确性(例如,ViT-B/16在ImageNet上的分类准确率提高了+2.83%)、训练数据效率(例如,在只有10% ImageNet训练数据的情况下提高了+12.69%)和模型可解释性方面都有显著改进。此外,TransHP也优于先前的HIC方法,显示出了良好的层次信息利用能力。

This paper explores a hierarchical prompting mechanism for the hierarchical image classification (HIC) task. Different from prior HIC methods, our hierarchical prompting is the first to explicitly inject ancestor-class information as a tokenized hint that benefits the descendant-class discrimination. We think it well imitates human visual recognition, i.e., humans may use the ancestor class as a prompt to draw focus on the subtle differences among descendant classes. We model this prompting mechanism into a Transformer with Hierarchical Prompting (TransHP). TransHP consists of three steps: 1) learning a set of prompt tokens to represent the coarse (ancestor) classes, 2) on-the-fly predicting the coarse class of the input image at an intermediate block, and 3) injecting the prompt token of the predicted coarse class into the intermediate feature. Though the parameters of TransHP maintain the same for all input images, the injected coarse-class prompt conditions (modifies) the subsequent feature extraction and encourages a dynamic focus on relatively subtle differences among the descendant classes. Extensive experiments show that TransHP improves image classification on accuracy (e.g., improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data efficiency (e.g., +12.69% improvement under 10% ImageNet training data), and model explainability. Moreover, TransHP also performs favorably against prior HIC methods, showing that TransHP well exploits the hierarchical information.

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives
Tom Monnier Jake Austin Angjoo Kanazawa Alexei A Efros Mathieu Aubry



研究问题:如何通过3D原始模型,从校准的图像集合中生成简单、紧凑且可操作的3D世界表示。
动机:许多方法都集中在恢复高保真度的3D场景上,而我们专注于将场景解析为由少量纹理化原始模型组成的中等级别的3D表示。
方法:我们将原始模型建模为纹理超级二次曲面网格,并从头开始优化其参数,使用图像渲染损失。
效果:我们的方法能够准确重建输入图像和可见的3D点,同时对未看到的物体区域进行非模式形状完成。在与现有技术的比较中,我们的方法表现出了强大的鲁棒性。

Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations. Code and video results are available at https://www.tmonnier.com/DBW.

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation
Zhaoxi Chen Fangzhou Hong Haiyi Mei Guangcong Wang Lei Yang Ziwei Liu



研究问题:开发一种基于扩散的3D人体生成框架。
动机:设计3D人体扩散模型困难,因为3D表示的计算成本高且3D人体的关节拓扑复杂。
方法:直接在一组体积原语上进行去噪扩散过程,将人体模型为带有辐射和运动信息的小体积数量。这种体积原语表示结合了体积表示的容量和基于原语的渲染的效率。
效果:实验证明,PrimDiffusion在3D人体生成方面优于最先进的方法。与基于GAN的方法相比,我们的PrimDiffusion在完成去噪过程后,可以在$512\times512$的分辨率下实时渲染高质量的3D人体。我们还展示了该框架在无训练条件生成(如纹理转移和3D修复)方面的灵活性。

We present PrimDiffusion, the first diffusion-based framework for 3D human generation. Devising diffusion models for 3D human generation is difficult due to the intensive computational cost of 3D representations and the articulated topology of 3D humans. To tackle these challenges, our key insight is operating the denoising diffusion process directly on a set of volumetric primitives, which models the human body as a number of small volumes with radiance and kinematic information. This volumetric primitives representation marries the capacity of volumetric representations with the efficiency of primitive-based rendering. Our PrimDiffusion framework has three appealing properties: **1)** compact and expressive parameter space for the diffusion model, **2)** flexible representation that incorporates human prior, and **3)** decoder-free rendering for efficient novel-view and novel-pose synthesis. Extensive experiments validate that PrimDiffusion outperforms state-of-the-art methods in 3D human generation. Notably, compared to GAN-based methods, our PrimDiffusion supports real-time rendering of high-quality 3D humans at a resolution of $512\times512$ once the denoising process is done. We also demonstrate the flexibility of our framework on training-free conditional generation such as texture transfer and 3D inpainting.

Weakly-Supervised Audio-Visual Segmentation
Shentong Mo Bhiksha Raj



研究问题:本文旨在解决音频-视觉分割任务,即预测视频中声音源的像素级掩码。
动机:以前的工作需要使用大量手动设计的架构和像素级准确的掩码作为监督,但这些掩码昂贵且并非在所有情况下都可用。
方法:本文提出了一种新的弱监督音频-视觉分割框架WS-AVS,该框架可以通过多尺度多实例对比学习来学习多尺度音频-视觉对齐以进行音频-视觉分割。
效果:在AVSBench上的大量实验表明,WS-AVS在单源和多源情况下的弱监督音频-视觉分割方面非常有效。

Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, $\textit{i.e.}$, weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.

DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model
Xiuye Gu Yin Cui Jonathan Huang Abdullah Rashwan Xuan Yang Xingyi Zhou Golnaz Ghiasi Weicheng Kuo Huizhong Chen Liang-Chieh Chen David A Ross



研究问题:本文旨在解决全景、语义和实例分割任务之间的紧密关系,提出一种通用的多数据集多任务分割模型DaTaSeg。
动机:现有的分割模型无法充分利用不同数据集之间的知识共享,且对小数据集的性能提升有限。
方法:采用共享表示(用掩码提案和类别预测)对所有任务进行训练,针对不同任务采用不同的合并操作和后处理,利用弱监督并共享网络参数以实现跨数据集的知识共享。
效果:在ADE语义、COCO全景和Objects365检测数据集上进行训练,DaTaSeg在所有数据集上都取得了性能提升,特别是在小数据集上,实现了54.0 mIoU的ADE语义和53.5 PQ的COCO全景。同时,DaTaSeg还实现了在ADE全景和Objects365实例分割上的弱监督知识转移。实验表明,DaTaSeg能够随着训练数据集数量的增加而扩展,并通过直接转移实现开放词汇分割。

Observing the close relationship among panoptic, semantic and instance segmentation tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg. We use a shared representation (mask proposals with class predictions) for all tasks. To tackle task discrepancy, we adopt different merge operations and post-processing for different tasks. We also leverage weak-supervision, allowing our segmentation model to benefit from cheaper bounding box annotations. To share knowledge across datasets, we use text embeddings from the same semantic embedding space as classifiers and share all network parameters among datasets. We train DaTaSeg on ADE semantic, COCO panoptic, and Objects365 detection datasets. DaTaSeg improves performance on all datasets, especially small-scale datasets, achieving 54.0 mIoU on ADE semantic and 53.5 PQ on COCO panoptic. DaTaSeg also enables weakly-supervised knowledge transfer on ADE panoptic and Objects365 instance segmentation. Experiments show DaTaSeg scales with the number of training datasets and enables open-vocabulary segmentation through direct transfer. In addition, we annotate an Objects365 instance segmentation set of 1,000 images and release it as a public evaluation benchmark on https://laoreja.github.io/dataseg.

Autodecoding Latent 3D Diffusion Models
Evangelos Ntavelis Aliaksandr Siarohin Kyle Olszewski Chaoyang Wang Luc Van Gool Sergey Tulyakov



研究问题:扩散模型在文本到图像领域表现出色,但在3D生成领域的应用受限于目标领域数据的稀缺性。
动机:为了解决3D生成中数据稀缺的问题,本文提出了一种以3D自动编码器为核心的新方法。
方法:首先使用自动编码器学习潜在空间,然后在瓶颈上进行去噪过程以生成新的样本。同时,将目标数据集的属性嵌入潜在空间,然后解码为体积表示以渲染一致的外观和几何形状。
效果:实验结果表明,该方法在各种基准数据集和指标上优于最先进的替代方案,包括合成物体的多视图图像数据集、移动人物的真实野外视频以及静态对象的大规模真实视频数据集。

Diffusion-based methods have shown impressive visual results in the text-to-image domain. They first learn a latent space using an autoencoder, then run a denoising process on the bottleneck to generate new samples. However, learning an autoencoder requires substantial data in the target domain. Such data is scarce for 3D generation, prohibiting the learning of large-scale diffusion models for 3D synthesis. We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects. Our approach is flexible enough to use either existing camera supervision or no camera information at all -- instead efficiently learning it during training. Our evaluations demonstrate that our generation results outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.

Emergent Correspondence from Image Diffusion
Luming Tang Menglin Jia Qianqian Wang Cheng Perng Phoo Bharath Hariharan



研究问题:如何在无需任何显式监督的情况下,通过图像扩散模型找到图像之间的对应关系。
动机:图像对应关系是计算机视觉的基本问题,而现有的方法需要大量的任务特定数据或注释进行微调或监督。
方法:提出一种名为DIFT的策略,从扩散网络中提取出隐含的知识作为图像特征,用于在真实图像之间建立对应关系。
效果:实验表明,DIFT无需额外的任务特定数据或注释的微调或监督,就能在识别语义、几何和时间对应关系上超越弱监督方法和现有的现成特征。特别是在语义对应关系上,DIFT的性能超过了DINO和OpenCLIP,并在SPair-71k基准测试中的9个类别上超越了最先进的监督方法。

Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io.

Modeling Human Visual Motion Processing with Trainable Motion Energy Sensing and a Self-attention Network
Zitang Sun Yen-Ju Chen Yung-Hao Yang Shin'ya Nishida



研究问题:本文旨在建立一个图像计算模型,模拟人类对动态环境的感知和交互,以提取自然场景中的有意义的运动流。
动机:尽管在认知神经科学方面有广泛的研究,但目前还没有一个能够以与人类视觉处理一致的方式从自然场景中提取有意义运动流的图像计算模型。
方法:我们提出了一种将可训练的运动能量感知与循环自注意力网络相结合的两阶段方法,用于自适应的运动整合和分离。这种模型架构旨在捕捉生物视觉系统中运动感知的核心结构V1-MT的计算过程,同时为各种刺激提供有意义的运动流。
效果:实验结果表明,我们的模型在预测人类反应上优于基准测试,而最先进的CV模型则相反。虽然我们的模型在生理对应性上可能并不完全准确,但它提供了一个与人类视觉运动处理一致的计算架构。

Visual motion processing is essential for humans to perceive and interact with dynamic environments. Despite extensive research in cognitive neuroscience, image-computable models that can extract informative motion flow from natural scenes in a manner consistent with human visual processing have yet to be established. Meanwhile, recent advancements in computer vision (CV), propelled by deep learning, have led to significant progress in optical flow estimation, a task closely related to motion perception. Here we propose an image-computable model of human motion perception by bridging the gap between biological and CV models. Specifically, we introduce a novel two-stages approach that combines trainable motion energy sensing with a recurrent self-attention network for adaptive motion integration and segregation. This model architecture aims to capture the computations in V1-MT, the core structure for motion perception in the biological visual system, while providing the ability to derive informative motion flow for a wide range of stimuli, including complex natural scenes. In silico neurophysiology reveals that our model's unit responses are similar to mammalian neural recordings regarding motion pooling and speed tuning. The proposed model can also replicate human responses to a range of stimuli examined in past psychophysical studies. The experimental results on the Sintel benchmark demonstrate that our model predicts human responses better than the ground truth, whereas the state-of-the-art CV models show the opposite. Our study provides a computational architecture consistent with human visual motion processing, although the physiological correspondence may not be exact.

OpenMask3D: Open-Vocabulary 3D Instance Segmentation
Ayça Takmaz Elisabetta Fedele Robert Sumner Marc Pollefeys Federico Tombari Francis Engelmann



研究问题:本文旨在解决当前3D实例分割方法只能识别预定义类别的问题,以及现有方法无法区分多个对象实例的问题。
动机:目前的3D实例分割方法通常只能识别训练数据集中预定义的封闭类别,这对于需要执行与各种物体相关的新颖开放词汇查询的现实世界应用有重要限制。
方法:本文提出了OpenMask3D,一种零射击法用于开放词汇3D实例分割。通过预测类无关的3D实例掩码,模型通过基于CLIP的图像嵌入的多视图融合聚合每个掩码的特征。
效果:实验和消融研究表明,OpenMask3D在ScanNet200和Replica上优于其他开放词汇方法,尤其是在长尾分布上。定性实验进一步展示了OpenMask3D根据描述几何、功能和材料的免费形式查询分割对象属性的能力。

We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D’s ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

Reading Relevant Feature from Global Representation Memory for Visual Object Tracking
Xinyu Zhou Pinxue Guo Lingyi Hong Jinglun Li Wei Zhang Weifeng Ge Wenqiang Zhang



研究问题:如何有效地利用参考特征进行视觉目标跟踪。
动机:由于视频的动态性,不同时间步长的不同搜索区域所需的参考历史信息也不一致,使用模板和内存中的所有特征可能导致冗余并影响跟踪性能。
方法:提出一种新的跟踪范式,包括相关性注意力机制和全局表示记忆,可以自适应地帮助搜索区域从参考特征中选择最相关的历叚信息。
效果:通过在五个具有挑战性的数据集上进行大量实验,验证了该方法的有效性,实现了71 FPS的竞争力能。

Reference features from a template or historical frames are crucial for visual object tracking. Prior works utilize all features from a fixed template or memory for visual object tracking. However, due to the dynamic nature of videos, the required reference historical information for different search regions at different time steps is also inconsistent. Therefore, using all features in the template and memory can lead to redundancy and impair tracking performance. To alleviate this issue, we propose a novel tracking paradigm, consisting of a relevance attention mechanism and a global representation memory, which can adaptively assist the search region in selecting the most relevant historical information from reference features. Specifically, the proposed relevance attention mechanism in this work differs from previous approaches in that it can dynamically choose and build the optimal global representation memory for the current frame by accessing cross- frame information globally. Moreover, it can flexibly read the relevant historical information from the constructed memory to reduce redundancy and counteract the negative effects of harmful information. Extensive experiments validate the effectiveness of the proposed method, achieving competitive performance on five challenging datasets with 71 FPS.

GMSF: Global Matching Scene Flow
Yushan Zhang Johan Edstedt Bastian Wandt Per-Erik Forssen Maria Magnusson Michael Felsberg



研究问题:本文旨在解决从点云中估计场景流的问题。
动机:现有的主导场景流估计方法需要复杂的多阶段细化,如粗到精或循环架构。
方法:提出一种显著更简单的单尺度一次性全局匹配方法来解决这个问题。通过混合局部-全局-交叉转换器架构分解特征提取步骤,以获得准确和鲁棒的特征表示。
效果:实验表明,所提出的全局匹配场景流(GMSF)在多个场景流估计基准上创造了新的最先进的性能。

We tackle the task of scene flow estimation from point clouds. Given a source and a target point cloud, the objective is to estimate a translation from each point in the source point cloud to the target, resulting in a 3D motion vector field. Previous dominant scene flow estimation methods require complicated coarse-to-fine or recurrent architectures as a multi-stage refinement. In contrast, we propose a significantly simpler single-scale one-shot global matching to address the problem. Our key finding is that reliable feature similarity between point pairs is essential and sufficient to estimate accurate scene flow. We thus propose to decompose the feature extraction step via a hybrid local-global-cross transformer architecture which is crucial to accurate and robust feature representations. Extensive experiments show that the proposed Global Matching Scene Flow (GMSF) sets a new state-of-the-art on multiple scene flow estimation benchmarks. On FlyingThings3D, with the presence of occlusion points, GMSF reduces the outlier percentage from the previous best performance of 27.4% to 5.6%. On KITTI Scene Flow, without any fine-tuning, our proposed method shows state-of-the-art performance. On the Waymo-Open dataset, the proposed method outperforms previous methods by a large margin. The code is available at https://github.com/ZhangYushan3/GMSF.

Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors
Pengchong Hu Zhizhong Han



研究问题:目前的多视角图像三维重建方法在深度监督下的渲染视图中存在空洞的不完全深度和遮挡结构无法感知的问题,严重影响了通过体积渲染进行几何推理的准确性。
动机:为了解决这个问题,我们提出了一种从多视角RGBD图像中学习神经隐式表示的方法,通过带有注意力深度融合先验的体积渲染。
方法:我们的先验允许神经网络从所有可用的深度图像融合的截断符号距离函数(TSDF)中感知粗略的3D结构以进行渲染。TSDF使得我们可以访问一个深度图像上的孔洞缺失的深度和当前视图看不见的被遮挡部分。通过引入一种新的注意力机制,我们让神经网络直接使用带有推断占有率的深度融合先验作为学习的隐式函数。
效果:我们在广泛使用的合成和真实世界扫描基准测试上进行的评估表明,我们的方法优于最新的神经隐式方法。

Learning neural implicit representations has achieved remarkable performance in 3D reconstruction from multi-view images. Current methods use volume rendering to render implicit representations into either RGB or depth images that are supervised by the multi-view ground truth. However, rendering a view each time suffers from incomplete depth at holes and unawareness of occluded structures from the depth supervision, which severely affects the accuracy of geometry inference via volume rendering. To resolve this issue, we propose to learn neural implicit representations from multi-view RGBD images through volume rendering with an attentive depth fusion prior. Our prior allows neural networks to sense coarse 3D structures from the Truncated Signed Distance Function (TSDF) fused from all available depth images for rendering. The TSDF enables accessing the missing depth at holes on one depth image and the occluded parts that are invisible from the current view. By introducing a novel attention mechanism, we allow neural networks to directly use the depth fusion prior with the inferred occupancy as the learned implicit function. Our attention mechanism works with either a one-time fused TSDF that represents a whole scene or an incrementally fused TSDF that represents a partial scene in the context of Simultaneous Localization and Mapping (SLAM). Our evaluations on widely used benchmarks including synthetic and real-world scans show our superiority over the latest neural implicit methods.

ClusterFomer: Clustering As A Universal Visual Learner
James Chenhao Liang Yiming Cui Qifan Wang Tong Geng Wenguan Wang Dongfang Liu



研究问题:本文提出了一种基于聚类范式和Transformer的通用视觉模型ClusterFormer。
动机:现有的视觉模型在处理异构视觉任务时,性能和可解释性存在局限。
方法:ClusterFormer包含两个创新设计:1)递归交叉注意力聚类,它重新定义了Transformer中的交叉注意力机制,使聚类中心能够进行递归更新,增强表示学习能力;2)特征分派,使用更新后的聚类中心通过相似性指标重新分配图像特征,形成透明流程。
效果:实验结果表明,ClusterFormer优于各种已知的专业架构,在ImageNet-1K图像分类、MS COCO物体检测和实例分割、ADE20K语义分割以及COCO Panoptic全景分割等任务上取得了优秀表现。

This paper presents ClusterFormer, a universal vision model that is based on the Clustering paradigm with TransFormer. It comprises two novel designs: 1) recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2) feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that ClusterFormer outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. This work aims to initiate a paradigm shift in universal visual understanding and to benefit the broader field.

Unified 3D Segmenter As Prototypical Classifiers
Zheyun Qin Cheng Han Qifan Wang Xiushan Nie Yilong Yin Xiankai Lu



研究问题:本文旨在解决点云分割任务,包括语义分割、实例分割和全景分割,通常通过设计特定网络架构来解决,这往往缺乏跨任务的灵活性,导致研究结果分散。
动机:现有的方法主要针对特定任务设计网络架构,缺乏灵活性和泛化性。因此,本文提出了一种原型为基础的模型ProtoSEG,将语义分割、实例分割和全景分割统一起来。
方法:该方法将这三个同类任务视为具有不同粒度级别的分类问题。利用Transformer架构提取点嵌入以优化原型类别距离,并动态学习类别原型以适应最终任务。
效果:实验结果表明,ProtoSEG在3D点云基准测试中优于同时期知名的专用架构,在S3DIS、ScanNet V2和SemanticKITTI上的语义分割mIoU分别为72.3%、76.4%和74.2%,在S3DIS和ScanNet V2上的实例分割mCov为66.8%,mAP为51.2%,在SemanticKITTI上的全景分割PQ为62.4%。验证了本方法的概念优势和算法有效性。代码和模型可在https://github.com/zyqin19/PROTOSEG获取。

The task of point cloud segmentation, comprising semantic, instance, and panoptic segmentation, has been mainly tackled by designing task-specific network architectures, which often lack the flexibility to generalize across tasks, thus resulting in a fragmented research landscape. In this paper, we introduce ProtoSEG, a prototype-based model that unifies semantic, instance, and panoptic segmentation tasks. Our approach treats these three homogeneous tasks as a classification problem with different levels of granularity. By leveraging a Transformer architecture, we extract point embeddings to optimize prototype-class distances and dynamically learn class prototypes to accommodate the end tasks. Our prototypical design enjoys simplicity and transparency, powerful representational learning, and ad-hoc explainability. Empirical results demonstrate that ProtoSEG outperforms concurrent well-known specialized architectures on 3D point cloud benchmarks, achieving 72.3%, 76.4% and 74.2% mIoU for semantic segmentation on S3DIS, ScanNet V2 and SemanticKITTI, 66.8% mCov and 51.2% mAP for instance segmentation on S3DIS and ScanNet V2, 62.4% PQ for panoptic segmentation on SemanticKITTI, validating the strength of our concept and the effectiveness of our algorithm. The code and models are available at https://github.com/zyqin19/PROTOSEG.

Learning Motion Refinement for Unsupervised Face Animation
Jiale Tao Shuhang Gu Wen Li Lixin Duan



研究问题:本文旨在解决现有无监督人脸动画方法在捕捉精细面部运动时存在的局限性。
动机:现有的无监督人脸动画方法通常采用基于先验的全局运动模型,但在局部区域(如嘴唇和眼睛)的运动捕捉上存在局限,无法准确模拟精细的面部运动。
方法:本文提出了一种新的无监督人脸动画方法,同时学习全局和局部的面部运动。具体来说,利用局部仿射运动模型学习全局的粗糙面部运动,设计了一个新的运动细化模块来补偿局部仿射运动模型在模拟精细面部运动上的不足。运动细化是通过源图像和驱动图像的关键特征之间的密集相关性学习的。
效果:实验结果表明,该方法在广泛使用的基准测试中取得了优于现有最先进技术的结果。

Unsupervised face animation aims to generate a human face video based on the appearance of a source image, mimicking the motion from a driving video. Existing methods typically adopted a prior-based motion model (e.g., the local affine motion model or the local thin-plate-spline motion model). While it is able to capture the coarse facial motion, artifacts can often be observed around the tiny motion in local areas (e.g., lips and eyes), due to the limited ability of these methods to model the finer facial motions. In this work, we design a new unsupervised face animation approach to learn simultaneously the coarse and finer motions. In particular, while exploiting the local affine motion model to learn the global coarse facial motion, we design a novel motion refinement module to compensate for the local affine motion model for modeling finer face motions in local areas. The motion refinement is learned from the dense correlation between the source and driving images. Specifically, we first construct a structure correlation volume based on the keypoint features of the source and driving images. Then, we train a model to generate the tiny facial motions iteratively from low to high resolution. The learned motion refinements are combined with the coarse motion to generate the new image. Extensive experiments on widely used benchmarks demonstrate that our method achieves the best results among state-of-the-art baselines.

Unsupervised Optical Flow Estimation with Dynamic Timing Representation for Spike Camera
Lujie Xia Ziluo Ding Rui Zhao Jiyuan Zhang Lei Ma Zhaofei Yu Tiejun Huang Ruiqin Xiong



研究问题:如何有效地从尖峰流数据中选择适当的长度以提取精确信息,是尖峰视觉任务的关键。
动机:为了解决这个问题,我们提出了一种动态时间表示法来处理尖峰流数据。
方法:基于多层架构,我们在时间维度上应用空洞卷积来提取多时标的特征,同时设计了层注意力机制来动态融合这些特征。此外,我们还提出了一种基于尖峰的无监督学习方法来估计光流,以打破对标记数据的依赖。
效果:实验表明,我们的方法可以在包括真实场景在内的不同高速场景中从尖峰流预测光流。例如,在PHM数据集上,与最好的尖峰基线SCFlow相比,我们的误差分别降低了15%和19%。

Efficiently selecting an appropriate spike stream data length to extract precise information is the key to the spike vision tasks. To address this issue, we propose a dynamic timing representation for spike streams. Based on multi-layers architecture, it applies dilated convolutions on temporal dimension to extract features on multi-temporal scales with few parameters. And we design layer attention to dynamically fuse these features. Moreover, we propose an unsupervised learning method for optical flow estimation in a spike-based manner to break the dependence on labeled data. In addition, to verify the robustness, we also build a spike-based synthetic validation dataset for extreme scenarios in autonomous driving, denoted as SSES dataset. It consists of various corner cases. Experiments show that our method can predict optical flow from spike streams in different high-speed scenes, including real scenes. For instance, our method achieves $15\%$ and $19\%$ error reduction on PHM dataset compared to the best spike-based work, SCFlow, in $\Delta t=10$ and $\Delta t=20$ respectively, using the same settings as in previous works. The source code and dataset are available at \href{https://github.com/Bosserhead/USFlow}{https://github.com/Bosserhead/USFlow}.

Online Map Vectorization for Autonomous Driving: A Rasterization Perspective
Gongjie Zhang Jiahao Lin Shuang Wu Yilin Song Zhipeng Luo Yang Xue Shijian Lu Zuoguan Wang



研究问题:如何提高地图矢量化的精度和敏感性,以更好地适应自动驾驶环境。
动机:目前的地图矢量化方法存在偏差,且评估指标对这种偏差的敏感度不足。
方法:将光栅化思想引入地图矢量化中,提出一种基于光栅化的评估指标和矢量化框架MapVR。MapVR通过微分光栅化处理矢量化输出,并对光栅化的高清地图进行精确的几何感知监督。
效果:实验证明,将光栅化引入地图矢量化能显著提升性能,且在推理过程中无需额外计算成本,有助于提高地图感知的准确性,推动更安全的自动驾驶。

High-definition (HD) vectorized map is essential for autonomous driving, providing detailed and precise environmental information for advanced perception and planning. However, current map vectorization methods often exhibit deviations, and the existing evaluation metric for map vectorization lacks sufficient sensitivity to detect these deviations. To address these limitations, we propose integrating the philosophy of rasterization into map vectorization. Specifically, we introduce a new rasterization-based evaluation metric, which has superior sensitivity and is better suited to real-world autonomous driving scenarios. Furthermore, we propose MapVR (Map Vectorization via Rasterization), a novel framework that applies differentiable rasterization to vectorized outputs and then performs precise and geometry-aware supervision on rasterized HD maps. Notably, MapVR designs tailored rasterization strategies for various geometric shapes, enabling effective adaptation to a wide range of map elements. Experiments show that incorporating rasterization into map vectorization greatly enhances performance with no extra computational cost during inference, leading to more accurate map perception and ultimately promoting safer autonomous driving. Codes are available at https://github.com/ZhangGongjie/MapVR. A standalone map vectorization evaluation toolkit is available at https://github.com/jiahaoLjh/MapVectorizationEvalToolkit.

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
Qihang Yu Ju He Xueqing Deng Xiaohui Shen Liang-Chieh Chen



研究问题:开放词汇分割是一项挑战性任务,需要在不同的环境中从开放的类别集合中对对象进行分割和识别。
动机:目前的处理方法通常采用两阶段框架来解决这个问题,其中输入首先通过掩码生成器,然后通过CLIP模型和预测的掩码一起处理。这种方法涉及多次从原始图像中提取特征,可能效果不佳且效率低下。相比之下,我们提出使用共享的冻结CLIP卷积骨干构建单阶段框架,这不仅大大简化了当前的两阶段流程,而且显著提高了准确性-成本权衡。
方法:我们构建了一个名为FC-CLIP的单阶段系统,该系统受益于以下观察结果:冻结的CLIP骨干保留了开放词汇分类的能力,也可以作为强大的掩码生成器;卷积的CLIP比对比图像-文本预训练期间使用的输入分辨率更好地推广。令人惊讶的是,FC-CLIP在各种基准测试中取得了最先进的结果,同时运行速度相当快。
效果:具体来说,当仅在COCO全景数据上进行训练并以零射击方式进行测试时,FC-CLIP在ADE20K上实现了26.8 PQ、16.8 AP和34.1 mIoU,在Mapillary Vistas上实现了18.2 PQ、27.9 mIoU,在Cityscapes上实现了44.0 PQ、26.8 AP、56.2 mIoU,在同一设置下超过了先前的艺术+4.2 PQ、+2.4 AP、+4.2 mIoU在ADE20K上,+4.0 PQ在Mapillary Vistas上和+20.1 PQ在Cityscapes上。此外,FC-CLIP的训练和测试时间分别比相同的先前艺术快7.5倍和6.6倍,同时使用5.9倍更少的总模型参数。与此同时,FC-CLIP还在各种开放词汇语义分割数据集上设置了新的最先进的性能。代码和模型可在https://github.com/bytedance/fc-clip获取。

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories in diverse environments. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which effectively bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from raw images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a _shared **F**rozen **C**onvolutional **CLIP** backbone_, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The resulting single-stage system, called FC-CLIP, benefits from the following observations: the _frozen_ CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the _convolutional_ CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. Surprisingly, FC-CLIP advances state-of-the-art results on various benchmarks, while running practically fast. Specifically, when training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art under the same setting by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer total model parameters. Meanwhile, FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code and models are available at https://github.com/bytedance/fc-clip

Self-supervised Object-Centric Learning for Videos
Görkay Aydemir Weidi Xie Fatma Guney



研究问题:本文旨在解决真实世界视频序列中多物体分割的问题。
动机:虽然无监督的多物体分割在合成序列上取得了显著的效果,但在更具挑战性的真实世界场景中,其性能并未得到提升。
方法:提出了一种全新的、完全无监督的方法,通过对象中心的学习框架将对象与每一帧上的插槽关联起来,并在帧之间建立联系。通过这些具有时间感知的插槽,训练目标是在高级语义特征空间中重构中间帧。
效果:该方法成功地在YouTube视频中对复杂和多样性类别的多个实例进行了分割。

Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining. An additional modality such as depth or motion is often used to facilitate the segmentation in video sequences. However, the performance improvements observed in synthetic sequences, which rely on the robustness of an additional cue, do not translate to more challenging real-world scenarios. In this paper, we propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. From these temporally-aware slots, the training objective is to reconstruct the middle frame in a high-level semantic feature space. We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization. Additionally, we address over-clustering by merging slots based on similarity. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.

GenS: Generalizable Neural Surface Reconstruction from Multi-View Images
Rui Peng Xiaodong Gu Luyang Tang Shihe Shen Fanqi Yu Ronggang Wang



研究问题:如何从无3D监督的多视角图像中重建表面?
动机:现有的方法需要对每个场景进行长时间的优化,并且无法推广到新的场景。
方法:提出了一种名为GenS的端到端可泛化神经表面重建模型,该模型在稀疏和密集环境中均表现良好。通过构建一个通用的多尺度体积来直接编码所有场景,避免了基于坐标的方法为每个场景训练单独的网络。
效果:与现有解决方案相比,该模型的表示能力更强,能够在保持全局平滑性的同时恢复高频细节。此外,引入了多尺度特征度量一致性,以在更具判别性的多尺度特征空间中强制实现多视角一致性,对光度一致性的失败具有鲁棒性。实验表明,该模型能够很好地推广到新的场景,并在流行的基准测试中优于现有的最先进方法,甚至优于使用真实深度监督的方法。

Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model, which performs well in both sparse and dense settings. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code will be available at https://github.com/prstrive/GenS.

DiffComplete: Diffusion-based Generative 3D Shape Completion
Ruihang Chu Enze Xie Shentong Mo Zhenguo Li Matthias Nießner Chi-Wing Fu Jiaya Jia



研究问题:本文旨在提出一种新的基于扩散的方法,用于3D范围扫描的形状补全。
动机:与现有的确定性和概率性方法相比,我们试图在真实感、多模态和高保真度之间找到平衡。
方法:我们将形状补全视为一个有条件生成的任务,提出了DiffComplete模型。主要设计包括两个方面:一是设计了一个分层特征聚合机制,以空间一致的方式注入条件特征,以捕捉局部细节和条件输入的更广泛上下文,从而控制形状补全;二是在我们的模型中提出了一种占用感知的融合策略,使模型能够完成多个部分形状的补全,并提高了对输入条件的灵活性。
效果:DiffComplete在两个大型3D形状补全基准测试中取得了新的最先进的性能(例如,在$l_1$误差上降低了40%)。我们的补全形状不仅比确定性方法具有更真实的外观,而且与概率性替代方案相比,与地面真相的相似度高。此外,无论是在合成数据还是真实数据上,DiffComplete都能很好地泛化到完全未见过的类别的对象上,消除了在不同应用中重新训练模型的需要。

We introduce a new diffusion-based approach for shape completion on 3D range scans. Compared with prior deterministic and probabilistic methods, we strike a balance between realism, multi-modality, and high fidelity. We propose DiffComplete by casting shape completion as a generative task conditioned on the incomplete shape. Our key designs are two-fold. First, we devise a hierarchical feature aggregation mechanism to inject conditional features in a spatially-consistent manner. So, we can capture both local details and broader contexts of the conditional inputs to control the shape completion. Second, we propose an occupancy-aware fusion strategy in our model to enable the completion of multiple partial shapes and introduce higher flexibility on the input conditions. DiffComplete sets a new SOTA performance (e.g., 40% decrease on $l_1$ error) on two large-scale 3D shape completion benchmarks. Our completed shapes not only have a realistic outlook compared with the deterministic methods but also exhibit high similarity to the ground truths compared with the probabilistic alternatives. Further, DiffComplete has strong generalizability on objects of entirely unseen classes for both synthetic and real data, eliminating the need for model re-training in various applications.

AiluRus: A Scalable ViT Framework for Dense Prediction
Jin Li Yaoming Wang XIAOPENG ZHANG Bowen Shi Dongsheng Jiang Chenglin Li Wenrui Dai Hongkai Xiong Qi Tian



研究问题:如何提高视觉转换器(ViTs)在处理长令牌序列和密集预测任务时的效率。
动机:由于其显著的性能,视觉转换器已成为处理视觉任务的主要架构。然而,当处理需要高分辨率输入的长令牌序列时,其复杂性会大大增加。
方法:提出一种自适应分辨率策略,根据图像中区域的重要性调整其分辨率。具体来说,在视觉转换器的中间层,使用提出的空间感知密度基聚类算法从令牌序列中选择锚点。与锚点相邻的令牌被合并以形成低分辨率区域,而其他令牌则独立保留为高分辨率。这种方法可以显著减少令牌的数量,后续层只需处理减少后的令牌序列以加速计算。
效果:该方法在三个不同的数据集上进行了评估,表现出良好的性能。例如,"Segmenter ViT-L"可以在不进行微调的情况下加速48%的帧数,同时保持性能。此外,该方法也可以用于加速微调过程。实验表明,我们可以在只降低0.09%性能的情况下节省52%的训练时间,同时加速2.46倍的帧数。

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution. This strategy could significantly reduce the number of tokens, and the following layers only handle the reduced token sequence for acceleration. At the output end, the resolution of the feature map is recovered by unfolding merged tokens for task prediction. Consequently, we can considerably accelerate ViTs for dense prediction tasks. The proposed method is evaluated across three different datasets and demonstrates promising performance. For instance, "Segmenter ViT-L" can be accelerated by 48\% FPS without fine-tuning, while maintaining the performance. Moreover, our method can also be applied to accelerate fine-tuning. Experiments indicate that we can save 52\% training time while accelerating 2.46$\times$ FPS with only a 0.09\% performance drop.

RangePerception: Taming LiDAR Range View for Efficient and Accurate 3D Object Detection
Yeqi BAI Ben Fei Youquan Liu Tao MA Yuenan Hou Botian Shi Yikang LI



研究问题:如何提高基于LiDAR的3D检测方法的性能,同时保持其效率。
动机:目前的基于鸟瞰图(BEV)和范围视图(RV)的3D检测方法存在性能和效率之间的矛盾。
方法:本文提出了一种名为RangePerception的高效准确的RV-based 3D物体检测框架。通过仔细分析,确定了两个阻碍现有RV-based方法性能的关键挑战,并针对这两个挑战提出了两个新的算法,即Range Aware Kernel (RAK)和Vision Restoration Module (VRM)。
效果:实验结果表明,RangePerception在Waymo Open Dataset上的平均L1/L2 AP比之前最先进的RV-based方法RangeDet高出3.25/4.18,并且首次作为RV-based 3D检测方法,其平均AP略优于著名的BEV-based方法CenterPoint,而其推理速度是CenterPoint的1.3倍。

LiDAR-based 3D detection methods currently use bird's-eye view (BEV) or range view (RV) as their primary basis. The former relies on voxelization and 3D convolutions, resulting in inefficient training and inference processes. Conversely, RV-based methods demonstrate higher efficiency due to their compactness and compatibility with 2D convolutions, but their performance still trails behind that of BEV-based methods. To eliminate this performance gap while preserving the efficiency of RV-based methods, this study presents an efficient and accurate RV-based 3D object detection framework termed RangePerception. Through meticulous analysis, this study identifies two critical challenges impeding the performance of existing RV-based methods: 1) there exists a natural domain gap between the 3D world coordinate used in output and 2D range image coordinate used in input, generating difficulty in information extraction from range images; 2) native range images suffer from vision corruption issue, affecting the detection accuracy of the objects located on the margins of the range images. To address the key challenges above, we propose two novel algorithms named Range Aware Kernel (RAK) and Vision Restoration Module (VRM), which facilitate information flow from range image representation and world-coordinate 3D detection results. With the help of RAK and VRM, our RangePerception achieves 3.25/4.18 higher averaged L1/L2 AP compared to previous state-of-the-art RV-based method RangeDet, on Waymo Open Dataset. For the first time as an RV-based 3D detection method, RangePerception achieves slightly superior averaged AP compared with the well-known BEV-based method CenterPoint and the inference speed of RangePerception is 1.3 times as fast as CenterPoint.

NAP: Neural 3D Articulated Object Prior
Jiahui Lei Congyue Deng Bokui Shen Leonidas Guibas Kostas Daniilidis



研究问题:本文旨在提出首个3D深度生成模型——神经3D关节对象先验(NAP),用于合成3D关节对象模型。
动机:尽管对生成3D静态物体、组合或场景的研究广泛,但几乎没有方法能够捕捉到关节对象——人类和机器人交互的常见物体类别——的分布。
方法:我们首先设计了一种新的关节树/图参数化方法,然后在这个表示上应用了一种扩散去噪概率模型,通过从随机完全图中去噪来生成关节对象。为了同时捕捉几何形状和运动结构(它们的分布会相互影响),我们设计了一个图形去噪网络来学习反向扩散过程。
效果:实验证明,我们在关节对象生成方面表现出色,并在条件生成任务中应用,包括Part2Motion、PartNet-Imagination、Motion2Part和GAPart2Object。

We propose Neural 3D Articulated object Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D static objects, compositions, or scenes, there are hardly any approaches on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality. Experiments demonstrate our high performance in articulated object generation as well as its applications on conditioned generation, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.

Segment Everything Everywhere All at Once
Xueyan Zou Jianwei Yang Hao Zhang Feng Li Linjie Li Jianfeng Wang Lijuan Wang Jianfeng Gao Yong Jae Lee



研究问题:如何开发一种可推广和交互式的模型,用于一次性在图像中分割所有内容。
动机:现有的图像分割模型需要针对每种任务进行特定设计,缺乏通用性和交互性。
方法:提出SEEM模型,通过引入新的视觉提示来统一不同的空间查询,并学习文本和视觉提示之间的联合视觉语义空间,使模型能够动态组合两种类型的提示来完成各种分割任务。此外,还引入了可学习的内存提示以保留分割历史记录。
效果:实验结果表明,SEEM模型能够在一个统一的表示空间中学习和组合不同类型的提示,从而有效地完成各种分割任务,并在最少的监督下实现竞争性能。

In this work, we present SEEM, a promotable and interactive model for segmenting everything everywhere all at once in an image. In SEEM, we propose a novel and versatile decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles, and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks, as shown in Fig. 1; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from the decoder to image features; iv) Semantic awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. The results demonstrate that SEEM exhibits robust generalizing to unseen user intents as it learns to compose prompts of different types in a unified representation space. Our approach achieves competitive performance on interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision in a single set of weights.

Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation
Jia-Xing Zhong Ta-Ying Cheng Yuhang He Kai Lu Kaichen Zhou Andrew Markham Niki Trigoni



研究问题:如何实现对刚性分割和运动估计的通用方法,以理解关节对象和移动场景的3D信息。
动机:分割和运动估计之间存在密切的关系,我们提出了一种无监督的方式来解决这个问题。
方法:我们设计了一个SE(3)等变架构和一个训练策略。该架构由两个相互连接的轻量级头部组成,这些头部使用点级别的不变特征预测分割掩码,并从SE(3)等变特征中估计运动,而无需类别信息。我们的训练策略是统一的,可以在线实施,通过利用场景流、分割掩码和刚体变换之间的关系来联合优化预测的分割和运动。
效果:我们在四个数据集上进行实验,结果显示我们的方法在模型性能和计算效率上都表现出色,参数量为0.25M,运算量为0.92G FLOPs。据我们所知,这是首个针对动态点云中的类别无关部分级别的SE(3)等变的研究成果。

A truly generalizable approach to rigid segmentation and motion estimation is fundamental to 3D understanding of articulated objects and moving scenes. In view of the closely intertwined relationship between segmentation and motion estimates, we present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our architecture is composed of two interconnected, lightweight heads. These heads predict segmentation masks using point-level invariant features and estimate motion from SE(3) equivariant features, all without the need for category information. Our training strategy is unified and can be implemented online, which jointly optimizes the predicted segmentation and motion by leveraging the interrelationships among scene flow, segmentation mask, and rigid transformations. We conduct experiments on four datasets to demonstrate the superiority of our method. The results show that our method excels in both model performance and computational efficiency, with only 0.25M parameters and 0.92G FLOPs. To the best of our knowledge, this is the first work designed for category-agnostic part-level SE(3) equivariance in dynamic point clouds.

Semantic Image Synthesis with Unconditional Generator
JungWoo Chae Hyunin Cho Sooyeon Go Kyungmook Choi Youngjung Uh



研究问题:如何利用用户指定的语义蒙版在预训练的无条件生成器上进行精细的空间控制,以生成逼真的图像。
动机:目前的语义图像合成方法需要对训练图像进行昂贵的像素级标注,而操作预训练无条件生成器(如StyleGAN)中的中间特征图则可以在没有繁重标注的情况下实现粗略的空间控制。
方法:本文提出了一种新的方法,通过一个语义映射器将用户指定的指导蒙版转换为代理蒙版,然后通过基于交叉注意力机制的重排网络使代理蒙版影响生成的图像。代理蒙版是对生成器中中间特征图的简单聚类。语义映射器和重排网络易于训练(不到半小时)。
效果:该方法对于许多任务都很有用,包括语义图像合成、真实图像的空间编辑以及未对齐的局部移植等。此外,它还可以广泛应用于各种数据集,如人脸、动物脸和教堂等。

Semantic image synthesis (SIS) aims to generate realistic images according to semantic masks given by a user. Although recent methods produce high quality results with fine spatial control, SIS requires expensive pixel-level annotation of the training images. On the other hand, manipulating intermediate feature maps in a pretrained unconditional generator such as StyleGAN supports coarse spatial control without heavy annotation. In this paper, we introduce a new approach, for reflecting user's detailed guiding masks on a pretrained unconditional generator. Our method converts a user's guiding mask to a proxy mask through a semantic mapper. Then the proxy mask conditions the resulting image through a rearranging network based on cross-attention mechanism. The proxy mask is simple clustering of intermediate feature maps in the generator. The semantic mapper and the rearranging network are easy to train (less than half an hour). Our method is useful for many tasks: semantic image synthesis, spatially editing real images, and unaligned local transplantation. Last but not least, it is generally applicable to various datasets such as human faces, animal faces, and churches.

Non-Rigid Shape Registration via Deep Functional Maps Prior
Puhua Jiang Mingze Sun Ruqi Huang



研究问题:本文旨在提出一种无需对应监督的基于学习的非刚性形状注册框架。
动机:传统的形状注册技术通常依赖于由外部接近性引起的对应关系,因此在存在大的固有形变时可能会失败。
方法:我们的方法将源网格变形到目标点云,由从深度功能映射(DFM)学习的高维嵌入引发的对应关系进行引导。特别是,对应关系会根据中间注册动态更新并经过一致性先验过滤,显著增强了整个流程的稳定性。
效果:实验结果表明,即使只有几十个有限变化的培训形状,我们的流程在几个非刚性点云匹配基准上实现了最先进的结果,同时为经历显著外在和内在形变的未见过的挑战性形状对提供了高质量的对应关系。

In this paper, we propose a learning-based framework for non-rigid shape registra- tion without correspondence supervision. Traditional shape registration techniques typically rely on correspondences induced by extrinsic proximity, therefore can fail in the presence of large intrinsic deformations. Spectral mapping methods overcome this challenge by embedding shapes into, geometric or learned, high- dimensional spaces, where shapes are easier to align. However, due to the dependency on abstract, non-linear embedding schemes, the latter can be vulnerable with respect to perturbed or alien input. In light of this, our framework takes the best of both worlds. Namely, we deform source mesh towards the target point cloud, guided by correspondences induced by high-dimensional embeddings learned from deep functional maps (DFM). In particular, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Moreover, in order to alleviate the requirement of extrinsically aligned input, we train an orientation regressor on a set of aligned synthetic shapes independent of the training shapes for DFM. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic defor- mations, in which case neither traditional registration methods nor intrinsic methods work. The code is available at https://github.com/rqhuang88/DFR.

Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping
Chunming He Kai Li Yachao Zhang Guoxia Xu Longxiang Tang Yulun Zhang Zhenhua Guo Xiu Li



研究问题:本文旨在解决利用稀疏标注数据训练模型时,难以区分与背景相似度高的隐蔽物体的问题。
动机:由于隐蔽物体与背景的相似性高以及稀疏标注的训练数据只能提供弱监督,使得对混合在环境中的物体进行准确分割仍然是一个挑战。
方法:提出了一种新的弱监督隐蔽物体分割(WSCOS)方法。设计了一个多尺度特征分组模块,通过将相似特征分组并聚合,增强分割连贯性,帮助获取单目标和多目标图像的完整分割结果。同时,利用新提出的视觉基础模型“Segment Anything Model(SAM)”,使用提供的稀疏标注作为提示生成分割掩模来训练模型。
效果:通过一系列策略如多增强结果集成、基于熵的像素级加权和基于熵的图像级选择等,减轻了低质量分割掩模的影响,为分割模型提供了更可靠的监督。实验证明,该方法在各种弱监督隐蔽物体分割任务上取得了最先进的性能。

Weakly-Supervised Concealed Object Segmentation (WSCOS) aims to segment objects well blended with surrounding environments using sparsely-annotated data for model training. It remains a challenging task since (1) it is hard to distinguish concealed objects from the background due to the intrinsic similarity and (2) the sparsely-annotated training data only provide weak supervision for model learning. In this paper, we propose a new WSCOS method to address these two challenges. To tackle the intrinsic similarity challenge, we design a multi-scale feature grouping module that first groups features at different granularities and then aggregates these grouping results. By grouping similar features together, it encourages segmentation coherence, helping obtain complete segmentation results for both single and multiple-object images. For the weak supervision challenge, we utilize the recently-proposed vision foundation model, ``Segment Anything Model (SAM)'', and use the provided sparse annotations as prompts to generate segmentation masks, which are used to train the model. To alleviate the impact of low-quality segmentation masks, we further propose a series of strategies, including multi-augmentation result ensemble, entropy-based pixel-level weighting, and entropy-based image-level selection. These strategies help provide more reliable supervision to train the segmentation model. We verify the effectiveness of our method on various WSCOS tasks, and experiments demonstrate that our method achieves state-of-the-art performance on these tasks.

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
Yang Cao Yihan Zeng Hang Xu Dan Xu



研究问题:本文旨在解决开放词汇3D物体检测(OV-3DDet)中的两个基本问题,即定位和分类新的对象。
动机:在有限的基类条件下,同时解决定位和分类新对象的问题,目前尚未在文献中得到充分探索。
方法:通过统一的框架,提出了一种有效的3D新对象发现策略,利用3D框几何先验和2D语义开放词汇先验生成新对象的伪框标签。为了对新的对象框进行分类,进一步开发了一种基于发现的新颖盒子的跨模态对齐模块,以对齐点云和图像/文本模态之间的特征空间。
效果:在SUN-RGBD和ScanNet两个具有挑战性的数据集上进行的大量实验表明了该方法的有效性,并在性能最好的替代方法上取得了80%的mAP显著提高。

Open-vocabulary 3D Object Detection (OV-3DDet) aims to detect objects from an arbitrary list of categories within a 3D scene, which remains seldom explored in the literature. There are primarily two fundamental problems in OV-3DDet, *i.e.*, localizing and classifying novel objects. This paper aims at addressing the two problems simultaneously via a unified framework, under the condition of limited base categories. To localize novel 3D objects, we propose an effective 3D Novel Object Discovery strategy, which utilizes both the 3D box geometry priors and 2D semantic open-vocabulary priors to generate pseudo box labels of the novel objects. To classify novel object boxes, we further develop a cross-modal alignment module based on discovered novel boxes, to align feature spaces between 3D point cloud and image/text modalities. Specifically, the alignment process contains a class-agnostic and a class-discriminative alignment, incorporating not only the base objects with annotations but also the increasingly discovered novel objects, resulting in an iteratively enhanced alignment. The novel box discovery and crossmodal alignment are jointly learned to collaboratively benefit each other. The novel object discovery can directly impact the cross-modal alignment, while a better feature alignment can, in turn, boost the localization capability, leading to a unified OV-3DDet framework, named **CoDA**, for simultaneous novel object localization and classification. Extensive experiments on two challenging datasets (*i.e.*, SUN-RGBD and ScanNet) demonstrate the effectiveness of our method and also show a significant mAP improvement upon the best-performing alternative method by 80%. Codes and pre-trained models are released on [the project page](https://yangcaoai.github.io/publications/CoDA.html).

DynPoint: Dynamic Neural Point For View Synthesis
Kaichen Zhou Jia-Xing Zhong Sangyun Shin Kai Lu Yiyuan Yang Andrew Markham Niki Trigoni



研究问题:现有的单目视频视图合成算法在处理无控制或长场景时面临困难,且需要针对每个新场景进行大量训练。
动机:为了解决这些问题,我们提出了DynPoint,一种用于快速合成无约束单目视频新视图的算法。
方法:DynPoint不将整个场景信息编码为潜在表示,而是专注于预测相邻帧之间的明确3D对应关系以实现信息聚合。具体来说,这种对应关系是通过估计跨帧的一致深度和场景流信息来实现的。然后,通过构建分层神经点云,利用获得的对应关系将来自多个参考帧的信息聚合到目标帧。
效果:实验结果表明,我们的方法可以大大加速训练时间——通常是一个数量级——同时产生与先前方法相当的结果。此外,我们的方法在处理长时间视频时表现出强大的鲁棒性,无需学习视频内容的规范表示。

The introduction of neural radiance fields has greatly improved the effectiveness of view synthesis for monocular videos. However, existing algorithms face difficulties when dealing with uncontrolled or lengthy scenarios, and require extensive training time specific to each new scenario. To tackle these limitations, we propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos. Rather than encoding the entirety of the scenario information into a latent representation, DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation. Specifically, this correspondence prediction is achieved through the estimation of consistent depth and scene flow information across frames. Subsequently, the acquired correspondence is utilized to aggregate information from multiple reference frames to a target frame, by constructing hierarchical neural point clouds. The resulting framework enables swift and accurate view synthesis for desired views of target frames. The experimental results obtained demonstrate the considerable acceleration of training time achieved - typically an order of magnitude - by our proposed method while yielding comparable outcomes compared to prior approaches. Furthermore, our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.

Enhancing Motion Deblurring in High-Speed Scenes with Spike Streams
Shiyan Chen Jiyuan Zhang Yajing Zheng Tiejun Huang Zhaofei Yu



研究问题:传统相机在高速场景中由于曝光时间长而产生运动模糊,现有的基于帧的去模糊算法在严重模糊的图像中提取有用的运动线索方面面临挑战。
动机:一种新兴的生物启发式视觉传感器——尖峰相机,由于其新颖的采样机制,在保持丰富的空间细节的同时实现了极高的帧率,但其典型的二进制尖峰流相对分辨率较低,缺乏颜色信息,不利于人类视觉。
方法:提出了一种新的方法,将两个模态从两个分支进行整合,利用尖峰流作为辅助视觉线索,引导高速运动场景中的去模糊。
效果:实验结果表明,该方法能有效恢复高度模糊场景下的清晰RGB图像,并在多种设置下优于最先进的去模糊算法。

Traditional cameras produce desirable vision results but struggle with motion blur in high-speed scenes due to long exposure windows. Existing frame-based deblurring algorithms face challenges in extracting useful motion cues from severely blurred images. Recently, an emerging bio-inspired vision sensor known as the spike camera has achieved an extremely high frame rate while preserving rich spatial details, owing to its novel sampling mechanism. However, typical binary spike streams are relatively low-resolution, degraded image signals devoid of color information, making them unfriendly to human vision. In this paper, we propose a novel approach that integrates the two modalities from two branches, leveraging spike streams as auxiliary visual cues for guiding deblurring in high-speed motion scenes. We propose the first spike-based motion deblurring model with bidirectional information complementarity. We introduce a content-aware motion magnitude attention module that utilizes learnable mask to extract relevant information from blurry images effectively, and we incorporate a transposed cross-attention fusion module to efficiently combine features from both spike data and blurry RGB images. Furthermore, we build two extensive synthesized datasets for training and validation purposes, encompassing high-temporal-resolution spikes, blurry images, and corresponding sharp images. The experimental results demonstrate that our method effectively recovers clear RGB images from highly blurry scenes and outperforms state-of-the-art deblurring algorithms in multiple settings.

PyNeRF: Pyramidal Neural Radiance Fields
Haithem Turki Michael Zollhöfer Christian Richardt Deva Ramanan



研究问题:如何改进神经辐射场(NeRFs)的空间网格表示,以解决不同相机距离下的场景重建中的尺度问题。
动机:目前的神经辐射场加速方法如Mip-NeRF等虽然解决了尺度问题,但需要使用与网格方法不兼容的位置编码,且训练速度较慢。
方法:提出一种简单的修改方法,通过在不同空间网格分辨率上训练模型头部,并在渲染时使用较粗的网格来渲染覆盖较大体积的样本。
效果:该方法可以很容易地应用于现有的加速NeRF方法,显著提高渲染质量(在合成和无界真实世界场景中,错误率降低了20-90%),同时性能开销最小(每个模型头部的评估速度很快)。与Mip-NeRF相比,该方法在训练速度上提高了60倍,同时错误率降低了20%。

Neural Radiance Fields (NeRFs) can be dramatically accelerated by spatial grid representations. However, they do not explicitly reason about scale and so introduce aliasing artifacts when reconstructing scenes captured at different camera distances. Mip-NeRF and its extensions propose scale-aware renderers that project volumetric frustums rather than point samples. But such approaches rely on positional encodings that are not readily compatible with grid methods. We propose a simple modification to grid-based models by training model heads at different spatial grid resolutions. At render time, we simply use coarser grids to render samples that cover larger volumes. Our method can be easily applied to existing accelerated NeRF methods and significantly improves rendering quality (reducing error rates by 20–90% across synthetic and unbounded real-world scenes) while incurring minimal performance overhead (as each model head is quick to evaluate). Compared to Mip-NeRF, we reduce error rates by 20% while training over 60x faster.

Generalizable One-shot 3D Neural Head Avatar
Xueting Li Shalini De Mello Sifei Liu Koki Nagano Umar Iqbal Jan Kautz



研究问题:如何从单视图肖像图像重建和动画化3D头部化身。
动机:现有的方法要么需要针对特定个体进行多张图片的耗时优化,要么在面部区域以外的复杂外观细节合成上存在困难。
方法:提出一种框架,不仅基于单视图图像泛化到未见过的个体,无需进行人物特异性优化,还能捕捉面部内外的特征细节(如发型、配饰等)。核心方法是三个分支,分别生成代表源图像的粗略3D几何、详细外观和目标图像表情的三个三角平面。通过将这三个三角平面的组合应用体积渲染并随后进行超分辨率模块处理,该方法产生所需身份、表情和姿势的高保真度图像。
效果:实验表明,所提出的方法在未见过的验证数据集上具有良好的泛化能力,在头部化身重建和动画方面大幅超越最先进的基线方法。

We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.). At the core of our method are three branches that produce three tri-planes representing the coarse 3D geometry, detailed appearance of a source image, as well as the expression of a target image. By applying volumetric rendering to the combination of the three tri-planes followed by a super-resolution module, our method yields a high fidelity image of the desired identity, expression and pose. Once trained, our model enables efficient 3D head avatar reconstruction and animation via a single forward pass through a network. Experiments show that the proposed approach generalizes well to unseen validation datasets, surpassing SOTA baseline methods by a large margin on head avatar reconstruction and animation.

HeadSculpt: Crafting 3D Head Avatars with Text
Xiao Han Yukang Cao Kai Han Xiatian Zhu Jiankang Deng Yi-Zhe Song Tao Xiang Kwan-Yee K. Wong



研究问题:现有的文本引导3D生成方法在创建高保真3D头部头像时存在两个主要问题:一是过于依赖预训练的文本到图像扩散模型,缺乏必要的3D意识和头部先验知识,导致生成的头像可能存在不一致性和几何失真;二是在细粒度编辑方面表现不佳。
动机:为了解决上述问题,本文提出了一种名为HeadSculpt的通用粗到细流程,用于从文本提示中生成和编辑3D头部头像。
方法:首先,通过利用基于地标的控制和学习到的表示头部背面外观的文本嵌入,使扩散模型具备3D意识,从而实现3D一致的头部头像生成。其次,提出一种新的身份感知编辑得分蒸馏策略,使用高分辨率可微渲染技术优化纹理网格,实现在遵循编辑指令的同时保持身份特征。
效果:通过全面的实验和与现有方法的比较,展示了HeadSculpt在保真度和编辑能力方面的优越性。

Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-fidelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in fine-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-fine pipeline dubbed HeadSculpt for crafting (i.e., generating and editing) 3D head avatars from textual prompts. Specifically, we first equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase HeadSculpt's superior fidelity and editing capabilities through comprehensive experiments and comparisons with existing methods.

Multi-modal Queried Object Detection in the Wild
Yifan Xu Mengdan Zhang Chaoyou Fu Peixian Chen Xiaoshan Yang Ke Li Changsheng Xu



研究问题:如何利用文本描述和视觉范例进行多模态查询物体检测。
动机:现有的语言查询物体检测器无法处理开放词汇类别和各种粒度的检测任务。
方法:提出MQ-Det模型,将视觉查询融入已有的语言查询物体检测器中,通过添加一个可扩展感知器模块来增强类别文本的类级视觉信息。同时,提出一种视觉条件的语言掩码预测策略来解决冻结检测器带来的学习惯性问题。
效果:实验结果表明,多模态查询极大地提高了开放世界检测的性能。例如,MQ-Det在LVIS基准测试上通过多模态查询将最先进的开放集检测器GLIP的AP提高了+7.8%,并且在13个少样本下游任务上平均提高了+6.3%的AP,而仅仅增加了3%的时间需求。

We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds
Hao Yang Haiyang Wang Di Dai Liwei Wang



研究问题:现有的点云预训练方法忽视了点云的不完整性问题,即LiDAR只能捕获到部分点,导致训练阶段存在模糊性。同时,图像可以提供更全面的信息和丰富的语义,有助于解决点云的不完整性问题,但将图像融入点云预训练中会面临遮挡等问题。
动机:为了解决这些问题,本文提出了一种新的图像辅助的点云预训练框架PRED,该框架以遮挡感知的方式对户外点云进行预训练。
方法:PRED的主要组成部分是条件语义渲染的鸟瞰图(BEV)特征图,通过神经渲染利用图像的语义进行监督。此外,我们还通过高比例(95%)的点状掩码增强了模型的性能。
效果:实验结果表明,PRED在各种大规模数据集上的3D感知任务上优于先前的点云预训练方法,取得了显著的改进。

Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https://github.com/PRED4pc/PRED.

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models
Yichao Cao Qingfei Tang Xiu Su Song Chen Shan You Xiaobo Lu Chang Xu



研究问题:本文旨在解决开放世界中的通用人与物体交互识别问题。
动机:真实世界中的人与物体交互复杂多样,给注释和识别带来了重大挑战,特别是在开放世界的环境下进行交互识别。
方法:通过使用视觉语言基础模型和大型语言模型,提出了一种被称为UniHOI的方法。该方法包括一个由高阶关系提取引导的解码器(HOPD),用于将基础模型中的高级关系表示与图像中的不同HO对关联起来,并利用大型语言模型(如GPT)进行交互解释,以生成更丰富的语言理解。
效果:在监督和零样本设置下,UniHOI的有效架构设计和学习方法有效地释放了视觉语言基础模型和大型语言模型的潜力,使其在所有现有方法中以显著的优势超越。

Human-object interaction (HOI) detection aims to comprehend the intricate relationships between humans and objects, predicting triplets, and serving as the foundation for numerous computer vision tasks. The complexity and diversity of human-object interactions in the real world, however, pose significant challenges for both annotation and recognition, particularly in recognizing interactions within an open world context. This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs). The proposed method is dubbed as UniHOI. We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning. Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. Furthermore, we utilize a LLM (i.e. GPT) for interaction interpretation, generating a richer linguistic understanding for complex HOIs. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence. Our efficient architecture design and learning methods effectively unleash the potential of the VL foundation models and LLMs, allowing UniHOI to surpass all existing methods with a substantial margin, under both supervised and zero-shot settings. The code and pre-trained weights will be made publicly available.

Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation
Yuan Wang Naisong Luo Tianzhu Zhang



研究问题:如何在只有少量标注样本的情况下对新类别的对象进行分割。
动机:现有的少数镜头分割(FSS)方法主要关注支持信息的探索,而对关键查询分支的挖掘关注不足。
方法:提出一种新的以查询为中心的FSS模型——对抗性挖掘变压器(AMFormer)。该模型通过粗糙的支持指导甚至弱支持标签,实现准确查询图像分割。设计了一个对象挖掘变压器(G)和一个细节挖掘变压器(D),并通过对抗过程训练G和D。
效果:在常用的Pascal-5i和COCO-20i基准测试中取得了最先进的结果,并在查询中心的范式下,即使使用弱支持标签也能达到满意的性能。

Few-shot segmentation (FSS) aims to segment objects of new categories given only a handful of annotated samples. Previous works focus their efforts on exploring the support information while paying less attention to the mining of the critical query branch. In this paper, we rethink the importance of support information and propose a new query-centric FSS model Adversarial Mining Transformer (AMFormer), which achieves accurate query image segmentation with only rough support guidance or even weak support labels. The proposed AMFormer enjoys several merits. First, we design an object mining transformer (G) that can achieve the expansion of incomplete region activated by support clue, and a detail mining transformer (D) to discriminate the detailed local difference between the expanded mask and the ground truth. Second, we propose to train G and D via an adversarial process, where G is optimized to generate more accurate masks approaching ground truth to fool D. We conduct extensive experiments on commonly used Pascal-5i and COCO-20i benchmarks and achieve state-of-the-art results across all settings. In addition, the decent performance with weak support labels in our query-centric paradigm may inspire the development of more general FSS models.

Shape Non-rigid Kinematics (SNK): A Zero-Shot Method for Non-Rigid Shape Matching via Unsupervised Functional Map Regularized Reconstruction
Souhaib Attaiki Maks Ovsjanikov



研究问题:本文旨在提出一种新的非刚性形状匹配方法,即形状非刚性运动学(SNK),以消除对大量训练或真实数据的需求。
动机:传统的非刚性形状匹配方法需要大量的训练或真实数据,而新提出的SNK方法通过预测和转换一个无监督的功能图来简化形状匹配过程,同时保持准确性。
方法:SNK采用编码器-解码器架构的重建策略,将源形状变形以紧密匹配目标形状。在此过程中,预测并转换为点对点的映射图,作为重建的监管机制。为了帮助训练,设计了一个新的解码器架构,生成平滑、真实的变形。
效果:实验结果表明,SNK在传统基准测试中表现出竞争力,简化了形状匹配过程而不牺牲准确性。

We present Shape Non-rigid Kinematics (SNK), a novel zero-shot method for non-rigid shape matching that eliminates the need for extensive training or ground truth data.SNK operates on a single pair of shapes, and employs a reconstruction-based strategy using an encoder-decoder architecture, which deforms the source shape to closely match the target shape. During the process, an unsupervised functional map is predicted and converted into a point-to-point map, serving as a supervisory mechanism for the reconstruction. To aid in training, we have designed a new decoder architecture that generates smooth, realistic deformations. SNK demonstrates competitive results on traditional benchmarks, simplifying the shape-matching process without compromising accuracy. Our code can be found online: https://github.com/pvnieo/SNK

Segment Anything in High Quality
Lei Ke Mingqiao Ye Martin Danelljan Yifan liu Yu-Wing Tai Chi-Keung Tang Fisher Yu



研究问题:现有的大规模分割模型在处理复杂结构的对象时,其遮罩预测质量往往不足。
动机:为了解决这一问题,我们提出了HQ-SAM模型,旨在提高遮罩预测的质量,同时保持原模型的可提示设计、效率和零样本泛化能力。
方法:我们在SAM模型中引入了一个可学习的高质量输出标记,用于预测高质量的遮罩。并且,我们将这个标记与早期和最终的ViT特征进行融合,以提高遮罩的细节。
效果:通过在多个下游任务的10个不同分割数据集上进行测试,我们发现HQ-SAM模型在8个零样本转移协议评估中表现出色。

The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 10 diverse segmentation datasets across different downstream tasks, where 8 out of them are evaluated in a zero-shot transfer protocol. Our code and pretrained models are at https://github.com/SysCV/SAM-HQ.

Echoes Beyond Points: Unleashing the Power of Raw Radar Data in Multi-modality Fusion
Yang Liu Feng Wang Naiyan Wang Zhaoxiang Zhang



研究问题:如何提高雷达在自动驾驶系统中的检测性能。
动机:由于雷达具有低成本和对恶劣天气的良好适应性,因此在自动驾驶系统中广泛使用。然而,由于其点云稀疏且不准确,雷达的检测性能通常较差。
方法:本文提出了一种名为EchoFusion的新方法,该方法跳过现有的雷达信号处理流程,直接将雷达原始数据与其他传感器的数据进行融合。具体来说,我们首先生成鸟瞰图查询,然后从雷达中提取相应的频谱特征,与其他传感器进行融合。
效果:通过这种方法,我们的方法能够利用雷达回波丰富且无损的距离和速度线索以及图像丰富的语义线索,使我们的方法在RADIal数据集上超越了所有现有方法,并接近激光雷达的性能。

Radar is ubiquitous in autonomous driving systems due to its low cost and good adaptability to bad weather. Nevertheless, the radar detection performance is usually inferior because its point cloud is sparse and not accurate due to the poor azimuth and elevation resolution. Moreover, point cloud generation algorithms already drop weak signals to reduce the false targets which may be suboptimal for the use of deep fusion. In this paper, we propose a novel method named EchoFusion to skip the existing radar signal processing pipeline and then incorporate the radar raw data with other sensors. Specifically, we first generate the Bird's Eye View (BEV) queries and then take corresponding spectrum features from radar to fuse with other sensors. By this approach, our method could utilize both rich and lossless distance and speed clues from radar echoes and rich semantic clues from images, making our method surpass all existing methods on the RADIal dataset, and approach the performance of LiDAR. The code will be released on https://github.com/tusen-ai/EchoFusion.

3D-Aware Visual Question Answering about Parts, Poses and Occlusions
Xingrui Wang Wufei Ma Zhuowan Li Adam Kortylewski Alan Yuille



研究问题:现有的视觉问答(VQA)数据集和模型主要关注二维场景的推理,但需要理解三维视觉场景的结构以支持导航或操作等任务。
动机:为了解决这一问题,本文提出了3D-aware VQA任务,该任务关注于对视觉场景的三维结构进行组合推理的挑战性问题。
方法:从数据集和模型两个角度来解决3D-aware VQA问题。首先引入Super-CLEVR-3D,这是一个关于物体部分、三维姿态和遮挡的组合推理数据集;其次提出PO3D-VQA模型,该模型结合了概率神经网络符号程序执行和具有物体三维生成表示的深度神经网络的强大思想,用于强大的视觉识别。
效果:实验结果表明,PO3D-VQA模型显著优于现有方法,但与二维VQA基准仍存在显著性能差距,表明3D-aware VQA仍是一个重要的开放研究领域。

Despite rapid progress in Visual question answering (\textit{VQA}), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks like navigation or manipulation. This includes an understanding of the 3D object pose, their parts and occlusions. In this work, we introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. We address 3D-aware VQA from both the dataset and the model perspective. First, we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains questions about object parts, their 3D poses, and occlusions. Second, we propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition. Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an important open research area.

Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models
Zhimin Chen Longlong Jing Yingwei Li Bing Li



研究问题:如何利用基础模型来丰富3D场景表示学习,以解决领域差距的问题。
动机:尽管基础模型在2D和语言任务上取得了显著的成果,但在3D场景表示学习方面的潜力尚未得到充分利用,主要原因是存在领域差距。
方法:提出了一种名为Bridge3D的创新方法,通过使用基础模型提取的特征、语义掩码和描述进行预训练3D模型。具体来说,该方法使用基础模型的语义掩码指导掩码自动编码器的掩蔽和重建过程,使模型能更专注于前景表示。此外,还通过图像描述基础模型弥合了3D-文本的差距,从而促进了场景级的知识蒸馏。
效果:Bridge3D的方法在3D对象检测和语义分割任务上的表现大大超过了现有最先进的方法。例如,在ScanNet数据集上,Bridge3D将基线提高了6.3%。

Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding. However, their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap. In this work, we propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our method employs semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, enabling more focused attention on foreground representations. Moreover, we bridge the 3D-text gap at the scene level using image captioning foundation models, thereby facilitating scene-level knowledge distillation. We further extend this bridging effort by introducing an innovative object-level knowledge distillation method that harnesses highly accurate object-level masks and semantic text data from foundation models. Our methodology significantly surpasses the performance of existing state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, Bridge3D improves the baseline by a notable margin of 6.3%. Code will be available at: https://github.com/Zhimin-C/Bridge3D

Hierarchical Open-vocabulary Universal Image Segmentation
Xudong Wang Shufan Li Konstantinos Kallidromitis Yusuke Kato Kazuki Kozuka Trevor Darrell



研究问题:本文旨在解决开放词汇图像分割中的语义级别、实例级别和部分级别的任务。
动机:现有的方法通常避开了分割模糊性,并将其视为外部因素,而我们的方法则主动将包含不同语义级别的分层表示纳入学习过程。
方法:我们提出了一种解耦的文本-图像融合机制和代表学习模块,用于处理“事物”和“材料”。此外,我们还系统地检查了这些类型类别之间在文本和视觉特征上的差异。
效果:我们的模型HIPIE在统一的框架内解决了分层、开放词汇和通用分割任务。在ADE20K、COCO、Pascal-VOC Part和RefCOCO/RefCOCOg等多样化数据集上进行基准测试,HIPIE在各种图像理解水平上实现了最先进的结果,包括语义级别(如语义分割)、实例级别(如全景/参考分割和目标检测)以及部分级别(如部分/子部分分割)任务。

Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. However, complex visual scenes can be naturally decomposed into simpler parts and abstracted at multiple lev4 els of granularity, introducing inherent segmentation ambiguity. Unlike existing methods that typically sidestep this ambiguity and treat it as an external factor, our approach actively incorporates a hierarchical representation encompassing different semantic-levels into the learning process. We propose a decoupled text-image fusion mechanism and representation learning modules for both “things” and “stuff”. Additionally, we systematically examine the differences that exist in the textual and visual features between these types of categories. Our resulting model, named HIPIE, tackles HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework. Benchmarked on diverse datasets, e.g., ADE20K,COCO, Pascal-VOC Part, and RefCOCO/RefCOCOg, HIPIE achieves the state-of14 the-art results at various levels of image comprehension, including semantic-level (e.g., semantic segmentation), instance-level (e.g., panoptic/referring segmentationand object detection), as well as part-level (e.g., part/subpart segmentation) tasks.

ISP: Multi-Layered Garment Draping with Implicit Sewing Patterns
Ren Li Benoît Guillard Pascal Fua



研究问题:现有的人体模型服装建模方法无法处理日常穿着中常见的多层服装,或者仅限于T型姿势。
动机:为了解决这些问题,本文提出了一种参数化服装表示模型。
方法:该模型将每件服装分解为独立的二维面板,通过二维到三维的映射来定义其形状。二维参数化可以方便地检测潜在碰撞,而三维参数化则能有效处理复杂形状。
效果:实验证明,这种组合比纯隐式表面表示更快,重建质量更高,并且由于其可微分性,可以从图像中恢复多层服装。此外,它还支持通过修改单个二维面板快速编辑服装的形状和纹理。

Many approaches to draping individual garments on human body models are realistic, fast, and yield outputs that are differentiable with respect to the body shape on which they are draped. However, they are either unable to handle multi-layered clothing, which is prevalent in everyday dress, or restricted to bodies in T-pose. In this paper, we introduce a parametric garment representation model that addresses these limitations. As in models used by clothing designers, each garment consists of individual 2D panels. Their 2D shape is defined by a Signed Distance Function and 3D shape by a 2D to 3D mapping. The 2D parameterization enables easy detection of potential collisions and the 3D parameterization handles complex shapes effectively. We show that this combination is faster and yields higher quality reconstructions than purely implicit surface representations, and makes the recovery of layered garments from images possible thanks to its differentiability. Furthermore, it supports rapid editing of garment shapes and texture by modifying individual 2D panels.

STXD: Structural and Temporal Cross-Modal Distillation for Multi-View 3D Object Detection
Sujin Jang Dae Ung Jo Sung Ju Hwang Dongwook Lee Daehyun Ji



研究问题:如何从多视角图像进行3D物体检测,以替代昂贵的基于激光雷达的探测器。
动机:由于缺乏精确的空间线索,从多视角图像进行3D物体检测是一项极具挑战性的任务。
方法:提出了一种新的结构与时间跨模态知识蒸馏(STXD)框架,通过在特征组件中减少冗余并最大化其相似性,以及通过编码特征在一系列帧中的相似性映射来有效转移时间知识,进一步改善了知识蒸馏的质量。
效果:实验证明,STXD显著提高了基本学生探测器在nuScenes测试数据集上的NDS和mAP,提高了2.8%~4.5%。

3D object detection (3DOD) from multi-view images is an economically appealing alternative to expensive LiDAR-based detectors, but also an extremely challenging task due to the absence of precise spatial cues. Recent studies have leveraged the teacher-student paradigm for cross-modal distillation, where a strong LiDAR-modality teacher transfers useful knowledge to a multi-view-based image-modality student. However, prior approaches have only focused on minimizing global distances between cross-modal features, which may lead to suboptimal knowledge distillation results. Based on these insights, we propose a novel structural and temporal cross-modal knowledge distillation (STXD) framework for multi-view 3DOD. First, STXD reduces redundancy of the feature components of the student by regularizing the cross-correlation of cross-modal features, while maximizing their similarities. Second, to effectively transfer temporal knowledge, STXD encodes temporal relations of features across a sequence of frames via similarity maps. Lastly, STXD also adopts a response distillation method to further enhance the quality of knowledge distillation at the output-level. Our extensive experiments demonstrate that STXD significantly improves the NDS and mAP of the based student detectors by 2.8%~4.5% on the nuScenes testing dataset.

RevColV2: Exploring Disentangled Representations in Masked Image Modeling
Qi Han Yuxuan Cai Xiangyu Zhang



研究问题:现有的遮蔽图像建模(MIM)方法在预训练和微调阶段存在表示不一致的问题,可能影响下游任务的性能。
动机:为了解决这一问题,本文提出了一种新的架构RevColV2,通过在预训练和微调阶段都保留完整的自动编码器架构。
方法:RevColV2的主要部分包括自底向上的列和自顶向下的列,其间的信息可逆传播并逐渐解耦。这种设计使得网络能在MIM预训练阶段保持低层次和语义信息的解耦。
效果:实验结果表明,具有解耦特性的基础模型在多个下游视觉任务上都能取得有竞争力的性能,如图像分类、语义分割和目标检测。例如,经过ImageNet-22K数据集的中间层微调后,RevColV2-L在ImageNet-1K分类上达到了88.4%的Top-1准确率,在ADE20K语义分割上达到了58.6 mIoU。使用额外的教师和大规模数据集,RevColv2-L在COCO检测上达到了62.1 APbox,在ADE20K语义分割上达到了60.4 mIoU。

Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applica- tions, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoen- coder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection. For exam- ple, after intermediate fine-tuning on ImageNet-22K dataset, RevColV2-L attains 88.4\% top-1 accuracy on ImageNet-1K classification and 58.6 mIoU on ADE20K semantic segmentation. With extra teacher and large scale dataset, RevColv2-L achieves 62.1 APbox on COCO detection and 60.4 mIoU on ADE20K semantic segmentation.

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation
Shentong Mo Enze Xie Ruihang Chu Lanqing HONG Matthias Nießner Zhenguo Li



研究问题:现有的3D扩散方法主要采用U-Net架构,尚不清楚Transformer架构在3D形状生成中是否同样有效。
动机:为了填补这一空白,我们提出了一种新的用于3D形状生成的扩散Transformer,名为DiT-3D,它可以直接对体素化的点云进行去噪处理。
方法:DiT-3D采用了DiT的设计哲学,但通过引入3D位置和片嵌入来聚合来自体素化点云的输入,以减少3D形状生成中自注意力计算的计算成本。
效果:实验结果表明,DiT-3D在ShapeNet数据集上实现了高保真度和多样化的3D点云生成方面的最先进的性能。

Recent Diffusion Transformers (i.e., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is unclear how the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, named DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, we empirically observe that the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation.

Unsupervised Polychromatic Neural Representation for CT Metal Artifact Reduction
Qing Wu Lixuan Chen Ce Wang Hongjiang Wei S Kevin Zhou Jingyi Yu Yuyao Zhang



研究问题:本文旨在解决金属植入物存在的人体CT成像难题。
动机:CT金属伪影是由于X射线能谱的不同能量级别下金属的衰减系数剧烈变化,导致CT测量中的非线性金属效应。从非线性逆问题的角度解决金属影响下的CT图像恢复问题。
方法:提出了一种新颖的多色神经网络表示(Polyner)方法。首先,推导出一个多色前向模型,以准确模拟非线性CT采集过程。然后,将我们的前向模型融入到隐式神经网络表示中完成重建。最后,采用正则化器在保持不同能量级别下CT图像物理属性的同时有效约束解空间。
效果:实验结果表明,Polyner在领域内数据集上取得了与监督方法相当甚至更好的性能,同时在领域外数据集上表现出显著的性能提升。据我们所知,Polyner是第一个超越监督对应方法的无监督MAR方法。

Emerging neural reconstruction techniques based on tomography (e.g., NeRF, NeAT, and NeRP) have started showing unique capabilities in medical imaging. In this work, we present a novel Polychromatic neural representation (Polyner) to tackle the challenging problem of CT imaging when metallic implants exist within the human body. CT metal artifacts arise from the drastic variation of metal's attenuation coefficients at various energy levels of the X-ray spectrum, leading to a nonlinear metal effect in CT measurements. Recovering CT images from metal-affected measurements hence poses a complicated nonlinear inverse problem where empirical models adopted in previous metal artifact reduction (MAR) approaches lead to signal loss and strongly aliased reconstructions. Polyner instead models the MAR problem from a nonlinear inverse problem perspective. Specifically, we first derive a polychromatic forward model to accurately simulate the nonlinear CT acquisition process. Then, we incorporate our forward model into the implicit neural representation to accomplish reconstruction. Lastly, we adopt a regularizer to preserve the physical properties of the CT images across different energy levels while effectively constraining the solution space. Our Polyner is an unsupervised method and does not require any external training data. Experimenting with multiple datasets shows that our Polyner achieves comparable or better performance than supervised methods on in-domain datasets while demonstrating significant performance improvements on out-of-domain datasets. To the best of our knowledge, our Polyner is the first unsupervised MAR method that outperforms its supervised counterparts. The code for this work is available at: https://github.com/iwuqing/Polyner.

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
Linyan Huang Zhiqi Li Chonghao Sima Wenhai Wang Jingdong Wang Yu Qiao Hongyang Li



研究问题:如何通过知识转移提高仅依赖摄像头的3D物体检测器的准确性。
动机:LiDAR和多模态专家模型与仅依赖摄像头的初级模型之间存在领域差距和时间融合不兼容的问题,阻碍了基于蒸馏的增强效果。
方法:提出VCD框架,包括易于初级用户的多模态专家和有利于时间融合的蒸馏监督。多模态专家VCD-E采用与仅依赖摄像头的初级模型相同的结构以减轻特征差异,并利用激光雷达输入作为深度先验重建3D场景,性能与其他异构多模态专家相当。此外,引入细粒度的轨迹基蒸馏模块,目的是单独纠正场景中每个对象的运动不匹配。
效果:改进后的仅依赖摄像头的初级模型VCD-A在nuScenes上取得了新的最先进的NDS分数63.1%。

Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS. The code will be released at https://github.com/OpenDriveLab/Birds-eye-view-Perception.

LEPARD: Learning Explicit Part Discovery for 3D Articulated Shape Reconstruction
Di Liu Anastasis Stathopoulos Qilong Zhangli Yunhe Gao Dimitris N. Metaxas



研究问题:如何从野外的单一图像中重建动物的三维关节形状。
动机:现有的方法在处理动物的三维关节形状时,往往受到姿势变化的影响,且整体形状复杂。因此,提出一种基于部分的三维形状重建方法。
方法:LEPARD框架通过学习发现具有语义意义的三维部分,并以部分为基础重建三维形状。这些部分被明确表示为参数化的原始曲面,具有全局和局部的3D变形,以匹配图像证据。并提出一种受运动学启发的优化方法,根据2D证据指导每个原始变形的转换。
效果:实验结果表明,LEPARD在三维动物形状重建方面优于现有方法,不仅能提高整体重建性能,还能发现具有语义意义和一致性的部分。

Reconstructing the 3D articulated shape of an animal from a single in-the-wild image is a challenging task. We propose LEPARD, a learning-based framework that discovers semantically meaningful 3D parts and reconstructs 3D shapes in a part-based manner. This is advantageous as 3D parts are robust to pose variations due to articulations and their shape is typically simpler than the overall shape of the object. In our framework, the parts are explicitly represented as parameterized primitive surfaces with global and local deformations in 3D that deform to match the image evidence. We propose a kinematics-inspired optimization to guide each transformation of the primitive deformation given 2D evidence. Similar to recent approaches, LEPARD is only trained using off-the-shelf deep features from DINO and does not require any form of 2D or 3D annotations. Experiments on 3D animal shape reconstruction, demonstrate significant improvement over existing alternatives in terms of both the overall reconstruction performance as well as the ability to discover semantically meaningful and consistent parts.

Compact Neural Volumetric Video Representations with Dynamic Codebooks
Haoyu Guo Sida Peng Yunzhi Yan Linzhan Mou Yujun Shen Hujun Bao Xiaowei Zhou



研究问题:如何以低存储成本表示高保真体视视频。
动机:现有的基于特征网格的方法在从输入的2D图像中快速学习隐式神经表示方面表现出优越的性能,但在建模动态场景时,这种显式的表示容易导致模型过大。
方法:提出一种新的神经表示方法——动态码本,通过合并相似特征进行模型压缩,并通过一组动态代码补偿可能降低的渲染质量。
效果:在NHR和DyNeRF数据集上的实验表明,该方法在实现更高的存储效率的同时,达到了最先进的渲染质量。

This paper addresses the challenge of representing high-fidelity volumetric videos with low storage cost. Some recent feature grid-based methods have shown superior performance of fast learning implicit neural representations from input 2D images. However, such explicit representations easily lead to large model sizes when modeling dynamic scenes. To solve this problem, our key idea is reducing the spatial and temporal redundancy of feature grids, which intrinsically exist due to the self-similarity of scenes. To this end, we propose a novel neural representation, named dynamic codebook, which first merges similar features for the model compression and then compensates for the potential decline in rendering quality by a set of dynamic codes. Experiments on the NHR and DyNeRF datasets demonstrate that the proposed approach achieves state-of-the-art rendering quality, while being able to achieve more storage efficiency. The source code is available at https://github.com/zju3dv/compact_vv.

Hierarchical Adaptive Value Estimation for Multi-modal Visual Reinforcement Learning
Yangru Huang Peixi Peng Yifan Zhao Haoran Xu Mengyue Geng Yonghong Tian



研究问题:现有的多模态视觉强化学习方法在政策学习上可能会忽视每个模态的独特价值。
动机:为了解决这一问题,本文提出了一种局部模态定制值估计(LVE)范式,从值级别动态估计每个模态的贡献并调整其重要性权重。
方法:开发了一个任务上下文再融合过程,以实现特征和值级别的任务级重新平衡。形成了一个分层自适应值估计(HAVE)框架,自适应地协调各个模态的贡献以及它们的集体效能。
效果:通过使用CARLA基准测试,利用神经形态事件和深度数据,展示了HAVE的能力及其独特组件的有效性。

Integrating RGB frames with alternative modality inputs is gaining increasing traction in many vision-based reinforcement learning (RL) applications. Existing multi-modal vision-based RL methods usually follow a Global Value Estimation (GVE) pipeline, which uses a fused modality feature to obtain a unified global environmental description. However, such a feature-level fusion paradigm with a single critic may fall short in policy learning as it tends to overlook the distinct values of each modality. To remedy this, this paper proposes a Local modality-customized Value Estimation (LVE) paradigm, which dynamically estimates the contribution and adjusts the importance weight of each modality from a value-level perspective. Furthermore, a task-contextual re-fusion process is developed to achieve a task-level re-balance of estimations from both feature and value levels. To this end, a Hierarchical Adaptive Value Estimation (HAVE) framework is formed, which adaptively coordinates the contributions of individual modalities as well as their collective efficacy. Agents trained by HAVE are able to exploit the unique characteristics of various modalities while capturing their intricate interactions, achieving substantially improved performance. We specifically highlight the potency of our approach within the challenging landscape of autonomous driving, utilizing the CARLA benchmark with neuromorphic event and depth data to demonstrate HAVE's capability and the effectiveness of its distinct components.

H2RBox-v2: Incorporating Symmetry for Boosting Horizontal Box Supervised Oriented Object Detection
Yi Yu Xue Yang Qingyun Li Yue Zhou Feipeng Da Junchi Yan



研究问题:如何利用弱监督检测器H2RBox从更易获取的水平框(HBox)中学习旋转框(RBox),以应对自动驾驶和遥感等领域对定向对象检测的快速增长需求。
动机:现有的定向对象检测方法需要大量的标注数据,而弱监督检测器H2RBox可以从水平框中学习旋转框,从而减少对标注数据的依赖。
方法:本文提出了H2RBox-v2,通过翻转和旋转一致性利用反射对称性,并使用类似于H2RBox的弱监督网络分支和一个从视觉对象的对称性中学习方向的新的自我监督分支。此外,还采用了一些实用技术来稳定和增强检测器,以应对周边问题如角度周期性。
效果:实验结果表明,H2RBox-v2是第一个具有对称感知的自我监督定向对象检测范式。与H2RBox相比,该方法对低质量标注和训练数据不足的敏感性较低。在多个数据集上,H2RBox-v2的性能与旋转标注训练的对应方法Rotated FCOS相当接近。

With the rapidly increasing demand for oriented object detection, e.g. in autonomous driving and remote sensing, the recently proposed paradigm involving weakly-supervised detector H2RBox for learning rotated box (RBox) from the more readily-available horizontal box (HBox) has shown promise. This paper presents H2RBox-v2, to further bridge the gap between HBox-supervised and RBox-supervised oriented object detection. Specifically, we propose to leverage the reflection symmetry via flip and rotate consistencies, using a weakly-supervised network branch similar to H2RBox, together with a novel self-supervised branch that learns orientations from the symmetry inherent in visual objects. The detector is further stabilized and enhanced by practical techniques to cope with peripheral issues e.g. angular periodicity. To our best knowledge, H2RBox-v2 is the first symmetry-aware self-supervised paradigm for oriented object detection. In particular, our method shows less susceptibility to low-quality annotation and insufficient training data compared to H2RBox. Specifically, H2RBox-v2 achieves very close performance to a rotation annotation trained counterpart -- Rotated FCOS: 1) DOTA-v1.0/1.5/2.0: 72.31%/64.76%/50.33% vs. 72.44%/64.53%/51.77%; 2) HRSC: 89.66% vs. 88.99%; 3) FAIR1M: 42.27% vs. 41.25%.

PolyDiffuse: Polygonal Shape Reconstruction via Guided Set Diffusion Models
Jiacheng Chen Ruizhi Deng Yasutaka Furukawa



研究问题:本文旨在解决结构化重建任务中,扩散模型面临的两个基本挑战:1)结构化几何是一个“集合”,其样本的N个元素有N!种不同但等价的表示方式,使得去噪过程高度模糊;2)重建任务只有一个解决方案,需要谨慎选择初始噪声,而生成任务则对初始噪声没有要求。
动机:将视觉传感器数据转化为多边形形状的结构化重建算法,通过条件化传感器数据进行生成过程。
方法:提出一种引导集扩散模型,其中1)前向扩散过程学习“引导网络”来控制噪声注入,使一个样本的一个表示与其其他排列变体保持不同,从而解决去噪模糊性问题;2)反向去噪过程根据传感器数据初始化和指导网络,将多边形形状作为条件生成过程进行重建。
效果:通过在标准基准上进行大量实验,证明PolyDiffuse显著提高了当前最先进的技术水平,并能够实现更广泛的应用。

This paper presents \textit{PolyDiffuse}, a novel structured reconstruction algorithm that transforms visual sensor data into polygonal shapes with Diffusion Models (DM), an emerging machinery amid exploding generative AI, while formulating reconstruction as a generation process conditioned on sensor data. The task of structured reconstruction poses two fundamental challenges to DM: 1) A structured geometry is a ''set'' (e.g., a set of polygons for a floorplan geometry), where a sample of $N$ elements has $N!$ different but equivalent representations, making the denoising highly ambiguous; and 2) A ''reconstruction'' task has a single solution, where an initial noise needs to be chosen carefully, while any initial noise works for a generation task. Our technical contribution is the introduction of a Guided Set Diffusion Model where 1) the forward diffusion process learns \textit{guidance networks} to control noise injection so that one representation of a sample remains distinct from its other permutation variants, thus resolving denoising ambiguity; and 2) the reverse denoising process reconstructs polygonal shapes, initialized and directed by the guidance networks, as a conditional generation process subject to the sensor data. We have evaluated our approach for reconstructing two types of polygonal shapes: floorplan as a set of polygons and HD map for autonomous cars as a set of polylines. Through extensive experiments on standard benchmarks, we demonstrate that PolyDiffuse significantly advances the current state of the art and enables broader practical applications.

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions
Haochen Wang Junsong Fan Yuxi Wang Kaiyou Song Tong Wang Zhaoxiang Zhang



研究问题:视觉转换器对输入令牌的顺序不敏感,需要一种增强其位置感知的适当的自我监督预训练任务。
动机:为了解决这个问题,我们提出了一种新的预训练任务——DropPos,通过重建被丢弃的位置来提高模型的位置感知能力。
方法:首先随机丢弃大部分位置嵌入,然后模型仅根据视觉外观在可能的所有位置中为每个非重叠的补丁分类实际位置。为了增加任务难度,我们只保留一部分可见的补丁。同时,考虑到可能有视觉外观相似的不同补丁,我们提出了位置平滑和注意力重建策略来放宽这个分类问题。
效果:实验结果表明,DropPos表现出强大的能力,不仅优于有监督的预训练,而且在一系列下游基准测试中与最先进的自我监督方法相比也取得了竞争性的结果。这表明像DropPos这样明确鼓励空间推理能力的任务确实有助于提高视觉转换器的位置感知能力。

As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.

DAC-DETR: Divide the Attention Layers and Conquer
Zhengdong Hu Yifan Sun Jingdong Wang Yi Yang



研究问题:DETR模型中,交叉注意力和自我注意力对对象查询的影响存在矛盾,影响了训练效果。
动机:为了提高DETR的训练效率,需要解决交叉注意力和自我注意力对对象查询的相反影响。
方法:提出一种Divide-And-Conquer DETR(DAC-DETR)方法,将交叉注意力从矛盾中分离出来,通过辅助解码器专注于学习交叉注意力层。
效果:实验表明,DAC-DETR在MS-COCO数据集上比流行的DETRs有显著改进,例如,在12个epoch的训练方案下,DAC-DETR将变形DETR(ResNet-50)提高了+3.4 AP,并实现了基于一些流行方法(如DINO和IoU相关损失)的50.9(ResNet-50)/ 58.1 AP(Swin-Large)。

This paper reveals a characteristic of DEtection Transformer (DETR) that negatively impacts its training efficacy, i.e., the cross-attention and self-attention layers in DETR decoder have contrary impacts on the object queries (though both impacts are important). Specifically, we observe the cross-attention tends to gather multiple queries around the same object, while the self-attention disperses these queries far away. To improve the training efficacy, we propose a Divide-And-Conquer DETR (DAC-DETR) that divides the cross-attention out from this contrary for better conquering. During training, DAC-DETR employs an auxiliary decoder that focuses on learning the cross-attention layers. The auxiliary decoder, while sharing all the other parameters, has NO self-attention layers and employs one-to-many label assignment to improve the gathering effect. Experiments show that DAC-DETR brings remarkable improvement over popular DETRs. For example, under the 12 epochs training scheme on MS-COCO, DAC-DETR improves Deformable DETR (ResNet-50) by +3.4 AP and achieves 50.9 (ResNet-50) / 58.1 AP (Swin-Large) based on some popular methods (i.e., DINO and an IoU-related loss). Our code will be made available at https://github.com/huzhengdongcs/DAC-DETR.

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation
Zhenchao Jin Xiaowei Hu Lingting Zhu Luchuan Song Li Yuan Lequan Yu



研究问题:本文旨在解决现有上下文建模模式中由于依赖大量预设先验而导致的上下文信息聚合不足或无效的问题。
动机:现有的上下文建模模式,如多尺度驱动和相似性驱动的上下文方案,虽然取得了令人印象深刻的结果,但往往因为依赖大量的预设先验而无法有效聚合上下文信息。
方法:本文提出了一种新的干预驱动的关系网络(IDRNet),利用删除诊断程序来指导不同像素之间的上下文关系建模。具体来说,我们首先通过伪标签的引导将像素级表示分组为语义级表示,并通过特征增强模块进一步提高分组表示的区分度。然后,进行删除诊断程序以模型化这些语义级表示之间的关系,并利用提取的关系指导语义级表示相互交互。最后,使用交互的表示来增强原始的像素级表示以进行最终预测。
效果:广泛的实验验证了IDRNet的有效性,无论是在数量上还是在质量上。值得注意的是,我们的干预驱动的上下文方案为最先进的分割框架带来了一致的性能改进,并在流行的基准数据集上取得了有竞争力的结果,包括ADE20K、COCO-Stuff、PASCAL-Context、LIP和Cityscapes。

Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks, which inspires the development of numerous context modeling paradigms, \emph{e.g.}, multi-scale-driven and similarity-driven context schemes. Despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To alleviate the issues, we propose a novel \textbf{I}ntervention-\textbf{D}riven \textbf{R}elation \textbf{Net}work (\textbf{IDRNet}), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes.

MonoUNI: A Unified Vehicle and Infrastructure-side Monocular 3D Object Detection Network with Sufficient Depth Clues
Jinrang Jia Zhenjia Li Yifeng Shi



研究问题:如何构建基于不同先验知识的自动驾驶中车辆和基础设施侧面的单目3D检测算法。
动机:由于传感器安装和焦距的多样性,研究人员面临着基于不同先验知识构建这两个主题的算法的挑战。
方法:本文提出了一个名为归一化深度的统一优化目标,考虑到俯仰角和焦距的多样性,实现了对两个侧面的3D检测问题的统一。同时,为了提高单目3D检测的准确性,开发了障碍物的三维归一化立方体深度来促进深度信息的学习。
效果:广泛的实验表明该方法的有效性。在不引入任何额外信息的情况下,该方法(命名为MonoUNI)在五个广泛使用的单目3D检测基准测试中实现了最先进的性能,包括用于基础设施侧面的Rope3D和DAIR-V2X-I,用于车辆侧面的KITTI和Waymo,以及用于跨数据集评估的nuScenes。

Monocular 3D detection of vehicle and infrastructure sides are two important topics in autonomous driving. Due to diverse sensor installations and focal lengths, researchers are faced with the challenge of constructing algorithms for the two topics based on different prior knowledge. In this paper, by taking into account the diversity of pitch angles and focal lengths, we propose a unified optimization target named normalized depth, which realizes the unification of 3D detection problems for the two sides. Furthermore, to enhance the accuracy of monocular 3D detection, 3D normalized cube depth of obstacle is developed to promote the learning of depth information. We posit that the richness of depth clues is a pivotal factor impacting the detection performance on both the vehicle and infrastructure sides. A richer set of depth clues facilitates the model to learn better spatial knowledge, and the 3D normalized cube depth offers sufficient depth clues. Extensive experiments demonstrate the effectiveness of our approach. Without introducing any extra information, our method, named MonoUNI, achieves state-of-the-art performance on five widely used monocular 3D detection benchmarks, including Rope3D and DAIR-V2X-I for the infrastructure side, KITTI and Waymo for the vehicle side, and nuScenes for the cross-dataset evaluation.

Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation
Chaofan Ma Yuhuan Yang Chen Ju Fei Zhang Ya Zhang Yanfeng Wang



研究问题:开放词汇语义分割任务需要对新的物体类别进行分割,但现有方法存在在实际应用中的问题,如对低质量的文本类别名的假设。
动机:现有的预训练语言模型在处理开放词汇语义分割任务时,往往基于一些不切实际的假设,例如新文本类别会准确且完整地提供,并且在预训练期间存在于词库中。然而,当遇到模糊不清或不完整的名称、不存在于预训练词库中的新词以及用户难以描述的类别时,这些假设往往无法成立。
方法:本文提出了一种新颖的属性分解-聚合框架,灵感来源于人类理解新概念的认知过程。具体来说,在分解阶段,我们将类别名分解为多样化的属性描述,从多个角度补充语义上下文。设计了两种属性构建策略:对于常见的类别使用大型语言模型,对于人类创造的类别则进行人工标注。在聚合阶段,我们将多样化的属性聚合成一个综合的全局描述,形成一个能够区分目标对象和其他对象的判别分类器。进一步提出了一种层次聚合架构,利用精心设计的聚类模块实现多级聚合。最终结果通过计算聚合属性和图像嵌入之间的相似性得到。
效果:为了评估效果,我们在三个数据集上进行了属性描述的注释,并进行了广泛的实验和消融研究。结果显示出属性分解-聚合的优越性能。

Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent works explore vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, \ie, low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when meet with ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labelling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregation, leveraging the meticulously designed clustering module. The final result is obtained by computing the similarity between aggregated attributes and images embedding. To evaluate the effectiveness, we annotate three datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation. We refer readers to the latest arXiv version at https://arxiv.org/abs/2309.00096 .

How2comm: Communication-Efficient and Collaboration-Pragmatic Multi-Agent Perception
Dingkang Yang Kun Yang Yuzheng Wang Jing Liu Zhi Xu Rongbin Yin Peng Zhai Lihua Zhang



研究问题:多智能体协作感知在驾驶场景中作为新兴应用受到广泛关注,但感知过程中的多种噪声(如通信冗余、传输延迟和协作异构性)仍存在挑战。
动机:为了解决这些问题,我们提出了一个名为How2comm的协作感知框架,旨在在感知性能和通信带宽之间找到平衡。
方法:我们的创新点有三个。首先,我们设计了一个相互信息感知的通信机制,以最大限度地维持合作者共享的有信息量的特征。其次,我们提出了一种流动引导的延迟补偿策略,通过预测未来特征来消除由于时间异步性导致的特征错位。最后,我们引入了一种实用的协作转换器,以整合各代理之间的整体空间语义和时间上下文线索。
效果:我们在多个基于激光雷达的协作检测数据集上进行了全面评估,无论是在真实世界还是模拟场景中,实验结果都表明How2comm及其所有关键组件的优越性。代码将在https://github.com/ydk122024/How2comm上发布。

Multi-agent collaborative perception has recently received widespread attention as an emerging application in driving scenarios. Despite the advancements in previous efforts, challenges remain due to various noises in the perception procedure, including communication redundancy, transmission delay, and collaboration heterogeneity. To tackle these issues, we propose \textit{How2comm}, a collaborative perception framework that seeks a trade-off between perception performance and communication bandwidth. Our novelties lie in three aspects. First, we devise a mutual information-aware communication mechanism to maximally sustain the informative features shared by collaborators. The spatial-channel filtering is adopted to perform effective feature sparsification for efficient communication. Second, we present a flow-guided delay compensation strategy to predict future characteristics from collaborators and eliminate feature misalignment due to temporal asynchrony. Ultimately, a pragmatic collaboration transformer is introduced to integrate holistic spatial semantics and temporal context clues among agents. Our framework is thoroughly evaluated on several LiDAR-based collaborative detection datasets in real-world and simulated scenarios. Comprehensive experiments demonstrate the superiority of How2comm and the effectiveness of all its vital components. The code will be released at https://github.com/ydk122024/How2comm.

ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections
Chun-Han Yao Amit Raj Wei-Chih Hung Michael Rubinstein Yuanzhen Li Ming-Hsuan Yang Varun Jampani



研究问题:如何从单目图像中估计像动物身体这样的三维关节形状?
动机:由于相机视角、姿态、纹理和照明等的模糊性,从单目图像中估计三维关节形状具有固有的挑战性。
方法:我们提出了ARTIC3D,一个自监督框架,用于从野外稀疏的图像集合中重建每个实例的3D形状。具体来说,ARTIC3D建立在基于骨架的表面表示之上,并进一步受到Stable Diffusion的二维扩散先验的指导。首先,我们通过二维扩散增强输入图像以获得更清晰的蒙版估计和语义特征。其次,我们执行扩散引导的3D优化以估计高保真度且忠实于输入图像的形状和纹理。我们还提出了一种新技术,通过扩散模型计算比现有替代方案更稳定的图像级梯度。最后,我们通过在刚性部分变换下微调渲染的形状和纹理来生成逼真的动画。
效果:我们在多个现有的数据集以及新引入的带有遮挡和截断的嘈杂网络图像集合上进行了广泛的评估,结果表明ARTIC3D对嘈杂图像更具鲁棒性,在形状和纹理细节方面质量更高,并且在动画化时更加真实。

Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging due to the ambiguities of camera viewpoint, pose, texture, lighting, etc. We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild. Specifically, ARTIC3D is built upon a skeleton-based surface representation and is further guided by 2D diffusion priors from Stable Diffusion. First, we enhance the input images with occlusions/truncation via 2D diffusion to obtain cleaner mask estimates and semantic features. Second, we perform diffusion-guided 3D optimization to estimate shape and texture that are of high-fidelity and faithful to input images. We also propose a novel technique to calculate more stable image-level gradients via diffusion models compared to existing alternatives. Finally, we produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations. Extensive evaluations on multiple existing datasets as well as newly introduced noisy web image collections with occlusions and truncation demonstrate that ARTIC3D outputs are more robust to noisy images, higher quality in terms of shape and texture details, and more realistic when animated.

topic-3

Topic words :  training,  model,  efficient,  performance,  memory,  methods,  based,  large

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Daniel Y Fu Simran Arora Jessica Grogan Isys Johnson Sabri Eyuboglu Armin W Thomas Benjamin Frederick Spector Michael Poli Atri Rudra Christopher Re



研究问题:现有的预训练语言模型和图像分类模型在序列长度和模型维度上的扩展都是呈二次方增长的,是否存在一种能够在这两方面都实现次二次方增长的高效架构?
动机:为了解决现有模型在序列长度和模型维度上的扩展问题,提出了Monarch Mixer(M2)模型。
方法:M2模型使用了同样的次二次方基元——Monarch矩阵,这是一种具有表达能力的结构矩阵,能捕获许多线性变换,并在GPU上实现了高硬件效率。
效果:实验结果表明,M2模型在三个领域的表现均优秀:非因果BERT风格的语言建模、ViT风格的图像分类以及因果GPT风格的语言建模。在非因果BERT风格的语言建模中,M2模型在下游GLUE质量上与BERT-base和BERT-large相当,但参数数量减少了27%,并且在序列长度为4K时,吞吐量提高了9.1倍。在ImageNet上,M2模型的准确性比ViT-b高出1%,而参数数量仅为其一半。对于因果GPT风格的模型,我们通过基于多元多项式评估和插值的新理论观点来减轻由遮蔽引入的二次瓶颈,使M2模型保持次二次方的同时成为因果模型。

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these axes. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1$\times$ higher throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in accuracy, with only half the parameters. Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE—showing for the first time that it may be possible to match Transformer quality without attention or MLPs.

Fine-Tuning Language Models with Just Forward Passes
Sadhika Malladi Tianyu Gao Eshaan Nichani Alex Damian Jason D. Lee Danqi Chen Sanjeev Arora



研究问题:大型语言模型的反向传播需要大量内存,如何进行有效的微调?
动机:零阶优化器理论上只需要两次前向传播就可以估计梯度,但在实践中对于大型模型来说速度太慢。
方法:提出一种内存高效的零阶优化器(MeZO),将经典的零阶随机梯度下降法改为原地操作,使微调大型语言模型的内存占用与推理相同。
效果:实验表明,MeZO在多种任务上的表现优于上下文学习和线性探测,与使用反向传播进行微调的效果相当,同时内存和GPU计算时间分别减少了12倍和2倍。

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12× memory reduction and up to 2× GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

Scaling Data-Constrained Language Models
Niklas Muennighoff Alexander M Rush Boaz Barak Teven Le Scao Nouamane Tazi Aleksandra Piktus Sampo Pyysalo Thomas Wolf Colin Raffel



研究问题:本文旨在探讨在数据受限的情况下,如何扩展语言模型。
动机:随着互联网上文本数据的增多,训练数据集的大小可能很快会受到限制。因此,我们研究了在数据受限的情况下如何扩展语言模型。
方法:通过大量实验,改变数据重复和计算预算的程度,最大达到9000亿个训练标记和90亿个参数模型。我们发现,在固定的计算预算下,使用重复数据进行训练时,与使用唯一数据相比,损失几乎没有变化。但是,随着重复次数的增加,增加计算的价值最终会衰减为零。我们提出了一种计算最优性的缩放定律,该定律考虑了重复标记和多余参数的递减价值。最后,我们尝试了一些缓解数据稀缺性的方法,包括使用代码数据扩充训练数据集或删除常用的过滤器。
效果:我们的400次训练运行的模型和数据集可以在https://github.com/huggingface/datablations上免费下载。

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.

Bridging Discrete and Backpropagation: Straight-Through and Beyond
Liyuan Liu Chengyu Dong Xiaodong Liu Bin Yu Jianfeng Gao



研究问题:深度学习的基石反向传播在处理离散潜在变量的问题上存在局限性。
动机:为了解决涉及离散潜在变量的问题,我们提出了一种新的方法来近似生成离散潜在变量的参数梯度。
方法:我们首先检查了广泛使用的直通(ST)启发式方法,并证明它作为梯度的第一阶近似是有效的。然后,我们提出了ReinMax,通过整合求解ODEs的二阶数值方法Heun的方法,实现了二阶精度。ReinMax不需要海森矩阵或其他二阶导数,因此计算开销可以忽略不计。
效果:我们在各种任务上的大量实验结果表明,ReinMax优于现有技术。

Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables. First, we examine the widely used Straight-Through (ST) heuristic and demonstrate that it works as a first-order approximation of the gradient. Guided by our findings, we propose ReinMax, which achieves second-order accuracy by integrating Heun’s method, a second-order numerical method for solving ODEs. ReinMax does not require Hessian or other second-order derivatives, thus having negligible computation overheads. Extensive experimental results on various tasks demonstrate the superiority of ReinMax over the state of the art.

How to Scale Your EMA
Dan Busbridge Jason Ramapuram Pierre Ablin Tatiana Likhomanenko Eeshan Gunesh Dhekane Xavier Suau Russell Webb



研究问题:如何在保持训练动态性的同时,在不同的批量大小之间进行权衡?
动机:在机器学习中,保持训练动态性是一个重要的工具,它能够在批量大小和计算时间之间进行权衡。
方法:提出了一种新的优化模型EMA的缩放规则,该规则在不同的架构、优化器和数据模态上均有效。
效果:实验证明,这种缩放规则在模型EMA有助于目标模型优化的情况下也有效,可以在小批量和大批量上都进行EMA基的伪标签和自监督学习的训练。对于自监督学习,实现了在理想硬件设置下,批量大小达到24576时的训练,计算时间减少了6倍。

Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.

Memory Efficient Optimizers with 4-bit States
Bingrui Li Jianfei Chen Jun Zhu



研究问题:优化器状态是训练神经网络的主要内存消耗源,限制了在给定内存预算内可训练的最大模型。
动机:将优化器状态从32位浮点数压缩到更低的位宽有望降低训练内存占用,而目前最低可实现的位宽为8位。
方法:通过详细分析一阶和二阶矩,我们将优化器状态的位宽压缩到4位。具体来说,我们发现矩具有复杂的异常模式,当前的块状量化无法准确近似。我们使用更小的块大小,并建议利用行和列的信息进行更好的量化。我们还识别出二阶矩量化的零点问题,并通过排除零点的线性量化器解决了这个问题。
效果:我们的4位优化器在各种基准测试上进行了评估,包括自然语言理解、机器翻译、图像分类和指令调优。在所有任务中,我们的优化器都能实现与全精度对应物相当的准确性,同时享受更好的内存效率。

Optimizer states are a major source of memory consumption for training neural networks, limiting the maximum trainable model within given memory budget. Compressing the optimizer states from 32-bit floating points to lower bitwidth is promising to reduce the training memory footprint, while the current lowest achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. Specifically, we find that moments have complicated outlier patterns, that current block-wise quantization cannot accurately approximate. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. We further identify a zero point problem of quantizing the second moment, and solve this problem with a linear quantizer that excludes the zero point. Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency.

Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Zhen Qin Songlin Yang Yiran Zhong



研究问题:如何有效地进行序列建模?
动机:尽管Transformers在并行训练和长期依赖性建模方面优于RNN,但最近有研究者开始重新关注使用线性RNN进行高效序列建模。
方法:本文提出了一种名为分层门控循环神经网络(HGRN)的模型,该模型在循环层的输出中使用了门控机制,同时考虑了在循环中使用遗忘门的重要性。
效果:通过在语言建模、图像分类和长范围竞技场基准测试上的实验,证明了HGRN模型的有效性和效率。

Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at https://github.com/OpenNLPLab/HGRN.

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Sotiris Anagnostidis Dario Pavllo Luca Biggio Lorenzo Noci Aurelien Lucchi Thomas Hofmann



研究问题:大型语言模型中的自回归变压器很难扩展到长序列,尽管有几种方法试图降低其计算成本,但大多数语言模型仍然在所有序列令牌对之间采用注意力层,从而产生二次成本。
动机:本文提出了一种新颖的方法,该方法在保持模型表达能力的同时动态地修剪上下文信息,从而减少了推理过程中的内存和计算需求。
方法:我们的方法采用了一种可学习的机制,该机制决定了在生成过程的任何一点都可以从上下文中删除哪些无信息的令牌。通过这样做,我们的方法不仅解决了性能问题,还增强了解释性,为模型的决策过程提供了有价值的见解。
效果:实证研究发现,我们可以有效地修剪高达80%的上下文,而不会显著降低下游任务的性能,这为减轻推理成本提供了有价值的工具。我们的参考实现实现了推理吞吐量最多提高2倍,甚至更大的内存节省。

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to $2\times$ increase in inference throughput and even greater memory savings.

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures
Runa Eschenhagen Alexander Immer Richard E Turner Frank Schneider Philipp Hennig



研究问题:如何利用二次优化方法K-FAC来加速神经网络的训练并降低计算成本。
动机:现代神经网络架构的核心组件,如转换器、卷积或图神经网络,都可以表示为具有*权重共享*的线性层。K-FAC是一种有前景的二次优化方法,可以加快神经网络的训练并减少计算成本,但目前还没有适用于通用架构(特别是具有线性权重共享层的架构)的框架。
方法:我们确定了两种不同的线性权重共享层设置,分别对应K-FAC的两种变体——*扩展*和*缩减*。我们发现它们在深度线性网络中是精确的,且各自在其设置中具有权重共享。值得注意的是,K-FAC-缩减通常比K-FAC-扩展更快,我们利用这一点通过优化广残差网络的边缘似然性来加速自动超参数选择。
效果:当我们使用这两种K-FAC变体来训练图神经网络和视觉变换器时,我们发现它们之间几乎没有差异。然而,这两种变体都能够在第一步参考运行的$50-75\%$步数内达到固定的验证指标目标,这相当于在墙钟时间上取得了相当的改进。这突显了将K-FAC应用于现代神经网络架构的潜力。

The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with *weight-sharing*. Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC -- *expand* and *reduce*. We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via optimising the marginal likelihood for a Wide ResNet. Finally, we observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer. However, both variations are able to reach a fixed validation metric target in $50$-$75$\% of the number of steps of a first-order reference run, which translates into a comparable improvement in wall-clock time. This highlights the potential of applying K-FAC to modern neural network architectures.

WITRAN: Water-wave Information Transmission and Recurrent Acceleration Network for Long-range Time Series Forecasting
Yuxin Jia Youfang Lin Xinyan Hao Yan Lin Shengnan Guo Huaiyu Wan



研究问题:如何准确捕捉长期时间序列预测中的语义信息,包括全局和局部的相关性以及长短期重复模式。
动机:现有的方法无法同时解决这些问题,且在时间和内存复杂度上仍不适合长期预测。
方法:提出一种新颖的“水波信息传输”(WIT)框架,通过双粒度信息传输捕捉长短期重复模式,并通过HVGSU递归融合和选择信息来模拟全局和局部相关性。同时,为提高计算效率,提出了一种通用的循环加速网络(RAN),将时间复杂度降低到O(√L),而内存复杂度保持在O(L)。
效果:提出的“水波信息传输和循环加速网络”(WITRAN)方法在长期和超长期时间序列预测任务上分别比现有方法提高了5.80%和14.28%,实验结果在四个基准数据集上进行了验证。

Capturing semantic information is crucial for accurate long-range time series forecasting, which involves modeling global and local correlations, as well as discovering long- and short-term repetitive patterns. Previous works have partially addressed these issues separately, but have not been able to address all of them simultaneously. Meanwhile, their time and memory complexities are still not sufficiently low for long-range forecasting. To address the challenge of capturing different types of semantic information, we propose a novel Water-wave Information Transmission (WIT) framework. This framework captures both long- and short-term repetitive patterns through bi-granular information transmission. It also models global and local correlations by recursively fusing and selecting information using Horizontal Vertical Gated Selective Unit (HVGSU). In addition, to improve the computing efficiency, we propose a generic Recurrent Acceleration Network (RAN) which reduces the time complexity to $\mathcal{O}(\sqrt{L})$ while maintaining the memory complexity at $\mathcal{O}(L)$. Our proposed method, called Water-wave Information Transmission and Recurrent Acceleration Network (WITRAN), outperforms the state-of-the-art methods by 5.80% and 14.28% on long-range and ultra-long-range time series forecasting tasks respectively, as demonstrated by experiments on four benchmark datasets. The code is available at: https://github.com/Water2sea/WITRAN.

Stable Nonconvex-Nonconcave Training via Linear Interpolation
Thomas Pethick Wanyun Xie Volkan Cevher



研究问题:本文旨在理论分析线性插值作为一种稳定(大规模)神经网络训练的原则性方法。
动机:优化过程中的不稳定性通常是由损失函数的非单调性引起的,本文展示了线性插值如何通过利用非扩张算子理论来帮助解决这个问题。
方法:构建了一种新的优化方案,称为放松近似近点(RAPP),这是第一个实现全范围共亚单调问题的最后迭代收敛率的显式方法。该构造扩展到约束和正则化设置。
效果:通过对生成对抗网络的实验证明,RAPP和Lookahead中的线性插值都有其优点,证实了结果的有效性。

This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method to achieve last iterate convergence rates for the full range of cohypomonotone problems. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.

Blockwise Parallel Transformers for Large Context Models
Hao Liu Pieter Abbeel



研究问题:现有的Transformer模型在处理长序列和长期依赖的任务时,由于自注意力机制和大型前馈网络的内存需求大,存在挑战。
动机:提出一种名为Blockwise Parallel Transformer(BPT)的新方法,通过块状计算自注意力和融合前馈网络来降低内存成本。
方法:BPT采用块状计算自注意力和融合前馈网络的方法,以保持内存效率的同时处理更长的输入序列。
效果:实验证明,BPT可以在训练序列长度上比原始的Transformers长32倍,比之前的内存高效方法长4倍,同时减少了内存需求并提高了性能。

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.

Grounding Neural Inference with Satisfiability Modulo Theories
Zifan Wang Saranya Vijayakumar Kaiji Lu Vijay Ganesh Somesh Jha Matt Fredrikson



研究问题:如何将SMT求解器整合到深度神经网络中,以弥补归纳学习和符号推理技术之间的鸿沟。
动机:目前的深度学习模型在处理符号推理方面的性能有待提高,而将SMT求解器整合到网络中可以有效地解决这个问题。
方法:提出了一种名为SMTLayer的技术,将SMT求解器整合到网络的前向和后向传播过程中。在前向传播过程中,求解器使用前一层产生的符号和数学公式进行推理;在后向传播过程中,求解器指导网络的更新,使其表示与求解器的理论相兼容。
效果:实验结果表明,使用SMTLayer的模型1)需要的训练样本比传统模型少,2)对某些类型的协变量偏移具有鲁棒性,3)最终学习到的表示与符号知识一致,因此具有自然可解释性。

Recent techniques that integrate solver layers into Deep Neural Networks (DNNs) have shown promise in bridging a long-standing gap between inductive learning and symbolic reasoning techniques. In this paper we present a set of techniques for integrating Satisfiability Modulo Theories (SMT) solvers into the forward and backward passes of a deep network layer, called SMTLayer. Using this approach, one can encode rich domain knowledge into the network in the form of mathematical formulas. In the forward pass, the solver uses symbols produced by prior layers, along with these formulas, to construct inferences; in the backward pass, the solver informs updates to the network, driving it towards representations that are compatible with the solver's theory. Notably, the solver need not be differentiable. We implement SMTLayer as a Pytorch module, and our empirical results show that it leads to models that 1) require fewer training samples than conventional models, 2) that are robust to certain types of covariate shift, and 3) that ultimately learn representations that are consistent with symbolic knowledge, and thus naturally interpretable.

Reinforcement-Enhanced Autoregressive Feature Transformation: Gradient-steered Search in Continuous Space for Postfix Expressions
Dongjie Wang Meng Xiao Min Wu pengfei wang Yuanchun Zhou Yanjie Fu



研究问题:本文旨在解决现有离散特征转换方法在搜索空间过大、效率与稳定性难以兼顾的问题。
动机:现有的离散特征转换方法,如穷举搜索、扩展缩减、进化算法、强化学习和迭代贪婪等,都面临着搜索空间过大的问题。过度强调算法设计的效率通常会牺牲稳定性或鲁棒性。
方法:本文将离散特征转换重新定义为一个连续空间优化任务,并开发了一个嵌入优化重建框架。该框架包括四个步骤:1)增强的强化数据准备,以准备高质量的转换准确性训练数据;2)特征转换操作序列嵌入,旨在将准备好的训练数据的知识封装在一个连续的空间中;3)梯度引导的最佳嵌入搜索,致力于在已学习的空间中发现潜在的优秀嵌入;4)转换操作序列重建,力求重现特征转换解决方案,以精确定位最佳特征空间。
效果:通过大量的实验和案例研究,证明了该方法的有效性和鲁棒性。

Feature transformation aims to generate new pattern-discriminative feature space from original features to improve downstream machine learning (ML) task performances. However, the discrete search space for the optimal feature explosively grows on the basis of combinations of features and operations from low-order forms to high-order forms. Existing methods, such as exhaustive search, expansion reduction, evolutionary algorithms, reinforcement learning, and iterative greedy, suffer from large search space. Overly emphasizing efficiency in algorithm design usually sacrifice stability or robustness. To fundamentally fill this gap, we reformulate discrete feature transformation as a continuous space optimization task and develop an embedding-optimization-reconstruction framework. This framework includes four steps: 1) reinforcement-enhanced data preparation, aiming to prepare high-quality transformation-accuracy training data; 2) feature transformation operation sequence embedding, intending to encapsulate the knowledge of prepared training data within a continuous space; 3) gradient-steered optimal embedding search, dedicating to uncover potentially superior embeddings within the learned space; 4) transformation operation sequence reconstruction, striving to reproduce the feature transformation solution to pinpoint the optimal feature space. Finally, extensive experiments and case studies are performed to demonstrate the effectiveness and robustness of the proposed method. The code and data are publicly accessible https://www.dropbox.com/sh/imh8ckui7va3k5u/AACulQegVx0MuywYyoCqSdVPa?dl=0.

Randomized Sparse Neural Galerkin Schemes for Solving Evolution Equations with Deep Networks
Jules Berman Benjamin Peherstorfer



研究问题:训练神经网络以近似时间依赖偏微分方程的解场,但这种按时间顺序的训练在数值上具有挑战性,因为训练误差会随时间快速累积和放大。
动机:本文提出了神经伽辽金方案,通过在每个时间步更新网络参数的随机稀疏子集,避免局部过度拟合,防止误差在按时间顺序的训练中迅速累积。
方法:利用神经伽辽金方案进行训练,其随机化更新可以避免局部过拟合,减少训练的计算成本而不会损失表现力。
效果:在一系列演化方程的数值实验中,与稠密更新方案相比,所提出的随机稀疏更新方案在固定的计算预算下准确性提高了两个数量级,在固定的准确性下速度提高了两个数量级。

Training neural networks sequentially in time to approximate solution fields of time-dependent partial differential equations can be beneficial for preserving causality and other physics properties; however, the sequential-in-time training is numerically challenging because training errors quickly accumulate and amplify over time. This work introduces Neural Galerkin schemes that update randomized sparse subsets of network parameters at each time step. The randomization avoids overfitting locally in time and so helps prevent the error from accumulating quickly over the sequential-in-time training, which is motivated by dropout that addresses a similar issue of overfitting due to neuron co-adaptation. The sparsity of the update reduces the computational costs of training without losing expressiveness because many of the network parameters are redundant locally at each time step. In numerical experiments with a wide range of evolution equations, the proposed scheme with randomized sparse updates is up to two orders of magnitude more accurate at a fixed computational budget and up to two orders of magnitude faster at a fixed accuracy than schemes with dense updates.

SimFBO: Towards Simple, Flexible and Communication-efficient Federated Bilevel Learning
Yifan Yang Peiyao Xiao Kaiyi Ji



研究问题:如何提高联邦双层优化(FBO)在机器学习和边缘计算中的性能,减少复杂的计算和通信开销。
动机:现有的FBO算法通常涉及复杂的计算,每个迭代都需要多个子循环,导致通信开销大。
方法:提出了一种简单灵活的FBO框架SimFBO,无需子循环,通过服务器端的聚合和更新来提高通信效率。同时,还提出了系统级异构鲁棒FBO(ShroFBO),以增强对异构本地计算的韧性。
效果:实验证明,SimFBO和ShroFBO在部分客户端参与和无替换客户端采样的情况下,可以显著提高收敛速度,降低样本和通信复杂度,优于现有的FBO算法。

Federated bilevel optimization (FBO) has shown great potential recently in machine learning and edge computing due to the emerging nested optimization structure in meta-learning, fine-tuning, hyperparameter tuning, etc. However, existing FBO algorithms often involve complicated computations and require multiple sub-loops per iteration, each of which contains a number of communication rounds. In this paper, we propose a simple and flexible FBO framework named SimFBO, which is easy to implement without sub-loops, and includes a generalized server-side aggregation and update for improving communication efficiency. We further propose System-level heterogeneity robust FBO (ShroFBO) as a variant of SimFBO with stronger resilience to heterogeneous local computation. We show that SimFBO and ShroFBO provably achieve a linear convergence speedup with partial client participation and client sampling without replacement, as well as improved sample and communication complexities. Experiments demonstrate the effectiveness of the proposed methods over existing FBO algorithms.

Model Sparsity Can Simplify Machine Unlearning
Jinghan Jia Jiancheng Liu Parikshit Ram Yuguang Yao Gaowen Liu Yang Liu Pranay Sharma Sijia Liu



研究问题:如何有效地进行机器去学习,以去除特定示例对模型的影响。
动机:由于数据调控要求,需要开发有效的、近似的去学习技术来减少特定示例对模型的影响。
方法:提出了一种新的基于模型的视角:通过权重剪枝实现模型稀疏化,可以缩小精确去学习和近似去学习之间的差距。
效果:理论和实践都表明,模型稀疏性可以提高近似去学习器的多准则去学习性能,同时保持效率。在各种去学习场景中,我们的方法都能带来持续效益。

In response to recent data regulation requirements, machine unlearning (MU) has emerged as a critical process to remove the influence of specific examples from a given model. Although exact unlearning can be achieved through complete model retraining using the remaining dataset, the associated computational costs have driven the development of efficient, approximate unlearning techniques. Moving beyond data-centric MU approaches, our study introduces a novel model-based perspective: model sparsification via weight pruning, which is capable of reducing the gap between exact unlearning and approximate unlearning. We show in both theory and practice that model sparsity can boost the multi-criteria unlearning performance of an approximate unlearner, closing the approximation gap, while continuing to be efficient. This leads to a new MU paradigm, termed prune first, then unlearn, which infuses a sparse prior to the unlearning process. Building on this insight, we also develop a sparsity-aware unlearning method that utilizes sparsity regularization to enhance the training process of approximate unlearning. Extensive experiments show that our proposals consistently benefit MU in various unlearning scenarios. A notable highlight is the 77% unlearning efficacy gain of fine-tuning (one of the simplest approximate unlearning methods) when using our proposed sparsity-aware unlearning method. Furthermore, we showcase the practical impact of our proposed MU methods through two specific use cases: defending against backdoor attacks, and enhancing transfer learning through source class removal. These applications demonstrate the versatility and effectiveness of our approaches in addressing a variety of machine learning challenges beyond unlearning for data privacy. Codes are available at https://github.com/OPTML-Group/Unlearn-Sparse.

DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization
Zhiqing Sun Yiming Yang



研究问题:本文旨在通过引入一种新的基于图的扩散框架DIFUSCO,扩大神经网络解决NP-完全问题的当前范围。
动机:目前的神经网络求解器在解决NP-完全问题上取得了一些成果,但依赖于手工制作的知识领域。本文提出了一种新的基于图的扩散模型,以生成高质量的解决方案。
方法:将NPC问题转化为离散的{0,1}向量空间,并使用基于图的去噪扩散模型来生成高质量的解决方案。具体来说,我们探索了具有高斯和伯努利噪声的扩散模型,并引入了一种有效的推理计划来提高生成质量。
效果:实验结果表明,DIFUSCO显著优于先前最先进的神经网络求解器,缩小了真实值与神经网络求解器之间的性能差距。在TSP-500、TSP-1000和TSP-10000上,DIFUSCO的性能分别提高了1.76%到0.46%、2.46%到1.17%和3.19%到2.58%。对于MIS问题,DIFUSCO在具有挑战性的SATLIB基准测试中优于先前最先进的神经网络求解器。

Neural network-based Combinatorial Optimization (CO) methods have shown promising results in solving various NP-complete (NPC) problems without relying on hand-crafted domain knowledge. This paper broadens the current scope of neural solvers for NPC problems by introducing a new graph-based diffusion framework, namely DIFUSCO. It formulates NPC problems into a discrete {0, 1}-vector space and uses graph-based denoising diffusion models to generate high-quality solutions. Specifically, we explore diffusion models with Gaussian and Bernoulli noise, respectively, and also introduce an effective inference schedule to improve the generation quality. We evaluate our methods on two well-studied combinatorial optimization problems: Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS). Experimental results show that DIFUSCO strongly outperforms the previous state-of-the-art neural solvers, improving the performance gap between ground-truth and neural solvers from 1.76% to 0.46% on TSP-500, from 2.46% to 1.17% on TSP-1000, and from 3.19% to 2.58% on TSP-10000. For the MIS problem, DIFUSCO outperforms the previous state-of-the-art neural solver on the challenging SATLIB benchmark. Our code is available at [this url](https://github.com/Edward-Sun/DIFUSCO).

Alternating Updates for Efficient Transformers
Cenk Baykal Dylan J Cutler Nishanth Dikkala Nikhil Ghosh Rina Panigrahy Xin Wang



研究问题:如何提高深度学习模型的性能,同时降低计算成本和推理延迟?
动机:增加深度学习模型的规模可以提高质量和性能,但同时也会增加计算成本和推理延迟。
方法:提出了一种名为Alternate Updates(AltUp)的方法,通过交替更新模型的子块来扩大学习到的表示,从而在不增加延迟的情况下提高模型的容量。
效果:实验结果表明,AltUp在基准测试模型和语言任务上表现出一致的效果,并在SuperGLUE和SQuAD基准测试中实现了高达87%的速度提升。

It has been well established that increasing scale in deep transformer networks leads to improved quality and performance. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation, i.e., the token embedding, while only incurring a negligible increase in latency. AltUp achieves this by working on a subblock of the widened representation at each layer and using a predict-and-correct mechanism to update the inactivated blocks. We present extensions of AltUp, such as its applicability to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity. Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of AltUp on a diverse set of scenarios. Notably, on SuperGLUE and SQuAD benchmarks, AltUp enables up to $87\%$ speedup relative to the dense baselines at the same accuracy.

MeCo: Zero-Shot NAS with One Data and Single Forward Pass via Minimum Eigenvalue of Correlation
Tangyu Jiang Haodi Wang Rongfang Bie



研究问题:现有的零样本神经网络架构搜索(NAS)方法需要通过至少一次反向传播或高度研究问题:现有的零样本神经网络架构搜索(NAS)方法需要通过至少一次反向传播或高度依赖数据和标签的特定指标进行评估,这限制了其应用。
动机:为了解决上述问题,本文提出了一种新的零成本代理方法,该方法只需要一次前向传播和一个随机数据就可以评估网络性能。
方法:首先,我们揭示了特征图的皮尔森相关矩阵如何影响过参数化神经网络的收敛速度和泛化能力。然后,我们提出了一种名为$\mathsf{MeCo}$的新型零成本代理方法,并设计了一种优化方法$mathsf{MeCo_{opt}}$来提高其性能。
效果:实验结果表明,$\mathsf{MeCo}$在所有最先进的代理中与真实值的相关性最高(例如,在NATS-Bench-TSS上使用CIFAR-10时为0.89),并且完全独立于数据和标签。此外,我们将$mathsf{MeCo}$与现有的生成方法集成,构成了一个完整的NAS。实验结果显示,基于$\mathsf{MeCo}$的NAS可以选择具有最高准确性和低搜索成本的架构。

Neural Architecture Search (NAS) is a promising paradigm in automatic architecture engineering. Zero-shot NAS can evaluate the network without training via some specific metrics called zero-cost proxies. Though effective, the existing zero-cost proxies either invoke at least one backpropagation or depend highly on the data and labels. To alleviate the above issues, in this paper, we first reveal how the Pearson correlation matrix of the feature maps impacts the convergence rate and the generalization capacity of an over-parameterized neural network. Enlightened by the theoretical analysis, we propose a novel zero-cost proxy called $\mathsf{MeCo}$, which requires only one random data for a single forward pass. We further propose an optimization approach $\mathsf{MeCo_{opt}}$ to improve the performance of our method. We design comprehensive experiments and extensively evaluate $\mathsf{MeCo}$ on multiple popular benchmarks. $\mathsf{MeCo}$ achieves the highest correlation with the ground truth (e.g., 0.89 on NATS-Bench-TSS with CIFAR-10) among all the state-of-the-art proxies, which is also fully independent of the data and labels. Moreover, we integrate $\mathsf{MeCo}$ with the existing generation method to comprise a complete NAS. The experimental results illustrate that $\mathsf{MeCo}$-based NAS can select the architecture with the highest accuracy and a low search cost. For instance, the best network searched by $\mathsf{MeCo}$-based NAS achieves 97.31% on CIFAR-10, which is 0.04% higher than the baselines under the same settings. Our code is available at https://github.com/HamsterMimi/MeCo

On quantum backpropagation, information reuse, and cheating measurement collapse
Amira Abbas Robbie King Hsin-Yuan Huang William J. Huggins Ramis Movassagh Dar Gilboa Jarrod Ryan McClean



研究问题:本文旨在探讨参数化量子模型是否能像经典神经网络一样有效训练。
动机:现代深度学习的成功依赖于大规模训练神经网络的能力,而量子测量崩溃似乎完全排除了信息再利用的可能性。然而,影子断层扫描的最新发展挑战了这个观点。
方法:通过影子断层扫描,我们假设可以访问量子状态的多个副本,并引入了一种基于影子断层扫描的算法,该算法在量子资源上实现了反向传播的规模,同时减少了经典的辅助计算成本。
效果:这些结果突出了在实际目的中重用量子信息的独特困难,并阐明了训练大型量子模型的独特困难,这可能会改变量子机器学习的进程。

The success of modern deep learning hinges on the ability to train neural networks at scale. Through clever reuse of intermediate information, backpropagation facilitates training through gradient computation at a total cost roughly proportional to running the function, rather than incurring an additional factor proportional to the number of parameters -- which can now be in the trillions. Naively, one expects that quantum measurement collapse entirely rules out the reuse of quantum information as in backpropagation. But recent developments in shadow tomography, which assumes access to multiple copies of a quantum state, have challenged that notion. Here, we investigate whether parameterized quantum models can train as efficiently as classical neural networks. We show that achieving backpropagation scaling is impossible without access to multiple copies of a state. With this added ability, we introduce an algorithm with foundations in shadow tomography that matches backpropagation scaling in quantum resources while reducing classical auxiliary computational costs to open problems in shadow tomography. These results highlight the nuance of reusing quantum information for practical purposes and clarify the unique difficulties in training large quantum models, which could alter the course of quantum machine learning.

QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Jerry Chee Yaohui Cai Volodymyr Kuleshov Christopher De Sa



研究问题:本文研究了大型语言模型的后训练参数量化问题。
动机:为了提高大型语言模型的运行效率,需要对其进行量化处理,但现有的量化方法效果不佳。
方法:提出了一种新的量化方法——量化与不连贯处理(QuIP),该方法基于权重和海森矩阵的不连贯性,通过随机正交矩阵的乘法来确保权重和海森矩阵的不连贯性。
效果:实验结果表明,QuIP在几种现有的量化算法上都有改进,并且是第一个能在大型语言模型上使用仅两位元进行量化的方法。

This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/jerry-chee/QuIP.

Separable Physics-Informed Neural Networks
Junwoo Cho Seungtae Nam Hyunmo Yang Seok-Bae Yun Youngjoon Hong Eunbyung Park



研究问题:训练物理感知神经网络(PINNs)解决多维偏微分方程(PDEs)和逼近复杂解函数存在基本限制。
动机:在挑战性的PDEs上,所需的训练点(配置点)数量显著增加,由于昂贵的计算成本和沉重的内存开销,这受到严重限制。
方法:提出一种新的网络架构和训练算法,即分离的物理感知神经网络(SPINN)。SPINN按每个轴进行操作,减少传统PINN中的点到点处理,从而降低多维PDEs中的网络传播次数。同时,使用前向模式自动微分来降低计算PDE残差的计算成本,使得单个商用GPU上可以有多达$10^7$的配置点。
效果:实验结果表明,在保持准确性的同时,SPINN在多维PDEs上的计算成本大大降低(在相同配置点数量的情况下,所需时间减少了62倍,FLOPs减少了1394倍)。此外,SPINN能够比性能最佳的先前方法更快地解决混沌的(2+1)-d Navier-Stokes方程(单GPU上9分钟 vs. 10小时),并保持准确性。最后,展示了SPINN能够准确获得高度非线性和多维的PDE——(3+1)-d Navier-Stokes方程的解决方案。

Physics-informed neural networks (PINNs) have recently emerged as promising data-driven PDE solvers showing encouraging results on various PDEs. However, there is a fundamental limitation of training PINNs to solve multi-dimensional PDEs and approximate very complex solution functions. The number of training points (collocation points) required on these challenging PDEs grows substantially, and it is severely limited due to the expensive computational costs and heavy memory overhead. To overcome this limit, we propose a network architecture and training algorithm for PINNs. The proposed method, separable PINN (SPINN), operates on a per-axis basis to decrease the number of network propagations in multi-dimensional PDEs instead of point-wise processing in conventional PINNs. We also propose using forward-mode automatic differentiation to reduce the computational cost of computing PDE residuals, enabling a large number of collocation points ($>10^7$) on a single commodity GPU. The experimental results show significantly reduced computational costs ($62\times$ in wall-clock time, $1,394\times$ in FLOPs given the same number of collocation points) in multi-dimensional PDEs while achieving better accuracy. Furthermore, we present that SPINN can solve a chaotic (2+1)-d Navier-Stokes equation much faster than the best-performing prior method (9 minutes vs. 10 hours in a single GPU), maintaining accuracy. Finally, we showcase that SPINN can accurately obtain the solution of a highly nonlinear and multi-dimensional PDE, a (3+1)-d Navier-Stokes equation. For visualized results and code, please see https://jwcho5576.github.io/spinn.github.io/.

Coop: Memory is not a Commodity
Jianhao Zhang Shihan Ma Peihong Liu Jinhui Yuan



研究问题:现有的张量重构技术忽视了深度学习框架中的内存系统,并错误地假设不同地址的空闲内存块是相同的,导致严重的内存碎片和潜在的重构成本增加。
动机:为了解决这个问题,我们提出了一种在滑动窗口内逐块逐块地逐出张量的方法,以确保所有被逐出的张量都是连续的,并且可以立即使用。
方法:我们进一步提出了廉价的张量划分和可计算的原地操作,以通过优化张量分配来进一步降低重构成本。
效果:我们的实验结果表明,与最先进的基线相比,该方法实现了高达2倍的内存节省,并大大减少了计算开销、搜索延迟和内存碎片。

Tensor rematerialization allows the training of deep neural networks (DNNs) under limited memory budgets by checkpointing the models and recomputing the evicted tensors as needed. However, the existing tensor rematerialization techniques overlook the memory system in deep learning frameworks and implicitly assume that free memory blocks at different addresses are identical. Under this flawed assumption, discontiguous tensors are evicted, among which some are not used to allocate the new tensor. This leads to severe memory fragmentation and increases the cost of potential rematerializations. To address this issue, we propose to evict tensors within a sliding window to ensure all evictions are contiguous and are immediately used. Furthermore, we proposed cheap tensor partitioning and recomputable in-place to further reduce the rematerialization cost by optimizing the tensor allocation. We named our method \name/ as it is a co-optimization of tensor allocation and tensor rematerialization. We evaluated \name/ on eight representative DNNs. The experimental results demonstrate that \name/ achieves up to $2\times$ memory saving and hugely reduces compute overhead, search latency, and memory fragmentation compared to the state-of-the-art baselines.

Mitigating the Popularity Bias of Graph Collaborative Filtering: A Dimensional Collapse Perspective
Yifei Zhang Hao Zhu yankai Chen Zixing Song Piotr Koniusz Irwin King



研究问题:图基协同过滤(GCF)在个性化推荐系统中广泛应用,但其基本问题是特征研究问题:图基协同过滤(GCF)在个性化推荐系统中广泛应用,但其基本问题是特征倾向于低效地占据嵌入空间,导致流行项目主导了嵌入空间。
动机:为了解决流行项目主导的问题,提高不流行项目的性能,我们提出了一种利用非欧几里得几何的解耦增强GCF目标的方法。
方法:我们分析了GCF中的特征矩阵奇异空间收缩的现象,并提出了一种新的优化目标,通过利用嵌入中的冗余减少原理来促进特征多样性。与使用欧几里得几何放松硬约束的传统方法不同,我们选择使用非欧几里得几何来保持矩阵的范围空间和获得小的条件数,防止嵌入空间退化。
效果:我们的新方法在几个基准数据集上优于对比基GCF模型,提高了不流行项目的性能。

Graph-based Collaborative Filtering (GCF) is widely used in personalized recommendation systems. However, GCF suffers from a fundamental problem where features tend to occupy the embedding space inefficiently (by spanning only a low-dimensional subspace). Such an effect is characterized in GCF by the embedding space being dominated by a few of popular items with the user embeddings highly concentrated around them. This enhances the so-called Matthew effect of the popularity bias where popular items are highly recommend whereas remaining items are ignored. In this paper, we analyze the above effect in GCF and reveal that the simplified graph convolution operation (typically used in GCF) shrinks the singular space of the feature matrix. As typical approaches (i.e., optimizing the uniformity term) fail to prevent the embedding space degradation, we propose a decorrelation-enhanced GCF objective that promotes feature diversity by leveraging the so-called principle of redundancy reduction in embeddings. However, unlike conventional methods that use the Euclidean geometry to relax hard constraints for decorrelation, we exploit non-Euclidean geometry. Such a choice helps maintain the range space of the matrix and obtain small condition number, which prevents the embedding space degradation. Our method outperforms contrastive-based GCF models on several benchmark datasets and improves the performance for unpopular items.

ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking
Yutong Kou Jin Gao Bing Li Gang Wang Weiming Hu Yizheng Wang Liang Li



研究问题:如何通过缩小输入尺寸,实现高速追踪的同时接近甚至达到最先进的追踪性能。
动机:尽管基于变换器的高速追踪器在小输入尺寸或轻量级特征提取主干的帮助下已经接近了最先进的性能,但它们仍然大大落后于其对应的性能导向版本。
方法:我们提出非均匀调整裁剪图像的大小以获得较小的输入尺寸,同时提高目标更可能出现的区域的分辨率,反之亦然。这使得我们可以解决在保持较小输入尺寸的同时关注更大的视觉场并保留更多原始目标信息的难题。
效果:我们在两个基于变换器的追踪器OSTrack和TransT上进行了全面的实验,结果表明我们的方法是有效的,特别是在速度导向版本的OSTrack上,其在TNL2K数据集上的性能甚至超过了其性能导向的版本0.6% AUC,同时运行速度提高了50%,节省了超过55%的MACs。

Recently, the transformer has enabled the speed-oriented trackers to approach state-of-the-art (SOTA) performance with high-speed thanks to the smaller input size or the lighter feature extraction backbone, though they still substantially lag behind their corresponding performance-oriented versions. In this paper, we demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size. To this end, we non-uniformly resize the cropped image to have a smaller input size while the resolution of the area where the target is more likely to appear is higher and vice versa. This enables us to solve the dilemma of attending to a larger visual field while retaining more raw information for the target despite a smaller input size. Our formulation for the non-uniform resizing can be efficiently solved through quadratic programming (QP) and naturally integrated into most of the crop-based local trackers. Comprehensive experiments on five challenging datasets based on two kinds of transformer trackers, \ie, OSTrack and TransT, demonstrate consistent improvements over them. In particular, applying our method to the speed-oriented version of OSTrack even outperforms its performance-oriented counterpart by 0.6\% AUC on TNL2K, while running 50\% faster and saving over 55\% MACs. Codes and models are available at https://github.com/Kou-99/ZoomTrack.

Robust Model Reasoning and Fitting via Dual Sparsity Pursuit
Xingyu Jiang Jiayi Ma



研究问题:解决异常值剔除、真实模型推理和参数估计的优化建模问题。
动机:通过统一优化建模,解决稀疏子空间恢复问题,寻找过嵌入数据空间的最大独立基。
方法:将目标转化为连续优化范式,同时估计基和误差的稀疏解。提出快速且稳健的求解器,通过最优次梯度下降法在交替优化框架下的近似方法进行实现。
效果:在已知和未知模型拟合的合成和具有挑战性的实数据集上进行的大量实验表明,该方法优于最先进的方法。将其应用于多类多模型拟合和环路闭合检测,在准确性和效率方面均取得了良好的结果。代码已发布在:https://github.com/StaRainJ/DSP。

In this paper, we contribute to solving a threefold problem: outlier rejection, true model reasoning and parameter estimation with a unified optimization modeling. To this end, we first pose this task as a sparse subspace recovering problem, to search a maximum of independent bases under an over-embedded data space. Then we convert the objective into a continuous optimization paradigm that estimates sparse solutions for both bases and errors. Wherein a fast and robust solver is proposed to accurately estimate the sparse subspace parameters and error entries, which is implemented by a proximal approximation method under the alternating optimization framework with the ``optimal'' sub-gradient descent. Extensive experiments regarding known and unknown model fitting on synthetic and challenging real datasets have demonstrated the superiority of our method against the state-of-the-art. We also apply our method to multi-class multi-model fitting and loop closure detection, and achieve promising results both in accuracy and efficiency. Code is released at: https://github.com/StaRainJ/DSP.

Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective
Zeyuan Yin Eric Xing Zhiqiang Shen



研究问题:如何有效地压缩数据集,以适应不同的数据集规模、模型架构和图像分辨率?
动机:现有的数据集压缩方法存在优化模型和合成数据的双重最优化问题,限制了其在不同规模数据集、模型架构和图像分辨率上的灵活性。
方法:提出了一种新的数据集压缩框架Squeeze, Recover and Relabel (SRe$^2$L),该框架将模型和合成数据的双层最优化解耦,能够处理不同规模的数据集、模型架构和图像分辨率,进行高效的数据集压缩。
效果:在Tiny-ImageNet和full ImageNet-1K数据集上进行的大量实验表明,该方法在50 IPC下实现了最高的42.5%和60.8%的验证精度,比所有先前最先进的方法分别高出14.5%和32.9%。此外,该方法在数据合成速度上比MTT快约52倍和16倍,内存消耗少11.6倍和6.4倍。

We present a new dataset condensation framework termed Squeeze, Recover and Relabel (SRe$^2$L) that decouples the bilevel optimization of model and synthetic data during training, to handle varying scales of datasets, model architectures and image resolutions for efficient dataset condensation. The proposed method demonstrates flexibility across diverse dataset scales and exhibits multiple advantages in terms of arbitrary resolutions of synthesized images, low training cost and memory consumption with high-resolution synthesis, and the ability to scale up to arbitrary evaluation network architectures. Extensive experiments are conducted on Tiny-ImageNet and full ImageNet-1K datasets. Under 50 IPC, our approach achieves the highest 42.5\% and 60.8\% validation accuracy on Tiny-ImageNet and ImageNet-1K, outperforming all previous state-of-the-art methods by margins of 14.5\% and 32.9\%, respectively. Our approach also surpasses MTT in terms of speed by approximately 52$\times$ (ConvNet-4) and 16$\times$ (ResNet-18) faster with less memory consumption of 11.6$\times$ and 6.4$\times$ during data synthesis. Our code and condensed datasets of 50, 200 IPC with 4K recovery budget are available at https://github.com/VILA-Lab/SRe2L.

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers
Zixuan Jiang Jiaqi Gu Hanqing Zhu David Z. Pan



研究问题:目前,Transformers在机器学习应用中取得了巨大的成功,但在选择归一化技术时,Layer Normalization和Root Mean Square Normalization之间存在争议。
动机:尽管RMSNorm在计算上更有效率,但可能会影响Transformers的表示能力。同时,将一种归一化技术转换为另一种类型具有挑战性。
方法:我们提出了一个解决方案,通过消除Pre-LN Transformers主分支中的冗余均值信息,将LayerNorm转化为RMSNorm,从而实现更高的效率。我们还提出了基于零均值向量无损压缩的Compressed RMSNorm (CRMSNorm)和Pre-CRMSNorm Transformer。
效果:实验证明,我们可以将Pre-LN Transformers的训练和推理时间减少1% - 10%。

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers. While LayerNorm recenters and rescales input vectors, RMSNorm only rescales the vectors by their RMS value. Despite being more computationally efficient, RMSNorm may compromise the representation ability of Transformers. There is currently no consensus regarding the preferred normalization technique, as some models employ LayerNorm while others utilize RMSNorm, especially in recent large language models. It is challenging to convert Transformers with one normalization to the other type. While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers. By removing the inherent redundant mean information in the main branch of Pre-LN Transformers, we can reduce LayerNorm to RMSNorm, achieving higher efficiency. We further propose the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a lossless compression of the zero-mean vectors. We formally establish the equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in both training and inference. It implies that Pre-LN Transformers can be substituted with Pre-(C)RMSNorm counterparts at almost no cost, offering the same arithmetic functionality along with free efficiency improvement. Experiments demonstrate that we can reduce the training and inference time of Pre-LN Transformers by 1% - 10%.

QuantSR: Accurate Low-bit Quantization for Efficient Image Super-Resolution
Haotong Qin Yulun Zhang Yifu Ding Yifan liu Xianglong Liu Martin Danelljan Fisher Yu



研究问题:如何通过低比特量化实现图像超分辨率(SR)的准确和高效处理。
动机:尽管低比特量化可以显著减少参数和操作,但许多量化SR模型的准确性会低于全精度模型,特别是在极低比特宽度(2-4比特)下,限制了其实际应用。
方法:提出一种名为QuantSR的新型量化图像SR网络,通过引入可学习的重新分配量化器(RLQ)来克服量化在网络中引起的表示同质性问题。同时,提出深度动态量化架构(DQA),以在推理过程中灵活地进行效率与准确性的权衡。
效果:实验表明,QuantSR在准确性方面优于现有的最先进的量化SR网络,同时也提供了更具竞争力的计算效率。此外,通过提供用于卷积和变压器版本的QuantSR-C和QuantSR-T,展示了该方案的良好架构通用性。

Low-bit quantization in image super-resolution (SR) has attracted copious attention in recent research due to its ability to reduce parameters and operations significantly. However, many quantized SR models suffer from accuracy degradation compared to their full-precision counterparts, especially at ultra-low bit widths (2-4 bits), limiting their practical applications. To address this issue, we propose a novel quantized image SR network, called QuantSR, which achieves accurate and efficient SR processing under low-bit quantization. To overcome the representation homogeneity caused by quantization in the network, we introduce the Redistribution-driven Learnable Quantizer (RLQ). This is accomplished through an inference-agnostic efficient redistribution design, which adds additional information in both forward and backward passes to improve the representation ability of quantized networks. Furthermore, to achieve flexible inference and break the upper limit of accuracy, we propose the Depth-dynamic Quantized Architecture (DQA). Our DQA allows for the trade-off between efficiency and accuracy during inference through weight sharing. Our comprehensive experiments show that QuantSR outperforms existing state-of-the-art quantized SR networks in terms of accuracy while also providing more competitive computational efficiency. In addition, we demonstrate the scheme's satisfactory architecture generality by providing QuantSR-C and QuantSR-T for both convolution and Transformer versions, respectively. Our code and models are released at https://github.com/htqin/QuantSR .

k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy
Chenglin Fan Ping Li Xiaoyun Li



研究问题:在聚类算法中,初始中心的选择对学习到的簇的质量至关重要。
动机:我们提出了一种新的初始化方案,用于解决一般度量空间(如由图诱导的离散空间)中的k-median问题。
方法:我们基于数据构建度量嵌入树结构,提出了一种新颖且高效的搜索算法,用于寻找良好的初始中心,这些初始中心随后可用于局部搜索算法。
效果:我们的HST初始化方法可以产生比另一种流行的方法k-median++更低误差的初始中心,当k不太小时,效率也更高。我们的HST初始化还可以轻松扩展到差分隐私(DP)设置,以生成私有初始中心。实验表明,应用我们的私有HST初始化和DP局部搜索后,近似误差得到了改善,并在一个小因子内接近下限。

In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. We propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower error than those from another popular method $k$-median++, also with higher efficiency when $k$ is not too small. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error of applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments demonstrate the effectiveness of our proposed methods.

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Jean Kaddour Oscar Key Piotr Nawrot Pasquale Minervini Matt Kusner



研究问题:近年来,基于Transformer的语言模型的训练计算需求急剧增加。
动机:这种趋势促使研究人员设计出更有效的算法,以提高训练、验证和下游性能,比标准的训练更快。
方法:我们重新审视了三类这样的算法:动态架构(层堆叠、层删除)、批量选择(选择性反向传播、RHO-loss)和高效优化器(Lion、Sophia)。
效果:我们发现,当使用这些方法在固定的计算预算下预训练BERT和T5时,它们在训练、验证和下游的收益与具有完全衰减学习率的基线相比几乎消失。

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop., RHO-loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: https://github.com/JeanKaddour/NoTrainNoGain.

HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation
Ho Man Kwan Ge Gao Fan Zhang Andy Gower David Bull



研究问题:如何利用隐式神经表示(INRs)进行视频压缩,以实现与现有标准视频编解码器竞争的效果。
动机:现有的基于INR的视频压缩方法由于网络结构过于简单,其压缩性能无法达到最先进的水平。
方法:本文提出了HiNeRV,一种结合了轻量级层和新型分层位置编码的INR。通过深度可分离卷积、MLP和插值层构建了深度和宽度都很大的网络架构,同时HiNeRV还能同时对视频的帧和块进行统一表示。
效果:在UVG和MCL-JCV数据集上进行的实验表明,HiNeRV在视频压缩方面比所有现有的INRs基线都有显著的改进,并且与学习为基础的编解码器相比具有竞争力的性能(在UVG数据集上,总体比特率节省了72.3%,在DCVC上节省了43.4%,以PSNR衡量)。

Learning-based video compression is currently a popular research topic, offering the potential to compete with conventional standard video codecs. In this context, Implicit Neural Representations (INRs) have previously been used to represent and compress image and video content, demonstrating relatively high decoding speed compared to other methods. However, existing INR-based methods have failed to deliver rate quality performance comparable with the state of the art in video compression. This is mainly due to the simplicity of the employed network architectures, which limit their representation capability. In this paper, we propose HiNeRV, an INR that combines light weight layers with novel hierarchical positional encodings. We employs depth-wise convolutional, MLP and interpolation layers to build the deep and wide network architecture with high capacity. HiNeRV is also a unified representation encoding videos in both frames and patches at the same time, which offers higher performance and flexibility than existing methods. We further build a video codec based on HiNeRV and a refined pipeline for training, pruning and quantization that can better preserve HiNeRV's performance during lossy model compression. The proposed method has been evaluated on both UVG and MCL-JCV datasets for video compression, demonstrating significant improvement over all existing INRs baselines and competitive performance when compared to learning-based codecs (72.3\% overall bit rate saving over HNeRV and 43.4\% over DCVC on the UVG dataset, measured in PSNR).

Single-Pass Pivot Algorithm for Correlation Clustering. Keep it simple!
Konstantin Makarychev Sayak Chakrabarty



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE)。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文旨在通过结合知识图谱来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We show that a simple single-pass semi-streaming variant of the Pivot algorithm for Correlation Clustering gives a (3+eps)-approximation using O(n/eps) words of memory. This is a slight improvement over the recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3+eps)-approximation using O(n log n) words of memory, and Behnezhad, Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory. One of the main contributions of our paper is that the algorithm and its analysis are simple and easy to understand.

Distributed Personalized Empirical Risk Minimization
Yuyang Deng Mohammad Mahdi Kamani Pouria Mahdavinia Mehrdad Mahdavi



研究问题:如何从异构数据源中学习,同时不对参与设备的共享计算资源施加严格限制。
动机:解决数据异质性问题,提高所有局部分布的统计准确性。
方法:提出个性化经验风险最小化(PERM)新范式,通过有效估计数据分布之间的统计差异来个性化本地经验损失的聚合,并设计分布式算法替代标准模型平均以优化所有设备的PERM目标。
效果:该算法能有效学习大规模个性化模型,同时适应不同客户的内存和计算资源,实验结果验证了其有效性。

This paper advocates a new paradigm Personalized Empirical Risk Minimization (PERM) to facilitate learning from heterogeneous data sources without imposing stringent constraints on computational resources shared by participating devices. In PERM, we aim at learning a distinct model for each client by personalizing the aggregation of local empirical losses by effectively estimating the statistical discrepancy among data distributions, which entails optimal statistical accuracy for all local distributions and overcomes the data heterogeneity issue. To learn personalized models at scale, we propose a distributed algorithm that replaces the standard model averaging with model shuffling to simultaneously optimize PERM objectives for all devices. This also allows to learn distinct model architectures (e.g., neural networks with different number of parameters) for different clients, thus confining to underlying memory and compute resources of individual clients. We rigorously analyze the convergence of proposed algorithm and conduct experiments that corroborates the effectiveness of proposed paradigm.

Small batch deep reinforcement learning
Johan Samir Obando Ceron Marc G Bellemare Pablo Samuel Castro



研究问题:在基于值的深度强化学习中,批量大小参数指定每个梯度更新要采样的转换数量。尽管对学习过程至关重要,但通常在提出新算法时不会调整此值。
动机:本研究通过广泛的实证研究指出,减小批量大小可以带来显著的性能提升,这令人惊讶,因为训练神经网络的一般趋势是使用更大的批量大小来提高性能。
方法:我们提出了一种新的算法,通过减少批量大小来优化深度强化学习的性能。
效果:实验结果表明,减小批量大小可以显著提高性能,并通过一系列实证分析进一步理解了这一现象。

In value-based deep reinforcement learning with replay memories, the batch size parameter specifies how many transitions to sample for each gradient update. Although critical to the learning process, this value is typically not adjusted when proposing new algorithms. In this work we present a broad empirical study that suggests reducing the batch size can result in a number of significant performance gains; this is surprising, as the general tendency when training neural networks is towards larger batch sizes for improved performance. We complement our experimental findings with a set of empirical analyses towards better understanding this phenomenon.

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Tao Lei Junwen Bai Siddhartha Brahma Joshua Ainslie Kenton Lee Yanqi Zhou Nan Du Vincent Y Zhao Yuexin Wu Bo Li Yu Zhang Ming-Wei Chang



研究问题:如何通过条件计算平衡速度和精度,提高推理效率。
动机:现有的适配器方法在推理效率上有待提升。
方法:提出条件适配器(CoDA)方法,通过添加稀疏激活和少量新参数以及轻量级训练阶段,利用已有的预训练模型进行知识转移。
效果:实验表明,CoDA方法在多种语言、视觉和语音任务上,与最先进的适配器方法相比,推理速度提高了2倍至8倍,且准确率损失较小,参数效率相同。

We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-weight training phase. Our experiments demonstrate that the CoDA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CoDA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency.

Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention
Matteo Pagliardini Daniele Paliotta Martin Jaggi François Fleuret



研究问题:如何有效地处理长度不断增长的序列,以降低计算复杂度并提高运行速度。
动机:Transformer模型在处理长序列时,自注意力机制的计算复杂度呈二次方增长,导致运行速度慢和计算资源消耗大。
方法:通过扩展FlashAttention,实现对多种稀疏注意力模式的支持,包括键/查询丢弃和基于哈希的注意力等,从而无需增加额外的计算复杂度,并在FlashAttention的基础上实现多倍的运行速度提升。
效果:在不牺牲困惑度的情况下,该方法能够显著提高Transformer语言模型的训练速度,对于8k和16k个标记的序列,训练速度分别提高了2.0倍和3.3倍。

Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention---which is the only component scaling quadratically w.r.t. the sequence length---becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementation concerns and end up imposing a simple and static structure over the attention matrix. Conversely, implementing more dynamic sparse attention often results in runtimes significantly slower than computing the full attention using the Flash implementation from Dao et al. (2022). We extend FlashAttention to accommodate a large class of attention sparsity patterns that, in particular, encompass key/query dropping and hashing-based attention. This leads to implementations with no computational complexity overhead and a multi-fold runtime speedup on top of FlashAttention. Even with relatively low degrees of sparsity, our method improves visibly upon FlashAttention as the sequence length increases. Without sacrificing perplexity, we increase the training speed of a transformer language model by $2.0\times$ and $3.3\times$ for sequences of respectively $8k$ and $16k$ tokens.

$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning
Adel Nabli Eugene Belilovsky Edouard Oyallon



研究问题:如何有效地进行深度学习模型的分布式训练,以解决同步集中算法在大规模训练中存在的通信瓶颈和同步锁的问题。
动机:当前的深度学习模型主要依赖同步集中算法进行训练,但在大规模训练中,这种算法会导致严重的通信瓶颈和同步锁问题。分散异步算法作为潜在的替代方案,其实际应用仍显不足。
方法:我们提出了一种基于随机化、谣言传播的优化算法,该算法通过引入一个持续的局部动量$textbf{A}^2textbf{CiD}^2$来工作。这种方法允许每个工人在不停止的情况下连续处理小批量数据,并并行运行对等平均程序,从而减少空闲时间。
效果:我们的理论研究证明,与以往的异步分散基线相比,该方法可以加速学习速度。实验结果表明,在使用我们的$\textbf{A}^2\textbf{CiD}^2$动量时,即使在连接性较差的网络中,也可以显著降低通信成本。特别是在ImageNet数据集上,我们在最多64个异步工人(使用A100 GPU)和各种通信网络拓扑结构上取得了一致的改进效果。

Distributed training of Deep Learning models has been critical to many recent successes in the field. Current standard methods primarily rely on synchronous centralized algorithms which induce major communication bottlenecks and synchronization locks at scale. Decentralized asynchronous algorithms are emerging as a potential alternative but their practical applicability still lags. In order to mitigate the increase in communication cost that naturally comes with scaling the number of workers, we introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $\textbf{A}^2\textbf{CiD}^2$. Our method allows each worker to continuously process mini-batches without stopping, and run a peer-to-peer averaging routine in parallel, reducing idle time. In addition to inducing a significant communication acceleration at no cost other than adding a local momentum variable, minimal adaptation is required to incorporate $\textbf{A}^2\textbf{CiD}^2$ to standard asynchronous approaches. Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines and we empirically show that using our $\textbf{A}^2\textbf{CiD}^2$ momentum significantly decrease communication costs in poorly connected networks. In particular, we show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers (A100 GPUs) and various communication network topologies.

Correlation Aware Sparsified Mean Estimation Using Random Projection
Shuli Jiang Pranay Sharma Gauri Joshi



研究问题:本文研究了分布式优化和联邦学习中常用的子任务——通信高效的分布式向量均值估计。
动机:在实际应用中,客户端之间可能存在相关性,而现有的随机$k$稀疏化技术(Rand-$k$)并未考虑到这种相关性。尽管最近提出的Rand-$k$-Spatial估计器利用了服务器端的跨客户端相关性信息来提高Rand-$k$的性能,但其性能仍不理想。因此,改进均值估计是加快分布式优化收敛速度的关键。
方法:我们提出了一种具有更灵活编码解码过程的Rand-Proj-Spatial估计器,该估计器通过将客户端向量投影到一个随机$k$维子空间来泛化Rand-$k$的编码。我们使用亚采样随机哈德马特变换(SRHT)作为投影矩阵,并证明使用SRHT的Rand-Proj-Spatial优于Rand-$k$-Spatial,能更有效地利用相关性信息。此外,我们还提出了一种引入不同程度相关性的方法,并在相关性信息不可用时提出了一种实用的Rand-Proj-Spatial变体。
效果:在真实世界的分布式优化任务上进行的实验表明,Rand-Proj-Spatial与Rand-$k$-Spatial和其他更复杂的稀疏化技术相比具有优越的性能。

We study the problem of communication-efficient distributed vector mean estimation, which is a commonly used subroutine in distributed optimization and Federated Learning (FL). Rand-$k$ sparsification is a commonly used technique to reduce communication cost, where each client sends $k < d$ of its coordinates to the server. However, Rand-$k$ is agnostic to any correlations, that might exist between clients in practical scenarios. The recently proposed Rand-$k$-Spatial estimator leverages the cross-client correlation information at the server to improve Rand-$k$'s performance. Yet, the performance of Rand-$k$-Spatial is suboptimal, and improving mean estimation is key to a faster convergence in distributed optimization. We propose the Rand-Proj-Spatial estimator with a more flexible encoding-decoding procedure, which generalizes the encoding of Rand-$k$ by projecting the client vectors to a random $k$-dimensional subspace. We utilize Subsampled Randomized Hadamard Transform (SRHT) as the projection matrix, and show that Rand-Proj-Spatial with SRHT outperforms Rand-$k$-Spatial, using the correlation information more efficiently. Furthermore, we propose an approach to incorporate varying degrees of correlation, and suggest a practical variant of Rand-Proj-Spatial when the correlation information is not available to the server. Finally, experiments on real-world distributed optimization tasks showcase the superior performance of Rand-Proj-Spatial compared to Rand-$k$-Spatial and other more sophisticated sparsification techniques.

BayesTune: Bayesian Sparse Deep Model Fine-tuning
Minyoung Kim Timothy Hospedales



研究问题:如何优化预训练模型的稀疏微调过程,选择更新哪些参数以提升下游任务的性能。
动机:当前的稀疏微调方法大多依赖于人工设定的策略或近似计算,缺乏理论指导和效率。
方法:提出一种贝叶斯稀疏微调算法,为预训练模型的每个参数设置稀疏拉普拉斯先验,通过后验均值判断参数是否需要更新。
效果:在NLP基准测试和VTAB视觉任务上,该方法比现有技术表现更好,例如,在RoBERTa的GLUE和SuperGLUE基准测试中,性能提高了1%。

Deep learning practice is increasingly driven by powerful foundation models (FM), pre-trained at scale and then fine-tuned for specific tasks of interest. A key property of this workflow is the efficacy of performing sparse or parameter-efficient fine-tuning, meaning that by updating only a tiny fraction of the whole FM parameters on a downstream task can lead to surprisingly good performance, often even superior to a full model update. However, it is not clear what is the optimal and principled way to select which parameters to update. Although a growing number of sparse fine-tuning ideas have been proposed, they are mostly not satisfactory, relying on hand-crafted heuristics or heavy approximation. In this paper we propose a novel Bayesian sparse fine-tuning algorithm: we place a (sparse) Laplace prior for each parameter of the FM, with the mean equal to the initial value and the scale parameter having a hyper-prior that encourages small scale. Roughly speaking, the posterior means of the scale parameters indicate how important it is to update the corresponding parameter away from its initial value when solving the downstream task. Given the sparse prior, most scale parameters are small a posteriori, and the few large-valued scale parameters identify those FM parameters that crucially need to be updated away from their initial values. Based on this, we can threshold the scale parameters to decide which parameters to update or freeze, leading to a principled sparse fine-tuning strategy. To efficiently infer the posterior distribution of the scale parameters, we adopt the Langevin MCMC sampler, requiring only two times the complexity of the vanilla SGD. Tested on popular NLP benchmarks as well as the VTAB vision tasks, our approach shows significant improvement over the state-of-the-arts (e.g., 1% point higher than the best SOTA when fine-tuning RoBERTa for GLUE and SuperGLUE benchmarks).

Private Federated Frequency Estimation: Adapting to the Hardness of the Instance
Jingfeng Wu Wennan Zhu Peter Kairouz Vladimir Braverman



研究问题:如何在多个通信轮次中进行联邦频率估计,同时保持服务器只能访问客户端持有的向量之和的安全约束。
动机:在单轮通信的联邦频率估计中,已有的方法如count sketch已经接近信息理论最优。但在多轮通信中,需要提出更优的概略算法。
方法:提出了一种新的概略算法,该算法在多轮通信中的准确性优于简单的count sketch适应。对于简单的问题,我们的方法以及count sketch都可以实现更好的准确性。因此,我们提出了一个两阶段的方法,使得对于简单的问题可以使用更小的概略大小。最后,我们提供了使我们的算法具有差分隐私性的机制。
效果:通过在真实数据集上进行的实验,验证了我们的方法的性能。

In federated frequency estimation (FFE), multiple clients work together to estimate the frequency of their local data by communicating with a server, while maintaining the security constraint of $\mathtt{secsum}$ where the server can only access the sum of client-held vectors. For FFE with a single communication round, it is known that count sketch is nearly information-theoretically optimal [Chen et al., 2022]. However, when multiple communication rounds are allowed, we propose a new sketch algorithm that is provably more accurate than a naive adaptation of count sketch. Furthermore, we show that both our sketch algorithm and count sketch can achieve better accuracy when the problem instance is simpler. Therefore, we propose a two-phase approach to enable the use of a smaller sketch size for simpler problems. Finally, we provide mechanisms to make our proposed algorithm differentially private. We verify the performance of our methods through experiments conducted on real datasets.

Handling Data Heterogeneity via Architectural Design for Federated Visual Recognition
Sara Pieri Jose Renato Restom Samuel Horváth Hisham Cholakkal



研究问题:如何在不交换敏感信息的情况下,实现多方协同训练机器学习模型。
动机:联邦学习(FL)是一种有前景的研究范式,可以在各方之间进行机器学习模型的协同训练,而无需交换敏感信息。然而,保留在各个客户端的数据对达到与集中式训练模型相媲美的性能提出了根本性的挑战。
方法:本研究对视觉识别中的联邦学习进行了广泛的回顾和分析,强调了在实现最佳性能方面,深思熟虑的架构设计选择的关键作用。通过对卷积神经网络、变压器和MLP混合器等不同尖端架构的深入分析,我们实验性地证明了架构选择可以显著提高FL系统的性能,特别是在处理异构数据时。
效果:我们在四个具有挑战性的FL数据集上研究了五个不同架构家族的视觉识别模型。我们还重新研究了在FL设置中表现不佳的基于卷积的架构,并分析了归一化层对FL性能的影响。我们的发现强调了在实际场景中计算机视觉任务的架构设计的重要性,有效地缩小了联邦学习和集中式学习之间的性能差距。

Federated Learning (FL) is a promising research paradigm that enables the collaborative training of machine learning models among various parties without the need for sensitive information exchange. Nonetheless, retaining data in individual clients introduces fundamental challenges to achieving performance on par with centrally trained models. Our study provides an extensive review of federated learning applied to visual recognition. It underscores the critical role of thoughtful architectural design choices in achieving optimal performance, a factor often neglected in the FL literature. Many existing FL solutions are tested on shallow or simple networks, which may not accurately reflect real-world applications. This practice restricts the transferability of research findings to large-scale visual recognition models. Through an in-depth analysis of diverse cutting-edge architectures such as convolutional neural networks, transformers, and MLP-mixers, we experimentally demonstrate that architectural choices can substantially enhance FL systems' performance, particularly when handling heterogeneous data. We study visual recognition models from five different architectural families on four challenging FL datasets. We also re-investigate the inferior performance convolution-based architectures in the FL setting and analyze the influence of normalization layers on the FL performance. Our findings emphasize the importance of architectural design for computer vision tasks in practical scenarios, effectively narrowing the performance gap between federated and centralized learning.

Hardware Resilience Properties of Text-Guided Image Classifiers
Syed Talal Wasim Kabila Haile Soboka Abdulrahman Mahmoud Salman Khan David Brooks Gu-Yeon Wei



研究问题:如何在部署图像分类模型时提高其面对暂时性硬件错误的可靠性。
动机:利用来自GPT-3的丰富文本嵌入和CLIP预训练的文本编码器,作为分类层初始化,以提高图像分类模型在面临暂时性硬件错误时的可靠性。
方法:通过使用来自GPT-3的问题提示和CLIP预训练的文本编码器生成丰富的文本嵌入,并将其用作分类层的初始值。
效果:该方法在各种网络架构中的关键层实现了平均5.5倍的硬件可靠性增长(最高可达14倍),同时与基线PyTorch模型相比,准确率仅下降了0.3%。此外,该方法可以无缝集成到任何图像分类主干网络中,适用于各种网络架构,且参数和FLOPs开销较小,训练过程一致。这项研究为提高图像分类模型对硬件故障的鲁棒性提供了一种实用且高效的方法,对未来在此领域的研究具有潜在影响。

This paper presents a novel method to enhance the reliability of image classification models during deployment in the face of transient hardware errors. By utilizing enriched text embeddings derived from GPT-3 with question prompts per class and CLIP pretrained text encoder, we investigate their impact as an initialization for the classification layer. Our approach achieves a remarkable $5.5\times$ average increase in hardware reliability (and up to $14\times$) across various architectures in the most critical layer, with minimal accuracy drop ($0.3\%$ on average) compared to baseline PyTorch models. Furthermore, our method seamlessly integrates with any image classification backbone, showcases results across various network architectures, decreases parameter and FLOPs overhead, and follows a consistent training recipe. This research offers a practical and efficient solution to bolster the robustness of image classification models against hardware failures, with potential implications for future studies in this domain. Our code and models are released at https://github.com/TalalWasim/TextGuidedResilience.

Convergence Analysis of Sequential Federated Learning on Heterogeneous Data
Yipeng Li Xinchen Lyu



研究问题:本文旨在解决联邦学习中,在异构数据上,顺序联邦学习(SFL)的收敛性理论尚未建立的问题。
动机:与并行联邦学习(PFL)相比,SFL在异构数据上的收敛性理论尚待完善。
方法:通过建立强/通用/非凸目标函数在异构数据上的SFL的收敛保证,比较了全和部分客户参与下的SFL和PFL在异构数据上的收敛性能。
效果:实验结果验证了在跨设备设置中,SFL在极端异构数据上优于PFL的反直觉分析结果。

There are two categories of methods in Federated Learning (FL) for joint training across multiple clients: i) parallel FL (PFL), where clients train models in a parallel manner; and ii) sequential FL (SFL), where clients train models in a sequential manner. In contrast to that of PFL, the convergence theory of SFL on heterogeneous data is still lacking. In this paper, we establish the convergence guarantees of SFL for strongly/general/non-convex objectives on heterogeneous data. The convergence guarantees of SFL are better than that of PFL on heterogeneous data with both full and partial client participation. Experimental results validate the counterintuitive analysis result that SFL outperforms PFL on extremely heterogeneous data in cross-device settings.

CoPriv: Network/Protocol Co-Optimization for Communication-Efficient Private Inference
Wenxuan Zeng Meng Li Haichuan Yang Wen-jie Lu Runsheng Wang Ru Huang



研究问题:现有的基于安全2-party计算(2PC)的深度神经网络(DNN)推理方法,由于大量的通信操作,导致延迟开销巨大。
动机:目前的方法主要依赖ReLU计数这一代理指标来近似通信开销,并专注于减少ReLU以改善通信效率。然而,我们发现这些方法对于最新的2PC协议来说,由于忽视了其他线性和非线性操作,其实现的通信减少效果有限。
方法:我们提出了CoPriv框架,该框架将2PC推理协议和DNN架构进行联合优化。CoPriv采用了一种基于Winograd转换的卷积2PC新协议,并开发了对DNN敏感的优化方法,显著减少了推理通信。此外,CoPriv还开发了一种与所提出的协议兼容的2PC感知网络优化算法,同时减少了所有线性和非线性操作的通信。
效果:我们在CIFAR-100上比较了CoPriv与最新的2PC协议CrypTFlow2,结果显示在ResNet-18和ResNet-32上都实现了2.1倍的通信减少。我们还比较了CoPriv与最新的网络优化方法SNL、MetaPruning等,结果显示CoPriv在在线和总通信减少方面分别达到了9.98倍和3.88倍,并且准确率更高。相比MetaPruning,CoPriv在在线通信减少方面达到了3.87倍,并且准确率提高了超过3%。

Deep neural network (DNN) inference based on secure 2-party computation (2PC) can offer cryptographically-secure privacy protection but suffers from orders of magnitude latency overhead due to enormous communication. Previous works heavily rely on a proxy metric of ReLU counts to approximate the communication overhead and focus on reducing the ReLUs to improve the communication efficiency. However, we observe these works achieve limited communication reduction for state-of-the-art (SOTA) 2PC protocols due to the ignorance of other linear and non-linear operations, which now contribute to the majority of communication. In this work, we present CoPriv, a framework that jointly optimizes the 2PC inference protocol and the DNN architecture. CoPriv features a new 2PC protocol for convolution based on Winograd transformation and develops DNN-aware optimization to significantly reduce the inference communication. CoPriv further develops a 2PC-aware network optimization algorithm that is compatible with the proposed protocol and simultaneously reduces the communication for all the linear and non-linear operations. We compare CoPriv with the SOTA 2PC protocol, CrypTFlow2, and demonstrate 2.1× communication reduction for both ResNet-18 and ResNet-32 on CIFAR-100. We also compare CoPriv with SOTA network optimization methods, including SNL, MetaPruning, etc. CoPriv achieves 9.98× and 3.88× online and total communication reduction with a higher accuracy compare to SNL, respectively. CoPriv also achieves 3.87× online communication reduction with more than 3% higher accuracy compared to MetaPruning.

MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates
Mohammad Mozaffari Sikan Li Zhao Zhang Maryam Mehri Dehnavi



研究问题:如何提高深度神经网络的训练速度和收敛性?
动机:二阶技术虽然比一阶技术具有更高的收敛速度,但其模型大小或训练批量大小的立方复杂度导致其在大型语言模型等变压器模型中表现不佳。
方法:提出了一种名为MKOR的基于Kronecker因子的优化器,使用Rank-1更新,其复杂度与模型大小呈二次关系,缓解了二阶方法的计算瓶颈。通过降低二阶更新的通信复杂度并实现线性通信复杂度,MKOR增加了二阶更新的频率。
效果:实验表明,MKOR在BERT-Large-Uncased上的表现优于最先进的一阶方法LAMB优化器和最优秀的二阶方法KAISA/KFAC,分别高达2.57倍和1.85倍。

This work proposes a Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 updates, called MKOR, that improves the training time and convergence properties of deep neural networks (DNNs). Second-order techniques, while enjoying higher convergence rates vs first-order counterparts, have cubic complexity with respect to either the model size and/or the training batch size. Hence they exhibit poor scalability and performance in transformer models, e.g. large language models (LLMs), because the batch sizes in these models scale by the attention mechanism sequence length, leading to large model size and batch sizes. MKOR's complexity is quadratic with respect to the model size, alleviating the computation bottlenecks in second-order methods. Because of their high computation complexity, state-of-the-art implementations of second-order methods can only afford to update the second order information infrequently, and thus do not fully exploit the promise of better convergence from these updates. By reducing the communication complexity of the second-order updates as well as achieving a linear communication complexity, MKOR increases the frequency of second order updates. We also propose a hybrid version of MKOR (called MKOR-H) that mid-training falls backs to a first order optimizer if the second order updates no longer accelerate convergence. Our experiments show that MKOR outperforms state -of-the-art first order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57x and 1.85x respectively on BERT-Large-Uncased on 64 GPUs.

PDP: Parameter-free Differentiable Pruning is All You Need
Minsik Cho Saurabh Adya Devang Naik



研究问题:如何有效地减少DNN模型的大小,提高推理延迟,并最小化DNN加速器的功耗。
动机:现有的方法可能过于复杂、昂贵或无效,无法应用于各种视觉/语言任务和DNN架构,也无法满足结构化剪枝的约束。
方法:本文提出了一种高效且有效的训练时剪枝方案,参数自由可微分剪枝(PDP),在模型大小、准确性和训练成本方面提供最先进的质量。PDP使用动态权重函数在训练期间生成软剪枝掩码,以参数自由的方式为给定的剪枝目标生成权重。
效果:例如,对于MobileNet-v1,PDP可以在86.6%的稀疏度下实现68.2%的ImageNet1k top-1准确率,比现有算法高出1.7%的准确率。此外,对于BERT,PDP在90%的稀疏度下实现了超过83.1%的Multi-Genre自然语言推理准确率,而现有技术中最好的结果为81.5%。此外,PDP还可以应用于结构化剪枝,如N:M剪枝和通道剪枝。对于ResNet18的1:4结构化剪枝,PDP将top-1 ImageNet1k准确率提高了超过3.6%。对于ResNet50的通道剪枝,PDP将top-1 ImageNet1k准确率降低了0.6%。

DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state- of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision and natural language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy. In addition, PDP can be applied to structured pruning, such as N:M pruning and channel pruning. For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k accuracy by over 3.6% over the state-of-the-art. For channel pruning of ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the state-of-the-art.

Efficient Beam Tree Recursion
Jishnu Ray Chowdhury Cornelia Caragea



研究问题:本文旨在解决Beam Tree Recursive Neural Network(BT-RvNN)内存使用过高的问题。
动机:尽管BT-RvNN在ListOps任务上的表现优于之前的方法,但其高昂的内存使用成本仍是一个问题。
方法:作者识别出BT-RvNN内存使用的主要瓶颈在于评分函数和递归细胞函数的纠缠,并提出了相应的策略来消除这个瓶颈,进一步简化其内存使用。
效果:这些策略不仅将BT-RvNN的内存使用降低了10-16倍,还创造了新的ListOps性能最优解,同时在其他任务上保持了相似的性能。此外,作者还提出了一种策略,利用BT-RvNN产生的隐式树节点表示,将其从形式为f: Rn×d→Rd的句子编码器转变为形式为f: Rn×d→Rn×d的标记上下文化器。因此,这些提案不仅为RvNN的进一步扩展开辟了道路,也为将BT-RvNN作为深度学习工具包中的另一个构建模块提供了标准化的方式,可以方便地与其他流行的模型如Transformers和结构化状态空间模型进行堆叠或接口连接。

Beam Tree Recursive Neural Network (BT-RvNN) was recently proposed as an extension of Gumbel Tree RvNN and it was shown to achieve state-of-the-art length generalization performance in ListOps while maintaining comparable performance on other tasks. However, although better than previous approaches in terms of memory usage, BT-RvNN can be still exorbitantly expensive. In this paper, we identify the main bottleneck in BT-RvNN's memory usage to be the entanglement of the scorer function and the recursive cell function. We propose strategies to remove this bottleneck and further simplify its memory usage. Overall, our strategies not only reduce the memory usage of BT-RvNN by $10-16$ times but also create a new state-of-the-art in ListOps while maintaining similar performance in other tasks. In addition, we also propose a strategy to utilize the induced latent-tree node representations produced by BT-RvNN to turn BT-RvNN from a sentence encoder of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{d}$ into a token contextualizer of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{n \times d}$. Thus, our proposals not only open up a path for further scalability of RvNNs but also standardize a way to use BT-RvNNs as another building block in the deep learning toolkit that can be easily stacked or interfaced with other popular models such as Transformers and Structured State Space models. Our code is available at the link: https://github.com/JRC1995/BeamRecursionFamily.

Addressing the speed-accuracy simulation trade-off for adaptive spiking neurons
Luke Taylor Andrew J King Nicol Spencer Harper



研究问题:如何在模拟大脑神经元时平衡速度和准确性?
动机:目前的模拟方法在模拟大脑神经元时,要么使用小的时间步长进行准确模拟但速度较慢,要么使用大的时间步长进行快速模拟但会损失模拟的准确性。
方法:通过算法重新解释自适应泄漏积分-触发器(ALIF)模型,降低序列模拟的复杂性,并允许在GPU上进行更有效的并行化。
效果:在合成基准测试中,使用小的时间步长,我们的实现获得了超过50倍的训练速度提升。在不同的有监督分类任务上,我们的方法与标准的ALIF实现相比,性能相当,但训练时间更短。此外,我们还展示了如何快速准确地拟合皮质神经元的真实电生理记录,其中非常精细的亚毫秒级时间步长对于捕获精确的尖峰定时至关重要。

The adaptive leaky integrate-and-fire (ALIF) model is fundamental within computational neuroscience and has been instrumental in studying our brains $\textit{in silico}$. Due to the sequential nature of simulating these neural models, a commonly faced issue is the speed-accuracy trade-off: either accurately simulate a neuron using a small discretisation time-step (DT), which is slow, or more quickly simulate a neuron using a larger DT and incur a loss in simulation accuracy. Here we provide a solution to this dilemma, by algorithmically reinterpreting the ALIF model, reducing the sequential simulation complexity and permitting a more efficient parallelisation on GPUs. We computationally validate our implementation to obtain over a $50\times$ training speedup using small DTs on synthetic benchmarks. We also obtained a comparable performance to the standard ALIF implementation on different supervised classification tasks - yet in a fraction of the training time. Lastly, we showcase how our model makes it possible to quickly and accurately fit real electrophysiological recordings of cortical neurons, where very fine sub-millisecond DTs are crucial for capturing exact spike timing.

Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability
Jishnu Ray Chowdhury Cornelia Caragea



研究问题:如何在保持计算效率的同时,提高神经网络模型处理复杂任务的能力。
动机:现有的平衡二叉树递归神经网络(BBT-RvNNs)虽然在长序列任务上效率高,但不能解决简单的算术任务;而其他能解决这类问题的递归神经网络模型(如Beam Tree RvNN)在时间和空间上的消耗则大得多。
方法:提出一种新的框架——递归中的递归(RIR),采用两层嵌套的递归结构,外层为k元平衡树模型,内层实现其单元功能。在内层递归中,选择使用Beam Tree RvNNs,并提出一种beam对齐策略来调整其在RIR中的表现。
效果:RIR模型首次实现了在ListOps任务上的高度泛化性能(达到90%以上),同时具有足够的可扩展性,可以训练处理来自Long Range Arena的长序列输入。在LRA语言任务的准确性方面,RIR与Structured State Space Models(SSMs)竞争,且无需特殊初始化即可超越Transformers。

Binary Balanced Tree Recursive Neural Networks (BBT-RvNNs) enforce sequence composition according to a preset balanced binary tree structure. Thus, their non-linear recursion depth (which is the tree depth) is just $\log_2 n$ ($n$ being the sequence length). Such logarithmic scaling makes BBT-RvNNs efficient and scalable on long sequence tasks such as Long Range Arena (LRA). However, such computational efficiency comes at a cost because BBT-RvNNs cannot solve simple arithmetic tasks like ListOps. On the flip side, RvNN models (e.g., Beam Tree RvNN) that do succeed on ListOps (and other structure-sensitive tasks like formal logical inference) are generally several times more expensive (in time and space) than even Recurrent Neural Networks. In this paper, we introduce a novel framework --- Recursion in Recursion (RIR) to strike a balance between the two sides - getting some of the benefits from both worlds. In RIR, we use a form of two-level nested recursion - where the outer recursion is a $k$-ary balanced tree model with another recursive model (inner recursion) implementing its cell function. For the inner recursion, we choose Beam Tree RvNNs. To adjust Beam Tree RvNNs within RIR we also propose a novel strategy of beam alignment. Overall, this entails that the total recursive depth in RIR is upper-bounded by $k \log_k n$. Our best RIR-based model is the first model that demonstrates high ($\geq 90\%$) length-generalization performance on ListOps while at the same time being scalable enough to be trainable on long sequence inputs from LRA (it can reduce the memory usage of the original Beam Tree RvNN by hundreds of times). Moreover, in terms of accuracy in the LRA language tasks, it performs competitively with Structured State Space Models (SSMs) without any special initialization - outperforming Transformers by a large margin. On the other hand, while SSMs can marginally outperform RIR on LRA, they (SSMs) fail to length-generalize on ListOps. Our code is available at: https://github.com/JRC1995/BeamRecursionFamily/

Pruning vs Quantization: Which is Better?
Andrey Kuzmin Markus Nagel Mart Van Baalen Arash Behboodi Tijmen Blankevoort



研究问题:本文旨在回答神经网络压缩中量化和剪枝哪种技术更好的问题。
动机:虽然神经网络剪枝和量化技术已经存在很久,但至今只有一些针对两者的比较结果被发表。作者希望通过回答这个问题,为未来的神经网络硬件设计提供决策依据。
方法:作者对深度神经网络的两种压缩技术进行了广泛的比较。首先,对于一般的数据分布,给出了预期的量化和剪枝误差的解析比较;然后,提供了训练网络中每层剪枝和量化误差的下界和上界,并将其与优化后的实证误差进行比较;最后,对8个大型模型进行了广泛的实验比较,这些模型在3个任务上进行训练,并提供了关于量化和剪枝循环微调过程中学习到的表示的见解。
效果:结果显示,在大多数情况下,量化优于剪枝。只有在压缩比非常高的情况下,从准确性的角度来看,压缩可能更有利。

Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date, only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question of which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower and upper bounds for the per-layer pruning and quantization error in trained networks and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models trained on 3 tasks and provide insights into the representations learned during fine-tuning with quantization and pruning in the loop. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with a very high compression ratio, compression might be beneficial from an accuracy standpoint.

Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks
Eeshaan Jain Tushar Nandy Gaurav Aggarwal Ashish V. Tendulkar Rishabh K Iyer Abir De



研究问题:现有的子集选择方法主要采用离散组合和特定模型的方法,缺乏通用性,对于新的模型,算法必须从头开始执行。
动机:为了解决上述问题,本文提出了SubSelNet,一个非适应性的子集选择框架。
方法:首先引入了一种基于注意力的神经网络小工具,利用体系结构的图结构,作为已训练深度神经网络的替代品进行快速模型预测。然后使用这些预测来构建子集采样器。
效果:实验表明,该模型在多个真实数据集上优于几种方法。

Existing subset selection methods for efficient learning predominantly employ discrete combinatorial and model-specific approaches, which lack generalizability--- for each new model, the algorithm has to be executed from the beginning. Therefore, for an unseen architecture, one cannot use the subset chosen for a different model. In this work, we propose $\texttt{SubSelNet}$, a non-adaptive subset selection framework, which tackles these problems. Here, we first introduce an attention-based neural gadget that leverages the graph structure of architectures and acts as a surrogate to trained deep neural networks for quick model prediction. Then, we use these predictions to build subset samplers. This naturally provides us two variants of $\texttt{SubSelNet}$. The first variant is transductive (called Transductive-$\texttt{SubSelNet}$), which computes the subset separately for each model by solving a small optimization problem. Such an optimization is still super fast, thanks to the replacement of explicit model training by the model approximator. The second variant is inductive (called Inductive-$\texttt{SubSelNet}$), which computes the subset using a trained subset selector, without any optimization. Our experiments show that our model outperforms several methods across several real datasets.

Learning to Search Feasible and Infeasible Regions of Routing Problems with Flexible Neural k-Opt
Yining Ma Zhiguang Cao Yeow Meng Chee



研究问题:本文旨在提出一种新的学习搜索(L2S)算法,用于解决路由问题。
动机:现有的学习搜索算法主要基于可行性掩蔽方案,无法自主探索可行和非可行区域。
方法:提出了一种名为NeuOpt的新的L2S求解器,它通过定制的动作分解方法和循环双流解码器进行灵活的k-opt交换。同时,提出了引导非可行区域探索(GIRE)方案,以补充NeuOpt策略网络的可行性相关特征,并利用奖励塑造更有效地指导强化学习。此外,还为NeuOpt配备了动态数据增强(D2A),以在推理过程中进行更多样化的搜索。
效果:在旅行商问题(TSP)和载货车辆路径问题(CVRP)上的大量实验表明,NeuOpt不仅显著超越了现有的(基于掩蔽的)L2S求解器,而且比学习构建(L2C)和学习预测(L2P)求解器表现出优越性。

In this paper, we present Neural k-Opt (NeuOpt), a novel learning-to-search (L2S) solver for routing problems. It learns to perform flexible k-opt exchanges based on a tailored action factorization method and a customized recurrent dual-stream decoder. As a pioneering work to circumvent the pure feasibility masking scheme and enable the autonomous exploration of both feasible and infeasible regions, we then propose the Guided Infeasible Region Exploration (GIRE) scheme, which supplements the NeuOpt policy network with feasibility-related features and leverages reward shaping to steer reinforcement learning more effectively. Additionally, we equip NeuOpt with Dynamic Data Augmentation (D2A) for more diverse searches during inference. Extensive experiments on the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that our NeuOpt not only significantly outstrips existing (masking-based) L2S solvers, but also showcases superiority over the learning-to-construct (L2C) and learning-to-predict (L2P) solvers. Notably, we offer fresh perspectives on how neural solvers can handle VRP constraints. Our code is available: https://github.com/yining043/NeuOpt.

EvoFed: Leveraging Evolutionary Strategies for Communication-Efficient Federated Learning
Mohammad Mahdi Rahimi Hasnain Irshad Bhatti Younghyun Park Humaira Kousar Do-Yeon Kim Jaekyun Moon



研究问题:如何在不共享数据的情况下,实现分散节点的模型训练。
动机:现有的联邦学习模式由于需要传输大量的模型参数,导致通信成本高昂,阻碍了其广泛应用。
方法:本文提出了一种名为EvoFed的新方法,将进化策略(ES)与联邦学习(FL)相结合,以解决这些问题。EvoFed采用了基于“适应度的信息共享”概念,与传统的基于模型的联邦学习有显著的不同。每个节点并不交换实际更新的模型参数,而是传输本地更新模型与噪声扰动模型种群中每个成员的距离相似性度量。
效果:实验结果表明,EvoFed在各种实际应用设置中,虽然增加了本地处理负载,但能在保持性能与FedAvg相当的同时,大幅减少总通信需求。

Federated Learning (FL) is a decentralized machine learning paradigm that enables collaborative model training across dispersed nodes without having to force individual nodes to share data. However, its broad adoption is hindered by the high communication costs of transmitting a large number of model parameters. This paper presents EvoFed, a novel approach that integrates Evolutionary Strategies (ES) with FL to address these challenges. EvoFed employs a concept of `fitness-based information sharing’, deviating significantly from the conventional model-based FL. Rather than exchanging the actual updated model parameters, each node transmits a distance-based similarity measure between the locally updated model and each member of the noise-perturbed model population. Each node, as well as the server, generates an identical population set of perturbed models in a completely synchronized fashion using the same random seeds. With properly chosen noise variance and population size, perturbed models can be combined to closely reflect the actual model updated using the local dataset, allowing the transmitted similarity measures (or fitness values) to carry nearly the complete information about the model parameters. As the population size is typically much smaller than the number of model parameters, the savings in communication load is large. The server aggregates these fitness values and is able to update the global model. This global fitness vector is then disseminated back to the nodes, each of which applies the same update to be synchronized to the global model. Our analysis shows that EvoFed converges, and our experimental results validate that at the cost of increased local processing loads, EvoFed achieves performance comparable to FedAvg while reducing overall communication requirements drastically in various practical settings.

Linear Time Algorithms for k-means with Multi-Swap Local Search
Junyu Huang Qilong Feng Ziyun Huang Jinhui Xu Jianxin Wang



研究问题:解决聚类问题的局部搜索方法。
动机:单次交换策略的局部搜索算法在处理大规模数据集时,其近似比与多次交换策略的局部搜索算法存在较大差距。
方法:提出一种线性时间复杂度的多次交换局部搜索算法用于解决k-means问题。该算法在给定交换次数t的情况下,可以达到(50(1+\frac{1}{t})+\epsilon)的近似比,改进了目前的最佳结果。
效果:与其他现有的局部搜索算法相比,该方法是首个实现线性时间复杂度的算法。通过采样加速交换过程中的聚类成本更新,并引入重组机制寻找可能的更好解决方案。实验证明,新提出的算法在小型和大型数据集上的表现均优于现有最先进的局部搜索算法和分支定界求解器。

The local search methods have been widely used to solve the clustering problems. In practice, local search algorithms for clustering problems mainly adapt the single-swap strategy, which enables them to handle large-scale datasets and achieve linear running time in the data size. However, compared with multi-swap local search algorithms, there is a considerable gap on the approximation ratios of the single-swap local search algorithms. Although the current multi-swap local search algorithms provide small constant approximation, the proposed algorithms tend to have large polynomial running time, which cannot be used to handle large-scale datasets. In this paper, we propose a multi-swap local search algorithm for the $k$-means problem with linear running time in the data size. Given a swap size $t$, our proposed algorithm can achieve a $(50(1+\frac{1}{t})+\epsilon)$-approximation, which improves the current best result 509 (ICML 2019) with linear running time in the data size. Our proposed method, compared with previous multi-swap local search algorithms, is the first one to achieve linear running time in the data size. To obtain a more practical algorithm for the problem with better clustering quality and running time, we propose a sampling-based method which accelerates the process of clustering cost update during swaps. Besides, a recombination mechanism is proposed to find potentially better solutions. Empirical experiments show that our proposed algorithms achieve better performances compared with branch and bound solver (NeurIPS 2022) and other existing state-of-the-art local search algorithms on both small and large datasets.

EvoPrompting: Language Models for Code-Level Neural Architecture Search
Angelica Chen David Dohan David So



研究问题:探索将语言模型用作一般适应性变异和交叉操作符的进化神经网络架构搜索(NAS)算法。
动机:虽然NAS仍然被证明对LM来说是一个过于困难的任务,但我们发现,通过结合进化提示工程和软提示调优的方法,我们称之为EvoPrompting,可以持续找到多样化且性能优秀的模型。
方法:首先在计算效率高的MNIST-1D数据集上进行演示,然后将其应用于CLRS算法推理基准上的图神经网络搜索,其中EvoPrompting能够设计出*新颖*的架构,在30个算法推理任务中的21个任务上超越了当前最先进的模型,同时保持了类似的模型大小。
效果:EvoPrompting成功地设计出了各种机器学习任务中准确且高效的神经网络架构,同时也足够通用,可以轻松适应其他任务,如神经网络设计等。

Given the recent impressive accomplishments of language models (LMs) for code generation, we explore the use of LMs as general adaptive mutation and crossover operators for an evolutionary neural architecture search (NAS) algorithm. While NAS still proves too difficult a task for LMs to succeed at solely through prompting, we find that the combination of evolutionary prompt engineering with soft prompt-tuning, a method we term EvoPrompting, consistently finds diverse and high performing models. We first demonstrate that EvoPrompting is effective on the computationally efficient MNIST-1D dataset, where EvoPrompting produces convolutional architecture variants that outperform both those designed by human experts and naive few-shot prompting in terms of accuracy and model size. We then apply our method to searching for graph neural networks on the CLRS Algorithmic Reasoning Benchmark, where EvoPrompting is able to design *novel* architectures that outperform current state-of-the-art models on 21 out of 30 algorithmic reasoning tasks while maintaining similar model size. EvoPrompting is successful at designing accurate and efficient neural network architectures across a variety of machine learning tasks, while also being general enough for easy adaptation to other tasks beyond neural network design.

Structured State Space Models for In-Context Reinforcement Learning
Chris Lu Yannick Schroecker Albert Gu Emilio Parisotto Jakob Nicolaus Foerster Satinder Singh Feryal Behbahani



研究问题:如何改进S4模型以适应强化学习任务?
动机:现有的S4模型在长序列建模任务上表现优秀,且具有快速推理和可并行训练的优势,适合用于许多强化学习环境。
方法:对S4的变体进行修改,使其能够并行初始化和重置隐藏状态,从而处理强化学习任务。
效果:实验结果显示,修改后的模型在序列长度上运行速度比Transformers更快,且在简单的记忆任务上的表现优于RNN。在部分可观察环境中评估该模型,发现其性能优于RNN,同时运行速度提高五倍。此外,利用该模型处理长序列的能力,成功应对了随机抽样连续控制环境和环境观测与行动的线性投影组合的挑战性元学习任务。

Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful in many reinforcement learning settings. We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks. We show that our modified architecture runs asymptotically faster than Transformers in sequence length and performs better than RNN's on a simple memory-based task. We evaluate our modified architecture on a set of partially-observable environments and find that, in practice, our model outperforms RNN's while also running over five times faster. Then, by leveraging the model’s ability to handle long-range sequences, we achieve strong performance on a challenging meta-learning task in which the agent is given a randomly-sampled continuous control environment, combined with a randomly-sampled linear projection of the environment's observations and actions. Furthermore, we show the resulting model can adapt to out-of-distribution held-out tasks. Overall, the results presented in this paper show that structured state space models are fast and performant for in-context reinforcement learning tasks. We provide code at https://github.com/luchris429/s5rl.

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models
Minsoo Kim Sihwa Lee Janghwan Lee Sukjin Hong Du-Seong Chang Wonyong Sung Jungwook Choi



研究问题:如何减小大型生成语言模型(GLMs)的模型大小以便于实际应用?
动机:大型GLMs在文本生成、理解和推理等任务上表现出色,但其大模型大小给实际部署带来了挑战。
方法:提出一种针对GLMs的新型知识蒸馏方法——token-scaled logit distillation,该方法可以防止过拟合,并从教师模型和真实数据中进行更优的学习。
效果:该方法首次评估了大规模GLMs的三进制权重量化感知训练,其困惑度降低不到1.0,并在常识问答、算术推理以及自然语言理解等任务上实现了更高的准确率。

Generative Language Models (GLMs) have shown impressive performance in tasks such as text generation, understanding, and reasoning. However, the large model size poses challenges for practical deployment. To solve this problem, Quantization-Aware Training (QAT) has become increasingly popular. However, current QAT methods for generative models have resulted in a noticeable loss of accuracy. To counteract this issue, we propose a novel knowledge distillation method specifically designed for GLMs. Our method, called token-scaled logit distillation, prevents overfitting and provides superior learning from the teacher model and ground truth. This research marks the first evaluation of ternary weight quantization-aware training of large-scale GLMs with less than 1.0 degradation in perplexity and achieves enhanced accuracy in tasks like common-sense QA and arithmetic reasoning as well as natural language understanding. Our code is available at https://github.com/aiha-lab/TSLD.

Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone
Zeyinzi Jiang Chaojie Mao Ziyuan Huang Ao Ma Yiliang Lv Yujun Shen Deli Zhao Jingren Zhou



研究问题:如何有效地将大型基础模型转移到下游应用?
动机:现有的方法通常将轻量级调谐器嵌入到主干网络中,其设计和学习都高度依赖于基础模型。
方法:提出了一种新的调优范式,称为Res-Tuning,它有意地将调谐器与主干网络分离。通过理论和实证证据,我们证明了流行的调优方法在我们的解耦公式下有等效的对应物,因此可以无缝地集成到我们的框架中。由于结构解耦,我们可以从网络架构中解放调谐器的设计,促进各种调优策略的灵活组合。
效果:在鉴别性和生成性任务上进行的大量实验表明,我们的方法在效力和效率方面优于现有的替代方案。

Parameter-efficient tuning has become a trend in transferring large-scale foundation models to downstream applications. Existing methods typically embed some light-weight tuners into the backbone, where both the design and the learning of the tuners are highly dependent on the base model. This work offers a new tuning paradigm, dubbed Res-Tuning, which intentionally unbinds tuners from the backbone. With both theoretical and empirical evidence, we show that popular tuning approaches have their equivalent counterparts under our unbinding formulation, and hence can be integrated into our framework effortlessly. Thanks to the structural disentanglement, we manage to free the design of tuners from the network architecture, facilitating flexible combination of various tuning strategies. We further propose a memory-efficient variant of Res-Tuning, where the bypass i.e., formed by a sequence of tuners) is effectively detached from the main branch, such that the gradients are back-propagated only to the tuners but not to the backbone. Such a detachment also allows one-time backbone forward for multi-task inference. Extensive experiments on both discriminative and generative tasks demonstrate the superiority of our method over existing alternatives from the perspectives of efficacy and efficiency. Project page: https://res-tuning.github.io/.

CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra
Andres Potapczynski Marc Anton Finzi Geoff Pleiss Andrew Gordon Wilson



研究问题:本文旨在解决机器学习和科学领域中涉及的大规模线性代数问题,如特征值分解、线性系统求解、矩阵指数计算和迹估计等。
动机:在处理这些问题时,通常涉及到具有克罗内克、卷积、块对角、和、积结构的矩阵,而现有的方法往往无法有效处理这些结构。
方法:本文提出了一个名为CoLA(Compositional Linear Algebra)的简单但通用的框架,通过结合线性算子抽象和组合调度规则,自动构建内存和运行时高效的数值算法。
效果:CoLA可以加速许多代数运算,同时使得原型化矩阵结构和算法变得容易,为任何需要线性代数的计算任务提供了一个吸引人的替代工具。实验结果表明,CoLA在广泛的应用领域都表现出了良好的效果,包括偏微分方程、高斯过程、等变模型构建和无监督学习等。

Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.

Accurate Interpolation for Scattered Data through Hierarchical Residual Refinement
Shizhe Ding Boyang Xia Dongbo Bu



研究问题:如何利用神经网络进行更准确的插值计算。
动机:传统的数值算法在观察点上具有精确的零残差约束,而基于神经网络的插值方法在这些点上表现出非零残差。这些残差可以指导预测插值函数,但现有方法尚未利用这一点。
方法:提出分层插值网络(HINT),利用观察点的残差以分层的方式指导目标函数估计。HINT由几个顺序排列的轻量级插值块组成。第一个插值块估计目标函数的主要分量,后续的块使用前一个块的观察点残差预测残差分量。主要分量和残差分量累积形成最终的插值结果。此外,我们还假设更精细的残差预测需要在观察点上有一个更集中的注意力范围,因此在观察点和目标点之间的相关性建模中使用了分层局部约束。
效果:大量实验表明,HINT在各种数据集上的插值精度显著优于现有的插值算法,这突出了其在实际应用中的潜力。

Accurate interpolation algorithms are highly desired in various theoretical and engineering scenarios. Unlike the traditional numerical algorithms that have exact zero-residual constraints on observed points, the neural network-based interpolation methods exhibit non-zero residuals at these points. These residuals, which provide observations of an underlying residual function, can guide predicting interpolation functions, but have not been exploited by the existing approaches. To fill this gap, we propose Hierarchical INTerpolation Network (HINT), which utilizes the residuals on observed points to guide target function estimation in a hierarchical fashion. HINT consists of several sequentially arranged lightweight interpolation blocks. The first interpolation block estimates the main component of the target function, while subsequent blocks predict the residual components using observed points residuals of the preceding blocks. The main component and residual components are accumulated to form the final interpolation results. Furthermore, under the assumption that finer residual prediction requires a more focused attention range on observed points, we utilize hierarchical local constraints in correlation modeling between observed and target points. Extensive experiments demonstrate that HINT outperforms existing interpolation algorithms significantly in terms of interpolation accuracy across a wide variety of datasets, which underscores its potential for practical scenarios.

ZipLM: Inference-Aware Structured Pruning of Language Models
Eldar Kurtic Elias Frantar Dan Alistarh



研究问题:大型语言模型的突破性性能伴随着巨大的计算开销和高昂的部署成本。
动机:本文提出了一种新的结构化压缩方法,称为ZipLM,以解决大型语言模型的计算和部署成本问题。
方法:ZipLM通过迭代识别并移除损失-运行时权衡最差的部分,实现模型压缩。这种方法不仅适用于特定的模型家族,如BERT(编码器)或GPT(解码器),而且在所有的设置中都能生成最先进的压缩模型。
效果:实验结果表明,ZipLM在满足指定推理规格的情况下,只需要一小部分计算成本,就能优于现有的蒸馏和剪枝技术,生成一系列更小、更快、更准确的模型。特别是在压缩GPT2时,ZipLM的性能超过了DistilGPT2,而且体积只有其60%,速度是其30%。

The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the *post-training/one-shot* or the *gradual compression* setting, and only for specific families of models such as BERT (*encoder*) or GPT (*decoder*), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60\% smaller and 30\% faster. Our code is available at: https://github.com/IST-DASLab/ZipLM.

Efficient Meta Neural Heuristic for Multi-Objective Combinatorial Optimization
Jinbiao Chen Jiahai Wang Zizhen Zhang Zhiguang Cao Te Ye Siyuan Chen



研究问题:如何提高深度强化学习神经启发式在解决多目标组合优化问题上的学习效率和解决方案质量。
动机:目前的神经启发式在解决多目标组合优化问题上,仍存在学习效率低和解决方案质量不高的问题。
方法:提出一种高效的元神经启发式(EMNH),通过训练一个元模型,然后进行少量步骤的微调来解决相应的单目标子问题。具体包括利用部分共享架构的多任务模型实现元模型的并行学习以提高训练速度,以及设计一种关于权重向量的缩放对称采样方法以稳定训练。在微调过程中,提出了一种有效的分层方法系统地处理所有子问题。
效果:在多目标旅行商问题(MOTSP)、多目标车辆路径问题(MOCVRP)和多目标背包问题(MOKP)上的实验结果表明,EMNH在解决方案质量和学习效率上优于最先进的神经启发式,同时能够在极短的时间内产生与传统强启发式相当的解决方案。

Recently, neural heuristics based on deep reinforcement learning have exhibited promise in solving multi-objective combinatorial optimization problems (MOCOPs). However, they are still struggling to achieve high learning efficiency and solution quality. To tackle this issue, we propose an efficient meta neural heuristic (EMNH), in which a meta-model is first trained and then fine-tuned with a few steps to solve corresponding single-objective subproblems. Specifically, for the training process, a (partial) architecture-shared multi-task model is leveraged to achieve parallel learning for the meta-model, so as to speed up the training; meanwhile, a scaled symmetric sampling method with respect to the weight vectors is designed to stabilize the training. For the fine-tuning process, an efficient hierarchical method is proposed to systematically tackle all the subproblems. Experimental results on the multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) show that, EMNH is able to outperform the state-of-the-art neural heuristics in terms of solution quality and learning efficiency, and yield competitive solutions to the strong traditional heuristics while consuming much shorter time.

Fast Projected Newton-like Method for Precision Matrix Estimation under Total Positivity
Jian-Feng CAI José Vinícius De Miranda Cardoso Daniel P. Palomar Jiaxi Ying



研究问题:本文旨在解决高维情况下,多变量二阶正定(MTP_2)的高斯分布中精度矩阵的估计问题。
动机:当前的算法在处理高维情况时,由于需要解决大量的非负二次规划或大规模线性系统问题,计算上具有挑战性。
方法:我们提出了一种基于二度量投影方法的新算法,该算法结合了精心设计的搜索方向和变量划分方案。
效果:实验结果表明,与目前最先进的方法相比,我们的新算法在计算效率上有显著的提高。

We study the problem of estimating precision matrices in Gaussian distributions that are multivariate totally positive of order two ($\mathrm{MTP}_2$). The precision matrix in such a distribution is an M-matrix. This problem can be formulated as a sign-constrained log-determinant program. Current algorithms are designed using the block coordinate descent method or the proximal point algorithm, which becomes computationally challenging in high-dimensional cases due to the requirement to solve numerous nonnegative quadratic programs or large-scale linear systems. To address this issue, we propose a novel algorithm based on the two-metric projection method, incorporating a carefully designed search direction and variable partitioning scheme. Our algorithm substantially reduces computational complexity, and its theoretical convergence is established. Experimental results on synthetic and real-world datasets demonstrate that our proposed algorithm provides a significant improvement in computational efficiency compared to the state-of-the-art methods.

Unleashing the Full Potential of Product Quantization for Large-Scale Image Retrieval
Yu Liang Shiliang Zhang Kenli Li Xiaoyu Wang



研究问题:目前的深度学习哈希方法在大规模真实场景应用中存在计算成本高和准确度不高的问题。
动机:提出一种基于乘积量化(PQ)的新型深度学习哈希框架,以解决这些问题。
方法:使用基于softmax的可微分PQ分支来学习预定义类别的PQ码。该方法易于实现,无需进行大规模的矩阵运算,并能够学习出高度区分性的紧凑码。
效果:在多个大规模数据集上进行了验证,包括ImageNet100、ImageNet1K和Glint360K,实验结果证明了该方法的优越性。

Due to its promising performance, deep hashing has become a prevalent method for approximate nearest neighbors search (ANNs). However, most of current deep hashing methods are validated on relatively small-scale datasets, leaving potential threats when are applied to large-scale real-world scenarios. Specifically, they can be constrained either by the computational cost due to the large number of training categories and samples, or unsatisfactory accuracy. To tackle those issues, we propose a novel deep hashing framework based on product quantization (PQ). It uses a softmax-based differentiable PQ branch to learn a set of predefined PQ codes of the classes. Our method is easy to implement, does not involve large-scale matrix operations, and learns highly discriminate compact codes. We validate our method on multiple large-scaled datasets, including ImageNet100, ImageNet1K, and Glint360K, where the category size scales from 100 to 360K and sample number scales from 10K to 17 million, respectively. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/yuleung/FPPQ.

Lookaround Optimizer: $k$ steps around, 1 step average
Jiangtao Zhang Shunyu Liu Jie Song Tongtian Zhu Zhengqi Xu Mingli Song



研究问题:如何通过联合训练和平均权重来提高深度学习网络的泛化能力。
动机:现有的平均权重方法通常只在一条训练轨迹上进行后处理,这大大降低了网络之间的多样性,从而影响了效果。
方法:提出Lookaround,一种简单而有效的基于SGD的优化器,通过在训练过程中迭代“周围”和“平均”两步,以获得具有更好泛化能力的平坦极小值。
效果:理论分析和大量实验表明,Lookaround在CIFAR和ImageNet等流行基准测试中,无论是对于CNN还是ViT,都明显优于现有的最佳方法。

Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner (i.e., the weights are averaged after the entire training process is finished), which significantly degrades the diversity between networks and thus impairs the effectiveness. In this paper, inspired by weight average, we propose Lookaround, a straightforward yet effective SGD-based optimizer leading to flatter minima with better generalization. Specifically, Lookaround iterates two steps during the whole training period: the around step and the average step. In each iteration, 1) the around step starts from a common point and trains multiple networks simultaneously, each on transformed data by a different data augmentation, and 2) the average step averages these trained networks to get the averaged network, which serves as the starting point for the next iteration. The around step improves the functionality diversity while the average step guarantees the weight locality of these networks during the whole training, which is essential for WA to work. We theoretically explain the superiority of Lookaround by convergence analysis, and make extensive experiments to evaluate Lookaround on popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs, demonstrating clear superiority over state-of-the-arts. Our code is available at https://github.com/Ardcy/Lookaround.

Bringing regularized optimal transport to lightspeed: a splitting method adapted for GPUs
Jacob Lindbäck Zesen Wang Mikael Johansson



研究问题:提出一种有效的正则化最优传输算法。
动机:与以往方法相比,我们使用Douglas-Rachford分割技术来开发一个能够处理广泛类别的正则化器的高效求解器。
方法:该算法具有强大的全局收敛保证,低每次迭代成本,并可以利用GPU并行化,使其比现有技术在许多问题上更快。
效果:我们在几个应用中展示了其竞争力,包括领域适应和生成模型的学习。

We present an efficient algorithm for regularized optimal transport. In contrast to previous methods, we use the Douglas-Rachford splitting technique to develop an efficient solver that can handle a broad class of regularizers. The algorithm has strong global convergence guarantees, low per-iteration cost, and can exploit GPU parallelization, making it considerably faster than the state-of-the-art for many problems. We illustrate its competitiveness in several applications, including domain adaptation and learning of generative models.

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Yelysei Bondarenko Markus Nagel Tijmen Blankevoort



研究问题:如何减少大型神经网络的计算时间和内存消耗。
动机:现代转换器模型在激活中学习到的强大异常值使得它们难以量化,需要更高的比特宽度或使用不同的数字格式,额外的微调或其他解决方法来保持可接受的性能。
方法:提出了两种简单的(独立的)注意力机制修改 - _clipped softmax_和_gated attention_。
效果:通过使用这些方法预训练的模型,学习到的异常值显著减小,同时保持甚至提高了浮点任务性能,使我们可以对变换器进行全INT8量化,无需任何额外努力。这种方法在语言模型(BERT,OPT)和视觉变换器上都显示出了有效性。

Transformer models have been widely adopted in various domains over the last years and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways for reducing the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher-bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op", or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - _clipped softmax_ and _gated attention_. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

LD2: Scalable Heterophilous Graph Neural Network with Decoupled Embeddings
Ningyi Liao Siqiang Luo Xiang Li Jieming Shi



研究问题:本文旨在解决异质图神经网络在大规模图上训练的可扩展性问题。
动机:现有的异质图神经网络模型在处理大规模图时,由于其高昂的计算成本和采用小批量方案的挑战,应用受到限制。
方法:我们提出了一种可扩展的模型LD2,通过解耦图传播和生成表达性嵌入,简化了学习过程。
效果:理论分析表明,LD2在训练中实现了最优的时间复杂度,内存占用与图规模无关。实验结果显示,我们的模型能够在大规模异质图上进行轻量级的小批量训练,速度提高了15倍,内存利用率高,同时保持与基线相当甚至更好的性能。

Heterophilous Graph Neural Network (GNN) is a family of GNNs that specializes in learning graphs under heterophily, where connected nodes tend to have different labels. Most existing heterophilous models incorporate iterative non-local computations to capture node relationships. However, these approaches have limited application to large-scale graphs due to their high computational costs and challenges in adopting minibatch schemes. In this work, we study the scalability issues of heterophilous GNN and propose a scalable model, LD2, which simplifies the learning process by decoupling graph propagation and generating expressive embeddings prior to training. Theoretical analysis demonstrates that LD2 achieves optimal time complexity in training, as well as a memory footprint that remains independent of the graph scale. We conduct extensive experiments to showcase that our model is capable of lightweight minibatch training on large-scale heterophilous graphs, with up to $15\times$ speed improvement and efficient memory utilization, while maintaining comparable or better performance than the baselines.

Direct Training of SNN using Local Zeroth Order Method
Bhaskar Mukhoty Velibor Bojkovic William de Vazelhes Xiaohan Zhao Giulia De Masi Huan Xiong Bin Gu



研究问题:如何解决脉冲神经网络训练中由于Heaviside函数导致的梯度信息丢失和非可微性问题?
动机:脉冲神经网络在现实世界任务中的能耗低,且准确率与传统人工神经网络相当,但其训练算法存在上述问题。
方法:提出使用零阶技术在局部或神经元级别进行脉冲神经网络的训练,并建立了它与现有替代方法的理论联系。
效果:通过在标准静态数据集和神经形态学数据集上进行实验验证,该方法比最先进的结果有所改进,并且可以提供3-4倍的总训练时间加速。代码可在\url{https://github.com/BhaskarMukhoty/LocalZO}获取。

Spiking neural networks are becoming increasingly popular for their low energy requirement in real-world tasks with accuracy comparable to traditional ANNs. SNN training algorithms face the loss of gradient information and non-differentiability due to the Heaviside function in minimizing the model loss over model parameters. To circumvent this problem, the surrogate method employs a differentiable approximation of the Heaviside function in the backward pass, while the forward pass continues to use the Heaviside as the spiking function. We propose to use the zeroth-order technique at the local or neuron level in training SNNs, motivated by its regularizing and potential energy-efficient effects and establish a theoretical connection between it and the existing surrogate methods. We perform experimental validation of the technique on standard static datasets (CIFAR-10, CIFAR-100, ImageNet-100) and neuromorphic datasets (DVS-CIFAR-10, DVS-Gesture, N-Caltech-101, NCARS) and obtain results that offer improvement over the state-of-the-art results. The proposed method also lends itself to efficient implementations of the back-propagation method, which could provide 3-4 times overall speedup in training time. The code is available at \url{https://github.com/BhaskarMukhoty/LocalZO}.

Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics
Anton Voronov Mikhail Khoroshikh Artem Babenko Max Ryabinin



研究问题:如何提高大型文本到图像模型在小数据集或新视觉概念上的快速适应性。
动机:目前许多高效的适应方法训练时间长,限制了其实际应用,减慢了实验速度,并消耗过多的GPU资源。
方法:通过观察发现大部分概念在早期阶段学习,后期质量不会提升,因此提出一种简单的提前停止标准,只需在所有训练迭代中对固定输入计算常规训练目标。
效果:在48个不同概念和三种个性化方法的稳定扩散实验中,该方法表现出竞争力,使适应速度提高8倍,质量没有显著下降。

Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practical applications, slows down experiments, and spends excessive GPU resources. In this work, we study the training dynamics of popular text-to-image personalization methods (such as Textual Inversion or DreamBooth), aiming to speed them up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard training convergence metrics fail to indicate that. Instead, we propose a simple drop-in early stopping criterion that only requires computing the regular training objective on a fixed set of inputs for all training iterations. Our experiments on Stable Diffusion for 48 different concepts and three personalization methods demonstrate the competitive performance of our approach, which makes adaptation up to 8 times faster with no significant drops in quality.

Temporal Dynamic Quantization for Diffusion Models
Junhyuk So Jungwon Lee Daehyun Ahn Hyungjun Kim Eunhyeok Park



研究问题:扩散模型在视觉应用中表现出色,但其大模型和迭代生成导致的高存储和计算需求限制了其在移动设备上的使用。
动机:现有的量化技术由于扩散模型在激活时具有时间变化的独特属性,即使在8位精度下也难以保持性能。
方法:提出一种新的量化方法,根据时间步信息动态调整量化间隔,显著提高输出质量。这种方法在推理过程中没有计算开销,并且与后训练量化(PTQ)和量化感知训练(QAT)兼容。
效果:广泛的实验表明,该量化模型在不同配置下的输出质量有显著提高。

Diffusion model has gained popularity in vision applications due to its remarkable generative performance and versatility. However, its high storage and computation demands, resulting from the model size and iterative generation, hinder its use on mobile devices. Existing quantization techniques struggle to maintain performance even in 8-bit precision due to the diffusion model's unique property of temporal variation in activation. We introduce a novel quantization method that dynamically adjusts the quantization interval based on time step information, significantly improving output quality. Unlike conventional dynamic quantization techniques, our approach has no computational overhead during inference and is compatible with both post-training quantization (PTQ) and quantization-aware training (QAT). Our extensive experiments demonstrate substantial improvements in output quality with the quantized model across various configurations.

KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training
Thao Nguyen Truong Balazs Gerofi Edgar Josafat Martinez-Noriega François Trahay Mohamed Wahib



研究问题:如何提高深度神经网络训练的效率。
动机:通过隐藏训练中贡献度最小的样本,降低训练成本。
方法:利用训练过程中的损失和预测置信度信息,在不影响准确性的前提下,自适应地排除对整体学习过程贡献较小的样本。
效果:在图像分类和分割的多个大规模数据集和模型上进行实证研究,结果显示,相比替换抽样算法在大数据集上表现不佳,该方法可以将总训练时间减少高达22%,仅影响0.4%的准确性。

This paper proposes a method for hiding the least-important samples during the training of deep neural networks to increase efficiency, i.e., to reduce the cost of training. Using information about the loss and prediction confidence during training, we adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process, without significantly degrading accuracy. We explore the converge properties when accounting for the reduction in the number of SGD updates. Empirical results on various large-scale datasets and models used directly in image classification and segmentation show that while the with-replacement importance sampling algorithm performs poorly on large datasets, our method can reduce total training time by up to 22\% impacting accuracy only by 0.4\% compared to the baseline.

Variational Monte Carlo on a Budget — Fine-tuning pre-trained Neural Wavefunctions
Michael Scherbela Leon Gerard Philipp Grohs



研究问题:如何准确解决薛定谔方程是计算量子化学中的关键挑战。
动机:尽管基于深度学习的变分蒙特卡洛(DL-VMC)在准确性方面超过了传统方法,但其计算成本高昂。
方法:我们提出了一种预先使用自监督波函数优化在大量化学多样化分子上训练的DL-VMC模型。将此模型应用于新分子,无需任何优化即可获得优于现有方法如CCSD(T)-2Z的波函数和绝对能。
效果:通过结合改进的几何嵌入架构和现有的SE(3)等变模型来表示分子轨道,我们实现了完全端到端的机器学习模型。结合这种架构与几何的连续采样,我们比最先进的技术将零样本准确率提高了两个数量级。我们在各种测试系统上广泛评估了我们基础模型的准确性、可扩展性和局限性。

Obtaining accurate solutions to the Schrödinger equation is the key challenge in computational quantum chemistry. Deep-learning-based Variational Monte Carlo (DL-VMC) has recently outperformed conventional approaches in terms of accuracy, but only at large computational cost. Whereas in many domains models are trained once and subsequently applied for inference, accurate DL-VMC so far requires a full optimization for every new problem instance, consuming thousands of GPUhs even for small molecules. We instead propose a DL-VMC model which has been pre-trained using self-supervised wavefunction optimization on a large and chemically diverse set of molecules. Applying this model to new molecules without any optimization, yields wavefunctions and absolute energies that outperform established methods such as CCSD(T)-2Z. To obtain accurate relative energies, only few fine-tuning steps of this base model are required. We accomplish this with a fully end-to-end machine-learned model, consisting of an improved geometry embedding architecture and an existing SE(3)-equivariant model to represent molecular orbitals. Combining this architecture with continuous sampling of geometries, we improve zero-shot accuracy by two orders of magnitude compared to the state of the art. We extensively evaluate the accuracy, scalability and limitations of our base model on a wide variety of test systems.

Operation-Level Early Stopping for Robustifying Differentiable NAS
Shen Jiang Zipeng Ji Guanghui Zhu Chunfeng Yuan Yihua Huang



研究问题:DARTS在各种机器学习任务中广泛应用,但其仍然存在鲁棒性问题,主要是跳跃连接的主导。
动机:现有的方法认为跳跃连接在优化中比其他参数化操作有额外的优势,并提出通过消除这些额外优势来减轻跳跃连接的主导地位。
方法:本文从简单直接的角度分析这个问题,并提出跳跃连接的主导地位是由于参数化操作过拟合训练数据,而架构参数在验证数据上进行训练,导致不良行为。基于这个观察,我们提出了操作级别的早期停止(OLES)方法来解决这个问题并增强DARTS,而不引入任何计算开销。
效果:大量的实验结果可以验证我们的假设和OLES的有效性。

Differentiable NAS (DARTS) is a simple and efficient neural architecture search method that has been extensively adopted in various machine learning tasks. % Nevertheless, DARTS still encounters several robustness issues, mainly the domination of skip connections. % The resulting architectures are full of parametric-free operations, leading to performance collapse. % Existing methods suggest that the skip connection has additional advantages in optimization compared to other parametric operations and propose to alleviate the domination of skip connections by eliminating these additional advantages. % In this paper, we analyze this issue from a simple and straightforward perspective and propose that the domination of skip connections results from parametric operations overfitting the training data while architecture parameters are trained on the validation data, leading to undesired behaviors. % Based on this observation, we propose the operation-level early stopping (OLES) method to overcome this issue and robustify DARTS without introducing any computation overhead. % Extensive experimental results can verify our hypothesis and the effectiveness of OLES.

Towards Data-Agnostic Pruning At Initialization: What Makes a Good Sparse Mask?
Hoang Pham The-Anh Ta Shiwei Liu Lichuan Xiang Dung D. Le Hongkai Wen Long Tran-Thanh



研究问题:本文旨在解决预训练剪枝(PaI)在训练效率和推理方面的问题,以及现有PaI方法在准确性和计算减少方面的不足。
动机:现有的PaI方法虽然优于随机剪枝,但其性能与后期训练剪枝相比仍有较大差距,且对PaI的理解尚不清晰。例如,最近的研究表明,现有的PaI方法只能找到良好的层稀疏性,而不能找到权重,因为发现的子网络对层随机掩码混洗和权重重新初始化具有惊人的抵抗力。
方法:本文从一个全新的角度——子网络的拓扑结构来研究PaI。具体来说,我们提出了一个原则性的框架,用有效路径数和有效节点数两个量来分析剪枝和初始化(PaI)方法的性能。这些数量使我们能够更全面地理解PaI方法,从而准确评估不同初始状态下的子网络。我们通过这个框架系统地分析了各种PaI方法的行为,并观察到了一个指导有效子网络构建的原则:在特定的稀疏度下,表现最佳的子网络总是在有效节点数和有效路径数之间保持良好的平衡。
效果:受此观察启发,我们提出了一种数据无关的新颖剪枝方法,通过解决多目标优化问题来实现。通过对不同架构和数据集进行大量实验,我们的结果表明,我们的方法优于最先进的PaI方法,同时能够发现具有更低推断FLOPs(高达3.4倍)的子网络。代码将完全发布。

Pruning at initialization (PaI) aims to remove weights of neural networks before training in pursuit of training efficiency besides the inference. While off-the-shelf PaI methods manage to find trainable subnetworks that outperform random pruning, their performance in terms of both accuracy and computational reduction is far from satisfactory compared to post-training pruning and the understanding of PaI is missing. For instance, recent studies show that existing PaI methods only able to find good layerwise sparsities not weights, as the discovered subnetworks are surprisingly resilient against layerwise random mask shuffling and weight re-initialization. In this paper, we study PaI from a brand-new perspective -- the topology of subnetworks. In particular, we propose a principled framework for analyzing the performance of Pruning and Initialization (PaI) methods with two quantities, namely, the number of effective paths and effective nodes. These quantities allow for a more comprehensive understanding of PaI methods, giving us an accurate assessment of different subnetworks at initialization. We systematically analyze the behavior of various PaI methods through our framework and observe a guiding principle for constructing effective subnetworks: *at a specific sparsity, the top-performing subnetwork always presents a good balance between the number of effective nodes and the number of effective paths.* Inspired by this observation, we present a novel data-agnostic pruning method by solving a multi-objective optimization problem. By conducting extensive experiments across different architectures and datasets, our results demonstrate that our approach outperforms state-of-the-art PaI methods while it is able to discover subnetworks that have much lower inference FLOPs (up to 3.4$\times$). Code will be fully released.

DeepPCR: Parallelizing Sequential Operations in Neural Networks
Federico Danieli Miguel Sarabia Xavier Suau Pau Rodriguez Luca Zappella



研究问题:尽管并行化技术已经广泛应用于加速深度神经网络的推理和训练,但一些操作仍然以顺序方式执行,这在步骤数量增加时可能成为瓶颈。
动机:为了解决这个问题,本文提出了一种新的算法DeepPCR,它可以并行化通常的顺序操作,从而加快神经网络的推理和训练速度。
方法:DeepPCR将L步序列解释为特定方程组的解,并使用并行循环约简算法进行恢复,从而将计算顺序操作的复杂度从O(L)降低到O(log_2L)。
效果:通过在多层感知器中并行前向和后向传播,以及在扩散模型中进行训练,验证了该算法的理论低复杂度,并实现了高达30倍的前向和200倍的后向传播速度的提升。

Parallelization techniques have become ubiquitous for accelerating inference and training of deep neural networks. Despite this, several operations are still performed in a sequential manner. For instance, the forward and backward passes are executed layer-by-layer, and the output of diffusion models is produced by applying a sequence of denoising steps. This sequential approach results in a computational cost proportional to the number of steps involved, presenting a potential bottleneck as the number of steps increases. In this work, we introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations in order to speed up inference and training of neural networks. DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm. This reduces the complexity of computing the sequential operations from $\mathcal{O}(L)$ to $\mathcal{O}(\log_2L)$, thus yielding a speedup for large $L$. To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons, and reach speedups of up to $30\times$ for the forward and $200\times$ for the backward pass. We additionally showcase the flexibility of DeepPCR by parallelizing training of ResNets with as many as 1024 layers, and generation in diffusion models, enabling up to $7\times$ faster training and $11\times$ faster generation, respectively, when compared to the sequential approach.

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models
Denis Kuznedelev Eldar Kurtic Elias Frantar Dan Alistarh



研究问题:如何提高计算机视觉模型的压缩能力,以便于部署?
动机:尽管计算机视觉模型在ImageNet等经典基准测试上的准确性有了显著提高,但这些高精度模型难以部署,因为使用标准的剪枝技术进行压缩较为困难。
方法:引入关联感知剪枝器(CAP),这是一种新的非结构化剪枝框架,可以显著提高最先进架构的可压缩性。该方法基于两个技术进展:一个新的理论上合理的剪枝器,可以在剪枝过程中准确高效地处理复杂的权重相关性;以及一种高效的后压缩恢复微调过程。
效果:通过在几种现代视觉模型(如Vision Transformers、现代CNN和ViT-CNN混合模型)上进行大量实验,首次证明这些模型可以被剪枝到高稀疏度水平(例如≥75%),且对准确性的影响很小(≤1%相对下降)。这种方法也与结构化剪枝和量化兼容,可以在不损失准确性的情况下实现1.5到2.4倍的实际加速。为了进一步展示CAP的准确性和可扩展性,我们首次使用它来证明通过自监督技术训练的极其准确的大型视觉模型也可以被剪枝到适度的稀疏度,而准确性损失几乎可以忽略不计。

Driven by significant improvements in architectural design and training pipelines, computer vision has recently experienced dramatic progress in terms of accuracy on classic benchmarks such as ImageNet. These highly-accurate models are challenging to deploy, as they appear harder to compress using standard techniques such as pruning. We address this issue by introducing the Correlation Aware Pruner (CAP), a new unstructured pruning framework which significantly pushes the compressibility limits for state-of-the-art architectures. Our method is based on two technical advancements: a new theoretically-justified pruner, which can handle complex weight correlations accurately and efficiently during the pruning process itself, and an efficient finetuning procedure for post-compression recovery. We validate our approach via extensive experiments on several modern vision models such as Vision Transformers (ViT), modern CNNs, and ViT-CNN hybrids, showing for the first time that these can be pruned to high sparsity levels (e.g. $\geq 75$%) with low impact on accuracy ($\leq 1$% relative drop). Our approach is also compatible with structured pruning and quantization, and can lead to practical speedups of 1.5 to 2.4x without accuracy loss. To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.

Facing Off World Model Backbones: RNNs, Transformers, and S4
Fei Deng Junyeong Park Sungjin Ahn



研究问题:本文旨在探索替代的世界模型基础,以提高长期记忆能力。
动机:现有的世界模型主要使用循环神经网络(RNNs)作为基础,但其内存容量有限。
方法:本文研究了变压器和结构化状态空间序列(S4)模型的有效性,并提出了第一个与S4及其变体兼容的并行可处理的世界模型S4WM。
效果:实验结果表明,S4WM在长期记忆方面优于变压器基础的世界模型,同时在训练和想象过程中表现出更高的效率。这些结果为开发更强大的MBRL代理铺平了道路。

World models are a fundamental component in model-based reinforcement learning (MBRL). To perform temporally extended and consistent simulations of the future in partially observable environments, world models need to possess long-term memory. However, state-of-the-art MBRL agents, such as Dreamer, predominantly employ recurrent neural networks (RNNs) as their world model backbone, which have limited memory capacity. In this paper, we seek to explore alternative world model backbones for improving long-term memory. In particular, we investigate the effectiveness of Transformers and Structured State Space Sequence (S4) models, motivated by their remarkable ability to capture long-range dependencies in low-dimensional sequences and their complementary strengths. We propose S4WM, the first world model compatible with parallelizable SSMs including S4 and its variants. By incorporating latent variable modeling, S4WM can efficiently generate high-dimensional image sequences through latent imagination. Furthermore, we extensively compare RNN-, Transformer-, and S4-based world models across four sets of environments, which we have tailored to assess crucial memory capabilities of world models, including long-term imagination, context-dependent recall, reward prediction, and memory-based reasoning. Our findings demonstrate that S4WM outperforms Transformer-based world models in terms of long-term memory, while exhibiting greater efficiency during training and imagination. These results pave the way for the development of stronger MBRL agents.

Combinatorial Optimization with Policy Adaptation using Latent Space Search
Felix Chalumeau Shikha Surana Clément Bonnet Nathan Grinsztajn Arnu Pretorius Alexandre Laterre Thomas D Barrett



研究问题:设计有效的算法来解决组合优化问题,这是一个典型的NP-hard问题。
动机:尽管强化学习在许多领域取得了显著的进展,但它尚未取代工业解决方案。
方法:提出一种新的强化学习方法COMPASS,该方法通过连续的潜在空间参数化多样化和专业化策略的分布。
效果:在三个典型问题上进行评估,结果表明COMPASS的搜索策略在11个标准基准任务中的9个上优于最先进的方法,并在18个程序转换实例分布上表现更好。

Combinatorial Optimization underpins many real-world applications and yet, designing performant algorithms to solve these complex, typically NP-hard, problems remains a significant research challenge. Reinforcement Learning (RL) provides a versatile framework for designing heuristics across a broad spectrum of problem domains. However, despite notable progress, RL has not yet supplanted industrial solvers as the go-to solution. Current approaches emphasize pre-training heuristics that construct solutions, but often rely on search procedures with limited variance, such as stochastically sampling numerous solutions from a single policy, or employing computationally expensive fine-tuning of the policy on individual problem instances. Building on the intuition that performant search at inference time should be anticipated during pre-training, we propose COMPASS, a novel RL approach that parameterizes a distribution of diverse and specialized policies conditioned on a continuous latent space. We evaluate COMPASS across three canonical problems - Travelling Salesman, Capacitated Vehicle Routing, and Job-Shop Scheduling - and demonstrate that our search strategy (i) outperforms state-of-the-art approaches in 9 out of 11 standard benchmarking tasks and (ii) generalizes better, surpassing all other approaches on a set of 18 procedurally transformed instance distributions.

Revisiting Adversarial Training for ImageNet: Architectures, Training and Generalization across Threat Models
Naman Deep Singh Francesco Croce Matthias Hein



研究问题:在ImageNet上,对抗训练对ViTs和ConvNeXts的影响如何?
动机:尽管对抗训练在ResNet架构和低分辨率数据集如CIFAR-10上已得到广泛研究,但在ImageNet上的研究却相对较少。鉴于最近关于Transformer是否比卷积网络更稳健的争论,我们重新审视了ImageNet上的对抗训练,比较了ViTs和ConvNeXts。
方法:通过大量的实验,我们发现在架构(主要是将PatchStem替换为ConvStem)和训练方案上的微小改变,对实现的稳健性有显著影响。这些改变不仅提高了在已知的l_∞威胁模型下的稳健性,而且更显著地改善了对未知的l_1/l_2攻击的泛化能力。
效果:我们的改进型ConvNeXt,即ConvNeXt + ConvStem,在不同的模型参数和FLOPs范围内,都产生了最稳健的l_∞模型,而我们的ViT + ConvStem则对未知的威胁模型具有最好的泛化能力。

While adversarial training has been extensively studied for ResNet architectures and low resolution datasets like CIFAR-10, much less is known for ImageNet. Given the recent debate about whether transformers are more robust than convnets, we revisit adversarial training on ImageNet comparing ViTs and ConvNeXts. Extensive experiments show that minor changes in architecture, most notably replacing PatchStem with ConvStem, and training scheme have a significant impact on the achieved robustness. These changes not only increase robustness in the seen $\ell_\infty$-threat model, but even more so improve generalization to unseen $\ell_1/\ell_2$-attacks. Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust $\ell_\infty$-models across different ranges of model parameters and FLOPs, while our ViT + ConvStem yields the best generalization to unseen threat models.

Efficient Hyper-parameter Optimization with Cubic Regularization
Zhenqian Shen Hansi Yang Yong Li James Kwok quanming yao



研究问题:本文旨在解决在超参数优化中,由于性能度量指标的非可微性或超参数的不连续性导致无法获取超梯度的问题。
动机:现有的算法如贝叶斯优化和强化学习等,在处理这类问题时常常陷入局部最优解,表现不佳。
方法:提出使用三次正则化来加速收敛并避免鞍点。首先采用随机松弛法,无需超梯度即可获得梯度和海森矩阵信息;然后利用三次正则化来挖掘丰富的曲率信息。理论证明该方法能收敛到近似的二阶稳定点,并且在下层问题求解不完全准确时也能保证收敛。
效果:通过在合成数据和真实世界数据上的实验,验证了该方法的有效性。

As hyper-parameters are ubiquitous and can significantly affect the model performance, hyper-parameter optimization is extremely important in machine learning. In this paper, we consider a sub-class of hyper-parameter optimization problems, where the hyper-gradients are not available. Such problems frequently appear when the performance metric is non-differentiable or the hyper-parameter is not continuous. However, existing algorithms, like Bayesian optimization and reinforcement learning, often get trapped in local optimals with poor performance. To address the above limitations, we propose to use cubic regularization to accelerate convergence and avoid saddle points. First, we adopt stochastic relaxation, which allows obtaining gradient and Hessian information without hyper-gradients. Then, we exploit the rich curvature information by cubic regularization. Theoretically, we prove that the proposed method can converge to approximate second-order stationary points, and the convergence is also guaranteed when the lower-level problem is inexactly solved. Experiments on synthetic and real-world data demonstrate the effectiveness of our proposed method.

Causes and Effects of Unanticipated Numerical Deviations in Neural Network Inference Frameworks
Alexander Schlögl Nora Hofer Rainer Böhme



研究问题:机器学习框架中的硬件特定优化可能导致推理结果的数值偏差。
动机:尽管使用固定的训练模型和输入数据,但在不同的平台上,推理结果并不一致,甚至在同一平台上也不具有确定性。
方法:在现实的端到端推理管道和孤立实验中,对卷积神经网络(CNN)进行研究,以了解这些数值偏差的原因。
效果:来自75个不同平台的结果表明,CPU上偏差的主要原因是不同的SIMD使用和GPU上运行时选择的卷积算法。我们还将原因和传播效应与ML模型的性质联系起来,并评估了可能的缓解措施。

Hardware-specific optimizations in machine learning (ML) frameworks can cause numerical deviations of inference results. Quite surprisingly, despite using a fixed trained model and fixed input data, inference results are not consistent across platforms, and sometimes not even deterministic on the same platform. We study the causes of these numerical deviations for convolutional neural networks (CNN) on realistic end-to-end inference pipelines and in isolated experiments. Results from 75 distinct platforms suggest that the main causes of deviations on CPUs are differences in SIMD use, and the selection of convolution algorithms at runtime on GPUs. We link the causes and propagation effects to properties of the ML model and evaluate potential mitigations. We make our research code publicly available.

Suggesting Variable Order for Cylindrical Algebraic Decomposition via Reinforcement Learning
Fuqi Jia Yuhang Dong Minghao Liu Pei Huang Feifei Ma Jian Zhang



研究问题:如何有效地确定多项式中的变量顺序以提高符号计算的效率。
动机:现有的确定变量顺序的方法主要依赖启发式算法,且学习型方法无法处理多样化的多项式集合。
方法:本文提出了两种结合图神经网络的强化学习方法来建议变量顺序,一种是与CAD集成的分支启发式方法,另一种是直接提供全排序的快速启发式方法。
效果:实验表明,这两种方法优于最先进的学习型启发式方法,并与最好的专家型启发式方法竞争。此外,这些模型表现出强大的泛化能力,即使在只训练了3变量随机数据集的情况下,也能在各种数据集上良好运行。

Cylindrical Algebraic Decomposition (CAD) is one of the pillar algorithms of symbolic computation, and its worst-case complexity is double exponential to the number of variables. Researchers found that variable order dramatically affects efficiency and proposed various heuristics. The existing learning-based methods are all supervised learning methods that cannot cope with diverse polynomial sets. This paper proposes two Reinforcement Learning (RL) approaches combined with Graph Neural Networks (GNN) for Suggesting Variable Order (SVO). One is GRL-SVO(UP), a branching heuristic integrated with CAD. The other is GRL-SVO(NUP), a fast heuristic providing a total order directly. We generate a random dataset and collect a real-world dataset from SMT-LIB. The experiments show that our approaches outperform state-of-the-art learning-based heuristics and are competitive with the best expert-based heuristics. Interestingly, our models show a strong generalization ability, working well on various datasets even if they are only trained on a 3-var random dataset. The source code and data are available at https://github.com/dongyuhang22/GRL-SVO.

Training Transformers with 4-bit Integers
Haocheng Xi ChangHao Li Jianfei Chen Jun Zhu



研究问题:如何利用4位量化方法加速神经网络训练。
动机:现有的4位训练方法需要自定义数值格式,不被现代硬件支持。
方法:提出一种使用INT4算术实现矩阵乘法的Transformers训练方法,针对激活和梯度的特殊结构设计专用量化器,并采用哈达玛德量化器抑制前向传播中的异常值,通过比特分割和得分采样技术准确量化梯度。
效果:在自然语言理解、机器翻译和图像分类等多种任务上取得具有竞争力的准确性,比其他4位训练方法快2.2倍,平均可加快大型模型的训练速度17.8%。

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by 17.8\% on average for sufficiently large models. Our code is available at https://github.com/xijiu9/Train\_Transformers\_with\_INT4.

Matrix Compression via Randomized Low Rank and Low Precision Factorization
Rajarshi Saha Varun Srivastava Mert Pilanci



研究问题:如何有效地存储和处理大型矩阵,特别是在其低秩结构的情况下。
动机:现代矩阵可能包含数十亿个元素,对计算资源和内存使用的需求非常大。然而,这些矩阵通常具有近似的低秩结构。
方法:提出一种算法,通过随机抽样矩阵列来获取矩阵范围空间的近似基,然后将构成这个基的向量量化,最后计算矩阵列在这个量化基上的近似投影,得到低秩和低精度因子分解。
效果:实验结果表明,该算法在图像压缩、图像和文本嵌入的最近邻分类以及压缩LlaMa-$7b$层等方面非常有效。可以达到每个矩阵坐标一比特的压缩比,同时超过或保持传统压缩技术的性能。

Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix $\mathbf{A}$ as $\mathbf{A} \approx \mathbf{L}\mathbf{R}$, where $\mathbf{L}$ and $\mathbf{R}$ are the low rank factors. The total number of elements in $\mathbf{L}$ and $\mathbf{R}$ can be significantly less than that in $\mathbf{A}$. Furthermore, the entries of $\mathbf{L}$ and $\mathbf{R}$ are quantized to low precision formats -- compressing $\mathbf{A}$ by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of $\mathbf{A}$ by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of $\mathbf{A}$ onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-$7$b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques.

Evolving Connectivity for Recurrent Spiking Neural Networks
Guan Wang Yuhao Sun Sijie Cheng Sen Song



研究问题:如何提高循环脉冲神经网络(RSNNs)的训练效率和准确性,使其更好地模拟生物神经系统并应对复杂的动态模型。
动机:目前广泛使用的基于梯度的RSNN训练方法存在不准确和对神经形态硬件不友好的问题。
方法:提出了一种名为“进化连接”(EC)的框架,这是一种仅用于推理的RSNN训练方法。该框架将权重调整重新定义为参数化连接概率分布的搜索,并使用自然进化策略(NES)优化这些分布。
效果:在一系列标准的机器人运动任务上评估了EC,其性能与深度神经网络相当,甚至超过了基于梯度训练的RSNNs,解决了复杂的17-DoF人形任务。此外,相比直接演化参数,EC框架的效率提高了两到三倍。通过提供一个高性能且对硬件友好的替代方案,EC框架为进一步节能应用RSNNs和推动神经形态设备的发展奠定了基础。

Recurrent spiking neural networks (RSNNs) hold great potential for advancing artificial general intelligence, as they draw inspiration from the biological nervous system and show promise in modeling complex dynamics. However, the widely-used surrogate gradient-based training methods for RSNNs are inherently inaccurate and unfriendly to neuromorphic hardware. To address these limitations, we propose the evolving connectivity (EC) framework, an inference-only method for training RSNNs. The EC framework reformulates weight-tuning as a search into parameterized connection probability distributions, and employs Natural Evolution Strategies (NES) for optimizing these distributions. Our EC framework circumvents the need for gradients and features hardware-friendly characteristics, including sparse boolean connections and high scalability. We evaluate EC on a series of standard robotic locomotion tasks, where it achieves comparable performance with deep neural networks and outperforms gradient-trained RSNNs, even solving the complex 17-DoF humanoid task. Additionally, the EC framework demonstrates a two to three fold speedup in efficiency compared to directly evolving parameters. By providing a performant and hardware-friendly alternative, the EC framework lays the groundwork for further energy-efficient applications of RSNNs and advances the development of neuromorphic devices. Our code is publicly available at https://github.com/imoneoi/EvolvingConnectivity.

Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation
Xin Yuan Pedro Henrique Pamplona Savarese Michael Maire



研究问题:本文旨在开发一种有效增长神经网络的方法,设计参数化和优化策略以考虑其对训练动态的影响。
动机:现有的增长方法遵循简单的复制启发式或利用辅助梯度基局部优化,而我们设计了一个参数化方案,随着架构的演变,动态稳定权重、激活和梯度缩放,并保持网络的推理功能。
方法:我们提出了一个学习率适应机制,重新平衡了这些不同增长阶段逐渐消失的子网络的梯度贡献,解决了由于训练努力分配不平衡导致的优化困难。
效果:实验表明,我们的方法在准确性上与训练大型固定大小模型相当甚至更好,同时节省了大部分原始训练计算预算。我们还证明,这些收益转化为实际的墙钟训练速度提升。

We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering their effects on the training dynamics. Unlike existing growing methods, which follow simple replication heuristics or utilize auxiliary gradient-based local optimization, we craft a parameterization scheme which dynamically stabilizes weight, activation, and gradient scaling as the architecture evolves, and maintains the inference functionality of the network. To address the optimization difficulty resulting from imbalanced training effort distributed to subnetworks fading in at different growth phases, we propose a learning rate adaption mechanism that rebalances the gradient contribution of these separate subcomponents. Experiments show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original training computation budget. We demonstrate that these gains translate into real wall-clock training speedups.

PriorBand: Practical Hyperparameter Optimization in the Age of Deep Learning
Neeratyoy Mallik Eddie Bergman Carl Hvarfner Danny Stoll Maciej Janowski Marius Lindauer Luigi Nardi Frank Hutter



研究问题:深度学习管道的超参数对其下游性能至关重要,但优化这些超参数的成本对于现代深度学习来说往往难以承受。
动机:尽管已经开发了许多超参数优化(HPO)方法,但其产生的成本对现代深度学习来说仍然难以承受。因此,手动实验仍然是优化超参数的最常见方法,依赖于研究者的直觉、领域知识和廉价的初步探索。
方法:为了解决HPO算法与DL研究人员之间的这种不匹配,我们提出了PriorBand,这是一种专为DL设计的HPO算法,能够同时利用专家信念和廉价的代理任务。
效果:通过实证研究,我们在一系列DL基准测试中展示了PriorBand的效率,并展示了其在有信息量专家输入下的收益以及在面对糟糕专家信念时的鲁棒性。

Hyperparameters of Deep Learning (DL) pipelines are crucial for their downstream performance. While a large number of methods for Hyperparameter Optimization (HPO) have been developed, their incurred costs are often untenable for modern DL. Consequently, manual experimentation is still the most prevalent approach to optimize hyperparameters, relying on the researcher's intuition, domain knowledge, and cheap preliminary explorations. To resolve this misalignment between HPO algorithms and DL researchers, we propose PriorBand, an HPO algorithm tailored to DL, able to utilize both expert beliefs and cheap proxy tasks. Empirically, we demonstrate PriorBand's efficiency across a range of DL benchmarks and show its gains under informative expert input and robustness against poor expert beliefs.

Landscape Surrogate: Learning Decision Losses for Mathematical Optimization Under Partial Information
Arman Zharmagambetov Brandon Amos Aaron M Ferber Taoan Huang Bistra Dilkina Yuandong Tian



研究问题:如何通过学习优化器来加速处理部分可观察的优化问题,特别是在通用优化器表现不佳且无专家调优的情况下。
动机:最近的一些工作已经表明,当优化问题只部分可见或通用优化器在没有专家调优的情况下表现不佳时,学习集成优化可以取得良好的效果。
方法:提出使用平滑和可学习的“地形替代模型”作为$f\circ \mathbf{g}$的替代品。这种替代模型可以通过神经网络进行学习,其计算速度比求解器$\mathbf{g}$快,能在训练过程中提供密集而平滑的梯度,并能推广到未见过的问题上,而且可以通过交替优化进行有效学习。
效果:在合成问题(如最短路径和多维背包问题)以及真实世界问题(如投资组合优化)上进行了测试,取得了与最先进的基线相当甚至更好的目标值,同时减少了对$\mathbf{g}$的调用次数。特别地,对于计算成本高昂的高维问题,该方法优于现有方法。

Recent works in learning-integrated optimization have shown promise in settings where the optimization problem is only partially observed or where general-purpose optimizers perform poorly without expert tuning. By learning an optimizer $\mathbf{g}$ to tackle these challenging problems with $f$ as the objective, the optimization process can be substantially accelerated by leveraging past experience. The optimizer can be trained with supervision from known optimal solutions or implicitly by optimizing the compound function $f\circ \mathbf{g}$. The implicit approach may not require optimal solutions as labels and is capable of handling problem uncertainty; however, it is slow to train and deploy due to frequent calls to optimizer $\mathbf{g}$ during both training and testing. The training is further challenged by sparse gradients of $\mathbf{g}$, especially for combinatorial solvers. To address these challenges, we propose using a smooth and learnable **Landscape Surrogate** $\mathcal{M}$ as a replacement for $f\circ \mathbf{g}$. This surrogate, learnable by neural networks, can be computed faster than the solver $\mathbf{g}$, provides dense and smooth gradients during training, can generalize to unseen optimization problems, and is efficiently learned via alternating optimization. We test our approach on both synthetic problems, including shortest path and multidimensional knapsack, and real-world problems such as portfolio optimization, achieving comparable or superior objective values compared to state-of-the-art baselines while reducing the number of calls to $\mathbf{g}$. Notably, our approach outperforms existing methods for computationally expensive high-dimensional problems.

Compressed Video Prompt Tuning
Bing Li Jiaxin Chen Xiuguo Bao Di Huang



研究问题:如何有效地将预训练的原始视频模型适应到压缩视频理解任务中。
动机:目前的压缩视频处理方法通常遵循资源消耗大的预训练和微调范式,没有充分利用压缩视频的特性,不适合广泛应用。
方法:提出一种基于提示的表示学习框架,即压缩视频提示微调(CVPT),通过重新参数化压缩模态(如运动矢量和残差)为条件提示并进行层状细化,以解决预训练和下游数据模态之间的不一致性问题。
效果:在HMDB-51、UCF-101和Something-Something v2等数据集上的广泛评估表明,CVPT显著优于最先进的方法,实现了准确性和效率的良好平衡。

Compressed videos offer a compelling alternative to raw videos, showing the possibility to significantly reduce the on-line computational and storage cost. However, current approaches to compressed video processing generally follow the resource-consuming pre-training and fine-tuning paradigm, which does not fully take advantage of such properties, making them not favorable enough for widespread applications. Inspired by recent successes of prompt tuning techniques in computer vision, this paper presents the first attempt to build a prompt based representation learning framework, which enables effective and efficient adaptation of pre-trained raw video models to compressed video understanding tasks. To this end, we propose a novel prompt tuning approach, namely Compressed Video Prompt Tuning (CVPT), emphatically dealing with the challenging issue caused by the inconsistency between pre-training and downstream data modalities. Specifically, CVPT replaces the learnable prompts with compressed modalities (\emph{e.g.} Motion Vectors and Residuals) by re-parameterizing them into conditional prompts followed by layer-wise refinement. The conditional prompts exhibit improved adaptability and generalizability to instances compared to conventional individual learnable ones, and the Residual prompts enhance the noisy motion cues in the Motion Vector prompts for further fusion with the visual cues from I-frames. Additionally, we design Selective Cross-modal Complementary Prompt (SCCP) blocks. After inserting them into the backbone, SCCP blocks leverage semantic relations across diverse levels and modalities to improve cross-modal interactions between prompts and input flows. Extensive evaluations on HMDB-51, UCF-101 and Something-Something v2 demonstrate that CVPT remarkably outperforms the state-of-the-art counterparts, delivering a much better balance between accuracy and efficiency.

Towards Optimal Caching and Model Selection for Large Model Inference
Banghua Zhu Ying Sheng Lianmin Zheng Clark Barrett Michael Jordan Jiantao Jiao



研究问题:大型语言模型(LLMs)和其他大型基础模型在推理过程中的资源消耗和延迟问题。
动机:解决大型语言模型在大规模部署中存在的资源需求高、推理延迟大的问题。
方法:采用缓存存储过往查询结果和使用模型选择器从模型集合中选择处理查询的方法来降低推理成本。
效果:通过结合GDSF或LEC等缓存算法和模型选择器,实验证明这种方法在离线和在线设置中都能达到最优效果,大大优于基线方法,能减少高达50倍的成本,并在真实数据集上将浮点运算次数减少了4.3倍,平均延迟减少了1.85倍。

Large Language Models (LLMs) and other large foundation models have achieved impressive results, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model selector to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model selector, we achieve optimal rates in both offline and online settings. Empirically, simulations show that our caching and model selection algorithm greatly improves over the baselines, with up to $50\times$ improvement over the baseline when the ratio between the maximum cost and minimum cost is $100$. Experiments on real datasets show a $4.3\times$ improvement in FLOPs over the baseline when the ratio for FLOPs is $10$, and a $1.8\times$ improvement in latency when the ratio for average latency is $1.85$.

$S^3$: Increasing GPU Utilization during Generative Inference for Higher Throughput
Yunho Jin Chun-Feng Wu David Brooks Gu-Yeon Wei



研究问题:大型语言模型(LLM)在生成文本时消耗大量内存,特别是保存序列中先前令牌信息的键/值(KV)缓存。
动机:当前的LLM服务框架由于无法预知输出序列的长度,会预留最大的序列长度给KV缓存,这限制了我们使用更小的批量大小,导致GPU利用率和吞吐量降低。
方法:我们提出了$S^3$方法,该方法预测输出序列的长度,根据预测结果安排生成查询以增加设备资源利用率和吞吐量,并处理误预测。
效果:我们的方法比那些假设输出序列长度最坏情况的系统实现了6.49倍的吞吐量。

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose $S^3$, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49× throughput over those systems that assume the worst case for the output sequence length.

Practical Differentially Private Hyperparameter Tuning with Subsampling
Antti Koskela Tejas Kulkarni



研究问题:如何降低差分隐私(DP)机器学习算法的超参数调整过程中的隐私泄露和计算成本。
动机:目前,使用敏感数据进行差分隐私超参数调整会通过超参数值泄露私人信息,且通常会导致隐私保护参数ε显著增加,并带来较大的计算负担。
方法:我们提出一种新方法,只使用敏感数据的随机子集进行超参数调整,并通过适当的外推将最优值扩展到更大的数据集。
效果:我们进行了Rényi差分隐私分析,实验表明,这种方法始终能比Papernot和Steinke的基线方法实现更好的隐私-效用权衡。

Tuning the hyperparameters of differentially private (DP) machine learning (ML) algorithms often requires use of sensitive data and this may leak private information via hyperparameter values. Recently, Papernot and Steinke (2022) proposed a certain class of DP hyperparameter tuning algorithms, where the number of random search samples is randomized. Commonly, these algorithms still considerably increase the DP privacy parameter $\varepsilon$ over non-tuned DP ML model training and can be computationally heavy as evaluating each hyperparameter candidate requires a new training run. We focus on lowering both the DP bounds and the compute cost of these methods by using only a random subset of the sensitive data for the hyperparameter tuning and by appropriately extrapolating the optimal values to a larger dataset. We carry out a Rényi differential privacy analysis for the proposed method and experimentally show that it consistently leads to better privacy-utility trade-off than the baseline method by Papernot and Steinke.

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
LILI YU Daniel Simig Colin Flaherty Armen Aghajanyan Luke Zettlemoyer Mike Lewis



研究问题:现有的自回归转换器模型在处理长序列如高分辨率图像、播客、代码或书籍时表现不佳。
动机:提出一种名为Megabyte的多尺度解码器架构,以实现对超过一百万个字节的序列进行端到端可微建模。
方法:Megabyte将序列分割成补丁,并在补丁内使用局部子模型,在补丁之间使用全局模型。这实现了次二次自注意力,相同计算量的更大前馈层和改进的解码并行性。
效果:实验表明,Megabyte使字节级模型在长上下文语言建模上与子词模型具有竞争力,在ImageNet上实现最先进的密度估计,并从原始文件中对音频进行建模。这些结果确立了无分词自回归序列建模的可行性。

Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding---unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions
Stefano Massaroli Michael Poli Daniel Y Fu Hermann Kumbong Rom Nishijima Parnichkun David W. Romero Aman Timalsina Quinn McIntyre Beidi Chen Atri Rudra Ce Zhang Christopher Re Stefano Ermon Yoshua Bengio



研究问题:如何降低预训练长卷积架构在生成任务中的计算和内存成本。
动机:现有的长卷积序列模型在自动回归推理工作负载中需要对输入序列进行完整遍历,导致计算和内存成本较高。
方法:通过提取各卷积层的低维线性状态空间模型,并结合有理插值和模型降阶技术,降低每生成一个标记的计算和内存成本。同时,通过权重绑定过滤器通道到头部,提高预训练质量并减少需要蒸馏的过滤器数量。
效果:所提模型在参数为1.3B时,其吞吐量比Transformers高10倍,比Hyena高1.5倍,且在蒸馏后无任何质量损失。

Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in many domains, but incur a significant cost during auto-regressive inference workloads -- naively requiring a full pass (or caching of activations) over the input sequence for each generated token -- similarly to attention-based models. In this paper, we seek to enable $\mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture to reduce memory footprint and increase throughput during generation. Concretely, our methods consist in extracting low-dimensional linear state-space models from each convolution layer, building upon rational interpolation and model-order reduction techniques. We further introduce architectural improvements to convolution-based layers such as Hyena: by weight-tying the filters across channels into heads, we achieve higher pre-training quality and reduce the number of filters to be distilled. The resulting model achieves 10x higher throughput than Transformers and 1.5x higher than Hyena at 1.3B parameters, without any loss in quality after distillation.

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
Jacob Portes Alexander R Trott Sam Havens DANIEL KING Abhinav Venigalla Moin Nadeem Nikhil Sardana Daya Khudia Jonathan Frankle



研究问题:如何优化预训练BERT模型以降低训练成本并提高训练效率。
动机:尽管BERT模型在自然语言处理研究中被广泛使用,但由于训练成本高,许多研究者并未从零开始预训练自己的BERT模型。
方法:介绍了一种名为MosaicBERT的BERT风格编码器架构和训练方法,该方法经过实证优化,能快速进行预训练。这种高效的架构将FlashAttention、带有线性偏置的注意力(ALiBi)、门控线性单元(GLU)、一个用于动态删除填充令牌的模块以及低精度的LayerNorm集成到了经典的transformer编码器块中。训练方法包括30%的Masked Language Modeling(MLM)目标遮罩比率、bfloat16精度以及通过GPU吞吐量优化的词汇表大小等。
效果:当在C4数据集上从头开始预训练时,这个基础模型在8个A100 80 GB GPU上花费1.13小时,平均GLUE得分达到79.6,成本约为20美元。实验结果显示,与竞争性的BERT基础和大型模型相比,MosaicBERT基础和大型模型始终是Pareto最优的。这种实证上的预训练速度提升使得研究人员和工程师能够以低成本预训练自定义的BERT风格模型,而不是在现有的通用模型上进行微调。

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
Zichang Liu Aditya Desai Fangshuo Liao Weitao Wang Victor Xie Zhaozhuo Xu Anastasios Kyrillidis Anshumali Shrivastava



研究问题:大型语言模型的部署需要大量的内存资源,其中关键的内存瓶颈来自于生成过程中存储的关键-值嵌入(KV缓存)的大小。
动机:由于KV缓存的巨大尺寸对推理批处理大小产生了限制,这对高吞吐量的推理工作负载至关重要。因此,研究者提出了“剪刀手”系统,通过只保留有重大影响的键值来管理KV缓存。
方法:“剪刀手”系统根据观察到的注意力分数的持久性假设进行操作,即只有在某个步骤中产生重大影响的键值才会对未来的生成产生显著影响。该系统通过存储概率更高的关键令牌来管理KV缓存。
效果:实验证明,“剪刀手”系统可以在不影响模型质量的情况下,将KV缓存的推理内存使用量减少多达5倍。此外,当与通常用于压缩模型权重的4位量化技术结合使用时,“剪刀手”系统可以实现高达20倍的压缩。

Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. Based on our empirical verification and theoretical analysis around this hypothesis, we propose scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that scissorhands reduces the inference memory usage of the KV cache by up to 5$\times$ without compromising model quality. We further demonstrate that scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20$\times$ compression.

Evolutionary Neural Architecture Search for Transformer in Knowledge Tracing
Shangshang Yang Xiaoshan Yu Ye Tian Xueming Yan Haiping Ma Xingyi Zhang



研究问题:现有的知识追踪模型在特征融合和全局上下文建模方面存在问题,无法准确捕捉学生的知识状态和遗忘行为。
动机:为了解决这些问题,本文提出了一种结合卷积操作的Transformer模型,并使用进化神经网络架构搜索方法自动选择输入特征和确定应用位置。
方法:通过添加卷积操作增强Transformer的局部上下文建模能力,同时使用进化算法探索搜索空间,寻找最优模型架构。
效果:实验结果表明,该方法在两个最大的教育数据集上取得了良好的效果,有效提高了知识追踪的准确性。

Knowledge tracing (KT) aims to trace students' knowledge states by predicting whether students answer correctly on exercises. Despite the excellent performance of existing Transformer-based KT approaches, they are criticized for the manually selected input features for fusion and the defect of single global context modelling to directly capture students' forgetting behavior in KT, when the related records are distant from the current record in terms of time. To address the issues, this paper first considers adding convolution operations to the Transformer to enhance its local context modelling ability used for students' forgetting behavior, then proposes an evolutionary neural architecture search approach to automate the input feature selection and automatically determine where to apply which operation for achieving the balancing of the local/global context modelling. In the search space, the original global path containing the attention module in Transformer is replaced with the sum of a global path and a local path that could contain different convolutions, and the selection of input features is also considered. To search the best architecture, we employ an effective evolutionary algorithm to explore the search space and also suggest a search space reduction strategy to accelerate the convergence of the algorithm. Experimental results on the two largest and most challenging education datasets demonstrate the effectiveness of the architecture found by the proposed approach.

Searching for Optimal Per-Coordinate Step-sizes with Multidimensional Backtracking
Frederik Kunstner Victor S. Portella Mark Schmidt Nick Harvey



研究问题:如何自动调整平滑优化中的步长大小。
动机:现有的方法无法与理论上最优的每个坐标步长相竞争,需要寻找更好的对角预处理器。
方法:提出多维回溯,将回溯线搜索扩展到寻找平滑凸问题的好的对角预处理器。通过梯度相对于步长的大小(超梯度)产生分离超平面,使用切割平面方法进行搜索。
效果:多维回溯被证明与最好的对角预处理器具有竞争力,无需手动调整,计算效率高。

The backtracking line-search is an effective technique to automatically tune the step-size in smooth optimization. It guarantees similar performance to using the theoretically optimal step-size. Many approaches have been developed to instead tune per-coordinate step-sizes, also known as diagonal preconditioners, but none of the existing methods are provably competitive with the optimal per-coordinate step-sizes. We propose multidimensional backtracking, an extension of the backtracking line-search to find good diagonal preconditioners for smooth convex problems. Our key insight is that the gradient with respect to the step-sizes, also known as hyper-gradients, yields separating hyperplanes that let us search for good preconditioners using cutting-plane methods. As black-box cutting-plane approaches like the ellipsoid method are computationally prohibitive, we develop an efficient algorithm tailored to our setting. Multidimensional backtracking is provably competitive with the best diagonal preconditioner and requires no manual tuning.

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer
Haoran You Huihong Shi Yipin Guo Yingyan Celine Lin



研究问题:Transformer在视觉任务上表现出色,但其注意力机制和多层感知机由于稠密的乘法运算导致训练和推理成本高昂。
动机:为了解决这一问题,本文提出了一种混合乘法原语(如位左移和加法)对预训练的ViTs进行重参数化的新模型ShiftAddViT,旨在实现端到端的GPU推理加速,而无需从头开始训练。
方法:通过将查询和键映射到汉明空间的二进制代码,使用加性核将查询、键和值之间的所有MatMuls重新参数化。然后将剩余的MLP或线性层用移位核重新参数化。利用TVM实现并优化这些定制内核以在实际硬件部署到GPU上。
效果:这种重参数化的注意力保持了模型的准确性,但当应用于MLP时会导致准确性下降。为了结合两者的优点,进一步提出了一种新的混合专家(MoE)框架,通过将乘法或其原语作为专家(如乘法和移位),并设计一个新的延迟感知负载平衡损失来重新参数化MLP。这种损失有助于训练一个通用的路由器,根据其延迟为不同的专家分配动态数量的输入令牌。实验表明,ShiftAddViT在各种2D/3D Transformer视觉任务上具有很高的有效性,在GPU上实现了高达5.18倍的延迟降低和42.9%的能量节省,同时保持与原始或高效ViT相当的准确性。

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed $\textbf{ShiftAddViT}$, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all $\texttt{MatMuls}$ among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on (quadratic or linear) attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. In principle, the faster the experts run, the more input tokens they are assigned. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to $\textbf{5.18$\times$}$ latency reductions on GPUs and $\textbf{42.9}$% energy savings, while maintaining a comparable accuracy as original or efficient ViTs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddViT.

Sparse Modular Activation for Efficient Sequence Modeling
Liliang Ren Yang Liu Shuohang Wang Yichong Xu Chenguang Zhu ChengXiang Zhai



研究问题:目前的混合模型在序列建模任务上表现出色,但现有的方法将注意力模块静态研究问题:目前的混合模型在序列建模任务上表现出色,但现有的方法将注意力模块静态且均匀地应用于输入序列的所有元素,导致质量-效率的次优权衡。
动机:为了解决这个问题,我们提出了稀疏模块化激活(SMA)机制,使神经网络能够以可微分的方式稀疏地动态激活子模块。
方法:我们设计了一种新的神经架构SeqBoat,它使用SMA稀疏激活基于SSM学习的状态表示的基于门控注意力单元(GAU)。通过限制GAU仅对激活的输入进行局部注意,SeqBoat可以实现理论上无限的关注范围和线性推理复杂度。
效果:实验结果表明,SeqBoat在长序列建模、语音分类和语言建模等多种任务上取得了新的最先进的结果,并通过学习的稀疏激活模式揭示了每个任务所需的注意力量。

Recent hybrid models combining Linear State Space Models (SSMs) with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. However, current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. To address this limitation, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption of neural networks at both training and inference stages. To validate the effectiveness of SMA on sequence modeling, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including long sequence modeling, speech classification and language modeling, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity, and reveals the amount of attention needed for each task through the learned sparse activation patterns. Our code is publicly available at https://github.com/renll/SeqBoat.

QuadAttac$K$: A Quadratic Programming Approach to Learning Ordered Top-$K$ Adversarial Attacks
Thomas Paniagua Ryan Grainger Tianfu Wu



研究问题:深度神经网络的对抗性漏洞问题。
动机:现有的对抗性攻击方法主要针对学习分类任务,而本文提出的方法可以执行更激进的有序top-K攻击。
方法:提出了一种新颖且严谨的二次规划(QP)方法,名为QuadAttacK,用于学习有序top-K攻击,计算成本低。
效果:在ImageNet-1k分类任务中,使用ResNet-50、DenseNet-121和Vision Transformers进行测试,成功将成功的有序top-K攻击从K=10提升到K=20,同时保持了K=1的攻击成功率。

The adversarial vulnerability of Deep Neural Networks (DNNs) has been well-known and widely concerned, often under the context of learning top-$1$ attacks (e.g., fooling a DNN to classify a cat image as dog). This paper shows that the concern is much more serious by learning significantly more aggressive ordered top-$K$ clear-box targeted attacks proposed in~\citep{zhang2020learning}. We propose a novel and rigorous quadratic programming (QP) method of learning ordered top-$K$ attacks with low computing cost, dubbed as \textbf{QuadAttac$K$}. Our QuadAttac$K$ directly solves the QP to satisfy the attack constraint in the feature embedding space (i.e., the input space to the final linear classifier), which thus exploits the semantics of the feature embedding space (i.e., the principle of class coherence). With the optimized feature embedding vector perturbation, it then computes the adversarial perturbation in the data space via the vanilla one-step back-propagation. In experiments, the proposed QuadAttac$K$ is tested in the ImageNet-1k classification using ResNet-50, DenseNet-121, and Vision Transformers (ViT-B and DEiT-S). It successfully pushes the boundary of successful ordered top-$K$ attacks from $K=10$ up to $K=20$ at a cheap budget ($1\times 60$) and further improves attack success rates for $K=5$ for all tested models, while retaining the performance for $K=1$.

Dynamic Sparsity Is Channel-Level Sparsity Learner
Lu Yin Gen Li Meng Fang Li Shen Tianjin Huang Zhangyang Wang Vlado Menkovski Xiaolong Ma Mykola Pechenizkiy Shiwei Liu



研究问题:如何将非结构化动态稀疏性转化为GPU友好的通道级稀疏性,以提高模型的训练效率和推理速度。
动机:现有的动态稀疏训练方法主要针对非结构化稀疏模式,这在常见硬件上的支持有限,限制了其在实际应用中的使用。
方法:提出一种通道感知的动态稀疏(Chase)方法,通过在一次端到端的训练过程中逐步识别并移除稀疏通道,将非结构化稀疏性转化为通道级稀疏性。
效果:实验结果表明,Chase方法在不牺牲精度的情况下,使ResNet-50在ImageNet上的推理吞吐量提高了1.7倍。

Sparse training has received an upsurging interest in machine learning due to its tantalizing saving potential for both the entire training process as well as the inference. Dynamic sparse training (DST) as a leading approach can train deep neural networks at high sparsity from scratch to match the performance of their dense counterparts. However, most if not all DST prior arts demonstrate their effectiveness on unstructured sparsity with highly irregular sparse patterns, which receives limited support in common hardware. This limitation hinders the usage of DST in practice. In this paper, we propose Channel-aware dynamic sparse (Chase), that for the first time seamlessly translates the promise of unstructured dynamic sparsity to GPU-friendly channel-level sparsity (not fine-grained N:M or group sparsity) during one end-to-end training process, without any ad-hoc operations. The resulting small sparse networks can be directly accelerated by commodity hardware, without using any particularly sparsity-aware hardware accelerators. This appealing outcome is partially motivated by a hidden phenomenon of dynamic sparsity: off-the-shelf unstructured DST implicitly involves biased parameter reallocation across channels, with a large fraction of channels (up to 60%) being sparser than others. By progressively identifying and removing these channels during training, our approach transfers unstructured sparsity to channel-wise sparsity. Our experimental results demonstrate that Chase achieves 1.7x inference throughput speedup on common GPU devices without compromising accuracy with ResNet-50 on ImageNet. We release our code in https://github.com/luuyin/chase.

Guiding The Last Layer in Federated Learning with Pre-Trained Models
Gwen Legate Nicolas Bernier Lucas Caccia Edouard Oyallon Eugene Belilovsky



研究问题:本文旨在探讨联邦学习中预训练模型的使用,并扩展至计算机视觉迁移学习问题。
动机:现有的联邦学习方法忽视了中心化学习设置中的大量高效迁移学习文献。
方法:我们首先观察到在许多情况下,简单地拟合一个线性分类头是有效的。然后,我们在联邦学习环境中展示了使用最近类均值(NCM)进行分类器拟合可以比现有方案更高效地完成,同时获得强大的性能。最后,我们证明了采用两步法获取分类器并微调模型可以在联邦环境中实现快速收敛和提高泛化能力。
效果:我们的方法是减少通信和计算成本的同时实现更好的模型性能的潜力。

Federated Learning (FL) is an emerging paradigm that allows a model to be trained across a number of participants without sharing data. Recent works have begun to consider the effects of using pre-trained models as an initialization point for existing FL algorithms; however, these approaches ignore the vast body of efficient transfer learning literature from the centralized learning setting. Here we revisit the problem of FL from a pre-trained model considered in prior work and expand it to a set of computer vision transfer learning problems. We first observe that simply fitting a linear classification head can be efficient in many cases. We then show that in the FL setting, fitting a classifier using the Nearest Class Means (NCM) can be done exactly and orders of magnitude more efficiently than existing proposals, while obtaining strong performance. Finally, we demonstrate that using a two-stage approach of obtaining the classifier and then fine-tuning the model can yield rapid convergence and improved generalization in the federated setting. We demonstrate the potential our method has to reduce communication and compute costs while achieving better model performance.

The Grand Illusion: The Myth of Software Portability and Implications for ML Progress.
Fraser Mince Dzung Dinh Jonas Kgomo Neil Thompson Sara Hooker



研究问题:本文旨在量化主流机器学习软件框架在不同硬件类型上的可移植性。
动机:当前,机器学习硬件和软件的专业化趋势限制了探索不同系统的能力,可能阻碍创新。然而,这种可移植性的问题尚未得到充分研究。
方法:通过对主流机器学习框架在不同硬件类型上进行大规模测试,评估其关键功能的可移植性和性能下降程度。
效果:研究发现,当框架被移植到其他硬件时,可能会丧失超过40%的关键功能,即使功能可移植,其性能下降也可能非常严重。这表明专业化会引发探索成本,从而阻碍机器学习研究的创新。

Pushing the boundaries of machine learning often requires exploring different hardware and software combinations. However, this ability to experiment with different systems can be at odds with the drive for efficiency, which has produced increasingly specialized AI hardware and incentivized consolidation around a narrow set of ML frameworks. Exploratory research can be further restricted if software and hardware are co-evolving, making it even harder to stray away from a given tooling stack. While this friction increasingly impacts the rate of innovation in machine learning, to our knowledge the lack of portability in tooling has not been quantified. In this work we ask: How portable are popular ML software frameworks? We conduct a large scale study of the portability of mainstream ML frameworks across different hardware types. Our findings paint an uncomfortable picture -- frameworks can lose more than 40% of their key functions when ported to other hardware. Worse, even when functions are portable, the slowdown in their performance can be extreme. Collectively, our results reveal how costly straying from a narrow set of hardware-software combinations can be - and thus how specialization incurs an exploration cost that can impede innovation in machine learning research.

StreamNet: Memory-Efficient Streaming Tiny Deep Learning Inference on the Microcontroller
Hong Sheng Zheng Yu-Yuan Liu Chen-Fong Hsu Tsung Tai Yeh



研究问题:如何在资源有限的微控制器单元(MCU)上部署TinyML模型。
动机:由于MCU的内存限制,如小的闪存、紧张的SRAM内存预算和慢速CPU性能,将TinyML模型部署到MCU上存在许多挑战。
方法:设计了一种名为StreamNet的模型,该模型使用流缓冲区来消除基于补丁的推理中的冗余计算。StreamNet使用1D和2D流处理,并提供一种参数选择算法,该算法可以在对MCU的SRAM内存空间需求最小的情况下自动提高基于补丁的推理的性能。
效果:在10个TinyML模型中,StreamNet-2D实现了比最先进的基于补丁的推理快7.3倍的速度,并节省了81%的MACs。

With the emerging Tiny Machine Learning (TinyML) inference applications, there is a growing interest when deploying TinyML models on the low-power Microcontroller Unit (MCU). However, deploying TinyML models on MCUs reveals several challenges due to the MCU’s resource constraints, such as small flash memory, tight SRAM memory budget, and slow CPU performance. Unlike typical layer-wise inference, patch-based inference reduces the peak usage of SRAM memory on MCUs by saving small patches rather than the entire tensor in the SRAM memory. However, the processing of patch-based inference tremendously increases the amount of MACs against the layer-wise method. Thus, this notoriously computational overhead makes patch-based inference undesirable on MCUs. This work designs StreamNet that employs the stream buffer to eliminate the redundant computation of patch-based inference. StreamNet uses 1D and 2D streaming processing and provides an parameter selection algorithm that automatically improve the performance of patch-based inference with minimal requirements on the MCU’s SRAM memory space. In 10 TinyML models, StreamNet-2D achieves a geometric mean of 7.3X speedup and saves 81\% of MACs over the state-of-the-art patch-based inference.

Block-State Transformers
Jonathan Pilault Mahan Fathi Orhan Firat Christopher Pal Pierre-Luc Bacon Ross Goroshin



研究问题:如何结合长短时记忆网络和状态空间模型,以提高语言模型的性能并适应更长的序列?
动机:尽管状态空间模型在处理需要长范围依赖的任务上表现出色,但在语言建模任务中的表现仍然落后于Transformer。
方法:提出了一种名为Block-State Transformer的混合层,内部结合了状态空间模型子层进行长范围的上下文理解,以及块Transformer子层进行序列的短时表示。
效果:实验结果显示,该模型在语言建模困惑度上优于类似的Transformer架构,并能适应更长的序列。此外,与块循环Transformer相比,块状态转换器在采用模型并行化时,其层级别的速度提高了十倍以上。

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (*BST*), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely *parallelizable*, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates a more than *tenfold* increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

Binarized Neural Machine Translation
Yichi Zhang Ankush Garg Yuan Cao Lukasz Lew Behrooz Ghorbani Zhiru Zhang Orhan Firat



研究问题:如何利用低比特量化来扩展语言模型。
动机:语言模型的快速扩展推动了对低比特量化的研究。
方法:我们提出了一种新颖的适用于机器翻译的Transformer二值化技术(BMT),这是首次提出的。我们识别并解决了使用一位权重和激活时点积方差过大的问题。具体来说,BMT利用额外的LayerNorms和残差连接来提高二值化质量。在WMT数据集上的实验表明,一位权重的Transformer可以达到与浮点型相同的质量,而大小仅为其16分之一。一位激活会带来不同程度的质量下降,但通过提出的架构更改得到了缓解。我们还使用生产规模的翻译数据集进行了规模定律研究,结果表明,一位权重的Transformer在域内和域外设置中都能很好地扩展和泛化。我们将在JAX/Flax上开源实现。
效果:实验结果显示,该模型在语言建模困惑度上优于类似的Transformer架构,并能适应更长的序列。此外,与块循环Transformer相比,块状态转换器在采用模型并行化时,其层级别的速度提高了十倍以上。

The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residual connections to improve binarization quality. Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16$\times$ smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings. Implementation in JAX/Flax will be open sourced.

FedNAR: Federated Optimization with Normalized Annealing Regularization
Junbo Li Ang Li Chong Tian Qirong Ho Eric Xing Hongyi Wang



研究问题:本文旨在解决联邦学习中权重衰减的选择问题,以及其对现有联邦学习算法收敛性的影响。
动机:在现代深度神经网络优化中,权重衰减是一种提高泛化性能的标准技术,也被广泛应用于联邦学习以防止局部客户端的过拟合。然而,权重衰减可能会引入与全局目标不同的优化目标,这在联邦学习中由于多次局部更新和异构数据分布而被进一步放大。
方法:本文提出了一种名为“联邦优化与归一化退火正则化”(FedNAR)的新算法,该算法通过同时裁剪梯度和权重衰减来调节每次更新的大小,可以无缝集成到任何现有的联邦学习算法中。
效果:实验结果表明,将FedNAR集成到现有的联邦学习算法中,可以加快收敛速度并提高模型准确性。此外,FedNAR在不同超参数配置下表现出韧性,即使在初始指定不最优的情况下,也能自我调整权重衰减,而传统联邦学习算法的准确性则会显著下降。

Weight decay is a standard technique to improve generalization performance in modern deep neural network optimization, and is also widely adopted in federated learning (FL) to prevent overfitting in local clients. In this paper, we first explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms. While preventing overfitting is crucial, weight decay can introduce a different optimization goal towards the global objective, which is further amplified in FL due to multiple local updates and heterogeneous data distribution. To address this challenge, we develop {\it Federated optimization with Normalized Annealing Regularization} (FedNAR), a simple yet effective and versatile algorithmic plug-in that can be seamlessly integrated into any existing FL algorithms. Essentially, we regulate the magnitude of each update by performing co-clipping of the gradient and weight decay. We provide a comprehensive theoretical analysis of FedNAR's convergence rate and conduct extensive experiments on both vision and language datasets with different backbone federated optimization algorithms. Our experimental results consistently demonstrate that incorporating FedNAR into existing FL algorithms leads to accelerated convergence and heightened model accuracy. Moreover, FedNAR exhibits resilience in the face of various hyperparameter configurations. Specifically, FedNAR has the ability to self-adjust the weight decay when the initial specification is not optimal, while the accuracy of traditional FL algorithms would markedly decline. Our codes are released at \href{https://anonymous.4open.science/r/fednar-BE8F}{https://anonymous.4open.science/r/fednar-BE8F}.

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
AJAY KUMAR JAISWAL Shiwei Liu Tianlong Chen Zhangyang Wang



研究问题:本文旨在全面研究大型预训练视觉和语言转换器中的诱导稀疏模式。
动机:随着参数数量的爆炸性增长,由于重复的“训练-剪枝-再训练”迭代大规模剪枝(IMP)计算和内存瓶颈的增加,彩票假设(LTH)及其变体在精简模型方面已失去实用性。
方法:我们直接移除权重最小的权重,提出了存在本质稀疏性的概念,并定义了一个尖锐的下降点,超过这个点后,性能会随着稀疏度的增加而迅速下降。我们还发现,在BERT的预训练过程中,会出现突然稀疏化的现象。
效果:我们的观察结果表明,使用大量预训练数据训练的BERT在相对较少的参数中具有更好的知识浓缩能力。此外,我们发现自监督学习(SSL)目标比有监督学习(SL)目标更能引发更强的突发稀疏性。

Large pre-trained transformers are $\textit{show-stealer}$ in modern-day deep learning, and it becomes crucial to comprehend the parsimonious patterns that exist within them as they grow in scale. With exploding parameter counts, Lottery Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in sparsifying them due to high computation and memory bottleneck of repetitive $\textit{train-prune-retrain}$ routine of iterative magnitude pruning (IMP) which worsens with increasing model size. In this paper, we comprehensively study $\textit{induced sparse patterns}$ across multiple large pre-trained vision and language transformers. We propose the existence of -- $\textbf{essential sparsity}$ defined with a $\textbf{sharp dropping point}$ beyond which the performance declines much faster w.r.t the rise of sparsity level, when we directly remove weights with the smallest magnitudes in $\textbf{one-shot}$. We also present an intriguing emerging phenomenon of $\textbf{abrupt sparsification}$ during the pre-training of BERT, i.e., BERT suddenly becomes heavily sparse in pre-training after certain iterations. Moreover, our observations also indicate a $\textbf{counter-intuitive}$ finding that BERT trained with a larger amount of pre-training data tends to have a better ability to condense knowledge in comparatively relatively fewer parameters. Lastly, we investigate the effect of the pre-training loss on essential sparsity and discover that self-supervised learning (SSL) objectives trigger stronger emergent sparsification properties than supervised learning (SL). All our codes will be publicly available.

Distributed Inference and Fine-tuning of Large Language Models Over The Internet
Alexander Borzunov Max Ryabinin Artem Chumachenko Dmitry Baranchuk Tim Dettmers Younes Belkada Pavel Samygin Colin Raffel



研究问题:大型语言模型(LLMs)在许多NLP任务中很有用,但随着模型规模的增大,需要更高端硬件的支持,使得大多数研究者无法使用。本研究探讨了低成本的LLM推理和微调方法,比较了本地和分布式策略。
动机:现有的50B+的大型语言模型需要高端硬件支持,使得大部分研究者无法使用。本研究旨在通过合并多个研究团队和志愿者的空闲计算资源来有效地运行大型语言模型。
方法:我们开发了特殊的容错推理算法和负载均衡协议,自动分配设备以最大化系统总吞吐量。我们在Petals系统中展示了这些算法,该系统在互联网上运行Llama 2(70B)和BLOOM(176B),比卸载速度快10倍。
效果:实验结果表明,我们的系统在模拟条件和跨越两个大洲的真实世界设置中表现良好。

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals — a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to $10\times$ faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

Mechanic: A Learning Rate Tuner
Ashok Cutkosky Aaron Defazio Harsh Mehta



研究问题:本文旨在提出一种自动调整学习率比例因子和调度的技术,称为Mechanic。
动机:为了实现在线凸优化中类似的目标,最近的理论减少需要一种实用的实现方法。
方法:通过一系列深度学习任务,使用不同的批量大小、调度和基础优化算法,对Mechanic进行严格评估。
效果:实验表明,根据问题的具体情况,Mechanic可以非常接近、匹配甚至优于手动调整学习率。

We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call Mechanic. Our method provides a practical realization of recent theoretical reductions for accomplishing a similar goal in online convex optimization. We rigorously evaluate Mechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms. These experiments demonstrate that depending on the problem, Mechanic either comes very close to, matches or even improves upon manual tuning of learning rates.

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang Ying Sheng Tianyi Zhou Tianlong Chen Lianmin Zheng Ruisi Cai Zhao Song Yuandong Tian Christopher Re Clark Barrett Zhangyang Wang Beidi Chen



研究问题:大型语言模型在处理长内容生成任务时,由于存储大量的临时状态信息(KV缓存)导致内存占用过高。
动机:本文提出了一种新的KV缓存实现方法,通过减少存储在GPU内存中的临时状态信息,降低内存占用。
方法:作者发现一小部分令牌(称为Heavy Hitters)在计算注意力得分时贡献了大部分价值。基于此,提出了Heavy Hitter Oracle(H2O)策略,动态保留近期和Heavy Hitters令牌的平衡。将KV缓存淘汰问题形式化为动态子模问题,并证明了新算法的理论保证。
效果:在各种任务上验证了算法的准确性,与OPT、LLaMA和GPT-NeoX进行了比较。使用H2O策略实现了20%的Heavy Hitters,在OPT-6.7B和OPT-30B上,吞吐量分别提高了29倍、29倍和3倍,同时保持相同的批量大小,可以将延迟降低1.9倍。

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the $\mathsf{KV}$ $\mathsf{cache}$, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the $\mathsf{KV}$ $\mathsf{cache}$ which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters ($\mathsf{H_2}$). Through a comprehensive investigation, we find that ($i$) the emergence of $\mathsf{H_2}$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and ($ii$) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle ($\mathsf{H_2O}$), a $\mathsf{KV}$ $\mathsf{cache}$ eviction policy that dynamically retains a balance of recent and $\mathsf{H_2}$ tokens. We formulate the $\mathsf{KV}$ $\mathsf{cache}$ eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of $\mathsf{H_2O}$ with 20\% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to $29\times$, $29\times$, and $3\times$ on OPT-6.7B and OPT-30B. With the same batch size, $\mathsf{H_2O}$ can reduce the latency by up to $1.9\times$.

Resetting the Optimizer in Deep RL: An Empirical Study
Kavosh Asadi Rasool Fakoor Shoham Sabach



研究问题:本文旨在解决深度强化学习中近似最优值函数的任务。
动机:在深度强化学习中,优化过程需要通过迭代解决一系列优化问题,而损失函数会随着迭代而变化。现有的解决方法主要是使用现代的随机梯度下降算法如Adam。但这些优化器会维护自己的内部参数,如梯度的一阶和二阶矩估计,并在时间上进行更新。然而,由于优化景观可能从一个迭代到下一个迭代发生任意变化,这可能会污染这些矩估计。因此,我们需要寻找一种方法来解决这个问题。
方法:我们提出了一个简单的想法,即在开始新的迭代时重置优化器的内部参数。为了验证这个想法,我们在Rainbow算法中使用了各种优化器进行实验。
效果:实验结果表明,这种简单的修改可以显著提高深度强化学习在Atari基准测试上的性能。

We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first-order and the second-order moments of the gradient, and update them over time. Therefore, information obtained in previous iterations is used to solve the optimization problem in the current iteration. We demonstrate that this can contaminate the moment estimates because the optimization landscape can change arbitrarily from one iteration to the next one. To hedge against this negative effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting idea by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification significantly improves the performance of deep RL on the Atari benchmark.

(Amplified) Banded Matrix Factorization: A unified approach to private training
Christopher A. Choquette-Choo Arun Ganesh Ryan McKenna Hugh Brendan McMahan J Keith Rush Abhradeep Guha Thakurta Zheng Xu



研究问题:如何利用矩阵分解机制在保护隐私的同时提高机器学习应用的性能?
动机:尽管矩阵分解机制已经在各种场景中显著提高了隐私-效用-计算权衡的最优性能,但在集中和联邦学习环境中,仍存在无法轻易应用矩阵分解机制或其它算法提供更好权衡的情况。
方法:通过构建带状矩阵(主对角线及其上下最多有b个非零带的下三角矩阵)的矩阵分解机制,无论在联邦学习还是集中学习环境中,都可以超越现有的最先进算法。
效果:在跨设备联邦学习中,这种方法可以与实际的联邦学习基础设施兼容,实现多参与的设备配置。在集中设置中,我们证明了带状矩阵可以获得与广泛使用的DP-SGD算法相同的隐私放大结果,但大多数情况下可以提供更好的性能。

Matrix factorization (MF) mechanisms for differential privacy (DP) have substantially improved the state-of-the-art in privacy-utility-computation tradeoffs for ML applications in a variety of scenarios, but in both the centralized and federated settings there remain instances where either MF cannot be easily applied, or other algorithms provide better tradeoffs (typically, as $\epsilon$ becomes small). In this work, we show how MF can subsume prior state-of-the-art algorithms in both federated and centralized training settings, across all privacy budgets. The key technique throughout is the construction of MF mechanisms with banded matrices (lower-triangular matrices with at most $\hat{b}$ nonzero bands including the main diagonal). For cross-device federated learning (FL), this enables multiple-participations with a relaxed device participation schema compatible with practical FL infrastructure (as demonstrated by a production deployment). In the centralized setting, we prove that banded matrices enjoy the same privacy amplification results as the ubiquitous DP-SGD algorithm, but can provide strictly better performance in most scenarios---this lets us always at least match DP-SGD, and often outperform it

Window-Based Distribution Shift Detection for Deep Neural Networks
Guy Bar-Shalom Yonatan Geifman Ran El-Yaniv



研究问题:如何检测和评估深度神经网络在生产环境中的预测质量,特别是在输入分布发生偏差时。
动机:由于恶意或良性的输入分布偏差可能对预测质量造成影响,因此需要监控和评估深度神经网络的预测质量。
方法:提出了一种基于选择性预测原理的深度神经网络分布偏差检测方法。该方法通过计算来自真实底层分布样本的紧密覆盖泛化边界来推导出。基于这个边界,检测器会在测试窗口中持续监控网络运行,并在检测到偏差时发出警报。
效果:该方法的性能与最先进的方法相当甚至更好,同时计算时间和空间复杂度大大降低。与之前的方法不同,该方法消除了对源分布大小的依赖性,使其适用于实际应用场景。

To deploy and operate deep neural models in production, the quality of their predictions, which might be contaminated benignly or manipulated maliciously by input distributional deviations, must be monitored and assessed. Specifically, we study the case of monitoring the healthy operation of a deep neural network (DNN) receiving a stream of data, with the aim of detecting input distributional deviations over which the quality of the network's predictions is potentially damaged. Using selective prediction principles, we propose a distribution deviation detection method for DNNs. The proposed method is derived from a tight coverage generalization bound computed over a sample of instances drawn from the true underlying distribution. Based on this bound, our detector continuously monitors the operation of the network over a test window and fires off an alarm whenever a deviation is detected. Our novel detection method performs on-par or better than the state-of-the-art, while consuming substantially lower computation time (five orders of magnitude reduction) and space complexity. Unlike previous methods, which require at least linear dependence on the size of the source distribution for each detection, rendering them inapplicable to ``Google-Scale'' datasets, our approach eliminates this dependence, making it suitable for real-world applications. Code is available at [https://github.com/BarSGuy/Window-Based-Distribution-Shift-Detection](https://github.com/BarSGuy/Window-Based-Distribution-Shift-Detection).

FAMO: Fast Adaptive Multitask Optimization
Bo Liu Yihao Feng Peter Stone qiang liu



研究问题:如何通过多任务学习(MTL)从多样化的数据中学习多个不同的任务,同时避免某些任务的严重欠优化。
动机:在实际应用中,对所有任务的平均损失应用梯度下降可能导致多任务性能不佳。
方法:提出快速自适应多任务优化(FAMO),一种动态加权方法,使用O(1)的空间和时间以平衡的方式降低任务损失。
效果:实验结果表明,FAMO在空间和计算效率上都有显著改进,同时其性能与最先进的梯度操纵技术相当或更好。

One of the grand enduring goals of AI is to create generalist agents that can learn multiple different tasks from diverse data via multitask learning (MTL). However, in practice, applying gradient descent (GD) on the average loss across all tasks may yield poor multitask performance due to severe under-optimization of certain tasks. Previous approaches that manipulate task gradients for a more balanced loss decrease require storing and computing all task gradients ($\mathcal{O}(k)$ space and time where $k$ is the number of tasks), limiting their use in large-scale scenarios. In this work, we introduce Fast Adaptive Multitask Optimization (FAMO), a dynamic weighting method that decreases task losses in a balanced way using $\mathcal{O}(1)$ space and time. We conduct an extensive set of experiments covering multi-task supervised and reinforcement learning problems. Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques while offering significant improvements in space and computational efficiency. Code is available at \url{https://github.com/Cranial-XIX/FAMO}.

One-Pass Distribution Sketch for Measuring Data Heterogeneity in Federated Learning
Zichang Liu Zhaozhuo Xu Benjamin Coleman Anshumali Shrivastava



研究问题:如何在联邦学习中有效地解决数据异构性问题,特别是在高维空间中。
动机:联邦学习中的设备训练模型时,由于数据分布在不同的客户端,因此存在数据异构性问题。为了减轻其负面影响,需要对不同客户端的数据分布进行测量。
方法:本文提出了一种单次通过的分布草图来表示客户端数据分布。该草图算法只需要对客户端数据进行一次遍历,既节省时间又节省内存。我们还证明了两个分布草图之间的距离代表了它们对应分布的发散程度。
效果:实验表明,我们的分布草图提高了联邦学习训练中的客户端选择效率。同时,我们也展示了对于新加入的、带有未标记数据的客户端,我们的分布草图是一种有效的冷启动解决方案。

Federated learning (FL) is a machine learning paradigm where multiple client devices train models collaboratively without data exchange. Data heterogeneity problem is naturally inherited in FL since data in different clients follow diverse distributions. To mitigate the negative influence of data heterogeneity, we need to start by measuring it across clients. However, the efficient measurement between distributions is a challenging problem, especially in high dimensionality. In this paper, we propose a one-pass distribution sketch to represent the client data distribution. Our sketching algorithm only requires a single pass of the client data, which is efficient in terms of time and memory. Moreover, we show in both theory and practice that the distance between two distribution sketches represents the divergence between their corresponding distributions. Furthermore, we demonstrate with extensive experiments that our distribution sketch improves the client selection in the FL training. We also showcase that our distribution sketch is an efficient solution to the cold start problem in FL for new clients with unlabeled data.

Augmenting Language Models with Long-Term Memory
Weizhi Wang Li Dong Hao Cheng Xiaodong Liu Xifeng Yan Jianfeng Gao Furu Wei



研究问题:现有的大型语言模型由于输入长度限制,只能处理固定大小的输入,无法利用丰富的长上下文信息。
动机:为了解决这个问题,我们提出了一种名为“Language Models Augmented with Long-Term Memory(LongMem)”的框架,使大型语言模型能够记忆长期的历史信息。
方法:我们设计了一种新颖的解耦网络架构,将原始的基础大型语言模型冻结作为记忆编码器,并设计了一个自适应的残差侧网络作为记忆检索器和阅读器。这种解耦的记忆设计可以方便地缓存和更新长期过去的上下文进行记忆检索,而不会遭受记忆陈旧的影响。通过增强记忆增强的适应训练,LongMem可以记住长期过去的上下文,并使用长期记忆进行语言建模。
效果:实验表明,我们的方法在挑战性的长期上下文建模基准测试ChapterBreak上优于强大的长期上下文模型,并在记忆增强的上下文学习方面显著提高了大型语言模型的性能。这些结果证明,我们提出的方法能有效帮助语言模型记忆和使用长篇内容。

Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, Language Models Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history. We design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side-network as a memory retriever and reader. Such a decoupled memory design can easily cache and update long-term past contexts for memory retrieval without suffering from memory staleness. Enhanced with memory-augmented adaptation training, LongMem can thus memorize long past context and use long-term memory for language modeling. The proposed memory retrieval module can handle unlimited-length context in its memory bank to benefit various downstream tasks. Typically, LongMem can enlarge the long-form memory to 65k tokens and thus cache many-shot extra demonstration examples as long-form memory for in-context learning. Experiments show that our method outperforms strong long-context models on ChapterBreak, a challenging long-context modeling benchmark, and achieves remarkable improvements on memory-augmented in-context learning over LLMs. The results demonstrate that the proposed method is effective in helping language models to memorize and utilize long-form contents.

Lockdown: Backdoor Defense for Federated Learning with Isolated Subspace Training
Tiansheng Huang Sihao Hu Ka-Ho Chow Fatih Ilhan Selim Furkan Tekin Ling Liu



研究问题:联邦学习(FL)由于其分布式计算特性,容易受到后门攻击。
动机:现有的防御解决方案通常需要更多的计算资源,这在资源有限的情境下限制了它们的实用性。
方法:本文提出了一种名为Lockdown的隔离子空间训练方法,以减轻后门攻击的影响。该方法包括三个关键步骤:修改训练协议以隔离不同客户端的训练子空间;利用随机性初始化隔离子空间,并进行子空间剪枝和恢复,以区分恶意和良性客户端的子空间;引入法定人数共识,通过清除恶意/虚拟参数来修复全局模型。
效果:实验结果表明,Lockdown在防御后门攻击方面具有优越和一致的性能,同时还能提高通信效率并降低模型复杂度,这对于资源有限的联邦学习场景至关重要。

Federated learning (FL) is vulnerable to backdoor attacks due to its distributed computing nature. Existing defense solution usually requires larger amount of computation in either the training or testing phase, which limits their practicality in the resource-constrain scenarios. A more practical defense, i.e., neural network (NN) pruning based defense has been proposed in centralized backdoor setting. However, our empirical study shows that traditional pruning-based solution suffers \textit{poison-coupling} effect in FL, which significantly degrades the defense performance.This paper presents Lockdown, an isolated subspace training method to mitigate the poison-coupling effect. Lockdown follows three key procedures. First, it modifies the training protocol by isolating the training subspaces for different clients. Second, it utilizes randomness in initializing isolated subspacess, and performs subspace pruning and subspace recovery to segregate the subspaces between malicious and benign clients. Third, it introduces quorum consensus to cure the global model by purging malicious/dummy parameters. Empirical results show that Lockdown achieves \textit{superior} and \textit{consistent} defense performance compared to existing representative approaches against backdoor attacks. Another value-added property of Lockdown is the communication-efficiency and model complexity reduction, which are both critical for resource-constrain FL scenario. Our code is available at \url{https://github.com/git-disl/Lockdown}.

A Unified Fast Gradient Clipping Framework for DP-SGD
Weiwei Kong Andres Munoz medina



研究问题:在差分隐私随机梯度下降(DP-SGD)算法中,计算大型输入批次中每个示例的梯度范数是一个众所周知的数字瓶颈。
动机:当DP-SGD的损失函数包含中间线性操作时,现有文献中的方法已经提出了适合快速范数计算的梯度分解方法。本文提出了一个框架,将上述方法推广到任意(可能是非线性的)中间操作。
方法:我们展示了对于某些操作,如全连接和嵌入层计算,可以通过使用我们框架的某些组件进一步减少现有分解的运行时和存储成本。
效果:初步数值实验表明了上述改进的显著效果。

A well-known numerical bottleneck in the differentially-private stochastic gradient descent (DP-SGD) algorithm is the computation of the gradient norm for each example in a large input batch. When the loss function in DP-SGD is consists of an intermediate linear operation, existing methods in the literature have proposed decompositions of gradients that are amenable to fast norm computations. In this paper, we present a framework that generalizes the above approach to arbitrary (possibly nonlinear) intermediate operations. Moreover, we show that for certain operations, such as fully-connected and embedding layer computations, further improvements to the runtime and storage costs of existing decompositions can be deduced using certain components of our framework. Finally, preliminary numerical experiments are given to demonstrate the substantial effects of the aforementioned improvements.

Penguin: Parallel-Packed Homomorphic Encryption for Fast Graph Convolutional Network Inference
Ran Ran Nuo Xu Tao Liu Wei Wang Gang Quan Wujie Wen



研究问题:如何提高知识图谱中的信息实体在预训练语言模型中的应用,以增强语言表示。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,需要结合知识图谱来提升模型性能。
方法:采用大规模文本语料库和知识图谱联合训练ERNIE模型,充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

The marriage of Graph Convolutional Network (GCN) and Homomorphic Encryption (HE) enables the inference of graph data on the cloud with significantly enhanced client data privacy. However, the tremendous computation and memory overhead associated with HE operations challenges the practicality of HE-based GCN inference. GCN inference involves a sequence of expensive matrix-matrix multiplications, and we observe that directly applying the state-of-the-art HE-based secure matrix-matrix multiplication solutions to accelerate HE-GCN inference is far less efficient as it does not exploit the unique aggregation mechanism of two-dimension graph node-features in GCN layer computation. As a result, in this paper, we propose a novel HE-based ciphertext packing technique, i.e., Penguin, that can take advantage of the unique computation pattern during the HE-GCN inference to significantly reduce the computation and memory overhead associated with HE operations. Specifically, Penguin employs (i) an effective two-dimension parallel packing technique for feature ciphertext with optimal graph node partitioning and graph feature interleaving, and (ii) an interleaved assembly technique that can effectively make use of the blank slots to merge ciphertexts after feature reduction and significantly reduce the costly rotation operation. We provide theoretical analysis and experimental validation to demonstrate the speedup achieved by Penguin in accelerating GCN inference using popular GCN models and datasets. Our results show that Penguin can achieve up to $\sim10\times$ speedup and around $\sim79$% reduction in computational memory overhead, significantly outperforming state-of-the-art solutions. To the best of our knowledge, this is the first work that can ensure the protection of both graph structure and features when accelerating HE-GCN inference on encrypted data. Our code is publicly available at https://github.com/ranran0523/Penguin.

Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model
Zirui Liu Guanchu Wang Shaochen Zhong Zhaozhuo Xu Daochen Zha Ruixiang Tang Zhimeng Jiang Kaixiong Zhou Vipin Chaudhary Shuai Xu Xia Hu



研究问题:大型预训练语言模型的微调过程中,由于参数众多,内存使用量大,导致训练困难。
动机:目前的研究主要通过减少网络中的可训练参数来解决这个问题,但实际上,训练过程中的主要内存瓶颈在于存储特征映射(激活值),这对梯度计算至关重要。
方法:提出了一种新的无偏估计器——sas,用于矩阵生成并降低方差,该估计器只需要存储子采样的激活值来计算梯度。
效果:实验证明,在调整变压器模型时,提出的估计器与现有的相比具有更低的方差。通过将线性操作替换为近似操作,可以实现高达2.7倍的峰值内存减少,且几乎不影响准确性,同时支持更大的批量大小。

As the model size grows rapidly, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, machine learning models are typically trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called \sas, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones. By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7X peak memory reduction with almost no accuracy drop and enables up to $6.4\times$ larger batch size. Under the same hardware, \sas enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes. The code is available at https://anonymous.4open.science/r/WTACRS-A5C5/.

Unsupervised Learning for Solving the Travelling Salesman Problem
Yimeng Min Yiwei Bai Carla P Gomes



研究问题:本文提出了一种解决旅行商问题(TSP)的无监督学习(UL)框架。
动机:现有的数据驱动TSP启发式方法存在参数和数据效率低下的问题。
方法:我们训练了一个图神经网络(GNN),使用替代损失函数。GNN输出一个热力图,表示每条边成为最优路径的概率。然后我们应用局部搜索基于热力图生成最终预测。
效果:实验结果表明,我们的UTSP框架在性能上超过了现有的数据驱动TSP启发式方法,同时模型参数和训练样本数量分别减少了约10%和0.2%。

We propose UTSP, an Unsupervised Learning (UL) framework for solving the Travelling Salesman Problem (TSP). We train a Graph Neural Network (GNN) using a surrogate loss. The GNN outputs a heat map representing the probability for each edge to be part of the optimal path. We then apply local search to generate our final prediction based on the heat map. Our loss function consists of two parts: one pushes the model to find the shortest path and the other serves as a surrogate for the constraint that the route should form a Hamiltonian Cycle. Experimental results show that UTSP outperforms the existing data-driven TSP heuristics. Our approach is parameter efficient as well as data efficient: the model takes $\sim$ 10\% of the number of parameters and $\sim$ 0.2\% of training samples compared with Reinforcement Learning or Supervised Learning methods.

LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference
Hongwu Peng Ran Ran Yukui Luo Jiahui Zhao Shaoyi Huang Kiran Thorat Tong Geng Chenghong Wang Xiaolin Xu Wujie Wen Caiwen Ding



研究问题:如何优化图卷积网络(GCN)在云中的部署,以解决隐私保护和计算效率的问题。
动机:随着GCN模型规模的扩大,其在个人健康和金融系统等领域的应用已经超越了人类的表现。然而,GCN在云端的部署可能会对客户数据产生潜在的对抗性攻击,引发隐私问题。
方法:提出了一种名为LinGCN的框架,通过减少乘法深度和优化同态加密(HE)基于GCN推理的性能来解决这个问题。该框架主要包括三个部分:(1) 可微分的结构线性化算法;(2) 紧凑的节点级多项式替换策略;(3) 增强的HE解决方案。
效果:实验结果显示,LinGCN在延迟、准确性和可扩展性方面都优于CryptoGCN等解决方案,特别是在同态加密推理方面,LinGCN实现了14.2倍的延迟加速,同时保持了约75%的推理精度,显著降低了乘法深度。

The growth of Graph Convolution Network (GCN) model sizes has revolutionized numerous applications, surpassing human performance in areas such as personal healthcare and financial systems. The deployment of GCNs in the cloud raises privacy concerns due to potential adversarial attacks on client data. To address security concerns, Privacy-Preserving Machine Learning (PPML) using Homomorphic Encryption (HE) secures sensitive client data. However, it introduces substantial computational overhead in practical applications. To tackle those challenges, we present LinGCN, a framework designed to reduce multiplication depth and optimize the performance of HE based GCN inference. LinGCN is structured around three key elements: (1) A differentiable structural linearization algorithm, complemented by a parameterized discrete indicator function, co-trained with model weights to meet the optimization goal. This strategy promotes fine-grained node-level non-linear location selection, resulting in a model with minimized multiplication depth. (2) A compact node-wise polynomial replacement policy with a second-order trainable activation function, steered towards superior convergence by a two-level distillation approach from an all-ReLU based teacher model. (3) an enhanced HE solution that enables finer-grained operator fusion for node-wise activation functions, further reducing multiplication level consumption in HE-based inference. Our experiments on the NTU-XVIEW skeleton joint dataset reveal that LinGCN excels in latency, accuracy, and scalability for homomorphically encrypted inference, outperforming solutions such as CryptoGCN. Remarkably, LinGCN achieves a 14.2× latency speedup relative to CryptoGCN, while preserving an inference accuracy of ~75\% and notably reducing multiplication depth. Additionally, LinGCN proves scalable for larger models, delivering a substantial 85.78\% accuracy with 6371s latency, a 10.47\% accuracy improvement over CryptoGCN.

Maximum Independent Set: Self-Training through Dynamic Programming
Lorenzo Brusca Lars C.P.M. Quaedvlieg Stratis Skoulakis Grigorios Chrysos Volkan Cevher



研究问题:本文旨在提出一种基于图神经网络(GNN)的框架,解决最大独立集(MIS)问题。
动机:受到动态规划(DP)的启发,设计了一种类似DP的递归算法,通过构建子图并预测具有较大MIS的子图,来求解MIS问题。
方法:首先构造两个较小的子图,然后预测其中一个具有较大MIS的子图,并将其用于下一次递归调用。通过比较不同图的MIS大小进行训练。
效果:实验结果表明,该方法在多个合成和真实世界数据集上优于现有方法。

This work presents a graph neural network (GNN) framework for solving the maximum independent set (MIS) problem, inspired by dynamic programming (DP). Specifically, given a graph, we propose a DP-like recursive algorithm based on GNNs that firstly constructs two smaller sub-graphs, predicts the one with the larger MIS, and then uses it in the next recursive call. To train our algorithm, we require annotated comparisons of different graphs concerning their MIS size. Annotating the comparisons with the output of our algorithm leads to a self-training process that results in more accurate self-annotation of the comparisons and vice versa. We provide numerical evidence showing the superiority of our method vs prior methods in multiple synthetic and real-world datasets.

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Mostafa Dehghani Basil Mustafa Josip Djolonga Jonathan Heek Matthias Minderer Mathilde Caron Andreas Peter Steiner Joan Puigcerver Robert Geirhos Ibrahim Alabdulmohsin Avital Oliver Piotr Padlewski Alexey A. Gritsenko Mario Lucic Neil Houlsby



研究问题:目前计算机视觉模型在处理图像前普遍选择固定分辨率的调整,但这种方法并不理想。
动机:为了解决这个问题,我们提出了一种名为NaViT(Native Resolution ViT)的方法,该方法利用Vision Transformer的灵活序列建模能力,可以处理任意分辨率和宽高比的输入。
方法:我们在训练过程中使用序列打包技术来处理任意分辨率和宽高比的输入,同时展示了对大规模有监督和对比性图像-文本预训练的训练效率提升。
效果:实验结果表明,NaViT在目标检测、图像和视频分类等标准任务上具有高效的迁移能力,并在鲁棒性和公平性基准测试上取得了更好的结果。此外,在推理时,输入分辨率的灵活性可以用来平滑地权衡测试时间和性能成本。我们认为NaViT标志着大多数计算机视觉模型使用的CNN设计的输入和模型管道的一种转变,并为ViTs提供了一个有前景的方向。

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as object detection, image and video classification, and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

Every Parameter Matters: Ensuring the Convergence of Federated Learning with Dynamic Heterogeneous Models Reduction
Hanhan Zhou Tian Lan Guru Prasadh Venkataramani Wenbo Ding



研究问题:跨设备联邦学习(FL)面临重大挑战,由于资源瓶颈,低端客户端可能做出的独特贡献被排除在大型模型的训练之外。
动机:近期的研究努力集中在模型异构的FL上,通过从全局模型中提取缩小尺寸的模型并相应地应用于本地客户端。尽管取得了实证成功,但这种方法的收敛性一般理论保证仍然是一个开放的问题。
方法:本文提出了一个统一的框架,用于具有在线模型提取的异构FL算法,并首次提供了一般的收敛性分析。
效果:我们证明,在某些充分条件下,对于IID和非IID数据,这些算法都会收敛到标准FL的稳定点,适用于一般的平滑成本函数。此外,我们还引入了最小覆盖指数的概念以及模型缩减噪声,这两个因素将决定异构联邦学习的收敛性。因此,我们主张采用一种全面的方法,同时考虑这两个因素,以提高异构联邦学习的效率。

Cross-device Federated Learning (FL) faces significant challenges where low-end clients that could potentially make unique contributions are excluded from training large models due to their resource bottlenecks. Recent research efforts have focused on model-heterogeneous FL, by extracting reduced-size models from the global model and applying them to local clients accordingly. Despite the empirical success, general theoretical guarantees of convergence on this method remain an open question. This paper presents a unifying framework for heterogeneous FL algorithms with online model extraction and provides a general convergence analysis for the first time. In particular, we prove that under certain sufficient conditions and for both IID and non-IID data, these algorithms converge to a stationary point of standard FL for general smooth cost functions. Moreover, we introduce the concept of minimum coverage index, together with model reduction noise, which will determine the convergence of heterogeneous federated learning, and therefore we advocate for a holistic approach that considers both factors to enhance the efficiency of heterogeneous federated learning.

Fantastic Weights and How to Find Them: Where to Prune in Dynamic Sparse Training
Aleksandra Nowak Bram Grooten Decebal Constantin Mocanu Jacek Tabor



研究问题:动态稀疏训练(DST)是一种优化神经网络稀疏初始化的研究方法,通过在训练过程中调整网络的稀疏连接性。本研究旨在深入理解DST中剪枝准则对性能的影响。
动机:尽管已有研究表明,在特定条件下,DST能够超越密集模型,但关于剪枝准则对DST性能影响的研究却相对被忽视。
方法:设计并执行了广泛的实证分析,研究了各种剪枝准则对DST解决方案动态性的影响。
效果:研究发现,大多数研究方法得到的结果相似。但在低密度区域,最简单且最有效的技术——基于幅度的剪枝法则表现出明显的优势。

Dynamic Sparse Training (DST) is a rapidly evolving area of research that seeks to optimize the sparse initialization of a neural network by adapting its topology during training. It has been shown that under specific conditions, DST is able to outperform dense models. The key components of this framework are the pruning and growing criteria, which are repeatedly applied during the training process to adjust the network’s sparse connectivity. While the growing criterion's impact on DST performance is relatively well studied, the influence of the pruning criterion remains overlooked. To address this issue, we design and perform an extensive empirical analysis of various pruning criteria to better understand their impact on the dynamics of DST solutions. Surprisingly, we find that most of the studied methods yield similar results. The differences become more significant in the low-density regime, where the best performance is predominantly given by the simplest technique: magnitude-based pruning.

Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding
Zhejun Zhang Alexander Liniger Christos Sakaridis Fisher Yu Luc Van Gool



研究问题:如何提高自动驾驶系统中运动预测模块的实时性和可扩展性。
动机:现有的以代理为中心的方法在公共基准测试中表现出色,但当需要预测的代理数量增加时,计算开销大且可扩展性差。
方法:提出K-最近邻注意力与相对位姿编码(KNARPE),一种新的注意力机制,允许Transformers使用成对的相对表示。基于KNARPE,提出异构多线段Transformer与相对位姿编码(HPTR),一个允许在线推理期间异步令牌更新的分层框架。通过在代理之间共享上下文并重用未更改的上下文,该方法与场景为中心的方法一样高效,同时与最先进的以代理为中心的方法性能相当。
效果:在Waymo和Argoverse-2数据集上的实验表明,HPTR在不使用昂贵后处理或模型集成的端到端方法中表现优异。代码可在https://github.com/zhejz/HPTR获取。

The real-world deployment of an autonomous driving system requires its components to run on-board and in real-time, including the motion prediction module that predicts the future trajectories of surrounding traffic participants. Existing agent-centric methods have demonstrated outstanding performance on public benchmarks. However, they suffer from high computational overhead and poor scalability as the number of agents to be predicted increases. To address this problem, we introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers. Then, based on KNARPE we present the Heterogeneous Polyline Transformer with Relative pose encoding (HPTR), a hierarchical framework enabling asynchronous token update during the online inference. By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods. Experiments on Waymo and Argoverse-2 datasets show that HPTR achieves superior performance among end-to-end methods that do not apply expensive post-processing or model ensembling. The code is available at https://github.com/zhejz/HPTR.

SyncTREE: Fast Timing Analysis for Integrated Circuit Design through a Physics-informed Tree-based Graph Neural Network
Yuting Hu Jiajie Li Florian Klemme Gi-Joon Nam Tengfei Ma Hussam Amrouch Jinjun Xiong



研究问题:如何利用人工智能提高集成电路设计中复杂的分析过程,如时序、噪声和功耗等。
动机:随着集成电路设计的复杂性增加,传统的分析方法需要花费大量的时间和计算资源。而人工智能的发展为提高分析速度和准确性提供了新的可能性。
方法:本文提出了一种基于树的图神经网络SyncTREE,用于加速集成电路互连的时序分析。该方法结合了电路的结构特性和物理特性,并通过两遍信息传递(自底向上和自顶向下)进行图嵌入,使用树对比损失函数进行学习指导,以及采用闭式公式进行快速时序计算。
效果:实验结果表明,与传统的图神经网络模型相比,SyncTREE在延迟和压摆率方面的时序预测性能最佳,且与业界黄金数值分析结果一致。

Nowadays integrated circuits (ICs) are underpinning all major information technology innovations including the current trends of artificial intelligence (AI). Modern IC designs often involve analyses of complex phenomena (such as timing, noise, and power etc.) for tens of billions of electronic components, like resistance (R), capacitance (C), transistors and gates, interconnected in various complex structures. Those analyses often need to strike a balance between accuracy and speed as those analyses need to be carried out many times throughout the entire IC design cycles. With the advancement of AI, researchers also start to explore news ways in leveraging AI to improve those analyses. This paper focuses on one of the most important analyses, timing analysis for interconnects. Since IC interconnects can be represented as an RC-tree, a specialized graph as tree, we design a novel tree-based graph neural network, SyncTREE, to speed up the timing analysis by incorporating both the structural and physical properties of electronic circuits. Our major innovations include (1) a two-pass message-passing (bottom-up and top-down) for graph embedding, (2) a tree contrastive loss to guide learning, and (3) a closed formular-based approach to conduct fast timing. Our experiments show that, compared to conventional GNN models, SyncTREE achieves the best timing prediction in terms of both delays and slews, all in reference to the industry golden numerical analyses results on real IC design data.

Neural Combinatorial Optimization with Heavy Decoder: Toward Large Scale Generalization
Fu Luo Xi Lin Fei Liu Qingfu Zhang Zhenkun Wang



研究问题:现有的构造性神经组合优化方法无法解决大规模实例的问题,限制了其在现实世界应用中的有效性。
动机:为了解决这个问题,我们提出了一种新的轻编码器和重解码器(LEHD)模型,该模型具有强大的泛化能力。
方法:LEHD模型可以学习动态捕获所有可用节点之间的关系,有利于模型泛化到各种规模的问题。我们还为LEHD模型开发了一种数据高效的训练方案和灵活的解决方案构建机制。
效果:实验结果表明,LEHD模型可以在小规模问题实例上进行训练,生成接近最优的解决方案,适用于最多1000个节点的旅行商问题(TSP)和车辆路径问题(CVRP),并能很好地泛化以解决现实世界的TSPLib和CVRPLib问题。这些结果证实了我们的LEHD模型可以显著提高构造性神经组合优化的最新性能。

Neural combinatorial optimization (NCO) is a promising learning-based approach for solving challenging combinatorial optimization problems without specialized algorithm design by experts. However, most constructive NCO methods cannot solve problems with large-scale instance sizes, which significantly diminishes their usefulness for real-world applications. In this work, we propose a novel Light Encoder and Heavy Decoder (LEHD) model with a strong generalization ability to address this critical issue. The LEHD model can learn to dynamically capture the relationships between all available nodes of varying sizes, which is beneficial for model generalization to problems of various scales. Moreover, we develop a data-efficient training scheme and a flexible solution construction mechanism for the proposed LEHD model. By training on small-scale problem instances, the LEHD model can generate nearly optimal solutions for the Travelling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) with up to 1000 nodes, and also generalizes well to solve real-world TSPLib and CVRPLib problems. These results confirm our proposed LEHD model can significantly improve the state-of-the-art performance for constructive NCO.

Self-Correcting Bayesian Optimization through Bayesian Active Learning
Carl Hvarfner Erik Orm Hellsten Frank Hutter Luigi Nardi



研究问题:本文旨在解决高斯过程在贝叶斯优化和主动学习中的超参数选择问题。
动机:当前的高斯过程模型对超参数的选择高度依赖,而现有文献中对此的研究较少。
方法:提出了两种明确优先考虑超参数学习的获取函数,一种是统计距离基础的主动学习(SAL),另一种是自我修正贝叶斯优化(SCoreBO)。
效果:实验结果表明,SAL和SCoreBO在贝叶斯优化和主动学习任务上均优于现有的最新方法,特别是在一些测试函数和传统基准测试上表现突出。

Gaussian processes are the model of choice in Bayesian optimization and active learning. Yet, they are highly dependent on cleverly chosen hyperparameters to reach their full potential, and little effort is devoted to finding good hyperparameters in the literature. We demonstrate the impact of selecting good hyperparameters for GPs and present two acquisition functions that explicitly prioritize hyperparameter learning. Statistical distance-based Active Learning (SAL) considers the average disagreement between samples from the posterior, as measured by a statistical distance. SAL outperforms the state-of-the-art in Bayesian active learning on several test functions. We then introduce Self-Correcting Bayesian Optimization (SCoreBO), which extends SAL to perform Bayesian optimization and active learning simultaneously. SCoreBO learns the model hyperparameters at improved rates compared to vanilla BO, while outperforming the latest Bayesian optimization methods on traditional benchmarks. Moreover, we demonstrate the importance of self-correction on atypical Bayesian optimization tasks.

Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions
Vladimir Feinberg Xinyi Chen Y. Jennifer Sun Rohan Anil Elad Hazan



研究问题:如何降低深度学习训练任务中矩阵预处理器的内存和计算需求?
动机:目前的自适应正则化方法虽然在许多任务上表现优秀,但对内存和运行时间的需求过大。我们发现深度训练任务中Kronecker因子化梯度协方差矩阵的谱集中在一个随训练而变化的主导特征空间上,这启发我们采用低秩草图法。
方法:我们提出了一种通用的方法,使用频繁方向(FD)草图来减少维护矩阵预处理器所需的内存和计算资源。
效果:在在线凸优化(OCO)设置中,我们在维度为d的情况下,仅使用dk个内存就实现了与全矩阵d^2内存相当的遗憾,误差主要来自梯度协方差的底部d-k个特征值。此外,我们将该方法扩展到Shampoo,结果证明,我们的方法在质量上与Shampoo和Adam相当,但跟踪二阶矩所需的内存仅为其子线性。

Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. While previous approaches have explored applying FD for second-order optimization, we present a novel analysis which allows efficient interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.

Expert load matters: operating networks at high accuracy and low manual effort
Sara Sangalli Ertunc Erdil Ender Konukoglu



研究问题:在关键应用的人类-AI协作系统中,如何设置模型置信度的操作点以决定何时将决策权交给人类专家。
动机:为了确保最小错误,用户需要根据模型的置信度设定操作点,对置信度低于操作点的样本进行人工分析,避免错误。
方法:提出了一种新的分类损失函数,该函数同时考虑了网络准确性和专家负载,通过最大化COC曲线下的面积来训练深度神经网络。
效果:实验结果表明,所提出的损失函数不仅提高了分类准确性,而且减少了交给人类的决策数量,实现了更好的异常样本检测和与现有损失函数相当的校准性能。

In human-AI collaboration systems for critical applications, in order to ensure minimal error, users should set an operating point based on model confidence to determine when the decision should be delegated to human experts. Samples for which model confidence is lower than the operating point would be manually analysed by experts to avoid mistakes. Such systems can become truly useful only if they consider two aspects: models should be confident only for samples for which they are accurate, and the number of samples delegated to experts should be minimized. The latter aspect is especially crucial for applications where available expert time is limited and expensive, such as healthcare. The trade-off between the model accuracy and the number of samples delegated to experts can be represented by a curve that is similar to an ROC curve, which we refer to as confidence operating characteristic (COC) curve. In this paper, we argue that deep neural networks should be trained by taking into account both accuracy and expert load and, to that end, propose a new complementary loss function for classification that maximizes the area under this COC curve. This promotes simultaneously the increase in network accuracy and the reduction in number of samples delegated to humans. We perform experiments on multiple computer vision and medical image datasets for classification. Our results demonstrate that the proposed loss improves classification accuracy and delegates less number of decisions to experts, achieves better out-of-distribution samples detection and on par calibration performance compared to existing loss functions.

DFRD: Data-Free Robustness Distillation for Heterogeneous Federated Learning
Kangyang Luo Shuai Wang Yexuan Fu Xiang Li Yunshi Lan Ming Gao



研究问题:如何在数据异构和模型异构的联邦学习场景中学习一个鲁棒的全局模型。
动机:在保护用户隐私的同时,实现客户端之间的协作训练。
方法:提出一种名为DFRD的新联邦学习方法,通过在服务器端设置条件生成器来近似客户端上传的本地模型的训练空间,并系统地研究其真实性、可转移性和多样性的训练。
效果:通过大量的图像分类任务实验,证明DFRD相较于最先进的基线方法取得了显著的性能提升。

Federated Learning (FL) is a privacy-constrained decentralized machine learning paradigm in which clients enable collaborative training without compromising private data. However, how to learn a robust global model in the data-heterogeneous and model-heterogeneous FL scenarios is challenging. To address it, we resort to data-free knowledge distillation to propose a new FL method (namely DFRD). DFRD equips a conditional generator on the server to approximate the training space of the local models uploaded by clients, and systematically investigates its training in terms of fidelity, transferability and diversity. To overcome the catastrophic forgetting of the global model caused by the distribution shifts of the generator across communication rounds, we maintain an exponential moving average copy of the generator on the server. Additionally, we propose dynamic weighting and label sampling to accurately extract knowledge from local models. Finally, our extensive experiments on various image classification tasks illustrate that DFRD achieves significant performance gains compared to SOTA baselines.

Neural Modulation for Flash Memory: An Unsupervised Learning Framework for Improved Reliability
Jonathan Zedaka Elisha Halperin Evgeny Blaichman Amit Berman



研究问题:近年来,NAND闪存的存储密度显著增加,使其成为现代电子设备的关键组件。然而,随着存储容量的增加,数据存储和检索中的错误可能性也在增加。
动机:日益增长的错误数量对系统设计师和工程师在NAND系统的表征、建模和优化方面提出了持续的挑战。
方法:我们提出了一种利用生成式和无监督机器学习方法进行错误建模和预防的新方法。我们构建并训练了一个神经网络调制器,该调制器将信息比特转化为NAND设备中每个存储单元的编程操作。
效果:我们的调制器专门针对闪存通道进行了优化,它提供了一种智能的写入方案,既减少了编程错误,又补偿了随时间的数据退化。具体来说,调制器基于一个带有嵌入在编码器和解码器之间的额外通道模型的自动编码器架构。经过优化以适应寿命终止工作点,学习到的内存系统在原始位错误率(RBER)上比现有技术提高了56%,并将闪存块的寿命延长了25%。

Recent years have witnessed a significant increase in the storage density of NAND flash memory, making it a critical component in modern electronic devices. However, with the rise in storage capacity comes an increased likelihood of errors in data storage and retrieval. The growing number of errors poses ongoing challenges for system designers and engineers, in terms of the characterization, modeling, and optimization of NAND-based systems. We present a novel approach for modeling and preventing errors by utilizing the capabilities of generative and unsupervised machine learning methods. As part of our research, we constructed and trained a neural modulator that translates information bits into programming operations on each memory cell in NAND devices. Our modulator, tailored explicitly for flash memory channels, provides a smart writing scheme that reduces programming errors as well as compensates for data degradation over time. Specifically, the modulator is based on an auto-encoder architecture with an additional channel model embedded between the encoder and the decoder. A conditional generative adversarial network (cGAN) was used to construct the channel model. Optimized for the end-of-life work-point, the learned memory system outperforms the prior art by up to 56\% in raw bit error rate (RBER) and extends the lifetime of the flash memory block by up to 25\%.

DeepACO: Neural-enhanced Ant Systems for Combinatorial Optimization
Haoran Ye Jiarui Wang Zhiguang Cao Helan Liang Yong Li



研究问题:本文旨在提出一种通用框架DeepACO,利用深度强化学习自动化蚁群优化(ACO)的启发式设计。
动机:传统的ACO算法需要专家设计知识驱动的启发式,而本文提出的DeepACO可以自动强化现有ACO算法的启发式测量,并在未来的应用中省去繁琐的手动设计。
方法:通过使用单个神经网络模型和一组超参数,DeepACO在八个组合优化问题上始终优于其对应的ACO算法。作为一种神经组合优化方法,DeepACO在经典的路由问题上表现优于或等同于特定问题的方法。
效果:实验结果表明,DeepACO作为神经增强的元启发式算法,在多个组合优化问题上表现出色,且代码已在GitHub上公开。

Ant Colony Optimization (ACO) is a meta-heuristic algorithm that has been successfully applied to various Combinatorial Optimization Problems (COPs). Traditionally, customizing ACO for a specific problem requires the expert design of knowledge-driven heuristics. In this paper, we propose DeepACO, a generic framework that leverages deep reinforcement learning to automate heuristic designs. DeepACO serves to strengthen the heuristic measures of existing ACO algorithms and dispense with laborious manual design in future ACO applications. As a neural-enhanced meta-heuristic, DeepACO consistently outperforms its ACO counterparts on eight COPs using a single neural model and a single set of hyperparameters. As a Neural Combinatorial Optimization method, DeepACO performs better than or on par with problem-specific methods on canonical routing problems. Our code is publicly available at https://github.com/henry-yeh/DeepACO.

Robust low-rank training via approximate orthonormal constraints
Dayana Savostianova Emanuele Zangrando Gianluca Ceruti Francesco Tudisco



研究问题:设计一种能在减少深度学习资源需求的同时保持模型性能的剪枝技术。
动机:随着模型和数据规模的增大,如何降低深度学习的推理和训练成本成为一个重要的研究方向。
方法:利用低秩矩阵分解来表示网络权重,同时引入近似正交约束来保证网络的权重在低秩矩阵流形上,从而降低训练和推理成本。
效果:通过大量的数值证据和主要近似定理证明,所提出的鲁棒低秩网络能很好地逼近理想的全模型,且不会牺牲模型的准确性。

With the growth of model and data sizes, a broad effort has been made to design pruning techniques that reduce the resource demand of deep learning pipelines, while retaining model performance. In order to reduce both inference and training costs, a prominent line of work uses low-rank matrix factorizations to represent the network weights. Although able to retain accuracy, we observe that low-rank methods tend to compromise model robustness against adversarial perturbations. By modeling robustness in terms of the condition number of the neural network, we argue that this loss of robustness is due to the exploding singular values of the low-rank weight matrices. Thus, we introduce a robust low-rank training algorithm that maintains the network's weights on the low-rank matrix manifold while simultaneously enforcing approximate orthonormal constraints. The resulting model reduces both training and inference costs while ensuring well-conditioning and thus better adversarial robustness, without compromising model accuracy. This is shown by extensive numerical evidence and by our main approximation theorem that shows the computed robust low-rank network well-approximates the ideal full model, provided a highly performing low-rank sub-network exists.

A Unified Solution for Privacy and Communication Efficiency in Vertical Federated Learning
Ganyu Wang Bin Gu Qingsong Zhang Xiang Li Boyu Wang Charles Ling



研究问题:如何在保证隐私安全和通信效率的同时,实现多方在不共享数据的情况下联合训练模型。
动机:现有的垂直联邦学习(VFL)方法在保护隐私和提高效率方面存在问题,需要进一步改进。
方法:提出一种级联混合优化方法,将零阶优化应用于客户端最关键的输出层,其他部分采用一阶优化。该方法在保持ZOO的隐私保护特性的同时,显著提高了收敛速度。
效果:实验结果表明,该方法在相同的隐私预算下实现了与高斯机制相似的效用,同时与最先进的通信高效VFL框架相比,通信成本显著降低。

Vertical Federated Learning (VFL) is a collaborative machine learning paradigm that enables multiple participants to jointly train a model on their private data without sharing it. To make VFL practical, privacy security and communication efficiency should both be satisfied. Recent research has shown that Zero-Order Optimization (ZOO) in VFL can effectively conceal the internal information of the model without adding costly privacy protective add-ons, making it a promising approach for privacy and efficiency. However, there are still two key problems that have yet to be resolved. First, the convergence rate of ZOO-based VFL is significantly slower compared to gradient-based VFL, resulting in low efficiency in model training and more communication round, which hinders its application on large neural networks. Second, although ZOO-based VFL has demonstrated resistance to state-of-the-art (SOTA) attacks, its privacy guarantee lacks a theoretical explanation. To address these challenges, we propose a novel cascaded hybrid optimization approach that employs a zeroth-order (ZO) gradient on the most critical output layer of the clients, with other parts utilizing the first-order (FO) gradient. This approach preserves the privacy protection of ZOO while significantly enhancing convergence. Moreover, we theoretically prove that applying ZOO to the VFL is equivalent to adding Gaussian Mechanism to the gradient information, which offers an implicit differential privacy guarantee. Experimental results demonstrate that our proposed framework achieves similar utility as the Gaussian mechanism under the same privacy budget, while also having significantly lower communication costs compared with SOTA communication-efficient VFL frameworks.

Sparse Parameterization for Epitomic Dataset Distillation
Xing Wei Anjia Cao Funing Yang Zhiheng Ma



研究问题:如何有效地处理和训练大规模的深度学习数据集。
动机:大规模深度学习数据集的存储、预处理和训练存在重大挑战,需要更高效的方法进行处理。
方法:提出了一种稀疏参数化表观数据蒸馏(SPEED)框架,利用字典学习和稀疏编码的概念来提炼代表数据集关键信息的表观。
效果:实验结果表明,SPEED在处理高分辨率数据集方面具有优越性,并在多个基准测试和下游应用中实现了最先进的性能。该框架与各种数据集匹配方法兼容,通常能提高其性能。

The success of deep learning relies heavily on large and diverse datasets, but the storage, preprocessing, and training of such data present significant challenges. To address these challenges, dataset distillation techniques have been proposed to obtain smaller synthetic datasets that capture the essential information of the originals. In this paper, we introduce a Sparse Parameterization for Epitomic datasEt Distillation (SPEED) framework, which leverages the concept of dictionary learning and sparse coding to distill epitomes that represent pivotal information of the dataset. SPEED prioritizes proper parameterization of the synthetic dataset and introduces techniques to capture spatial redundancy within and between synthetic images. We propose Spatial-Agnostic Epitomic Tokens (SAETs) and Sparse Coding Matrices (SCMs) to efficiently represent and select significant features. Additionally, we build a Feature-Recurrent Network (FReeNet) to generate hierarchical features with high compression and storage efficiency. Experimental results demonstrate the superiority of SPEED in handling high-resolution datasets, achieving state-of-the-art performance on multiple benchmarks and downstream applications. Our framework is compatible with a variety of dataset matching approaches, generally enhancing their performance. This work highlights the importance of proper parameterization in epitomic dataset distillation and opens avenues for efficient representation learning. Source code is available at https://github.com/MIV-XJTU/SPEED.

Towards Efficient and Accurate Winograd Convolution via Full Quantization
Chen Tianqi Weixiang Xu Weihan Chen Peisong Wang Jian Cheng



研究问题:如何提高Winograd卷积的计算效率?
动机:尽管后训练量化(Post-Training Quantization)具有低计算成本的优点,但在Winograd卷积中应用时会导致严重的精度下降。此外,大多数现有方法仅对元素级乘法阶段进行量化,导致大量计算保持全精度。
方法:本文提出了PTQ-Aware Winograd (PAW),通过统一的优化目标协同优化不同的转换过程。同时,首次探索了更快的Winograd(tile size≥4)的全量化。进一步提出了一种硬件友好的方法——因子化比例量化(FSQ),可以有效平衡Winograd域中显著的范围差异。
效果:实验表明该方法的有效性,例如,使用8位量化和6的tile size,在ResNet-18和ResNet-34上,该方法比之前的Winograd PTQ方法分别提高了8.27%和5.38%的top-1准确率。

The Winograd algorithm is an efficient convolution implementation, which performs calculations in the transformed domain. To further improve the computation efficiency, recent works propose to combine it with model quantization. Although Post-Training Quantization has the advantage of low computational cost and has been successfully applied in many other scenarios, a severe accuracy drop exists when utilizing it in Winograd convolution. Besides, despite the Winograd algorithm consisting of four stages, most existing methods only quantize the element-wise multiplication stage, leaving a considerable portion of calculations in full precision. In this paper, observing the inconsistency among different transformation procedures, we present PTQ-Aware Winograd (PAW) to optimize them collaboratively under a unified objective function. Moreover, we explore the full quantization of faster Winograd (tile size $\geq4$) for the first time. We further propose a hardware-friendly method called Factorized Scale Quantization (FSQ), which can effectively balance the significant range differences in the Winograd domain. Experiments demonstrate the effectiveness of our method, e.g., with 8-bit quantization and a tile size of 6, our method outperforms the previous Winograd PTQ method by 8.27\% and 5.38\% in terms of the top-1 accuracy on ResNet-18 and ResNet-34, respectively.

Understanding Neural Network Binarization with Forward and Backward Proximal Quantizers
Yiwei Lu Yaoliang Yu Xinlin Li Vahid Partovi Nia



研究问题:本文旨在从优化的角度探讨神经网络二值化中的标准方法BinaryConnect及其变体存在的问题。
动机:由于sign函数的导数在定义时为零,导致训练过程中冻结,因此通常使用身份或其他近似梯度替代方案来更新权重。虽然这种方法在实践中效果良好,但很大程度上是一种启发式或“训练技巧”。
方法:基于现有的ProxConnect理论(PC,BC的泛化),我们(1)为PC配备了不同的前向-后向量化器,得到了包含现有二值化技术作为特例的ProxConnect++(PC++);(2)推导出一种具有自动理论保证的合成前向-后向量化器的方法;(3)通过提出增强的二值化算法BNN++来阐述我们的理论;(4)对CNNs和视觉转换器进行图像分类实验,并实证证明BNN++在这些模型的二值化上通常能取得竞争性的结果。
效果:实验表明该方法的有效性,例如,使用8位量化和6的tile size,在ResNet-18和ResNet-34上,该方法比之前的Winograd PTQ方法分别提高了8.27%和5.38%的top-1准确率。

In neural network binarization, BinaryConnect (BC) and its variants are considered the standard. These methods apply the sign function in their forward pass and their respective gradients are backpropagated to update the weights. However, the derivative of the sign function is zero whenever defined, which consequently freezes training. Therefore, implementations of BC (e.g., BNN) usually replace the derivative of sign in the backward computation with identity or other approximate gradient alternatives. Although such practice works well empirically, it is largely a heuristic or ``training trick.'' We aim at shedding some light on these training tricks from the optimization perspective. Building from existing theory on ProxConnect (PC, a generalization of BC), we (1) equip PC with different forward-backward quantizers and obtain ProxConnect++ (PC++) that includes existing binarization techniques as special cases; (2) derive a principled way to synthesize forward-backward quantizers with automatic theoretical guarantees; (3) illustrate our theory by proposing an enhanced binarization algorithm BNN++; (4) conduct image classification experiments on CNNs and vision transformers, and empirically verify that BNN++ generally achieves competitive results on binarizing these models.

Model-enhanced Vector Index
Hailin Zhang Yujing Wang Qi Chen Ruiheng Chang Ting Zhang Ziming Miao Yingyan Hou Yang Ding Xupeng Miao Haonan Wang Bochen Pang Yuefeng Zhan Hao Sun Weiwei Deng Qi Zhang Fan Yang Xing Xie Mao Yang Bin CUI



研究问题:如何提高基于嵌入的检索方法的性能,同时保持可接受的服务效率。
动机:当前的深度检索解决方案虽然提供了更好的模型质量,但由于服务延迟不可接受且无法支持文档更新,因此受到限制。
方法:提出了一种模型增强向量索引(MEVI)的方法,该方法利用深度检索模型的可微分优势,同时保持理想的服务效率。MEVI使用残差量化(RQ)代码簿连接序列到序列的深度检索和基于嵌入的模型。为了大幅减少推理时间,我们首先在少量步骤中生成候选文档的一些语义虚拟集群ID,然后利用适应良好的嵌入向量进一步在候选虚拟集群中进行相关文档的细粒度搜索。
效果:实验结果表明,我们的模型在常用的学术基准MSMARCO Passage和Natural Questions上取得了更好的性能,与密集检索解决方案相比具有相当的服务延迟。

Embedding-based retrieval methods construct vector indices to search for document representations that are most similar to the query representations. They are widely used in document retrieval due to low latency and decent recall performance. Recent research indicates that deep retrieval solutions offer better model quality, but are hindered by unacceptable serving latency and the inability to support document updates. In this paper, we aim to enhance the vector index with end-to-end deep generative models, leveraging the differentiable advantages of deep retrieval models while maintaining desirable serving efficiency. We propose Model-enhanced Vector Index (MEVI), a differentiable model-enhanced index empowered by a twin-tower representation model. MEVI leverages a Residual Quantization (RQ) codebook to bridge the sequence-to-sequence deep retrieval and embedding-based models. To substantially reduce the inference time, instead of decoding the unique document ids in long sequential steps, we first generate some semantic virtual cluster ids of candidate documents in a small number of steps, and then leverage the well-adapted embedding vectors to further perform a fine-grained search for the relevant documents in the candidate virtual clusters. We empirically show that our model achieves better performance on the commonly used academic benchmarks MSMARCO Passage and Natural Questions, with comparable serving latency to dense retrieval solutions.

Learning Large-scale Neural Fields via Context Pruned Meta-Learning
Jihoon Tack Subin Kim Sihyun Yu Jaeho Lee Jinwoo Shin Jonathan Richard Schwarz



研究问题:本文旨在提出一种高效的优化型元学习技术,用于大规模神经场训练。
动机:通过自动在线选择上下文点实现显著的内存节省,提高模型质量。
方法:将每个学习步骤集中在数据子集上,该子集具有最高的预期即时模型质量改进,从而实现全局结构和高频细节的快速建模和优化。
效果:通过梯度重标定在元测试时间进行极高质量的神经场学习,并在明显缩短的优化过程中展示出优秀的重构能力。在多个数据集上进行广泛的实证评估,结果达到最先进的水平。

We introduce an efficient optimization-based meta-learning technique for large-scale neural field training by realizing significant memory savings through automated online context point selection. This is achieved by focusing each learning step on the subset of data with the highest expected immediate improvement in model quality, resulting in the almost instantaneous modeling of global structure and subsequent refinement of high-frequency details. We further improve the quality of our meta-learned initialization by introducing a bootstrap correction resulting in the minimization of any error introduced by reduced context sets while simultaneously mitigating the well-known myopia of optimization-based meta-learning. Finally, we show how gradient re-scaling at meta-test time allows the learning of extremely high-quality neural fields in significantly shortened optimization procedures. Our framework is model-agnostic, intuitive, straightforward to implement, and shows significant reconstruction improvements for a wide range of signals. We provide an extensive empirical evaluation on nine datasets across multiple multiple modalities, demonstrating state-of-the-art results while providing additional insight through careful analysis of the algorithmic components constituting our method. Code is available at https://github.com/jihoontack/GradNCP

Reusing Pretrained Models by Multi-linear Operators for Efficient Training
Yu Pan Ye Yuan Yichun Yin Zenglin Xu Lifeng Shang Xin Jiang Qun Liu



研究问题:如何有效地利用预训练模型来加速大型模型的训练。
动机:现有的预训练模型初始化方法只映射部分权重,忽视了整个模型中的潜在相关性,导致信息不完整和训练效果不佳。
方法:提出一种新的方法,通过将目标模型的每个权重与预训练模型的所有权重进行线性关联,以增强加速能力。同时使用多线性运算符降低计算和空间复杂度。
效果:实验表明,该方法在资源需求可接受的情况下,可以显著提高训练速度,并在多个任务上取得优于现有方法的效果。

Training large models from scratch usually costs a substantial amount of resources. Towards this problem, recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model (termed the ``target model''), leading to a considerable acceleration in training. Despite the successes of these previous studies, they grew pretrained models by mapping partial weights only, ignoring potential correlations across the entire model. As we show in this paper, there are inter- and intra-interactions among the weights of both the pretrained and the target models. As a result, the partial mapping may not capture the complete information and lead to inadequate growth. In this paper, we propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model to further enhance acceleration ability. We utilize multi-linear operators to reduce computational and spacial complexity, enabling acceptable resource requirements. Experiments demonstrate that our method can save 76\% computational costs on DeiT-base transferred from DeiT-small, which outperforms bert2BERT by +12\% and LiGO by +21\%, respectively.

An Efficient and Robust Framework for Approximate Nearest Neighbor Search with Attribute Constraint
Mengzhao Wang Lingwei Lv Xiaoliang Xu Yuxiang Wang Qiang Yue Jiongkang Ni



研究问题:本文旨在提出一种高效且稳健的混合查询(HQ)处理框架,将近似最近邻搜索(ANN)与属性约束相结合。
动机:现有的方法将ANN和属性过滤分开处理,导致效率低下和准确性不高。
方法:本文提出的原生混合查询(NHQ)框架基于接近图(PG)构建复合索引,并应用联合剪枝进行HQ处理。我们还提出了两种新的可导航PGs(NPGs),通过优化边的选择和路由来提高整体ANN性能。
效果:我们在NHQ中实现了五种基于提出的NPGs和现有PGs的HQ方法,并在10个真实世界数据集上展示了它们优于最先进的方法(在保持相同准确性的情况下快315倍)。

This paper introduces an efficient and robust framework for hybrid query (HQ) processing, which combines approximate nearest neighbor search (ANNS) with attribute constraint. HQ aims to find objects that are similar to a feature vector and match some structured attributes. Existing methods handle ANNS and attribute filtering separately, leading to inefficiency and inaccuracy. Our framework, called native hybrid query (NHQ), builds a composite index based on proximity graph (PG) and applies joint pruning for HQ. We can easily adapt existing PGs to this framework for efficient HQ processing. We also propose two new navigable PGs (NPGs) with optimized edge selection and routing, which improve the overall ANNS performance. We implement five HQ methods based on the proposed NPGs and existing PGs in NHQ, and show that they outperform the state-of-the-art methods on 10 real-world datasets (up to 315$\times$ faster with the same accuracy).

MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition
Nicolas Menet Michael Hersche Geethan Karunaratne Luca Benini Abu Sebastian Abbas Rahimi



研究问题:如何利用深度学习降低推理成本,同时处理多个输入?
动机:通过利用深度神经网络的大容量模型,我们试图通过叠加计算来降低推理成本。
方法:提出多输入多输出神经网络(MIMONets),能够一次处理多个输入。通过将各种深度神经网络架构与可变绑定机制相结合,MIMONets能够在一个固定宽度的分布式表示中以组合数据结构的形式表示任意数量的输入。然后,MIMONets适应非线性神经转换来整体处理该数据结构,从而实现与数据结构中叠加输入项数量几乎成比例的加速。在叠加处理后,解绑机制恢复每个感兴趣的转换输入。此外,MIMONets还提供了一种动态权衡精度和吞吐量的方法,即在单个固定参数集内即时按需切换一组精度-吞吐量操作点。
效果:我们将MIMONets的概念应用于CNN和Transformer架构,分别得到MIMOConv和MIMOFormer。实验评估表明,与WideResNet CNNs相比,MIMOConv在CIFAR10和CIFAR100上实现了约2-4倍的速度提升。同样,MIMOFormer可以在保持高精度的同时一次处理2-4个输入,平均准确率在[-1.07, -3.43]%范围内。最后,我们对MIMOFormer中的叠加通道之间的干扰提供了数学界限。

With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves $\approx 2$–$4\times$ speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle $2$–$4$ inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.

FedL2P: Federated Learning to Personalize
Royson Lee Minyoung Kim Da Li Xinchi Qiu Timothy Hospedales Ferenc Huszár Nicholas Donald Lane



研究问题:本文旨在解决联邦学习中如何为每个客户端学习个性化策略的问题。
动机:不同的联邦学习问题可能需要不同的个性化策略,且无法为所有客户端定义一种有效的通用个性化策略。
方法:通过使用元网络来推导每个客户端的批量归一化和学习率参数,然后通过联邦学习来学习这些元网络。
效果:实证结果显示,该框架在标签和特征转移情况下均优于一系列标准的手工制作个性化基线。

Federated learning (FL) research has made progress in developing algorithms for distributed learning of global models, as well as algorithms for local personalization of those common models to the specifics of each client’s local data distribution. However, different FL problems may require different personalization strategies, and it may not even be possible to define an effective one-size-fits-all personalization strategy for all clients: Depending on how similar each client’s optimal predictor is to that of the global model, different personalization strategies may be preferred. In this paper, we consider the federated meta-learning problem of learning personalization strategies. Specifically, we consider meta-nets that induce the batch-norm and learning rate parameters for each client given local data statistics. By learning these meta-nets through FL, we allow the whole FL network to collaborate in learning a customized personalization strategy for each client. Empirical results show that this framework improves on a range of standard hand-crafted personalization baselines in both label and feature shift situations.

CamoPatch: An Evolutionary Strategy for Generating Camoflauged Adversarial Patches
Phoenix Neale Williams Ke Li



研究问题:深度神经网络(DNN)对对抗性示例的脆弱性引发了对其在安全关键应用中的可靠性的担忧。
动机:虽然现有的大多数方法通过修改整个图像来生成对抗性示例,但最近的研究表明,一种被称为对抗性补丁的实用替代方案更为有效。
方法:我们提出了一种新的构建对抗性补丁的方法,该方法通过使用一组半透明、RGB值的圆形来近似覆盖区域的外观,从而最小化补丁的可见性。
效果:我们的方法在ImageNet DNN分类器上实现了比最先进的方法更好或相当的性能,同时从原始图像到补丁的距离更小。这项工作进一步突显了DNN对对抗性补丁的脆弱性。

Deep neural networks (DNNs) have demonstrated vulnerabilities to adversarial examples, which raises concerns about their reliability in safety-critical applications. While the majority of existing methods generate adversarial examples by making small modifications to the entire image, recent research has proposed a practical alternative known as adversarial patches. Adversarial patches have shown to be highly effective in causing DNNs to misclassify by distorting a localized area (patch) of the image. However, existing methods often produce clearly visible distortions since they do not consider the visibility of the patch. To address this, we propose a novel method for constructing adversarial patches that approximates the appearance of the area it covers. We achieve this by using a set of semi-transparent, RGB-valued circles, drawing inspiration from the computational art community. We utilize an evolutionary strategy to optimize the properties of each shape, and employ a simulated annealing approach to optimize the patch's location. Our approach achieves better or comparable performance to state-of-the-art methods on ImageNet DNN classifiers while achieving a lower $l_2$ distance from the original image. By minimizing the visibility of the patch, this work further highlights the vulnerabilities of DNNs to adversarial patches.

Boosting Learning for LDPC Codes to Improve the Error-Floor Performance
Hee-Youl Kwak Dae-Young Yun Yongjune Kim Sang-Hyo Kim Jong-Seon No



研究问题:如何消除低密度奇偶校验(LDPC)编码中的错误底现象,以实现极低的误码率和在需要超高度可靠性的场景中的部署。
动机:尽管LDPC编码由于其强大的错误纠正能力和简单的解码过程已经在通信系统中成功商业化,但其错误底现象仍然对实现极低的误码率和在需要超高度可靠性的场景中的部署构成了挑战。
方法:我们提出了训练神经网络最小和(NMS)解码器的方法来消除错误底效应。首先,通过利用集成网络的增强学习技术,我们将解码网络分为两个神经网络解码器,并训练后解码器专门用于前一个解码器未能纠正的未纠正单词。其次,为了解决训练中的梯度消失问题,我们引入了分块训练计划,即局部训练一组权重,同时重新训练前面的一组。最后,我们发现为未满足检查节点分配不同的权重可以有效地降低错误底,而只需要最少数量的权重。
效果:通过将这些训练方法应用于标准的LDPC编码,我们实现了比其他解码方法更好的错误底性能。所提出的NMS解码器仅通过新颖的训练方法进行优化,无需额外的模块,就可以集成到现有的LDPC解码器中,而不会带来额外的硬件成本。

Low-density parity-check (LDPC) codes have been successfully commercialized in communication systems due to their strong error correction capabilities and simple decoding process. However, the error-floor phenomenon of LDPC codes, in which the error rate stops decreasing rapidly at a certain level, presents challenges for achieving extremely low error rates and deploying LDPC codes in scenarios demanding ultra-high reliability. In this work, we propose training methods for neural min-sum (NMS) decoders to eliminate the error-floor effect. First, by leveraging the boosting learning technique of ensemble networks, we divide the decoding network into two neural decoders and train the post decoder to be specialized for uncorrected words that the first decoder fails to correct. Secondly, to address the vanishing gradient issue in training, we introduce a block-wise training schedule that locally trains a block of weights while retraining the preceding block. Lastly, we show that assigning different weights to unsatisfied check nodes effectively lowers the error-floor with a minimal number of weights. By applying these training methods to standard LDPC codes, we achieve the best error-floor performance compared to other decoding methods. The proposed NMS decoder, optimized solely through novel training methods without additional modules, can be integrated into existing LDPC decoders without incurring extra hardware costs. The source code is available at https://github.com/ghy1228/LDPC_Error_Floor.

Softmax Output Approximation for Activation Memory-Efficient Training of Attention-based Networks
Changhyeon Lee Seulki Lee



研究问题:如何减少训练基于注意力机制的网络(如Transformers)时的注意力模块的激活内存使用。
动机:大多数基于注意力的模型严重依赖softmax-based注意力模块,该模块通常占用网络的最大部分,因此,通过减少其内存需求可以有效降低训练成本。
方法:提出一种近似softmax输出的方法,仅存储一小部分用于反向传播所需的完整softmax输出,并将其余的softmax输出从内存中逐出。然后在反向传播过程中,对被逐出的softmax激活输出进行近似以组成梯度并进行模型训练。
效果:实验证明,该方法在机器翻译、文本分类和情感分析等任务上,可以将softmax-based注意力模块的激活内存使用减少高达84%(训练内存需求减少了6.2倍),同时保持了相当甚至更好的性能,例如分类准确率提高了5.4%。

In this paper, we propose to approximate the softmax output, which is the key product of the attention mechanism, to reduce its activation memory usage when training attention-based networks (aka Transformers). During the forward pass of the network, the proposed softmax output approximation method stores only a small fraction of the entire softmax output required for back-propagation and evicts the rest of the softmax output from memory. Then, during the backward pass, the evicted softmax activation output is approximated to compose the gradient to perform back-propagation for model training. Considering most attention-based models heavily rely on the softmax-based attention module that usually takes one of the biggest portions of the network, approximating the softmax activation output can be a simple yet effective way to decrease the training memory requirement of many attention-based networks. The experiment with various attention-based models and relevant tasks, i.e., machine translation, text classification, and sentiment analysis, shows that it curtails the activation memory usage of the softmax-based attention module by up to 84% (6.2× less memory) in model training while achieving comparable or better performance, e.g., up to 5.4% higher classification accuracy.

A Computationally Efficient Sparsified Online Newton Method
Fnu Devvrit Sai Surya Duvvuri Rohan Anil Vineet Gupta Cho-Jui Hsieh Inderjit S Dhillon



研究问题:如何有效地训练大型模型的二次优化方法,以解决其大内存和计算需求的问题。
动机:尽管二次优化方法在深度神经网络训练的收敛性上有显著优势,但其巨大的内存和计算需求限制了其实用性。因此,需要可扩展的二次优化方法来有效训练大型模型。
方法:本文介绍了稀疏在线牛顿(SONew)方法,这是一种内存高效的二次算法,能产生稀疏而有效的预处理器。该方法源于对LogDet矩阵散度测量的创新应用;我们将它与稀疏性约束相结合,以最小化在线凸优化框架中的遗憾。
效果:在具有高达10亿参数的大型基准测试中,我们的方法比包括一阶方法在内的内存高效优化器快30%,验证性能相对提高3.4%,训练损失相对提高80%。此外,这种方法易于实现且并行化程度高,与一阶方法相当。

Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton~(SONew) method, a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. The algorithm emerges from a novel use of the LogDet matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to $30\\%$ faster convergence, $3.4\\%$ relative improvement in validation performance, and $80\\%$ relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. Powering the method is a surprising fact -- imposing structured sparsity patterns, like tridiagonal and banded structure, requires little to no overhead, making it as efficient and parallelizable as first-order methods. In wall-clock time, tridiagonal SONew is only about $3\\%$ slower per step than first-order methods but gives overall gains due to much faster convergence. In contrast, one of the state-of-the-art (SOTA) memory-intensive second-order methods, Shampoo, is unable to scale to large benchmarks. Additionally, while Shampoo necessitates significant engineering efforts to scale to large benchmarks, SONew offers a more straightforward implementation, increasing its practical appeal. SONew code is available at: https://github.com/devvrit/SONew

LambdaBeam: Neural Program Search with Higher-Order Functions and Lambdas
Kensen Shi Hanjun Dai Wen-Ding Li Kevin Ellis Charles Sutton



研究问题:本文旨在解决现有程序合成搜索方法无法处理迭代循环、高阶函数或lambda函数的问题。
动机:现有的神经网络模型在指导程序合成搜索方面效果显著,但无法处理复杂的函数结构,限制了其生成更通用的程序的能力。
方法:设计了一种名为LambdaBeam的搜索算法,该算法可以构造任意的lambda函数,并在给定的DSL中组合操作。通过创建lambda函数的语义向量表示,并训练一个神经网络策略网络来选择在搜索过程中要构造的lambda函数,并将它们作为参数传递给高阶函数以执行循环计算。
效果:实验结果表明,LambdaBeam在整数列表操作领域优于神经、符号和基于LLM的技术。

Search is an important technique in program synthesis that allows for adaptive strategies such as focusing on particular search directions based on execution results. Several prior works have demonstrated that neural models are effective at guiding program synthesis searches. However, a common drawback of those approaches is the inability to handle iterative loops, higher-order functions, or lambda functions, thus limiting prior neural searches from synthesizing longer and more general programs. We address this gap by designing a search algorithm called LambdaBeam that can construct arbitrary lambda functions that compose operations within a given DSL. We create semantic vector representations of the execution behavior of the lambda functions and train a neural policy network to choose which lambdas to construct during search, and pass them as arguments to higher-order functions to perform looping computations. Our experiments show that LambdaBeam outperforms neural, symbolic, and LLM-based techniques in an integer list manipulation domain.

Accelerated On-Device Forward Neural Network Training with Module-Wise Descending Asynchronism
Xiaohan Zhao Hualin Zhang Zhouyuan Huo Bin Gu



研究问题:如何在边缘设备上优化或微调深度学习模型时克服内存限制。
动机:当前在边缘设备上训练深度模型主要依赖反向传播,但其高内存使用率需要重新评估其主导地位。
方法:本文提出了前向梯度下降(FGD)作为解决边缘设备学习中内存容量限制的潜在解决方案。为了克服FGD层间依赖性阻碍并行计算的问题,我们提出了异步FGD框架,该框架解耦了依赖关系,利用模块级陈旧参数,并最大化并行计算。
效果:我们在NVIDIA的AGX Orin等流行嵌入式设备上进行了实证评估,结果显示AsyncFGD减少了内存消耗,提高了硬件效率,为设备端学习提供了一种新的方法。

On-device learning faces memory constraints when optimizing or fine-tuning on edge devices with limited resources. Current techniques for training deep models on edge devices rely heavily on backpropagation. However, its high memory usage calls for a reassessment of its dominance. In this paper, we propose forward gradient descent (FGD) as a potential solution to overcome the memory capacity limitation in on-device learning. However, FGD's dependencies across layers hinder parallel computation and can lead to inefficient resource utilization. To mitigate this limitation, we propose AsyncFGD, an asynchronous framework that decouples dependencies, utilizes module-wise stale parameters, and maximizes parallel computation. We demonstrate its convergence to critical points through rigorous theoretical analysis. Empirical evaluations conducted on NVIDIA's AGX Orin, a popular embedded device, show that AsyncFGD reduces memory consumption and enhances hardware efficiency, offering a novel approach to on-device learning.

Block Low-Rank Preconditioner with Shared Basis for Stochastic Optimization
Jui-Nan Yen Sai Surya Duvvuri Inderjit S Dhillon Cho-Jui Hsieh



研究问题:如何降低自适应方法的计算复杂度和内存需求,以适应现代神经网络架构?
动机:现有的自适应方法虽然在各种任务上表现优秀,但其高计算复杂度和内存需求限制了其应用。
方法:提出一种通过将二阶矩矩阵的对角块近似为低秩矩阵,并强制每层内的块使用相同的基来降低时间和内存复杂度的方法。
效果:实验结果表明,该方法在深度自动编码器和变压器基准测试中的表现优于一阶方法,同时具有稍高的时间和内存使用率,但比其他二阶方法的性能更好或相当。

Adaptive methods with non-diagonal preconditioning have shown state-of-the-art results on various tasks. However, their computational complexity and memory requirement makes it challenging to scale these methods to modern neural network architectures. To address this challenge, some previous works have adopted block-diagonal preconditioners. However, the memory cost of storing the block-diagonal matrix remains substantial, leading to the use of smaller block sizes and ultimately resulting in suboptimal performance. To reduce the time and memory complexity without sacrificing performance, we propose approximating each diagonal block of the second moment matrix by low-rank matrices and enforcing the same basis for the blocks within each layer. We provide theoretical justification for such sharing and design an algorithm to efficiently maintain this shared-basis block low-rank approximation during training. Our results on a deep autoencoder and a transformer benchmark demonstrate that the proposed method outperforms first-order methods with slightly more time and memory usage, while also achieving competitive or superior performance compared to other second-order methods with less time and memory usage.

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks
Ryan Sullivan Akarsh Kumar Shengyi Huang John P Dickerson Joseph Suarez



研究问题:大多数强化学习方法严重依赖密集且良好归一化的环境奖励,DreamerV3提出了一种基于模型的方法,通过一些技巧来缓解这些限制,并在广泛的基准测试中实现了最先进的性能。
动机:DreamerV3的技巧在其他强化学习算法中是否具有通用性引起了讨论。本研究将这些技巧应用于PPO,并进行了首次此类实证研究。
方法:我们使用高质量的PPO参考实现,并在Arcade Learning Environment和DeepMind Control Suite上进行了超过10,000个A100小时的大量消融研究。
效果:实验表明,这些技巧并不能普遍超越PPO,但我们发现在某些情况下它们可以成功,并提供了关于实现技巧之间关系的洞察。特别是在有奖励裁剪的Atari游戏中,应用了这些技巧的PPO与未使用奖励裁剪的PPO相比表现相当出色。

Most reinforcement learning methods rely heavily on dense, well-normalized environment rewards. DreamerV3 recently introduced a model-based method with a number of tricks that mitigate these limitations, achieving state-of-the-art on a wide range of benchmarks with a single set of hyperparameters. This result sparked discussion about the generality of the tricks, since they appear to be applicable to other reinforcement learning algorithms. Our work applies DreamerV3's tricks to PPO and is the first such empirical study outside of the original work. Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. We use a high quality PPO reference implementation and present extensive ablation studies totaling over 10,000 A100 hours on the Arcade Learning Environment and the DeepMind Control Suite. Though our experiments demonstrate that these tricks do not generally outperform PPO, we identify cases where they succeed and offer insight into the relationship between the implementation tricks. In particular, PPO with these tricks performs comparably to PPO on Atari games with reward clipping and significantly outperforms PPO without reward clipping.

Towards a fuller understanding of neurons with Clustered Compositional Explanations
Biagio La Rosa Leilani H. Gilpin Roberto Capobianco



研究问题:本文旨在解决预训练语言模型在知识驱动任务上的性能不足,以及现有解释方法的不完整性问题。
动机:预训练语言模型缺乏对结构化知识的利用,而现有的解释方法只能捕捉到神经元激活的最高频部分,缺乏完整性。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型结合了大规模文本语料库和知识图谱进行联合训练,以充分利用词汇、句法和知识信息。同时,本文还提出了一种改进的解释方法——Clustered Compositional Explanations,该方法通过聚类和新的搜索策略来捕捉更广泛的神经元行为。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。Clustered Compositional Explanations能够提供更全面的解释,有助于深入理解神经网络的行为。

Compositional Explanations is a method for identifying logical formulas of concepts that approximate the neurons' behavior. However, these explanations are linked to the small spectrum of neuron activations (i.e., the highest ones) used to check the alignment, thus lacking completeness. In this paper, we propose a generalization, called Clustered Compositional Explanations, that combines Compositional Explanations with clustering and a novel search heuristic to approximate a broader spectrum of the neuron behavior. We define and address the problems connected to the application of these methods to multiple ranges of activations, analyze the insights retrievable by using our algorithm, and propose desiderata qualities that can be used to study the explanations returned by different algorithms.

A Hierarchical Spatial Transformer for Massive Point Samples in Continuous Space
Wenchong He Zhe Jiang Tingsong Xiao Zelin Xu Shigang Chen Ronald Fick MILES D MEDINA Christine Angelini



研究问题:如何设计一种针对大规模连续空间点的Transformer模型。
动机:现有的Transformer模型大多针对序列、图像或视频以及图形数据,对于在环境科学、数值模拟和基于位置的服务等领域中常见的大规模连续空间点数据,设计合适的Transformer模型具有挑战性。
方法:提出一种新的分层空间Transformer模型,包括在四叉树层次结构中进行多分辨率表示学习和通过粗化近似进行有效空间注意力的方法。同时设计了一个不确定性量化分支,用于估计与输入特征噪声和点稀疏性相关的预测置信度。
效果:实验结果表明,该方法在预测精度上优于多个基线,并且该模型可以在一个NVIDIA A100 GPU上扩展到一百万点。

Transformers are widely used deep learning architectures. Existing transformers are mostly designed for sequences (texts or time series), images or videos, and graphs. This paper proposes a novel transformer model for massive (up to a million) point samples in continuous space. Such data are ubiquitous in environment sciences (e.g., sensor observations), numerical simulations (e.g., particle-laden flow, astrophysics), and location-based services (e.g., POIs and trajectories). However, designing a transformer for massive spatial points is non-trivial due to several challenges, including implicit long-range and multi-scale dependency on irregular points in continuous space, a non-uniform point distribution, the potential high computational costs of calculating all-pair attention across massive points, and the risks of over-confident predictions due to varying point density. To address these challenges, we propose a new hierarchical spatial transformer model, which includes multi-resolution representation learning within a quad-tree hierarchy and efficient spatial attention via coarse approximation. We also design an uncertainty quantification branch to estimate prediction confidence related to input feature noise and point sparsity. We provide a theoretical analysis of computational time complexity and memory costs. Extensive experiments on both real-world and synthetic datasets show that our method outperforms multiple baselines in prediction accuracy and our model can scale up to one million points on one NVIDIA A100 GPU. The code is available at https://github.com/spatialdatasciencegroup/HST

Accelerating Monte Carlo Tree Search with Probability Tree State Abstraction
Yangqing Fu Ming Sun Buqing Nie Yue Gao



研究问题:如何提高蒙特卡洛树搜索(MCTS)算法的搜索效率。
动机:MCTS-based 算法的计算复杂度受搜索空间大小影响,需要改进以提高效率。
方法:提出概率树状态抽象(PTSA)算法,定义了具有路径传递性的通用树状态抽象,并在聚合步骤中减少错误。
效果:通过将PTSA算法与先进的MCTS-based算法如Sampled MuZero和Gumbel MuZero集成,实验结果表明,该方法可以在不同任务上加速最先进的算法的训练过程,搜索空间减少10%-45%。

Monte Carlo Tree Search (MCTS) algorithms such as AlphaGo and MuZero have achieved superhuman performance in many challenging tasks. However, the computational complexity of MCTS-based algorithms is influenced by the size of the search space. To address this issue, we propose a novel probability tree state abstraction (PTSA) algorithm to improve the search efficiency of MCTS. A general tree state abstraction with path transitivity is defined. In addition, the probability tree state abstraction is proposed for fewer mistakes during the aggregation step. Furthermore, the theoretical guarantees of the transitivity and aggregation error bound are justified. To evaluate the effectiveness of the PTSA algorithm, we integrate it with state-of-the-art MCTS-based algorithms, such as Sampled MuZero and Gumbel MuZero. Experimental results on different tasks demonstrate that our method can accelerate the training process of state-of-the-art algorithms with 10%-45% search space reduction.

Mnemosyne: Learning to Train Transformers with Transformers
Deepali Jain Krzysztof Marcin Choromanski Kumar Avinava Dubey Sumeet Singh Vikas Sindhwani Tingnan Zhang Jie Tan



研究问题:提出一种新的可学习优化器类别,称为Mnemosyne。
动机:基于新型的时空低秩隐式注意力Transformers,Mnemosyne无需任何特定任务的优化器调优即可学习训练整个神经网络架构。
方法:通过简单的元训练策略,使用Mnemosyne成功训练了Transformers,同时其空间复杂度与手工设计的一阶对应物相当,允许其扩展到训练更大的参数集。
效果:实验结果表明,Mnemosyne在微调各种视觉Transformers、预训练BERT模型和软提示调整大型11B+ T5XXL模型方面表现优秀。

In this work, we propose a new class of learnable optimizers, called Mnemosyne. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature engineering to mitigate catastrophic forgetting of LSTMs), (b) can successfully train Transformers while using simple meta-training strategies that require minimal computational resources, (c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned hyper-parameters (often producing top performing models). Furthermore, Mnemosyne provides space complexity comparable to that of its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters. We conduct an extensive empirical evaluation of Mnemosyne on: (a) fine-tuning a wide range of Vision Transformers (ViTs) from medium-size architectures to massive ViT-Hs (36 layers, 16 heads), (b) pre-training BERT models and (c) soft prompt-tuning large 11B+ T5XXL models. We complement our results with a comprehensive theoretical analysis of the compact associative memory used by Mnemosyne which we believe was never done before.

Leveraging Early-Stage Robustness in Diffusion Models for Efficient and High-Quality Image Synthesis
Yulhwa Kim Dongwon Jo Hyesung Jeon Taesu Kim Daehyun Ahn Hyungjun Kim jae-joon kim



研究问题:本文旨在解决扩散模型在图像生成中计算量大、采样速度慢的问题。
动机:扩散模型虽然具有优秀的图像生成能力,但其迭代噪声估计过程计算量大,采样速度慢,限制了其实用化实现。
方法:提出一种新方法,利用早期阶段扩散模型的鲁棒性来加速噪声估计网络。通过结合后训练量化(PTQ),在早期反向扩散过程中使用低比特激活,而在后期保持高比特激活。
效果:实验结果表明,该方法可以加速早期阶段的计算,同时不牺牲生成图像的质量。

While diffusion models have demonstrated exceptional image generation capabilities, the iterative noise estimation process required for these models is compute-intensive and their practical implementation is limited by slow sampling speeds. In this paper, we propose a novel approach to speed up the noise estimation network by leveraging the robustness of early-stage diffusion models. Our findings indicate that inaccurate computation during the early-stage of the reverse diffusion process has minimal impact on the quality of generated images, as this stage primarily outlines the image while later stages handle the finer details that require more sensitive information. To improve computational efficiency, we combine our findings with post-training quantization (PTQ) to introduce a method that utilizes low-bit activation for the early reverse diffusion process while maintaining high-bit activation for the later stages. Experimental results show that the proposed method can accelerate the early-stage computation without sacrificing the quality of the generated images.

An Efficient Dataset Condensation Plugin and Its Application to Continual Learning
Enneng Yang Li Shen Zhenyi Wang Tongliang Liu Guibing Guo



研究问题:如何将大型真实世界数据集压缩成小型合成数据集,以训练一个在后者上表现与前者相似的网络。
动机:现有的数据集压缩方法都忽视了自然图像的局部连接性和较低的固有维度,导致压缩效率低下。
方法:提出一种简单而有效的数据集压缩插件,该插件在低维流形中匹配原始和合成数据集。具体来说,我们的插件将原始图像压缩成两个低秩矩阵,而不是参数化的图像矩阵。
效果:实验证明,当将提出的插件与最先进的数据集压缩方法结合使用时,训练在合成数据上的网络性能显著优于传统的方法。此外,当我们将数据集压缩方法作为插件应用于持续学习任务时,我们发现该方法有效地缓解了有限内存缓冲区约束下旧任务的灾难性遗忘问题,并避免了原始数据隐私泄露的问题。

Dataset condensation (DC) distills a large real-world dataset into a small synthetic dataset, with the goal of training a network from scratch on the latter that performs similarly to the former. State-of-the-art (SOTA) DC methods have achieved satisfactory results through techniques such as accuracy, gradient, training trajectory, or distribution matching. However, these works all perform matching in the high-dimension pixel spaces, ignoring that natural images are usually locally connected and have lower intrinsic dimensions, resulting in low condensation efficiency. In this work, we propose a simple-yet-efficient dataset condensation plugin that matches the raw and synthetic datasets in a low-dimensional manifold. Specifically, our plugin condenses raw images into two low-rank matrices instead of parameterized image matrices. Our plugin can be easily incorporated into existing DC methods, thereby containing richer raw dataset information at limited storage costs to improve the downstream applications' performance. We verify on multiple public datasets that when the proposed plugin is combined with SOTA DC methods, the performance of the network trained on synthetic data is significantly improved compared to traditional DC methods. Moreover, when applying the DC methods as a plugin to continual learning tasks, we observed that our approach effectively mitigates catastrophic forgetting of old tasks under limited memory buffer constraints and avoids the problem of raw data privacy leakage.

Efficient Low-rank Backpropagation for Vision Transformer Adaptation
Yuedong Yang Hung-Yueh Chiang Guihong Li Diana Marculescu Radu Marculescu



研究问题:如何有效地微调视觉转换器(ViT)以满足特定需求。
动机:视觉转换器的大规模模型在各种应用中进行有效微调是一个重大挑战,因为需要大量的计算资源来进行线性层中的矩阵乘法。
方法:提出一种新的低秩反向传播通过沃尔什-哈达玛变换(LBP-WHT)方法。该方法将梯度投影到低秩空间并进行反向传播,大大降低了调整ViT所需的计算量。
效果:通过在不同的模型和数据集上进行大量实验,证明了该方法的有效性。例如,当在CIFAR100上调整EfficientFormer-L1模型时,我们的LBP-WHT比最先进的基线实现了10.4%的更高准确率,同时减少了9 MFLOPs的计算量。

The increasing scale of vision transformers (ViT) has made the efficient fine-tuning of these large models for specific needs a significant challenge in various applications. This issue originates from the computationally demanding matrix multiplications required during the backpropagation process through linear layers in ViT. In this paper, we tackle this problem by proposing a new Low-rank BackPropagation via Walsh-Hadamard Transformation (LBP-WHT) method. Intuitively, LBP-WHT projects the gradient into a low-rank space and carries out backpropagation. This approach substantially reduces the computation needed for adapting ViT, as matrix multiplication in the low-rank space is far less resource-intensive. We conduct extensive experiments with different models (ViT, hybrid convolution-ViT model) on multiple datasets to demonstrate the effectiveness of our method. For instance, when adapting an EfficientFormer-L1 model on CIFAR100, our LBP-WHT achieves 10.4\% higher accuracy than the state-of-the-art baseline, while requiring 9 MFLOPs less computation. As the first work to accelerate ViT adaptation with low-rank backpropagation, our LBP-WHT method is complementary to many prior efforts and can be combined with them for better performance.

Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism
Chengcheng Wang Wei He Ying Nie Jianyuan Guo Chuanjian Liu Yunhe Wang Kai Han



研究问题:如何改进实时物体检测领域的主导模型YOLO系列,解决信息融合问题。
动机:尽管FPN和PANet缓解了信息融合问题,但现有的模型仍受其困扰。
方法:提出了一种先进的Gather-and-Distribute机制(GD),通过卷积和自注意力操作实现。设计了新的模型Gold-YOLO,增强了多尺度特征融合能力,实现了延迟和准确性之间的理想平衡。
效果:在COCO val2017数据集上,Gold-YOLO-N达到了39.9%的AP,在T4 GPU上达到了1030 FPS,比之前的最优模型YOLOv6-3.0-N在相同FPS下提高了2.4%。

In the past years, YOLO-series models have emerged as the leading approaches in the area of real-time object detection. Many studies pushed up the baseline to a higher level by modifying the architecture, augmenting data and designing new losses. However, we find previous models still suffer from information fusion problem, although Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) have alleviated this. Therefore, this study provides an advanced Gatherand-Distribute mechanism (GD) mechanism, which is realized with convolution and self-attention operations. This new designed model named as Gold-YOLO, which boosts the multi-scale feature fusion capabilities and achieves an ideal balance between latency and accuracy across all model scales. Additionally, we implement MAE-style pretraining in the YOLO-series for the first time, allowing YOLOseries models could be to benefit from unsupervised pretraining. Gold-YOLO-N attains an outstanding 39.9% AP on the COCO val2017 datasets and 1030 FPS on a T4 GPU, which outperforms the previous SOTA model YOLOv6-3.0-N with similar FPS by +2.4%. The PyTorch code is available at https://github.com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/Gold_YOLO.

Unbiased Compression Saves Communication in Distributed Optimization: When and How Much?
Yutong He Xinmeng Huang Kun Yuan



研究问题:本文探讨了在分布式优化中,无偏压缩是否能降低总通信成本。
动机:压缩可以减轻通信开销,但可能引入信息失真,减慢收敛速度并增加达到所需解决方案的通信轮次。因此,压缩是否真正降低了总通信成本尚不清楚。
方法:本文提出了第一个理论模型来描述分布式优化中的总通信成本,并展示了如何通过使用独立的无偏压缩器来降低总通信成本。
效果:研究发现,如果所有工人使用的压缩器都是独立的,那么无偏压缩就可以降低总通信成本。实验结果也支持这一发现。

Communication compression is a common technique in distributed optimization that can alleviate communication overhead by transmitting compressed gradients and model parameters. However, compression can introduce information distortion, which slows down convergence and incurs more communication rounds to achieve desired solutions. Given the trade-off between lower per-round communication costs and additional rounds of communication, it is unclear whether communication compression reduces the total communication cost. This paper explores the conditions under which unbiased compression, a widely used form of compression, can reduce the total communication cost, as well as the extent to which it can do so. To this end, we present the first theoretical formulation for characterizing the total communication cost in distributed optimization with communication compression. We demonstrate that unbiased compression alone does not necessarily save the total communication cost, but this outcome can be achieved if the compressors used by all workers are further assumed independent. We establish lower bounds on the communication rounds required by algorithms using independent unbiased compressors to minimize smooth convex functions, and show that these lower bounds are tight by refining the analysis for ADIANA. Our results reveal that using independent unbiased compression can reduce the total communication cost by a factor of up to $\Theta(\\sqrt{\\min\\{n, \\kappa\\}})$, where $n$ is the number of workers and $\\kappa$ is the condition number of the functions being minimized. These theoretical findings are supported by experimental results.

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective
Zeke Xie zhiqiang xu Jingzhao Zhang Issei Sato Masashi Sugiyama



研究问题:本文旨在解决预训练语言模型在利用知识图谱上的不足,以及现有权重衰减方法在训练深度神经网络时可能遇到的问题。
动机:现有的预训练语言模型没有充分利用知识图谱中的有信息量的实体来增强语言表示,同时,常用的权重衰减方法可能会导致训练过程中梯度范数过大的问题。
方法:本文提出了一种增强的语言表示模型ERNIE,通过结合大规模文本语料库和知识图谱进行联合训练,以充分利用词汇、句法和知识信息。同时,本文还提出了一种称为Scheduled Weight Decay(SWD)的新型权重衰减调度方法,能够根据梯度范数动态调整权重衰减强度,避免训练过程中梯度范数过大的问题。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,SWD方法也能有效缓解梯度范数过大的问题,并在Adam优化器上取得了优于常规常量权重衰减策略的效果。

Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).

Two-Stage Predict+Optimize for MILPs with Unknown Parameters in Constraints
Xinyi HU Jasper C.H. Lee Jimmy H.M. Lee



研究问题:本文旨在解决优化问题中部分参数未知,需要通过相关特征进行预测的问题。
动机:现有的框架大多只能处理优化目标中的未知数,不能处理约束条件中的未知数。为此,作者提出了一种新的框架——两阶段预测+优化。
方法:该框架将优化问题的信息融入到训练过程中,以产生更好的预测结果。作者还给出了一个适用于所有混合整数线性规划的算法,极大地扩展了该框架的适用性。
效果:实验结果表明,该训练框架在所有经典和最先进的方法中具有优越的预测性能。

Consider the setting of constrained optimization, with some parameters unknown at solving time and requiring prediction from relevant features. Predict+Optimize is a recent framework for end-to-end training supervised learning models for such predictions, incorporating information about the optimization problem in the training process in order to yield better predictions in terms of the quality of the predicted solution under the true parameters. Almost all prior works have focused on the special case where the unknowns appear only in the optimization objective and not the constraints. Hu et al. proposed the first adaptation of Predict+Optimize to handle unknowns appearing in constraints, but the framework has somewhat ad-hoc elements, and they provided a training algorithm only for covering and packing linear programs. In this work, we give a new simpler and more powerful framework called Two-Stage Predict+Optimize, which we believe should be the canonical framework for the Predict+Optimize setting. We also give a training algorithm usable for all mixed integer linear programs, vastly generalizing the applicability of the framework. Experimental results demonstrate the superior prediction performance of our training framework over all classical and state-of-the-art methods.

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms
Shenao Zhang Boyi Liu Zhaoran Wang Tuo Zhao



研究问题:模型基础的重参数化策略梯度方法在长期强化学习问题上可能会遇到优化困难,如梯度方差爆炸和收敛缓慢。
动机:尽管重参数化方法在深度生成模型等问题中被认为具有较低的梯度估计方差,但在长期强化学习问题上的表现却并不理想。为了解决这个问题,研究人员对模型基础的重参数化策略梯度方法进行了深入的理论分析。
方法:研究人员分析了模型基础的重参数化策略梯度方法的收敛性,并发现函数近似器的平滑度是影响梯度估计质量的主要因素。基于这一分析,他们提出了一种谱归一化方法来缓解由长模型展开引起的方差爆炸问题。
效果:实验结果表明,适当的归一化可以显著降低模型基础的重参数化策略梯度方法的梯度方差,从而提高其性能。这种方法的性能与其他梯度估计器(如似然比梯度估计器)相当或优于。

ReParameterization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. However, recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes with exploding gradient variance, which leads to slow convergence. This is in contrast to the conventional belief that reparameterization methods have low gradient estimation variance in problems such as training deep generative models. To comprehend this phenomenon, we conduct a theoretical examination of model-based RP PGMs and search for solutions to the optimization difficulties. Specifically, we analyze the convergence of the model-based RP PGMs and pinpoint the smoothness of function approximators as a major factor that affects the quality of gradient estimation. Based on our analysis, we propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls. Our experimental results demonstrate that proper normalization significantly reduces the gradient variance of model-based RP PGMs. As a result, the performance of the proposed method is comparable or superior to other gradient estimators, such as the Likelihood Ratio (LR) gradient estimator. Our code is available at https://github.com/agentification/RP_PGM.

Train 'n Trade: Foundations of Parameter Markets
Tzu-Heng Huang Harit Vishwakarma Frederic Sala



研究问题:如何通过参数交易优化大型模型的训练,以降低训练成本和时间。
动机:现有的大型模型训练方式成本高且耗时,能否通过交易模型的组成部分(即权重集)来利用他人的专业知识。
方法:提出了一个包含市场运营所需基础设施的框架,研究了参数交换策略,并为代理提供了参数货币化的手段。
效果:实验表明,即使在竞争环境中,使用市场进行训练的代理也能相互获益,这为未来改善大规模模型训练提供了有用的范式。

Organizations typically train large models individually. This is costly and time-consuming, particularly for large-scale foundation models. Such vertical production is known to be suboptimal. Inspired by this economic insight, we ask whether it is possible to leverage others' expertise by trading the constituent parts in models, i.e., sets of weights, as if they were market commodities. While recent advances in aligning and interpolating models suggest that doing so may be possible, a number of fundamental questions must be answered to create viable parameter markets. In this work, we address these basic questions, propose a framework containing the infrastructure necessary for market operations to take place, study strategies for exchanging parameters, and offer means for agents to monetize parameters. Excitingly, compared to agents who train siloed models from scratch, we show that it is possible to mutually gain by using the market, even in competitive settings. This suggests that the notion of parameter markets may be a useful paradigm for improving large-scale model training in the future.

Model-Based Control with Sparse Neural Dynamics
Ziang Liu Genggeng Zhou Jeff He Tobia Marcucci Li Fei-Fei Jiajun Wu Yunzhu Li



研究问题:如何有效地从观察中学习预测模型,以解决许多现实世界的规划和控制问题。
动机:当前的深度神经网络对于有效的规划过于无序,而现有的控制方法通常依赖于大量的采样或局部梯度下降。
方法:提出了一种新的集成模型学习和预测控制的框架,该框架适用于高效的优化算法。具体来说,首先使用ReLU神经网络对系统动态进行建模,然后通过移除冗余的神经元逐渐稀疏化模型,同时保持预测精度的最小损失。这种离散稀疏过程被近似为一个连续问题,从而实现了模型架构和权重参数的端到端优化。稀疏后的模型随后被混合整数预测控制器使用,该控制器将神经元激活表示为二进制变量,并采用有效的分支定界算法。
效果:实验表明,尽管进行了激进的稀疏化,但该框架仍能提供比现有最先进的方法更好的闭环性能。

Learning predictive models from observations using deep neural networks (DNNs) is a promising new approach to many real-world planning and control problems. However, common DNNs are too unstructured for effective planning, and current control methods typically rely on extensive sampling or local gradient descent. In this paper, we propose a new framework for integrated model learning and predictive control that is amenable to efficient optimization algorithms. Specifically, we start with a ReLU neural model of the system dynamics and, with minimal losses in prediction accuracy, we gradually sparsify it by removing redundant neurons. This discrete sparsification process is approximated as a continuous problem, enabling an end-to-end optimization of both the model architecture and the weight parameters. The sparsified model is subsequently used by a mixed-integer predictive controller, which represents the neuron activations as binary variables and employs efficient branch-and-bound algorithms. Our framework is applicable to a wide variety of DNNs, from simple multilayer perceptrons to complex graph neural dynamics. It can efficiently handle tasks involving complicated contact dynamics, such as object pushing, compositional object sorting, and manipulation of deformable objects. Numerical and hardware experiments show that, despite the aggressive sparsification, our framework can deliver better closed-loop performance than existing state-of-the-art methods.

Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training
Hanyang Peng Shuang Qin Yue Yu Jin Wang Hui Wang Ge Li



研究问题:如何缓解分布式学习中的通信瓶颈?
动机:现有的梯度压缩算法在理论上具有低通信复杂度,但在实际应用中,其性能和效率无法与未压缩的SGD-momentum和自适应优化器(如Adam)相比。
方法:提出一种名为Birder的新型1-bit自适应优化器,该优化器的量化计算简单且轻量,无需在开始阶段进行未压缩版本的预热。同时设计了分层1-bit All-Reduce以进一步降低通信量。
效果:实验证明,Birder在训练ResNet-50和BERT-Base时,其推理性能与未压缩的SGDM/Adam相当,训练速度分别提高了2.5倍和6.3倍。

Various gradient compression algorithms have been proposed to alleviate the communication bottleneck in distributed learning, and they have demonstrated effectiveness in terms of high compression ratios and theoretical low communication complexity. However, when it comes to practically training modern deep neural networks (DNNs), these algorithms have yet to match the inference performance of uncompressed SGD-momentum (SGDM) and adaptive optimizers (e.g.,Adam). More importantly, recent studies suggest that these algorithms actually offer no speed advantages over SGDM/Adam when used with common distributed DNN training frameworks ( e.g., DistributedDataParallel (DDP)) in the typical settings, due to heavy compression/decompression computation or incompatibility with the efficient All-Reduce or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel 1-bit adaptive optimizer, dubbed *Bi*nary *r*andomization a*d*aptive optimiz*er* (**Birder**). The quantization of Birder can be easily and lightly computed, and it does not require warmup with its uncompressed version in the beginning. Also, we devise Hierarchical-1-bit-All-Reduce to further lower the communication volume. We theoretically prove that it promises the same convergence rate as the Adam. Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to ${2.5 \times}$ speedup for training ResNet-50 and ${6.3\times}$ speedup for training BERT-Base. Code is publicly available at https://openi.pcl.ac.cn/c2net_optim/Birder.

Fast Rank-1 Lattice Targeted Sampling for Black-box Optimization
Yueming Lyu



研究问题:如何提高高维问题的查询效率?
动机:现有的黑箱优化方法在处理高维问题时,查询效率仍然是一个挑战。
方法:本文提出了一种新的Rank-1 Lattice Targeted Sampling(RLTS)技术,通过随机rank-1 lattice Quasi-Monte Carlo进行快速局部精确的高斯过程训练和推理,并开发了一种快速坐标搜索方法,以提高查询效率。
效果:实验结果表明,RLTS技术在解决高维问题上的查询效率优于贝叶斯优化,并在大型语言模型的黑色盒子提示微调中表现出良好的性能。

Black-box optimization has gained great attention for its success in recent applications. However, scaling up to high-dimensional problems with good query efficiency remains challenging. This paper proposes a novel Rank-1 Lattice Targeted Sampling (RLTS) technique to address this issue. Our RLTS benefits from random rank-1 lattice Quasi-Monte Carlo, which enables us to perform fast local exact Gaussian processes (GP) training and inference with $O(n \log n)$ complexity w.r.t. $n$ batch samples. Furthermore, we developed a fast coordinate searching method with $O(n \log n)$ time complexity for fast targeted sampling. The fast computation enables us to plug our RLTS into the sampling phase of stochastic optimization methods. This improves the query efficiency while scaling up to higher dimensional problems than Bayesian optimization. Moreover, to construct rank-1 lattices efficiently, we proposed a closed-form construction. Extensive experiments on challenging benchmark test functions and black-box prompt fine-tuning for large language models demonstrate the query efficiency of our RLTS technique.

Making Scalable Meta Learning Practical
Sang Keun Choe Sanket Vaibhav Mehta Hwijeen Ahn Willie Neiswanger Pengtao Xie Emma Strubell Eric Xing



研究问题:元学习(即学习如何学习)虽然在机器学习程序中具有学习不同归纳偏差的灵活性,但长期以来一直被认为由于其巨大的计算/内存成本、训练不稳定性以及缺乏有效的分布式训练支持而难以扩展。
动机:本文旨在通过引入SAMA来解决元学习的可扩展性问题,该方法结合了隐式微分算法和系统的进步。
方法:SAMA设计用于灵活地支持元学习程序基础级别的各种自适应优化器,同时通过避免显式计算二阶梯度信息并利用为一阶梯度实现的高效分布式训练技术来减少计算负担。
效果:在多个大规模元学习基准测试中,与其他基线元学习算法相比,SAMA在单/多GPU设置上分别实现了高达1.7/4.8倍的吞吐量增加和2.0/3.8倍的内存消耗减少。此外,我们还表明,基于SAMA的数据优化可以持续提高BERT和RoBERTa大型语言模型的文本分类准确性,并在图像分类任务的小/大规模数据剪枝方面实现了最先进的结果,展示了可扩展元学习在语言和视觉领域的实际应用性。

Despite its flexibility to learn diverse inductive biases in machine learning programs, meta learning (i.e.,\ learning to learn) has long been recognized to suffer from poor scalability due to its tremendous compute/memory costs, training instability, and a lack of efficient distributed training support. In this work, we focus on making scalable meta learning practical by introducing SAMA, which combines advances in both implicit differentiation algorithms and systems. Specifically, SAMA is designed to flexibly support a broad range of adaptive optimizers in the base level of meta learning programs, while reducing computational burden by avoiding explicit computation of second-order gradient information, and exploiting efficient distributed training techniques implemented for first-order gradients. Evaluated on multiple large-scale meta learning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and 2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU setups compared to other baseline meta learning algorithms. Furthermore, we show that SAMA-based data optimization leads to consistent improvements in text classification accuracy with BERT and RoBERTa large language models, and achieves state-of-the-art results in both small- and large-scale data pruning on image classification tasks, demonstrating the practical applicability of scalable meta learning across language and vision domains.

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing
Wei Dong Dawei Yan Zhijun Lin Peng Wang



研究问题:如何有效地将大型预训练模型适应到下游任务中。
动机:现有的解决方案主要集中在设计轻量级的适配器及其与预训练模型的交互上,目标是最小化需要更新的参数数量。
方法:提出了一种新的适配器重构成(ARC)策略,考虑了适应性参数的可重用性,并引入了参数共享方案。具体来说,利用对称的下/上投影来构造瓶颈操作,这些操作在各层之间共享。通过学习低维重缩放系数,可以有效地重构适应各层的适配器。
效果:实验结果表明,该方法在24个图像分类下游任务上取得了令人信服的迁移学习性能,同时减少了参数数量。

The advent of high-capacity pre-trained models has revolutionized problem-solving in computer vision, shifting the focus from training task-specific models to adapting pre-trained models. Consequently, effectively adapting large pre-trained models to downstream tasks in an efficient manner has become a prominent research area. Existing solutions primarily concentrate on designing lightweight adapters and their interaction with pre-trained models, with the goal of minimizing the number of parameters requiring updates. In this study, we propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation from a fresh perspective. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme. Specifically, we leverage symmetric down-/up-projections to construct bottleneck operations, which are shared across layers. By learning low-dimensional re-scaling coefficients, we can effectively re-compose layer-adaptive adapters. This parameter-sharing strategy in adapter design allows us to further reduce the number of new parameters while maintaining satisfactory performance, thereby offering a promising approach to compress the adaptation cost. We conduct experiments on 24 downstream image classification tasks using various Vision Transformer variants to evaluate our method. The results demonstrate that our approach achieves compelling transfer learning performance with a reduced parameter count. Our code is available at https://github.com/DavidYanAnDe/ARC.

DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method
Ahmed Khaled Konstantin Mishchenko Chi Jin



研究问题:本文提出了一种易于实现的无参数基于梯度的优化器——DoWG(Distance over Weighted Gradients)。
动机:现有的优化算法需要手动调整参数,而DoWG无需任何参数即可达到最优的收敛速度。
方法:DoWG通过维护基于距离的加权运行平均值来实现其效果,这是实现所需属性的关键。
效果:实验证明,DoWG在训练中处于稳定的边缘,并在实际的机器学习任务上验证了其有效性。

This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient---matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal---automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks.

Stable and low-precision training for large-scale vision-language models
Mitchell Wortsman Tim Dettmers Luke Zettlemoyer Ari S. Morcos Ali Farhadi Ludwig Schmidt



研究问题:如何加速和稳定大型语言-视觉模型的训练。
动机:为了解决大型语言-视觉模型训练速度慢且不稳定的问题。
方法:提出了SwitchBack线性层用于int8量化训练,可以提供13%-25%的速度提升,同时在性能上与bfloat16训练相匹配。同时,通过分析损失尖峰发现它们通常在AdamW二阶估计器低估平方梯度后的1-8次迭代中出现,因此推荐使用AdamW-Adafactor混合训练方法以避免损失尖峰。
效果:实验结果表明,SwitchBack对于float8训练非常有效,而标准技术在网络训练和初始化时也取得了成功,如果大型特征的幅值被抑制,我们通过零初始化实现这一点。 AdamW-Adafactor混合训练方法在训练CLIP ViT-Huge模型时避免了损失尖峰,并在我们测试的规模上超越了梯度裁剪。

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge---the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.

AdANNS: A Framework for Adaptive Semantic Search
Aniket Rege Aditya Kusupati Sharan Ranjit S Alan Fan Qingqing Cao Sham M. Kakade Prateek Jain Ali Farhadi



研究问题:本文旨在解决网络搜索系统中的精确度与计算成本之间的权衡问题。
动机:目前的搜索系统通常使用刚性、高维向量来表示查询和数据点,这会导致计算成本高昂。
方法:本文提出了一种新的搜索设计框架AdANNS,该框架利用了俄罗斯套娃表示法的灵活性,在不同的搜索阶段使用不同容量的适应性表示,以实现更好的精度-计算权衡。
效果:实验结果表明,AdANNS在ImageNet检索和自然问题查询等任务上,不仅提高了精度,而且大大减少了计算时间,实现了更高的效率。

Web-scale search systems learn an encoder to embed a given query which is then hooked into an approximate nearest neighbor search (ANNS) pipeline to retrieve similar data points. To accurately capture tail queries and data points, learned representations typically are _rigid, high-dimensional_ vectors that are generally used as-is in the entire ANNS pipeline and can lead to computationally expensive retrieval. In this paper, we argue that instead of rigid representations, different stages of ANNS can leverage _adaptive representations_ of varying capacities to achieve significantly better accuracy-compute trade-offs, i.e., stages of ANNS that can get away with more approximate computation should use a lower-capacity representation of the same data point. To this end, we introduce AdANNS, a novel ANNS design framework that explicitly leverages the flexibility of Matryoshka Representations. We demonstrate state-of-the-art accuracy-compute trade-offs using novel AdANNS-based key ANNS building blocks like search data structures (AdANNS-IVF) and quantization (AdANNS-OPQ). For example on ImageNet retrieval, AdANNS-IVF is up to $\mathbf{1.5}$% more accurate than the rigid representations-based IVF at the same compute budget; and matches accuracy while being up to $\mathbf{90}\times$ faster in _wall-clock time_. For Natural Questions, $32$-byte AdANNS-OPQ matches the accuracy of the $64$-byte OPQ baseline constructed using rigid representations -- _same accuracy at half the cost!_ We further show that the gains from AdANNS translate to modern-day composite ANNS indices that combine search structures and quantization. Finally, we demonstrate that AdANNS can enable inference-time adaptivity for compute-aware search on ANNS indices built non-adaptively on matryoshka representations. Code is open-sourced at https://github.com/RAIVNLab/AdANNS.

Clustering the Sketch: Dynamic Compression for Embedding Tables
Henry Tsang Thomas Dybdahl Ahle



研究问题:如何有效地在推荐系统中处理大规模的分类特征嵌入表?
动机:随着推荐系统的发展,分类特征嵌入表的规模越来越大,需要开发新的方法来适应内存限制,甚至在训练过程中也需要进行处理。
方法:我们提出了集群化复合嵌入(CCE)方法,该方法将基于聚类的压缩(如量化到码本)与动态方法(如哈希技巧和复合嵌入)相结合[Shi等人,2020]。
效果:实验证明,CCE实现了两种方法的最佳结合:即具有基于码本的量化的高压缩率,又具有基于哈希的方法的动态性,因此可以在训练过程中使用。理论上,我们证明了CCE一定会收敛到最优码本,并给出了所需的迭代次数的紧界。

Embedding tables are used by machine learning systems to work with categorical features. In modern Recommendation Systems, these tables can be very large, necessitating the development of new methods for fitting them in memory, even during training. We suggest Clustered Compositional Embeddings (CCE) which combines clustering-based compression like quantization to codebooks with dynamic methods like The Hashing Trick and Compositional Embeddings [Shi et al., 2020]. Experimentally CCE achieves the best of both worlds: The high compression rate of codebook-based quantization, but \emph{dynamically} like hashing-based methods, so it can be used during training. Theoretically, we prove that CCE is guaranteed to converge to the optimal codebook and give a tight bound for the number of iterations required.

Your representations are in the network: composable and parallel adaptation for large scale models
Yonatan Dukler Alessandro Achille Hao Yang Varsha Vivek Luca Zancato Benjamin Bowman Avinash Ravichandran Charless Fowlkes Ashwin Swaminathan Stefano Soatto



研究问题:如何有效地将大型基础模型进行迁移学习,以适应新的任务?
动机:目前的迁移学习方法往往需要大量的计算资源和时间,且难以同时处理多个下游任务。
方法:提出一种名为InCA(自省交叉注意力)的框架,通过在基础模型的中间激活层上学习轻量级的交叉注意力模块,以快速适应新的任务。
效果:实验结果表明,InCA能够在训练过程中高效并行地训练多个适配器,并且在11个具有挑战性的下游分类任务上,单个适配器就能达到全微调的准确性。此外,与其他形式的参数高效适应相比,InCA的独立性使其在大规模模型上具有更好的计算性能。

We present a framework for transfer learning that efficiently adapts a large base-model by learning lightweight cross-attention modules attached to its intermediate activations. We name our approach InCA (Introspective-Cross-Attention) and show that it can efficiently survey a network’s representations and identify strong performing adapter models for a downstream task. During training, InCA enables training numerous adapters efficiently and in parallel, isolated from the frozen base model. On the ViT-L/16 architecture, our experiments show that a single adapter, 1.3% of the full model, is able to reach full fine-tuning accuracy on average across 11 challenging downstream classification tasks. Compared with other forms of parameter-efficient adaptation, the isolated nature of the InCA adaptation is computationally desirable for large-scale models. For instance, we adapt ViT-G/14 (1.8B+ parameters) quickly with 20+ adapters in parallel on a single V100 GPU (76% GPU memory reduction) and exhaustively identify its most useful representations. We further demonstrate how the adapters learned by InCA can be incrementally modified or combined for flexible learning scenarios and our approach achieves state of the art performance on the ImageNet-to-Sketch multi-task benchmark.

Mobilizing Personalized Federated Learning in Infrastructure-Less and Heterogeneous Environments via Random Walk Stochastic ADMM
Ziba Parsons Fei Dou Houyi Du Zheng Song Jin Lu



研究问题:本文探讨了在无基础设施的环境中,如何在实际场景中实现联邦学习(FL),这些场景中的孤立节点数据异构,只能通过无线链接连接到服务器。
动机:为了克服这些挑战,我们提出了一种新的个性化移动FL方法,旨在促进移动性和韧性。
方法:我们开发了一种名为随机游走随机交替方向乘子法(RWSADMM)的新颖优化算法。RWSADMM利用服务器向客户端的随机移动,并根据硬性不等式约束而不是要求共识更新或通过正则化方法引入偏差,来制定其相邻客户端之间的局部邻近性。
效果:我们的理论研究和实证结果表明,与基线方法相比,RWSADMM实现了显著的快速收敛和准确性提高,同时减少了通信成本并提高了可扩展性。

This paper explores the challenges of implementing Federated Learning (FL) in practical scenarios featuring isolated nodes with data heterogeneity, which can only be connected to the server through wireless links in an infrastructure-less environment. To overcome these challenges, we propose a novel mobilizing personalized FL approach, which aims to facilitate mobility and resilience. Specifically, we develop a novel optimization algorithm called Random Walk Stochastic Alternating Direction Method of Multipliers (RWSADMM). RWSADMM capitalizes on the server's random movement toward clients and formulates local proximity among their adjacent clients based on hard inequality constraints rather than requiring consensus updates or introducing bias via regularization methods. To mitigate the computational burden on the clients, an efficient stochastic solver of the approximated optimization problem is designed in RWSADMM, which provably converges to the stationary point almost surely in expectation. Our theoretical and empirical results demonstrate the provable fast convergence and substantial accuracy improvements achieved by RWSADMM compared to baseline methods, along with its benefits of reduced communication costs and enhanced scalability.

Sparsity-Preserving Differentially Private Training of Large Embedding Models
Badih Ghazi Yangsibo Huang Pritish Kamath Ravi Kumar Pasin Manurangsi Amer Sinha Chiyuan Zhang



研究问题:如何在保护用户数据隐私的同时,提高大型嵌入模型的训练效率。
动机:随着大型嵌入模型在推荐系统和语言应用中的使用增加,对用户数据隐私的关注也在增加。
方法:提出了两种新的算法DP-FEST和DP-AdaFEST,它们在大型嵌入模型的私人训练过程中保持梯度稀疏性。
效果:这两种新算法在基准真实世界数据集上实现了显著的梯度大小减少(10^6倍),同时保持了相当的准确性水平。

As the use of large embedding models in recommendation systems and language applications increases, concerns over user data privacy have also risen. DP-SGD, a training algorithm that combines differential privacy with stochastic gradient descent, has been the workhorse in protecting user privacy without compromising model accuracy by much. However, applying DP-SGD naively to embedding models can destroy gradient sparsity, leading to reduced training efficiency. To address this issue, we present two new algorithms, DP-FEST and DP-AdaFEST, that preserve gradient sparsity during the private training of large embedding models. Our algorithms achieve substantial reductions ($10^6 \times$) in gradient size, while maintaining comparable levels of accuracy, on benchmark real-world datasets.

An Inverse Scaling Law for CLIP Training
Xianhang Li Zeyu Wang Cihang Xie



研究问题:如何降低训练CLIP模型的计算成本,以推动其在计算机视觉领域的广泛应用。
动机:CLIP模型的训练成本高,限制了其进一步的研究和应用。
方法:通过研究发现,图像/文本编码器越大,训练中可应用的图像/文本令牌长度越短。通过减少图像/文本令牌长度的策略,可以成功训练CLIP模型。
效果:使用8个A100 GPUs,在2-4天内,CLIP模型在ImageNet-1k上的零样本准确率分别达到63.2%、67.8%和69.3%。当使用G/14时,ImageNet-1k的零样本准确率达到了83.0%,比OpenCLIP快约33倍。

CLIP, one of the pioneering foundation models that connect images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even with limited computational resources. For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling up --- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot accuracy, and meanwhile accelerate the training by ~33x compared to its OpenCLIP counterpart. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
Piotr Indyk Haike Xu



研究问题:本文旨在研究图基近似最近邻搜索算法的最坏情况性能,如HNSW、NSG和DiskANN。
动机:尽管图基近似最近邻搜索算法在实践中是处理大型数据集的流行且强大的工具,但其在理论上的保证有限。
方法:对最新的图基近似最近邻搜索算法进行研究,包括DiskANN的“慢速预处理”版本,HNSW和NSG等。对于DiskANN,证明了其“慢速预处理”版本在数据集的“内在”维度有界的情况下,能以常数近似比和多项式对数查询时间支持近似最近邻搜索查询。
效果:对于其他数据结构变体,包括DiskANN的“快速预处理”版本、HNSW和NSG,我们展示了一系列实例,在这些实例上,达到“合理”精度所需的查询时间与实例大小呈线性关系。例如,对于DiskANN,我们表明查询过程在遇到查询的前5个最近邻之前至少需要0.1n步。

Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing'' version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic'' dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing'', HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable'' accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least $0.1 n$ steps on instances of size $n$ before it encounters any of the $5$ nearest neighbors of the query.

FedGCN: Convergence-Communication Tradeoffs in Federated Training of Graph Convolutional Networks
Yuhang Yao Weizhao Jin Srivatsan Ravi Carlee Joe-Wong



研究问题:如何在多个客户端之间训练图模型,以减少通信开销并保持数据隐私?
动机:由于图的大小和数据生成地的法规,分布式图模型训练方法越来越受欢迎。然而,客户端之间的交叉边缘自然存在,这会导致显著的通信开销或训练信息的损失。
方法:我们提出了联邦图卷积网络(FedGCN)算法,该算法使用联邦学习来训练图卷积网络(GCN)模型进行半监督节点分类,具有快速收敛和少量通信的特点。
效果:与每轮训练都需要在客户端之间进行额外通信的现有方法相比,FedGCN客户端只需与中央服务器进行一次预训练步骤的通信,大大减少了通信成本,并允许使用同态加密进一步提高隐私性。实验结果表明,我们的FedGCN算法在平均速度上快51.7%,并且至少减少了100倍的通信量,同时实现了更好的模型准确性。

Methods for training models on graphs distributed across multiple clients have recently grown in popularity, due to the size of these graphs as well as regulations on keeping data where it is generated. However, the cross-client edges naturally exist among clients. Thus, distributed methods for training a model on a single graph incur either significant communication overhead between clients or a loss of available information to the training. We introduce the Federated Graph Convolutional Network (FedGCN) algorithm, which uses federated learning to train GCN models for semi-supervised node classification with fast convergence and little communication. Compared to prior methods that require extra communication among clients at each training round, FedGCN clients only communicate with the central server in one pre-training step, greatly reducing communication costs and allowing the use of homomorphic encryption to further enhance privacy. We theoretically analyze the tradeoff between FedGCN's convergence rate and communication cost under different data distributions. Experimental results show that our FedGCN algorithm achieves better model accuracy with 51.7\% faster convergence on average and at least 100$\times$ less communication compared to prior work.

Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models
Deepak Narayanan Keshav Santhanam Peter Henderson Rishi Bommasani Tony Lee Percy Liang



研究问题:大型语言模型(LLMs)虽然能力强大,但计算成本高昂。如何量化推理效率和模型能力之间的基本权衡是一个挑战。
动机:现有的评估方法无法公平地比较不同供应商提供的模型的推理效率,因为模型供应商可以实施与模型无关的软件和硬件优化,而共享基础设施会导致性能竞争。
方法:我们提出了一种新的推理效率指标——理想化运行时,它可以公平地比较在无性能竞争的均匀硬件和软件上运行的模型。我们还提出了一种成本模型,可以有效地估计自回归Transformer模型的理想化运行时。
效果:我们使用这些指标比较了2022年开发的10个LLMs,首次分析了推理效率-能力权衡。我们的分析发现,某些API的优越推理运行时性能通常是API内优化的结果,而不是底层模型的结果。

Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the _fundamental tradeoff_ between inference efficiency and model capabilities requires a metric that is comparable across models from different providers. Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention. We propose a new metric for inference efficiency called _idealized runtime_, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models. We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model. Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our code is open sourced at https://github.com/stanford-crfm/helm-efficiency.

CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training
A. Feder Cooper Wentao Guo Khiem Pham Tiancheng Yuan Charlie F. Ruan Yucheng Lu Christopher De Sa



研究问题:如何将有信息量的实体融入预训练语言模型以增强其性能?
动机:目前的预训练语言模型缺乏对结构化知识的利用,而知识图谱中的实体可以提供丰富的外部知识来提升语言理解。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型同时利用大规模文本语料库和知识图谱进行训练,能够充分利用词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上取得了显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美。

Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: While it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms baselines empirically, including distributed RR, on a variety of benchmark tasks.

FLuID: Mitigating Stragglers in Federated Learning using Invariant Dropout
Irene Wang Prashant J. Nair Divya Mahajan



研究问题:联邦学习中,性能较差的设备(即“落后者”)往往决定了整体的训练时间,这对训练效率产生了瓶颈。
动机:为了解决联邦学习中由于落后设备导致的训练效率低下问题。
方法:提出了一种名为"Invariant Dropout"的方法,通过提取基于权重更新阈值的子模型来最小化对准确性的潜在影响。并在此基础上开发了一个自适应训练框架——Federated Learning using Invariant Dropout (FLuID)。
效果:FLuID能够提供轻量级的子模型提取以调节计算强度,从而在不影响模型质量的情况下减轻落后设备上的负载。实验证明,Invariant Dropout能够在保持基线模型效率的同时,通过动态运行时方法缓解落后设备的性能瓶颈。

Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, “straggler” devices with lower performance often dictate the overall training time in FL. In this work, we aim to alleviate this performance bottleneck due to stragglers by dynamically balancing the training load across the system. We introduce Invariant Dropout, a method that extracts a sub-model based on the weight update threshold, thereby minimizing potential impacts on accuracy. Building on this dropout technique, we develop an adaptive training framework, Federated Learning using Invariant Dropout (FLuID). FLuID offers a lightweight sub-model extraction to regulate computational intensity, thereby reducing the load on straggler devices without affecting model quality. Our method leverages neuron updates from non-straggler devices to construct a tailored sub-model for each straggler based on client performance profiling. Furthermore, FLuID can dynamically adapt to changes in stragglers as runtime conditions shift. We evaluate FLuID using five real-world mobile clients. The evaluations show that Invariant Dropout maintains baseline model efficiency while alleviating the performance bottleneck of stragglers through a dynamic, runtime approach.

Don’t just prune by magnitude! Your mask topology is a secret weapon
Duc N.M Hoang Souvik Kundu Shiwei Liu Zhangyang Wang



研究问题:本文旨在探索深度网络架构的连通性与性能之间的关系,并分析参数在图连通性中的作用。
动机:尽管已有一些研究将深度架构与扩张器图或拉马努金图联系起来,但尚未有工作明确探讨参数在图中连通性的角色。
方法:通过分析稀疏神经网络中的拉马努金结构的加权谱间隙,并研究其与最终性能的相关性。具体来说,我们检查了流行动态稀疏到稀疏网络训练方案下的稀疏结构演变,发现生成的随机拓扑结构本质上最大化了拉马努金图。
效果:我们发现了一个强大的关联存在于掩码、性能和加权谱间隙之间。利用这一观察结果,我们提出了一个新的“全谱坐标”概念,以全面描述稀疏神经网络的潜力。此外,我们还开发了一种新可行的剪枝方法,通过采样稀疏掩码来最大化L2-坐标距离。

Recent years have witnessed significant progress in understanding the relationship between the connectivity of a deep network's architecture as a graph, and the network's performance. A few prior arts connected deep architectures to expander graphs or Ramanujan graphs, and particularly,[7] demonstrated the use of such graph connectivity measures with ranking and relative performance of various obtained sparse sub-networks (i.e. models with prune masks) without the need for training. However, no prior work explicitly explores the role of parameters in the graph's connectivity, making the graph-based understanding of prune masks and the magnitude/gradient-based pruning practice isolated from one another. This paper strives to fill in this gap, by analyzing the Weighted Spectral Gap of Ramanujan structures in sparse neural networks and investigates its correlation with final performance. We specifically examine the evolution of sparse structures under a popular dynamic sparse-to-sparse network training scheme, and intriguingly find that the generated random topologies inherently maximize Ramanujan graphs. We also identify a strong correlation between masks, performance, and the weighted spectral gap. Leveraging this observation, we propose to construct a new "full-spectrum coordinate'' aiming to comprehensively characterize a sparse neural network's promise. Concretely, it consists of the classical Ramanujan's gap (structure), our proposed weighted spectral gap (parameters), and the constituent nested regular graphs within. In this new coordinate system, a sparse subnetwork's L2-distance from its original initialization is found to have nearly linear correlated with its performance. Eventually, we apply this unified perspective to develop a new actionable pruning method, by sampling sparse masks to maximize the L2-coordinate distance. Our method can be augmented with the "pruning at initialization" (PaI) method, and significantly outperforms existing PaI methods. With only a few iterations of training (e.g 500 iterations), we can get LTH-comparable performance as that yielded via "pruning after training", significantly saving pre-training costs. Codes can be found at: https://github.com/VITA-Group/FullSpectrum-PAI.

Federated Multi-Objective Learning
Haibo Yang Zhuqing Liu Jia Liu Chaosheng Dong Michinari Momma



研究问题:现有的多目标优化(MOO)算法主要适用于集中式学习环境,无法满足多代理多任务学习的分布式特性和数据隐私需求。
动机:为了解决这一问题,我们提出了一种新的联邦多目标学习(FMOL)框架,允许多个客户端在保持训练数据私有的同时分散协作解决MOO问题。
方法:我们的FMOL框架允许不同客户端使用不同的目标函数,以支持广泛的应用场景,并将MOO的表述首次推广到联邦学习范式。为此,我们提出了两种新的联邦多目标优化(FMOO)算法,即联邦多梯度下降平均(FMGDA)和联邦随机多梯度下降平均(FSMGDA)。这两种算法都允许局部更新以显著降低通信成本,同时达到与单目标联邦学习算法相同的收敛速度。
效果:我们的大量实验证实了我们提出的FMOO算法的有效性。

In recent years, multi-objective optimization (MOO) emerges as a foundational problem underpinning many multi-agent multi-task learning applications. However, existing algorithms in MOO literature remain limited to centralized learning settings, which do not satisfy the distributed nature and data privacy needs of such multi-agent multi-task learning applications. This motivates us to propose a new federated multi-objective learning (FMOL) framework with multiple clients distributively and collaboratively solving an MOO problem while keeping their training data private. Notably, our FMOL framework allows a different set of objective functions across different clients to support a wide range of applications, which advances and generalizes the MOO formulation to the federated learning paradigm for the first time. For this FMOL framework, we propose two new federated multi-objective optimization (FMOO) algorithms called federated multi-gradient descent averaging (FMGDA) and federated stochastic multi-gradient descent averaging (FSMGDA). Both algorithms allow local updates to significantly reduce communication costs, while achieving the {\em same} convergence rates as those of their algorithmic counterparts in the single-objective federated learning. Our extensive experiments also corroborate the efficacy of our proposed FMOO algorithms.

VRA: Variational Rectified Activation for Out-of-distribution Detection
Mingyu Xu Zheng Lian Bin Liu Jianhua Tao



研究问题:如何有效地检测模型在开放世界中的分布外(OOD)数据,以建立可靠的机器学习系统。
动机:尽管现有的减少模型对OOD数据的过度自信的策略如ReAct取得了一定的成果,但是否存在更好的选择仍待验证。
方法:利用变分法寻找最优操作,并验证在OOD检测中抑制异常低和高激活以及放大中间激活的必要性,而不仅仅关注像ReAct那样的高激活。由此提出了一种名为“变分修正激活(VRA)”的新方法,该方法使用分段函数模拟这些抑制和放大操作。
效果:在多个基准数据集上的实验结果表明,我们的方法优于现有的后处理方法。同时,VRA与不同的评分函数和网络架构兼容。

Out-of-distribution (OOD) detection is critical to building reliable machine learning systems in the open world. Researchers have proposed various strategies to reduce model overconfidence on OOD data. Among them, ReAct is a typical and effective technique to deal with model overconfidence, which truncates high activations to increase the gap between in-distribution and OOD. Despite its promising results, is this technique the best choice? To answer this question, we leverage the variational method to find the optimal operation and verify the necessity of suppressing abnormally low and high activations and amplifying intermediate activations in OOD detection, rather than focusing only on high activations like ReAct. This motivates us to propose a novel technique called ``Variational Rectified Activation (VRA)'', which simulates these suppression and amplification operations using piecewise functions. Experimental results on multiple benchmark datasets demonstrate that our method outperforms existing post-hoc strategies. Meanwhile, VRA is compatible with different scoring functions and network architectures. Our code is available at https://github.com/zeroQiaoba/VRA.

Flow: Per-instance Personalized Federated Learning
Kunjal Panchal Sunav Choudhary Nisarg Parikh Lijun Zhang Hui Guan



研究问题:联邦学习中的数据异质性问题,即不同客户端的多样化数据分布使得训练一个有效的全局模型具有挑战性。
动机:现有的个性化方法通过为每个客户端创建一个适应其本地数据分布的个性化模型来解决数据异质性问题,但这些个性化模型在某些客户端上可能比全局模型的准确性低,导致与无个性化相比性能提升有限。
方法:提出一种基于实例的个性化联邦学习方法Flow,Flow创建了不仅适应每个客户端的数据分布,而且适应每个客户端的数据实例的动态个性化模型。这个个性化模型允许每个实例动态决定是使用本地参数还是全局参数进行正确的预测,从而提高客户端的准确性。
效果:对Flow的收敛性进行了理论分析,并在视觉和语言任务上实证证明了Flow在提高客户端准确性方面优于最先进的个性化方法。

Federated learning (FL) suffers from data heterogeneity, where the diverse data distributions across clients make it challenging to train a single global model effectively. Existing personalization approaches aim to address the data heterogeneity issue by creating a personalized model for each client from the global model that fits their local data distribution. However, these personalized models may achieve lower accuracy than the global model in some clients, resulting in limited performance improvement compared to that without personalization. To overcome this limitation, we propose a per-instance personalization FL algorithm Flow. Flow creates dynamic personalized models that are adaptive not only to each client’s data distributions but also to each client’s data instances. The personalized model allows each instance to dynamically determine whether it prefers the local parameters or its global counterpart to make correct predictions, thereby improving clients’ accuracy. We provide theoretical analysis on the convergence of Flow and empirically demonstrate the superiority of Flow in improving clients’ accuracy compared to state-of-the-art personalization approaches on both vision and language-based tasks.

Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models
Leonardo Galli Holger Rauhut Mark Schmidt



研究问题:现有的线搜索方法在现代过参数化设置中可以加速随机梯度下降(SGD)和Adam,但可能需要的步长比实际需要的更小。
动机:我们探索非单调线搜索方法来放宽这个条件,并可能接受更大的步长。尽管缺乏单调递减,但我们证明了与单调情况下相同的快速收敛率。
方法:我们提出了一个名为PoNoS的方法,通过将非单调线搜索与Polyak初始步长相结合来实现。此外,我们还开发了一种新的重置技术,在大多数迭代中将回溯量减少到零,同时保持较大的初始步长。
效果:实验表明,非单调方法提高了SGD/Adam的收敛速度和泛化性能,甚至超过了之前的单调线搜索。据我们所知,首次运行时比较显示,基于线搜索的方法的时期优势在整体计算时间中得到体现。

Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function. We explore nonmonotone line search methods to relax this condition and possibly accept larger step sizes. Despite the lack of a monotonic decrease, we prove the same fast rates of convergence as in the monotone case. Our experiments show that nonmonotone methods improve the speed of convergence and generalization properties of SGD/Adam even beyond the previous monotone line searches. We propose a POlyak NOnmonotone Stochastic (PoNoS) method, obtained by combining a nonmonotone line search with a Polyak initial step size. Furthermore, we develop a new resetting technique that in the majority of the iterations reduces the amount of backtracks to zero while still maintaining a large initial step size. To the best of our knowledge, a first runtime comparison shows that the epoch-wise advantage of line-search-based methods gets reflected in the overall computational time.

Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices
Kilian Pfeiffer Ramin Khalili Joerg Henkel



研究问题:如何在资源有限的设备上进行联邦学习,以解决内存不足导致设备被排除在训练之外的问题。
动机:联邦学习通常在资源受限的边缘设备上进行,如计算内存有限。如果模型所需的内存超过了这个限制,该设备将被排除在训练之外,这可能导致较低的准确率,并造成偏见和不公平。
方法:我们提出了一种新的方法,使设备能够端到端地冻结和训练联邦学习的模型参数,从而降低了设备的资源需求,同时仍然允许参数之间有足够的协同适应。
效果:通过大量的实验评估,我们发现这种方法比现有技术大大提高了训练模型的准确率(提高了52.4个百分点),并且有效地聚合了分布式设备的计算能力。

Federated learning (FL) is usually performed on resource-constrained edge devices, e.g., with limited memory for the computation. If the required memory to train a model exceeds this limit, the device will be excluded from the training. This can lead to a lower accuracy as valuable data and computation resources are excluded from training, also causing bias and unfairness. The FL training process should be adjusted to such constraints. The state-of-the-art techniques propose training subsets of the FL model at constrained devices, reducing their resource requirements for training. However, these techniques largely limit the co-adaptation among parameters of the model and are highly inefficient, as we show: it is actually better to train a smaller (less accurate) model by the system where all the devices can train the model end-to-end than applying such techniques. We propose a new method that enables successive freezing and training of the parameters of the FL model at devices, reducing the training’s resource requirements at the devices while still allowing enough co-adaptation between parameters. We show through extensive experimental evaluation that our technique greatly improves the accuracy of the trained model (by 52.4 p.p. ) compared with the state of the art, efficiently aggregating the computation capacity available on distributed devices.

Learning Large-Scale MTP$_2$ Gaussian Graphical Models via Bridge-Block Decomposition
Xiwen Wang Jiaxi Ying Daniel P. Palomar



研究问题:本文研究了学习大规模二阶多元正定(MTP_2)高斯图模型的问题。
动机:通过引入在大尺度稀疏图中常见的“桥接”概念,作者展示了整个问题可以通过在阈值样本协方差图上进行“桥接-块分解”产生的几个较小规模子问题和一组对“桥接”对应条目的显式解决方案等效优化。
方法:从实践的角度来看,这种简单且可证明的方法可以将一个大问题分解为小的、易于处理的子问题,从而大大降低计算复杂度,并对所有现有算法产生实质性改进。
效果:合成和真实世界的实验表明,与最先进的基准测试相比,我们提出的方法表现出显著的速度提升。

This paper studies the problem of learning the large-scale Gaussian graphical models that are multivariate totally positive of order two ($\text{MTP}_2$). By introducing the concept of bridge, which commonly exists in large-scale sparse graphs, we show that the entire problem can be equivalently optimized through (1) several smaller-scaled sub-problems induced by a \emph{bridge-block decomposition} on the thresholded sample covariance graph and (2) a set of explicit solutions on entries corresponding to \emph{bridges}. From practical aspect, this simple and provable discipline can be applied to break down a large problem into small tractable ones, leading to enormous reduction on the computational complexity and substantial improvements for all existing algorithms. The synthetic and real-world experiments demonstrate that our proposed method presents a significant speed-up compared to the state-of-the-art benchmarks.

Federated Compositional Deep AUC Maximization
Xinwen Zhang Yihan Zhang Tianbao Yang Richard Souvenir Hongchang Gao



研究问题:本文旨在解决联邦学习在处理高度不平衡数据时预测性能不佳的问题。
动机:大多数现有的联邦学习方法主要关注平衡数据问题,对于样本类别极度不平衡的现实世界应用,其预测性能远未达到理想。
方法:通过直接优化曲线下面积(AUC)得分,开发了一种用于处理不平衡数据的新的联邦学习方法。具体来说,我们将AUC最大化问题形式化为联邦组合极小极大优化问题,并开发了一种带有动量的局部随机组合梯度下降上升算法。
效果:广泛的实验结果证实了该方法的有效性。

Federated learning has attracted increasing attention due to the promise of balancing privacy and large-scale learning; numerous approaches have been proposed. However, most existing approaches focus on problems with balanced data, and prediction performance is far from satisfactory for many real-world applications where the number of samples in different classes is highly imbalanced. To address this challenging problem, we developed a novel federated learning method for imbalanced data by directly optimizing the area under curve (AUC) score. In particular, we formulate the AUC maximization problem as a federated compositional minimax optimization problem, develop a local stochastic compositional gradient descent ascent with momentum algorithm, and provide bounds on the computational and communication complexities of our algorithm. To the best of our knowledge, this is the first work to achieve such favorable theoretical results. Finally, extensive experimental results confirm the efficacy of our method.

Generalised f-Mean Aggregation for Graph Neural Networks
Ryan Kortvelesy Steven Morad Amanda Prorok



研究问题:如何选择合适的图神经网络(GNN)聚合器以最小化信息损失。
动机:目前大多数方法选择“标准聚合器”如平均、求和或最大,但这种选择通常没有理由,且对性能有重大影响。
方法:提出GenAgg,一个通用的聚合运算符,可以表示包括所有标准聚合器在内的函数空间。
效果:实验表明,GenAgg能以比基线方法高得多的准确性表示标准聚合器,并且将其用作GNN中现有聚合器的替代品,通常会显著提高各种任务的性能。

Graph Neural Network (GNN) architectures are defined by their implementations of update and aggregation modules. While many works focus on new ways to parametrise the update modules, the aggregation modules receive comparatively little attention. Because it is difficult to parametrise aggregation functions, currently most methods select a ``standard aggregator'' such as mean, sum, or max. While this selection is often made without any reasoning, it has been shown that the choice in aggregator has a significant impact on performance, and the best choice in aggregator is problem-dependent. Since aggregation is a lossy operation, it is crucial to select the most appropriate aggregator in order to minimise information loss. In this paper, we present GenAgg, a generalised aggregation operator, which parametrises a function space that includes all standard aggregators. In our experiments, we show that GenAgg is able to represent the standard aggregators with much higher accuracy than baseline methods. We also show that using GenAgg as a drop-in replacement for an existing aggregator in a GNN often leads to a significant boost in performance across various tasks.

DELTA: Diverse Client Sampling for Fasting Federated Learning
Lin Wang Yongxin Guo Tao Lin Xiaoying Tang



研究问题:如何在联邦学习中有效地减少通信负担,同时避免因客户端采样方案不当导致的模型更新差异大和收敛速度慢的问题。
动机:现有的客户端采样方法存在偏差或需要进一步优化以加快收敛速度。
方法:提出DELTA,一种无偏的采样方案,通过刻画客户端多样性和局部方差的影响,选择具有有价值信息的代表性客户端进行全局模型更新。
效果:实验证明,DELTA是能够最小化部分客户端参与引起的方差的最优无偏采样方案,并且在收敛速度上优于其他无偏采样方案。同时,针对全客户端梯度依赖性,提供了一种依赖于可用客户端信息的实用版本的DELTA,并分析了其收敛性。在合成和真实世界数据集上的实验结果验证了这些发现。

Partial client participation has been widely adopted in Federated Learning (FL) to reduce the communication burden efficiently. However, an inadequate client sampling scheme can lead to the selection of unrepresentative subsets, resulting in significant variance in model updates and slowed convergence. Existing sampling methods are either biased or can be further optimized for faster convergence. In this paper, we present DELTA, an unbiased sampling scheme designed to alleviate these issues. DELTA characterizes the effects of client diversity and local variance, and samples representative clients with valuable information for global model updates. In addition, DELTA is a proven optimal unbiased sampling scheme that minimizes variance caused by partial client participation and outperforms other unbiased sampling schemes in terms of convergence. Furthermore, to address full-client gradient dependence, we provide a practical version of DELTA depending on the available clients' information, and also analyze its convergence. Our results are validated through experiments on both synthetic and real-world datasets.

A fast heuristic to optimize time-space tradeoff for large models
Akifumi Imanishi Zijian Xu Masayuki Takagi Sixue Wang Emilio Castillo



研究问题:大规模神经网络训练受GPU内存限制,需要寻找有效的梯度重计算方法。
动机:现有的梯度重计算方法如Checkmate和Moccasin依赖于混合整数线性规划或约束规划,由于搜索空间巨大,扩展性有限。
方法:本文提出了一种基于模拟退火启发式的新算法FastSA进行梯度重计算。
效果:实验结果表明,FastSA在大型视觉和文本模型上取得了显著的内存减少效果,平均额外增加18%的计算开销。

Training large-scale neural networks is heavily constrained by GPU memory. In order to circumvent this limitation, gradient checkpointing, or recomputation is a powerful technique. There is active research in this area with methods such as Checkmake or Moccasin. However, both Checkmate and Moccasin rely on mixed integer linear programming or constraint programming, resulting in limited scalability due to their exponentially large search space. This paper proposes a novel algorithm for recomputation (FastSA) based on a simulated annealing heuristic that achieves comparable or even better solutions than state-of-the-art alternatives. FastSA can optimize computational graphs with thousands of nodes within 3 to 30 seconds, several orders of magnitude faster than current solutions. We applied FastSA to PyTorch models and verified its effectiveness through popular large vision and text models, including recent language models with the transformer architecture. The results demonstrate significant memory reductions by 73% with extra 18% computational overheads on average. Our experiments demonstrate the practicality and effectiveness of our recomputation algorithm, further highlighting its potential for wide application in various deep learning domains.

RL-based Stateful Neural Adaptive Sampling and Denoising for Real-Time Path Tracing
Antoine Scardigli Lukas Cavigelli Lorenz K Muller



研究问题:蒙特卡洛路径追踪在低样本数量下会产生高噪声,限制了其在实时应用中的使用。
动机:提出一种端到端训练采样重要性网络、潜在空间编码器网络和去噪器网络的框架,以解决蒙特卡洛路径追踪的问题。
方法:使用强化学习优化采样重要性网络,避免显式数值近似梯度;不通过平均像素采样值,而是将所有采样值输入潜在空间编码器;编码器用潜在空间中学习到的表示替换手工制作的时空启发式方法;最后,训练神经网络去噪器对输出图像进行细化。
效果:该方法在多个具有挑战性的数据集上提高了视觉质量,与之前最先进的方法相比,渲染时间减少了1.6倍,使其成为实时应用的有希望的解决方案。

Monte-Carlo path tracing is a powerful technique for realistic image synthesis but suffers from high levels of noise at low sample counts, limiting its use in real-time applications. To address this, we propose a framework with end-to-end training of a sampling importance network, a latent space encoder network, and a denoiser network. Our approach uses reinforcement learning to optimize the sampling importance network, thus avoiding explicit numerically approximated gradients. Our method does not aggregate the sampled values per pixel by averaging but keeps all sampled values which are then fed into the latent space encoder. The encoder replaces handcrafted spatiotemporal heuristics by learned representations in a latent space. Finally, a neural denoiser is trained to refine the output image. Our approach increases visual quality on several challenging datasets and reduces rendering times for equal quality by a factor of 1.6x compared to the previous state-of-the-art, making it a promising solution for real-time applications.

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Jeonghoon Kim Jung Hyun Lee Sungdong Kim Joonsuk Park Kang Min Yoo Se Jung Kwon Dongsoo Lee



研究问题:大型语言模型(LLMs)在微调与部署时面临内存需求大和计算成本高的挑战。
动机:尽管参数高效微调(PEFT)方法旨在减少微调过程中优化器状态的内存使用,但预训练LLM权重的内在大小仍是一个紧迫的问题。
方法:本文提出了一种简单而有效的参数高效且量化感知适应(PEQA)方法,该方法结合了PEFT与量化LLM的优势。通过仅更新量化比例,PEQA可以直接应用于量化LLM,确保任务转换的无缝性。
效果:我们为具有高达650亿个参数的LLM进行了任务特定的PEQA调整。为了评估PEQA调整的LLM的逻辑推理和语言理解能力,我们使用指令数据集对低比特量化LLM进行了微调。结果显示,即使LLM被量化到低于4位精度,其语言建模、少样本上下文学习和理解的能力也可以通过PEQA恢复到(甚至超过)其全精度原始性能。

Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer state during fine-tuning, the inherent size of pre-trained LLM weights continues to be a pressing concern. Even though quantization techniques are widely proposed to ease memory demands and accelerate LLM inference, most of these techniques are geared towards the deployment phase. To bridge this gap, this paper presents Parameter-Efficient and Quantization-aware Adaptation (PEQA) – a simple yet effective method that combines the advantages of PEFT with quantized LLMs. By updating solely the quantization scales, PEQA can be directly applied to quantized LLMs, ensuring seamless task transitions. Parallel to existing PEFT methods, PEQA significantly reduces the memory overhead associated with the optimizer state. Furthermore, it leverages the advantages of quantization to substantially reduce model sizes. Even after fine-tuning, the quantization structure of a PEQA-tuned LLM remains intact, allowing for accelerated inference on the deployment stage. We employ PEQA-tuning for task-specific adaptation on LLMs with up to $65$ billion parameters. To assess the logical reasoning and language comprehension of PEQA-tuned LLMs, we fine-tune low-bit quantized LLMs using a instruction dataset. Our results show that even when LLMs are quantized to below 4-bit precision, their capabilities in language modeling, few-shot in-context learning, and comprehension can be resiliently restored to (or even improved over) their full-precision original performances with PEQA.

Understanding How Consistency Works in Federated Learning via Stage-wise Relaxed Initialization
Yan Sun Li Shen Dacheng Tao



研究问题:本文旨在解决联邦学习中由于本地客户端优化不一致性导致的"客户端漂移"问题,并探索其对联邦学习的影响。
动机:联邦学习是一种分布式学习方法,通过协调大量的本地客户端在异构数据集上进行局部训练来共同训练全局模型。然而,现有的研究缺乏对"客户端漂移"问题的深入理论分析。
方法:本文设计了一种名为FedInit的高效联邦学习算法,该算法允许在每个局部训练阶段的开始时使用个性化的松弛初始化状态。具体来说,FedInit通过从当前的全局状态向最新局部状态的相反方向移动来初始化局部状态,这种松弛的初始化有助于修正局部发散并提高局部一致性水平。
效果:通过对剩余风险的分析,研究发现在非凸目标函数上,优化误差对局部不一致性不敏感,而主要影响FedInit的泛化误差界。实验结果验证了这一结论,并且表明FedInit能够在不增加额外成本的情况下达到最先进的性能。此外,阶段松弛初始化也可以被集成到现有的先进算法中,以提高联邦学习的性能。

Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the "client-drift" problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of the "client drift" and explore its substance in FL, in this paper, we first design an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. This relaxed initialization helps to revise the local divergence and enhance the local consistency level. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error of the proposed FedInit method. Our studies show that on the non-convex objectives, optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound in FedInit. Extensive experiments are conducted to validate this conclusion. Our proposed FedInit could achieve state-of-the-art (SOTA) results compared to several advanced benchmarks without any additional costs. Meanwhile, stage-wise relaxed initialization could also be incorporated into the current advanced algorithms to achieve higher performance in the FL paradigm.

MathNAS: If Blocks Have a Role in Mathematical Architecture Design
Wang Qinsi JingHan Ke Zhi Liang Sihai Zhang



研究问题:如何有效地进行神经架构搜索(NAS),以在大型模型中实现更快的搜索速度和更准确的结果。
动机:随着大型模型的发展,对更快的搜索速度和更精确的搜索结果的需求日益增强。然而,由于搜索空间的急剧扩大和相关的高昂性能评估成本,通过NAS设计大型模型具有挑战性。
方法:我们提出了一种新颖的分而治之策略,利用搜索空间的模块化特性,而不是将架构搜索视为一个整体问题。我们引入了MathNAS,这是一个基于数学规划的通用NAS框架。在MathNAS中,首先计算搜索空间中所有可能构建块的性能,然后根据其构建块的性能直接预测网络的性能。
效果:我们的方法是有效的,并在多个大规模计算机视觉和自然语言处理基准数据集上进行了验证。特别是在ImageNet-1k上,MathNAS实现了82.5%的top-1准确率,比Swin-T和LeViT-256分别高出1.2%和0.96%。此外,当部署在移动设备上时,MathNAS实现了实时搜索和动态网络切换,在1秒内完成(在TX2 GPU上为0.4秒),超越了基线动态网络的设备性能。

Neural Architecture Search (NAS) has emerged as a favoured method for unearthing effective neural architectures. Recent development of large models has intensified the demand for faster search speeds and more accurate search results. However, designing large models by NAS is challenging due to the dramatical increase of search space and the associated huge performance evaluation cost. Consider a typical modular search space widely used in NAS, in which a neural architecture consists of $m$ block nodes and a block node has $n$ alternative blocks. Facing the space containing $n^m$ candidate networks, existing NAS methods attempt to find the best one by searching and evaluating candidate networks directly. Different from the general strategy that takes architecture search as a whole problem, we propose a novel divide-and-conquer strategy by making use of the modular nature of the search space. Here, we introduce MathNAS, a general NAS framework based on mathematical programming. In MathNAS, the performances of all possible building blocks in the search space are calculated first, and then the performance of a network is directly predicted based on the performances of its building blocks. Although estimating block performances involves network training, just as what happens for network performance evaluation in existing NAS methods, predicting network performance is completely training-free and thus extremely fast. In contrast to the $n^m$ candidate networks to evaluate in existing NAS methods, which requires training and a formidable computational burden, there are only $m*n$ possible blocks to handle in MathNAS. Therefore, our approach effectively reduces the complexity of network performance evaluation. The superiority of MathNAS is validated on multiple large-scale CV and NLP benchmark datasets. Notably on ImageNet-1k, MathNAS achieves 82.5\% top-1 accuracy, 1.2\% and 0.96\% higher than Swin-T and LeViT-256, respectively. In addition, when deployed on mobile device, MathNAS achieves real-time search and dynamic network switching within 1s (0.4s on TX2 GPU), surpassing baseline dynamic networks in on-device performance.

LogSpecT: Feasible Graph Learning Model from Stationary Signals with Recovery Guarantees
Shangyuan LIU Linglingzhi Zhu Anthony Man-Cho So



研究问题:如何从信号中学习图结构是图信号处理(GSP)的核心任务。
动机:在GSP社区中,一种被称为稳定图信号的图信号的重要子类正在越来越受欢迎,它扩展了数据在规则域上定义的平稳性的概念到图上的信号。最常用的从这些稳定信号中学习图的模型是SpecT,它是几乎所有后续更先进模型的基础。然而,该模型的实践形式rSpecT已被识别为对超参数的选择敏感,更重要的是,它可能会面临优化问题的不可行性。
方法:我们引入了第一个确保rSpecT不可行的条件,并设计了一个名为LogSpecT的新模型,以及其实践形式rLogSpecT来解决这个问题。与rSpecT相反,我们的新实践模型rLogSpecT总是可行的。此外,我们还提供了关于现代优化工具的收敛保证,这些工具与上逼近有关,这可能具有独立的兴趣和对各种学习问题的重大意义。
效果:为了证明rLogSpecT的实际优势,我们提出了一种基于线性化的交替方向乘子法(L-ADMM)的高度有效的算法,该算法允许每个子问题的闭型解,并有收敛保证。在合成和真实网络上的大量数值结果不仅证实了我们提出的方法的稳定性,而且强调了它们与现有模型相当甚至优越的性能。

Graph learning from signals is a core task in graph signal processing (GSP). A significant subclass of graph signals called the stationary graph signals that broadens the concept of stationarity of data defined on regular domains to signals on graphs is gaining increasing popularity in the GSP community. The most commonly used model to learn graphs from these stationary signals is SpecT, which forms the foundation for nearly all the subsequent, more advanced models. Despite its strengths, the practical formulation of the model, known as rSpecT, has been identified to be susceptible to the choice of hyperparameters. More critically, it may suffer from infeasibility as an optimization problem. In this paper, we introduce the first condition that ensures the infeasibility of rSpecT and design a novel model called LogSpecT, along with its practical formulation rLogSpecT to overcome this issue. Contrary to rSpecT, our novel practical model rLogSpecT is always feasible. Furthermore, we provide recovery guarantees of rLogSpecT from modern optimization tools related to epi-convergence, which could be of independent interest and significant for various learning problems. To demonstrate the practical advantages of rLogSpecT, a highly efficient algorithm based on the linearized alternating direction method of multipliers (L-ADMM) that allows closed-form solutions for each subproblem is proposed with convergence guarantees. Extensive numerical results on both synthetic and real networks not only corroborate the stability of our proposed methods, but also highlight their comparable and even superior performance than existing models.

Kissing to Find a Match: Efficient Low-Rank Permutation Representation
Hannah Dröge Zorah Lähner Yuval Bahat Onofre Martorell Nadal Felix Heide Michael Moeller



研究问题:本文旨在解决大规模排列矩阵在各领域匹配和分配问题中的关键作用,特别是在计算机视觉和机器人技术中。
动机:现有的排列矩阵表示方法由于其大小会呈二次方增长,导致内存需求巨大,限制了对大型问题实例的处理。
方法:本文提出通过低秩矩阵分解并添加非线性项来近似大型排列矩阵,以解决维度灾难问题。我们依赖“接吻数”理论来推断给定大小的排列矩阵所需的最小秩,这显著小于问题的大小,从而大大降低了计算和存储成本。
效果:实验结果表明,该方法可以准确表示大型排列矩阵,进而能够处理原本无法处理的大型问题。我们在一系列涉及预测排列矩阵的问题上展示了该方法的应用性和优点,包括线性和二次分配以及形状匹配等问题。

Permutation matrices play a key role in matching and assignment problems across the fields, especially in computer vision and robotics. However, memory for explicitly representing permutation matrices grows quadratically with the size of the problem, prohibiting large problem instances. In this work, we propose to tackle the curse of dimensionality of large permutation matrices by approximating them using low-rank matrix factorization, followed by a nonlinearity. To this end, we rely on the Kissing number theory to infer the minimal rank required for representing a permutation matrix of a given size, which is significantly smaller than the problem size. This leads to a drastic reduction in computation and memory costs, e.g., up to $3$ orders of magnitude less memory for a problem of size $n=20000$, represented using $8.4\times10^5$ elements in two small matrices instead of using a single huge matrix with $4\times 10^8$ elements. The proposed representation allows for accurate representations of large permutation matrices, which in turn enables handling large problems that would have been infeasible otherwise. We demonstrate the applicability and merits of the proposed approach through a series of experiments on a range of problems that involve predicting permutation matrices, from linear and quadratic assignment to shape matching problems.

From Distribution Learning in Training to Gradient Search in Testing for Combinatorial Optimization
Yang Li Jinpei Guo Runzhong Wang Junchi Yan



研究问题:如何通过模型预测组合优化(CO)解决方案,同时提供有助于搜索的支撑知识。
动机:目前的神经网络在追求最小化历史问题实例的平均目标得分时,与组合优化寻求每个测试实例的最佳解决方案的核心目标偏离。
方法:提出T2TCO(Training to Testing)框架,首先利用生成模型在训练期间估计每个实例的高质量解决方案分布,然后在测试期间在解决方案空间内进行梯度搜索。
效果:实验结果表明,T2TCO在解决旅行商问题(TSP)和最大独立集问题(MIS)上具有显著优势,相比于先前最先进的方法,平均性能提高了49.15%和17.27%。

Extensive experiments have gradually revealed the potential performance bottleneck of modeling Combinatorial Optimization (CO) solving as neural solution prediction tasks. The neural networks, in their pursuit of minimizing the average objective score across the distribution of historical problem instances, diverge from the core target of CO of seeking optimal solutions for every test instance. This calls for an effective search on each problem instance, while the model should serve to provide supporting knowledge that benefits the search. To this end, we propose T2TCO (Training to Testing) framework that first leverages the generative modeling to estimate the high-quality solution distribution for each instance during training, and then conducts a gradient-based search within the solution space during testing. The proposed neural search paradigm consistently leverages generative modeling, specifically diffusion, for graduated solution improvement. It disrupts the local structure of the given solution by introducing noise and reconstructs a lower-cost solution guided by the optimization objective. Experimental results on Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS) show the significant superiority of T2TCO, demonstrating an average performance gain of 49.15% for TSP solving and 17.27% for MIS solving compared to the previous state-of-the-art.

Communication-Efficient Federated Bilevel Optimization with Global and Local Lower Level Problems
Junyi Li Feihu Huang Heng Huang



研究问题:双层级优化在联邦学习环境中的运用及其收敛性问题。
动机:尽管双层级优化近期取得了显著进展,但在联邦学习环境下的应用和其对算法收敛性的影响仍不明确。
方法:我们提出了一种名为FedBiOAcc的高效通信算法,该算法利用分布式环境中的超梯度估计和基于动量的方差减少加速技术。
效果:FedBiOAcc实现了$O(\epsilon^{-1})$的通信复杂度、$O(\epsilon^{-1.5})$的样本复杂度,并以线性速度随客户端数量增加而增加。我们还分析了联邦双层优化问题的一种特殊情况,即底层问题由客户端本地管理的情况,证明了FedBiOAcc-Local(FedBiOAcc的修改版)在这种问题上具有相同的收敛速度。最后,我们通过两个真实世界的任务:联邦数据清理和联邦超表示学习来验证我们的算法,实验结果表明我们的算法表现优越。

Bilevel Optimization has witnessed notable progress recently with new emerging efficient algorithms. However, its application in the Federated Learning setting remains relatively underexplored, and the impact of Federated Learning's inherent challenges on the convergence of bilevel algorithms remain obscure. In this work, we investigate Federated Bilevel Optimization problems and propose a communication-efficient algorithm, named FedBiOAcc. The algorithm leverages an efficient estimation of the hyper-gradient in the distributed setting and utilizes the momentum-based variance-reduction acceleration. Remarkably, FedBiOAcc achieves a communication complexity $O(\epsilon^{-1})$, a sample complexity $O(\epsilon^{-1.5})$ and the linear speed up with respect to the number of clients. We also analyze a special case of the Federated Bilevel Optimization problems, where lower level problems are locally managed by clients. We prove that FedBiOAcc-Local, a modified version of FedBiOAcc, converges at the same rate for this type of problems. Finally, we validate the proposed algorithms through two real-world tasks: Federated Data-cleaning and Federated Hyper-representation Learning. Empirical results show superior performance of our algorithms.

Resolving the Tug-of-War: A Separation of Communication and Learning in Federated Learning
Junyi Li Heng Huang



研究问题:如何在保护隐私的同时,实现分布式数据上的机器学习?
动机:现有的联邦学习(FL)模式在学习和通信对参数选择的需求上存在根本的差异。
方法:提出FedSep,一种新的两层联邦学习框架,将学习和通信分离,并通过解码/编码操作进行连接。
效果:理论证明FedSep的收敛性与标准的FL算法相匹配。实证验证显示,FedSep在各种任务中的表现优于各种基线。

Federated learning (FL) is a promising privacy-preserving machine learning paradigm over distributed data. In this paradigm, each client trains the parameter of a model locally and the server aggregates the parameter from clients periodically. Therefore, we perform the learning and communication over the same set of parameters. However, we find that learning and communication have fundamentally divergent requirements for parameter selection, akin to two opposite teams in a tug-of-war game. To mitigate this discrepancy, we introduce FedSep, a novel two-layer federated learning framework. FedSep consists of separated communication and learning layers for each client and the two layers are connected through decode/encode operations. In particular, the decoding operation is formulated as a minimization problem. We view FedSep as a federated bilevel optimization problem and propose an efficient algorithm to solve it. Theoretically, we demonstrate that its convergence matches that of the standard FL algorithms. The separation of communication and learning in FedSep offers innovative solutions to various challenging problems in FL, such as Communication-Efficient FL and Heterogeneous-Model FL. Empirical validation shows the superior performance of FedSep over various baselines in these tasks.

Convolutional State Space Models for Long-Range Spatiotemporal Modeling
Jimmy T.H. Smith Shalini De Mello Jan Kautz Scott Linderman Wonmin Byeon



研究问题:如何有效地对长时空序列进行建模,同时处理复杂的空间关联和长程时间依赖性。
动机:由于需要同时处理复杂的空间关联和长程时间依赖性,对长时空序列的有效建模具有挑战性。ConvLSTMs和Transformers虽然都试图解决这个问题,但各自存在训练速度慢和难以扩展到更长序列的问题。
方法:提出了一种结合了ConvLSTM的张量建模思想和S4、S5等状态空间方法的长序列建模方法——卷积状态空间模型(ConvSSM)。通过并行扫描实现亚二次并行化和快速自回归生成,并建立了ConvSSMs和SSMs的动态等价性,为长程依赖性建模提供了参数化和初始化策略。
效果:基于此,研发出了适用于长程时空建模的高效ConvSSM变体——ConvS5。在长时序Moving-MNIST实验中,ConvS5显著优于Transformers和ConvLSTM,且训练速度快于ConvLSTM 3倍,样本生成速度快于Transformers 400倍。此外,ConvS5在DMLab、Minecraft和Habitat预测基准测试中的表现与或超过了最先进的方法,为长时空序列建模开辟了新的方向。

Effectively modeling long spatiotemporal sequences is challenging due to the need to model complex spatial correlations and long-range temporal dependencies simultaneously. ConvLSTMs attempt to address this by updating tensor-valued states with recurrent neural networks, but their sequential computation makes them slow to train. In contrast, Transformers can process an entire spatiotemporal sequence, compressed into tokens, in parallel. However, the cost of attention scales quadratically in length, limiting their scalability to longer sequences. Here, we address the challenges of prior methods and introduce convolutional state space models (ConvSSM) that combine the tensor modeling ideas of ConvLSTM with the long sequence modeling approaches of state space methods such as S4 and S5. First, we demonstrate how parallel scans can be applied to convolutional recurrences to achieve subquadratic parallelization and fast autoregressive generation. We then establish an equivalence between the dynamics of ConvSSMs and SSMs, which motivates parameterization and initialization strategies for modeling long-range dependencies. The result is ConvS5, an efficient ConvSSM variant for long-range spatiotemporal modeling. ConvS5 significantly outperforms Transformers and ConvLSTM on a long horizon Moving-MNIST experiment while training $3\times$ faster than ConvLSTM and generating samples $400\times$ faster than Transformers. In addition, ConvS5 matches or exceeds the performance of state-of-the-art methods on challenging DMLab, Minecraft and Habitat prediction benchmarks and enables new directions for modeling long spatiotemporal sequences.

AutoGO: Automated Computation Graph Optimization for Neural Network Evolution
Mohammad Salameh Keith G Mills Negar Hassanpour Fred X. Han Shuting Zhang Wei Lu SHANGLING JUI CHUNHUA ZHOU Fengyu Sun Di Niu



研究问题:优化深度神经网络以获取高质量的模型,实现高效的实际部署。
动机:现有的方法要么在启发式设计空间中搜索神经网络架构,要么对计算原语进行低级别调整以提高硬件上的推理效率。
方法:提出自动化图形优化(AutoGO)框架,通过一个标记化方案在低级计算图的原语操作上演化神经网络,以提高其性能和对硬件的友好性。
效果:大量实验结果表明,AutoGO可以在一系列计算机视觉任务上自动演化几种典型的大型卷积网络,显著提高任务性能并减少浮点运算次数,同时不需要引入任何新的原语操作。

Optimizing Deep Neural Networks (DNNs) to obtain high-quality models for efficient real-world deployment has posed multi-faceted challenges to machine learning engineers. Existing methods either search for neural architectures in heuristic design spaces or apply low-level adjustments to computation primitives to improve inference efficiency on hardware. We present Automated Graph Optimization (AutoGO), a framework to evolve neural networks in a low-level Computation Graph (CG) of primitive operations to improve both its performance and hardware friendliness. Through a tokenization scheme, AutoGO performs variable-sized segment mutations, making both primitive changes and larger-grained changes to CGs. We introduce our segmentation and mutation algorithms, efficient frequent segment mining technique, as well as a pretrained context-aware predictor to estimate the impact of segment replacements. Extensive experimental results show that AutoGO can automatically evolve several typical large convolutional networks to achieve significant task performance improvement and FLOPs reduction on a range of CV tasks, ranging from Classification, Semantic Segmentation, Human Pose Estimation, to Super Resolution, yet without introducing any newer primitive operations. We also demonstrate the lightweight deployment results of AutoGO-optimized super-resolution and denoising U-Nets on a cycle simulator for a Neural Processing Unit (NPU), achieving PSNR improvement and latency/power reduction simultaneously. Code available at https://github.com/Ascend-Research/AutoGO.

Learning to Configure Separators in Branch-and-Cut
Sirui Li Wenbin Ouyang Max B. Paulus Cathy Wu



研究问题:如何有效地选择分离器以加速混合整数线性规划(MILP)的求解。
动机:现代MILP求解器依赖于各种分离器来生成多样化的切割平面,但在选择分离器的过程中存在挑战。
方法:我们提出了一种数据驱动的策略来限制选择空间,并在受限的空间上实施学习指导的算法。该方法预测了实例感知的分离器配置,这些配置可以在求解过程中动态适应,从而有效地加速开源MILP求解器SCIP。
效果:在合成和真实世界的MILP基准测试中,我们的方法可以将相对求解时间提高多达72%和37%。

Cutting planes are crucial in solving mixed integer linear programs (MILP) as they facilitate bound improvements on the optimal solution. Modern MILP solvers rely on a variety of separators to generate a diverse set of cutting planes by invoking the separators frequently during the solving process. This work identifies that MILP solvers can be drastically accelerated by appropriately selecting separators to activate. As the combinatorial separator selection space imposes challenges for machine learning, we *learn to separate* by proposing a novel data-driven strategy to restrict the selection space and a learning-guided algorithm on the restricted space. Our method predicts instance-aware separator configurations which can dynamically adapt during the solve, effectively accelerating the open source MILP solver SCIP by improving the relative solve time up to 72% and 37% on synthetic and real-world MILP benchmarks. Our work complements recent work on learning to select cutting planes and highlights the importance of separator management.

Embedding Space Interpolation Beyond Mini-Batch, Beyond Pairs and Beyond Examples
Shashanka Venkataramanan Ewa Kijak laurent amsaleg Yannis Avrithis



研究问题:如何通过插值进行数据增强,以超越经验风险最小化(ERM)。
动机:大多数方法在输入空间中生成的示例数量有限,且被插值的示例数量通常限制为两个。
方法:提出MultiMix和Dense MultiMix,可以在超出 mini-batch 大小的情况下生成任意数量的插值示例,并在嵌入空间中对整个 mini-batch 进行插值。
效果:在四个不同的基准测试中,即使插值只是线性的,我们的方法也显著提高了最先进的mixup方法的效果。通过分析嵌入空间,我们发现类别在嵌入空间中更紧密地聚集和均匀分布,从而解释了改进的行为。

Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Its extensions mostly focus on the definition of interpolation and the space (input or feature) where it takes place, while the augmentation process itself is less studied. In most methods, the number of generated examples is limited to the mini-batch size and the number of examples being interpolated is limited to two (pairs), in the input space. We make progress in this direction by introducing MultiMix, which generates an arbitrarily large number of interpolated examples beyond the mini-batch size and interpolates the entire mini-batch in the embedding space. Effectively, we sample on the entire convex hull of the mini-batch rather than along linear segments between pairs of examples. On sequence data, we further extend to Dense MultiMix. We densely interpolate features and target labels at each spatial location and also apply the loss densely. To mitigate the lack of dense labels, we inherit labels from examples and weight interpolation factors by attention as a measure of confidence. Overall, we increase the number of loss terms per mini-batch by orders of magnitude at little additional cost. This is only possible because of interpolating in the embedding space. We empirically show that our solutions yield significant improvement over state-of-the-art mixup methods on four different benchmarks, despite interpolation being only linear. By analyzing the embedding space, we show that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.

HotBEV: Hardware-oriented Transformer-based Multi-View 3D Detector for BEV Perception
Peiyan Dong Zhenglun Kong Xin Meng Pinrui Yu Yifan Gong Geng Yuan Hao Tang Yanzhi Wang



研究问题:如何设计一种低延迟、高效且准确的鸟瞰图(BEV)感知模型,以实现自动驾驶系统中的实时决策。
动机:现有的BEV检测方法虽然提高了检测精度,但由于计算和内存负担重,增加了系统崩溃的风险,而且缺乏对实际设备延迟的关注。
方法:提出了一种考虑硬件属性(如内存访问成本和并行度)的延迟感知设计方法。利用理论延迟预测模型和有效的构建操作符,开发了一种面向硬件的特征捕捉和融合优化的骨干网络。
效果:实验表明,新提出的HotBEV在多个GPU设备上比其他方法快1.1倍至6.3倍,同时在V100上实现了2%至23%的NDS增益和2%至7.8%的mAP增益。

The bird's-eye-view (BEV) perception plays a critical role in autonomous driving systems, involving the accurate and efficient detection and tracking of objects from a top-down perspective. To achieve real-time decision-making in self-driving scenarios, low-latency computation is essential. While recent approaches to BEV detection have focused on improving detection precision using Lift-Splat-Shoot (LSS)-based or transformer-based schemas, the substantial computational and memory burden of these approaches increases the risk of system crashes when multiple on-vehicle tasks run simultaneously. Unfortunately, there is a dearth of literature on efficient BEV detector paradigms, let alone achieving realistic speedups. Unlike existing works that focus on reducing computation costs, this paper focuses on developing an efficient model design that prioritizes actual on-device latency. To achieve this goal, we propose a latency-aware design methodology that considers key hardware properties, such as memory access cost and degree of parallelism. Given the prevalence of GPUs as the main computation platform for autonomous driving systems, we develop a theoretical latency prediction model and introduce efficient building operators. By leveraging these operators and following an effective local-to-global visual modeling process, we propose a hardware-oriented backbone that is also optimized for strong feature capturing and fusing. Using these insights, we present a new hardware-oriented framework for efficient yet accurate camera-view BEV detectors. Experiments show that HotBEV achieves a 2\%$\sim$23\% NDS gain, and 2\%$\sim$7.8\% mAP gain with a 1.1$\times$$\sim$3.4$\times$ speedups compared to existing works on V100; On multiple GPU devices such as GPU GTX 2080 and the low-end GTX 1080, HotBEV achieves 1.1$\times$$\sim$6.3$\times$ faster than others.

PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile
Peiyan Dong LEI LU Chao Wu Cheng Lyu Geng Yuan Hao Tang Yanzhi Wang



研究问题:计算机视觉中的Transformer模型需要大量的计算和内存资源,如何优化其硬件效率并降低推理延迟是一个难题。
动机:现有的商品硬件如CPU和GPU在执行低于8位精度的量化网络时效率低下,且目前关于Transformer模型低于8位精度量化的研究文献较少。
方法:本文提出了一种名为PackQViT的激活感知全低于8位量化感知训练框架,通过调整数据激活策略和精度,采用对数量化或裁剪处理长尾部分布,引入异常值感知训练进行残差链接量化,以及使用Int-$2^{n}$-Softmax、Int-LayerNorm和Integer GELU实现整数计算流程,最后开发了一个基于SIMD的4位打包乘法器以实现手机上的端到端ViT加速。
效果:与之前使用8位精度对ViT进行量化的研究相比,PackQViT在ImageNet数据集上的各种广泛使用的ViTs中,准确率提高了0.4%至17.9%;在4位精度下,PackQViT的准确率提高了0.4%至2.8%。在Snapdragon 870 SoC CPU的Realme GT安卓智能手机上,与基线乘法器相比,实现了8位场景下2.6x至3.7x的速度提升和4位场景下的3.8x至5.9x的速度提升,确保了实际的实时性能。

While Vision Transformers (ViTs) have undoubtedly made impressive strides in computer vision (CV), their intricate network structures necessitate substantial computation and memory resources. A decision-making process for CV tasks typically entails performing computations with low latency, which is a tricky problem for ViT models. Model quantization is a widely-used technique to optimize the hardware efficiency of deep neural networks. Full quantization under Sub-8-bit precision, in particular, is a promising solution to reduce inference latency significantly. Unfortunately, current commodity hardware, such as CPUs and GPUs, still struggles to efficiently execute these sub-8-bit quantized networks, as their SIMD instructions only support a granularity of 8 bits or wider. Also, there is a scarcity of literature that presents a full quantization paradigm for ViTs. In this paper, we propose an activation-aware fully sub-8-bit quantization-aware training (QAT) framework called PackQViT for efficient yet accurate ViT acceleration on mobile devices to facilitate real-time AI-powered decision-making. Specifically, in revisiting data activation within the ViT dataflow, two characteristics are relevant to quantization strategy and precision: the long-tailed distribution and systematic channel-wise outliers. In response, we employ either log2 quantization or clipping to address the long-tailed distribution and incorporate outlier-aware training for residual link quantization to regulate the various channel-wise outliers more consistently. Notably, due to the systematic fixed pattern, outlier-aware training approach can predict the channel indices and regularized scales of outliers in advance, thus avoiding the runtime data-adaptive selection during inference. Furthermore, we employ Int-$2^{n}$-Softmax, Int-LayerNorm, and Integer GELU to enable integer-only computation flow. Finally, we develop a SIMD-based 4-bit packed multiplier to achieve end-to-end ViT acceleration on mobile phones. Compared to prior studies on ViT quantization using 8-bit precision, PackQViT surpasses other works by an improved accuracy ranging from 0.4\% to 17.9\% for various widely used ViTs on ImageNet dataset; under 4-bit precision, PackQViT demonstrates 0.4%$\sim$2.8% higher accuracy. Compared to the baseline multiplier, our implementations on the Realme GT Android smartphone with Snapdragon 870 SoC CPU achieve 2.6x$\sim$3.7x speedup under 8-bit scenario and 3.8x$\sim$5.9x speedup under 4-bit which ensures practical real-time performance.

Fast Trainable Projection for Robust Fine-tuning
Junjiao Tian Yen-Cheng Liu James Smith Zsolt Kira



研究问题:如何实现预训练模型在向下游任务转移时,既能保持优秀的分布内性能,又能保持良好的分布外鲁棒性。
动机:目前的投影梯度下降法虽然在鲁棒微调中取得了成功,但算法的可扩展性和效率存在问题。
方法:提出一种新的基于投影的微调算法——快速可训练投影(FTP),通过学习每一层投影约束,提高了计算效率。
效果:实验表明,FTP在分布外数据集上具有优越的鲁棒性,并在四个不同的视觉任务和五个不同的预训练模型上进行了测试。此外,由于其易于适应的特性,FTP也广泛应用于其他学习场景,如低标签和持续学习设置。

Robust fine-tuning aims to achieve competitive in-distribution (ID) performance while maintaining the out-of-distribution (OOD) robustness of a pre-trained model when transferring it to a downstream task. Recently, projected gradient descent has been successfully used in robust fine-tuning by constraining the deviation from the initialization of the fine-tuned model explicitly through projection. However, algorithmically, two limitations prevent this method from being adopted more widely, scalability and efficiency. In this paper, we propose a new projection-based fine-tuning algorithm, Fast Trainable Projection (FTP) for computationally efficient learning of per-layer projection constraints, resulting in an average 35% speedup on our benchmarks compared to prior works. FTP can be combined with existing optimizers such as AdamW, and be used in a plug-and-play fashion. Finally, we show that FTP is a special instance of hyper-optimizers that tune the hyper-parameters of optimizers in a learnable manner through nested differentiation. Empirically, we show superior robustness on OOD datasets, including domain shifts and natural corruptions, across four different vision tasks with five different pre-trained models. Additionally, we demonstrate that FTP is broadly applicable and beneficial to other learning scenarios such as low-label and continual learning settings thanks to its easy adaptability. The code will be available at https://github.com/GT-RIPL/FTP.git.

ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation
Shuyang Sun Weijun Wang Andrew G. Howard Qihang Yu Philip Torr Liang-Chieh Chen



研究问题:如何提高基于掩码变压器的端到端架构在全景分割训练过程中的效率和性能。
动机:全景分割训练目标的复杂性导致对假阳性的惩罚较高,这种不平衡的损失使得训练过程困难,尤其是对于高效的模型。
方法:提出ReMaX方法,通过在训练阶段对掩码预测和类别预测进行放松,以改善模型的训练效果。
效果:实验证明,该方法可以在不增加推理计算成本的情况下,明显提高模型性能。将该方法与MobileNetV3-Small等高效骨干网络结合,在COCO、ADE20K和Cityscapes等数据集上实现了新的最先进的全景分割结果。

This paper presents a new mechanism to facilitate the training of mask transformers for efficient panoptic segmentation, democratizing its deployment. We observe that due to the high complexity in the training objective of panoptic segmentation, it will inevitably lead to much higher penalization on false positive. Such unbalanced loss makes the training process of the end-to-end mask-transformer based architectures difficult, especially for efficient models. In this paper, we present ReMaX that adds relaxation to mask predictions and class predictions during the training phase for panoptic segmentation. We demonstrate that via these simple relaxation techniques during training, our model can be consistently improved by a clear margin without any extra computational cost on inference. By combining our method with efficient backbones like MobileNetV3-Small, our method achieves new state-of-the-art results for efficient panoptic segmentation on COCO, ADE20K and Cityscapes. Code and pre-trained checkpoints will be available at https://github.com/google-research/deeplab2.

“Why Not Looking backward?” A Robust Two-Step Method to Automatically Terminate Bayesian Optimization
Shuang Li Ke Li Wei Li



研究问题:如何有效地终止贝叶斯优化(BO)以解决昂贵的黑箱优化问题。
动机:BO是一种强大的方法,用于处理昂贵的黑箱优化问题。然而,决定何时终止BO对解决方案的质量和计算效率有重大影响。
方法:我们提出了一种简单但理论上有根据的两步法来自动终止BO。核心概念是通过检查之前观察到的样本主动识别搜索是否在凸区域中。一旦该凸区域内的局部遗憾低于预定阈值,BO就会停止。为了增强数值稳定性,我们提出了一种通过解决双层优化问题来计算终止指标的近似方法。
效果:我们在各种基准问题上进行了广泛的实证研究,包括合成函数、强化学习和超参数优化。实验结果表明,我们提出的方法节省了高达80%的计算预算,与同行方法相比,性能下降了一个数量级。此外,我们的终止方法在终止标准的设置上是稳健的。

Bayesian Optimization (BO) is a powerful method for tackling expensive black-box optimization problems. As a sequential model-based optimization strategy, BO iteratively explores promising solutions until a predetermined budget, either iterations or time, is exhausted. The decision on when to terminate BO significantly influences both the quality of solutions and its computational efficiency. In this paper, we propose a simple, yet theoretically grounded, two-step method for automatically terminating BO. Our core concept is to proactively identify if the search is within a convex region by examining previously observed samples. BO is halted once the local regret within this convex region falls below a predetermined threshold. To enhance numerical stability, we propose an approximation method for calculating the termination indicator by solving a bilevel optimization problem. We conduct extensive empirical studies on diverse benchmark problems, including synthetic functions, reinforcement learning, and hyperparameter optimization. Experimental results demonstrate that our proposed method saves up to $\approx 80\%$ computational budget yet is with an order of magnitude smaller performance degradation, comparing against the other peer methods. In addition, our proposed termination method is robust in terms of the setting of its termination criterion.

Hypervolume Maximization: A Geometric View of Pareto Set Learning
Xiaoyuan Zhang Xi Lin Bo Xue Yifan Chen Qingfu Zhang



研究问题:提出一种新颖的多目标算法,使用神经网络对Pareto集进行建模。
动机:与以往主要关注识别有限数量的解决方案不同,我们的方法允许直接对整个Pareto集进行建模。
方法:建立了学习完整的Pareto集和最大化相关超体积之间的等价性,使得可以分析超体积(作为新的度量标准)在Pareto集学习中的收敛性。具体来说,我们的新分析框架揭示了学习到的Pareto解决方案与其在极坐标系中的表示之间的关系。
效果:我们在各种基准问题和实际问题上评估了我们提出的方法,令人鼓舞的结果使其成为现有多目标算法的一个有潜力的替代方案。代码可在https://github.com/xzhang2523/hvpsl/tree/master获取。

This paper presents a novel approach to multiobjective algorithms aimed at modeling the Pareto set using neural networks. Whereas previous methods mainly focused on identifying a finite number of solutions, our approach allows for the direct modeling of the entire Pareto set. Furthermore, we establish an equivalence between learning the complete Pareto set and maximizing the associated hypervolume, which enables the convergence analysis of hypervolume (as a new metric) for Pareto set learning. Specifically, our new analysis framework reveals the connection between the learned Pareto solution and its representation in a polar coordinate system. We evaluate our proposed approach on various benchmark problems and real-world problems, and the encouraging results make it a potentially viable alternative to existing multiobjective algorithms. Code is available at \url{https://github.com/xzhang2523/hvpsl/tree/master}.

On the Pareto Front of Multilingual Neural Machine Translation
Liang Chen Shuming Ma Dongdong Zhang Furu Wei Baobao Chang



研究问题:本文研究了在多语言神经机器翻译(MNMT)中,给定方向的性能如何随其采样比例变化。
动机:通过训练200多个多语言模型,发现某些翻译方向的性能并不总是随着其在多任务优化目标中的权重增加而提高,这在训练语料库存在数据不平衡时会导致整体性能提升的挑战。
方法:基于观察结果,提出了双幂律预测MNMT中独特的性能权衡前沿,该方法在不同的语言、数据充分性和任务数量上都表现出稳健性。最后,将MNMT中的采样比例选择问题转化为基于双幂律的优化问题。
效果:大量实验表明,该方法比温度搜索和梯度操作方法效果更好,且仅需总训练预算的1/5到1/2。

In this work, we study how the performance of a given direction changes with its sampling ratio in Multilingual Neural Machine Translation (MNMT). By training over 200 multilingual models with various model sizes, data sizes, and language directions, we find it interesting that the performance of certain translation direction does not always improve with the increase of its weight in the multi-task optimization objective. Accordingly, scalarization method leads to a multitask trade-off front that deviates from the traditional Pareto front when there exists data imbalance in the training corpus, which poses a great challenge to improve the overall performance of all directions. Based on our observations, we propose the Double Power Law to predict the unique performance trade-off front in MNMT, which is robust across various languages, data adequacy, and the number of tasks. Finally, we formulate the sample ratio selection problem in MNMT as an optimization problem based on the Double Power Law. Extensive experiments show that it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget. We release the code at https://github.com/pkunlp-icler/ParetoMNMT for reproduction.

TexQ: Zero-shot Network Quantization with Texture Feature Distribution Calibration
Xinrui Chen Yizhi Wang Renao Yan Yiqing Liu Tian Guan Yonghong He



研究问题:如何有效地压缩神经网络,提高边缘设备上神经网络模型的处理效率。
动机:现有的大多数量化方法使用真实数据集来优化量化参数并进行微调,但这种方法存在隐私和安全问题。因此,需要一种自然的方法引入合成样本进行零样本量化(ZSQ)。
方法:提出了一种新的ZSQ方法TexQ,首先通过纹理特征能量分布校准方法为每个类别合成一个校准图像并提取其校准中心,然后使用这些校准中心指导生成器合成样本,最后引入混合知识蒸馏模块以丰富合成样本用于微调。
效果:在CIFAR10/100和ImageNet上的大量实验表明,TexQ在极低比特宽度量化方面表现优越。例如,当ResNet-18量化到3位时,与最先进的方法相比,TexQ在ImageNet上实现了12.18%的top-1准确率提升。

Quantization is an effective way to compress neural networks. By reducing the bit width of the parameters, the processing efficiency of neural network models at edge devices can be notably improved. Most conventional quantization methods utilize real datasets to optimize quantization parameters and fine-tune. Due to the inevitable privacy and security issues of real samples, the existing real-data-driven methods are no longer applicable. Thus, a natural method is to introduce synthetic samples for zero-shot quantization (ZSQ). However, the conventional synthetic samples fail to retain the detailed texture feature distributions, which severely limits the knowledge transfer and performance of the quantized model. In this paper, a novel ZSQ method, TexQ is proposed to address this issue. We first synthesize a calibration image and extract its calibration center for each class with a texture feature energy distribution calibration method. Then, the calibration centers are used to guide the generator to synthesize samples. Finally, we introduce the mixup knowledge distillation module to diversify synthetic samples for fine-tuning. Extensive experiments on CIFAR10/100 and ImageNet show that TexQ is observed to perform state-of-the-art in ultra-low bit width quantization. For example, when ResNet-18 is quantized to 3-bit, TexQ achieves a 12.18% top-1 accuracy increase on ImageNet compared to state-of-the-art methods. Code at https://github.com/dangsingrue/TexQ.

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory
Yinan Liang Ziwei Wang Xiuwei Xu Yansong Tang Jie Zhou Jiwen Lu



研究问题:如何将视觉变换器部署在内存有限的微控制器上。
动机:由于GPU价格高昂且能耗大,将深度学习模型部署在如微控制器等物联网设备上对生态AI有重大贡献。虽然现有的方法可以在微控制器上成功进行高分辨率图像的卷积神经网络推理,但在许多视觉应用中实现最先进性能的视觉变换器框架仍未得到探索。
方法:提出一种名为MCUFormer的硬件-算法协同优化方法,用于在内存极其有限的微控制器上部署视觉变换器。我们联合设计了变换器架构并构建了推理算子库以适应内存资源约束。具体来说,我们将一次网络结构搜索(NAS)推广到发现给定微控制器内存预算下具有最高任务性能的最佳架构,通过考虑低秩分解维度和补丁分辨率来扩大视觉变换器的现有搜索空间以减少内存。对于视觉变换器推理算子库的构建,我们通过操作整合、补丁嵌入分解和令牌重写调度推理期间的内存缓冲区,使内存缓冲区能够充分利用以适应视觉变换器的前向传播。
效果:实验结果表明,我们的MCUFormer在STM32F746微控制器上使用320KB内存实现了73.62\%的ImageNet图像分类任务top-1准确率。代码可在https://github.com/liangyn22/MCUFormer获取。

Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62\% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.

Greedy Poisson Rejection Sampling
Gergely Flamich



研究问题:本文旨在解决一次性信道模拟问题,即使用编码分布P对目标分布Q的单个样本进行编码,以平均尽可能少的比特数。
动机:现有的一次性信道模拟解决方案速度过慢或适用性有限,阻止了其广泛应用。
方法:通过构造一个等价于贪婪地搜索泊松过程点的拒绝采样程序,我们提出了一种名为“贪婪泊松拒绝采样”(GPRS)的算法,并分析了其正确性和时间复杂度的几个变体。
效果:实验验证了我们的定理,证明GPRS显著优于当前最先进的A*编码方法。

One-shot channel simulation is a fundamental data compression problem concerned with encoding a single sample from a target distribution $Q$ using a coding distribution $P$ using as few bits as possible on average. Algorithms that solve this problem find applications in neural data compression and differential privacy and can serve as a more efficient and natural alternative to quantization-based methods. Unfortunately, existing solutions are too slow or have limited applicability, preventing their widespread adaptation. In this paper, we conclusively solve one-shot channel simulation for one-dimensional problems where the target-proposal density ratio is unimodal by describing an algorithm with optimal runtime. We achieve this by constructing a rejection sampling procedure equivalent to greedily searching over the points of a Poisson process. Hence, we call our algorithm greedy Poisson rejection sampling (GPRS) and analyze the correctness and time complexity of several of its variants. Finally, we empirically verify our theorems, demonstrating that GPRS significantly outperforms the current state-of-the-art method, A* coding.

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix
Yun Yue Zhiling Ye Jiadi Jiang Yongchao Liu Ke Zhang



研究问题:如何设计预训练语言模型以充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,需要通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,该模型能更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Adaptive optimizers, such as Adam, have achieved remarkable success in deep learning. A key component of these optimizers is the so-called preconditioning matrix, providing enhanced gradient information and regulating the step size of each gradient direction. In this paper, we propose a novel approach to designing the preconditioning matrix by utilizing the gradient difference between two successive steps as the diagonal elements. These diagonal elements are closely related to the Hessian and can be perceived as an approximation of the inner product between the Hessian row vectors and difference of the adjacent parameter vectors. Additionally, we introduce an auto-switching function that enables the preconditioning matrix to switch dynamically between Stochastic Gradient Descent (SGD) and the adaptive optimizer. Based on these two techniques, we develop a new optimizer named AGD that enhances the generalization performance. We evaluate AGD on public datasets of Natural Language Processing (NLP), Computer Vision (CV), and Recommendation Systems (RecSys). Our experimental results demonstrate that AGD outperforms the state-of-the-art (SOTA) optimizers, achieving highly competitive or significantly better predictive performance. Furthermore, we analyze how AGD is able to switch automatically between SGD and the adaptive optimizer and its actual effects on various scenarios. The code is available at https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers.

ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation
Björn Deiseroth Mayukh Deb Samuel Weinbach Manuel Brack Patrick Schramowski Kristian Kersting



研究问题:当前复杂的生成型变换模型需要大量的参数和处理多种输入模态的能力,但其预测解释的方法资源消耗大,且在生产环境中难以使用。
动机:为了解决生成型变换模型预测解释的问题,提出了一种几乎不需要额外成本的解释方法。
方法:提出了AtMan方法,该方法通过操纵变换器的注意机制来产生输入与输出预测的相关图。它不使用反向传播,而是应用了一种并行的基于令牌的搜索方法,依赖于嵌入空间中的余弦相似性邻域。
效果:在文本和图像-文本基准测试上的大量实验表明,AtMan在几个指标上优于当前最先进的梯度基方法,同时具有计算效率。因此,AtMan适合在大模型推理部署中使用。

Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of additional memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use explanations in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method relying on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.

SUBP: Soft Uniform Block Pruning for 1$\times$N Sparse CNNs Multithreading Acceleration
Jingyang Xiang Siqi Li Jun Chen Guang Dai Shipeng Bai Yukai Ma Yong Liu



研究问题:如何有效地压缩和加速在资源有限的环境中的卷积神经网络(CNNs)模型。
动机:通过约束输出通道中的连续N个权重为非零组,最近的1×N稀疏网络由于其三个突出的优点而受到广泛关注:1) 通过“块稀疏行”矩阵节省大量存储空间;2) 在高稀疏度下表现优秀;3) 在具有高级向量扩展的CPU上显著加速。
方法:本文提出了一种新的“软均匀块剪枝”(SUBP)方法,从零开始训练均匀的1×N稀疏结构化网络。具体来说,我们的方法倾向于在整个训练过程中以均匀的方式重复允许被剪枝的块基于块角冗余性和重要性采样重新生长到网络中。
效果:在ImageNet上进行的全面实验表明,我们的SUBP方法始终优于现有的基于预训练模型或从零开始训练的1×N和结构化稀疏方法。源代码和模型可在\url{https://github.com/JingyangXiang/SUBP}获取。

The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a \emph{Block Sparse Row} matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1$\times$N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel \emph{\textbf{S}oft \textbf{U}niform \textbf{B}lock \textbf{P}runing} (SUBP) approach to train a uniform 1$\times$N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at \url{https://github.com/JingyangXiang/SUBP}.

Dynamic Personalized Federated Learning with Adaptive Differential Privacy
Xiyuan Yang Wenke Huang Mang Ye



研究问题:目前的个性化联邦学习方法存在不灵活的个性化和收敛困难的问题。
动机:由于数据分布的非IID特性和隐私泄露风险,个性化联邦学习需要解决这些问题。
方法:提出一种具有动态费雪个性化和自适应约束(FedDPA)的新型联邦学习方法。该方法通过使用层状费雪信息来测量局部参数的信息内容,同时防止这些参数受到噪声干扰。此外,还引入了一种自适应方法,通过对先前确定的个性化参数和共享参数应用差分约束策略,以改善收敛性。
效果:在CIFAR-10、FEMNIST和SVHN数据集上的实验结果表明,该方法在实现更好的性能和对剪切操作的鲁棒性方面是有效的。

Personalized federated learning with differential privacy has been considered a feasible solution to address non-IID distribution of data and privacy leakage risks. However, current personalized federated learning methods suffer from inflexible personalization and convergence difficulties due to two main factors: 1) Firstly, we observe that the prevailing personalization methods mainly achieve this by personalizing a fixed portion of the model, which lacks flexibility. 2) Moreover, we further demonstrate that the default gradient calculation is sensitive to the widely-used clipping operations in differential privacy, resulting in difficulties in convergence. Considering that Fisher information values can serve as an effective measure for estimating the information content of parameters by reflecting the model sensitivity to parameters, we aim to leverage this property to address the aforementioned challenges. In this paper, we propose a novel federated learning method with Dynamic Fisher Personalization and Adaptive Constraint (FedDPA) to handle these challenges. Firstly, by using layer-wise Fisher information to measure the information content of local parameters, we retain local parameters with high Fisher values during the personalization process, which are considered informative, simultaneously prevent these parameters from noise perturbation. Secondly, we introduce an adaptive approach by applying differential constraint strategies to personalized parameters and shared parameters identified in the previous for better convergence. Our method boosts performance through flexible personalization while mitigating the slow convergence caused by clipping operations. Experimental results on CIFAR-10, FEMNIST and SVHN dataset demonstrate the effectiveness of our approach in achieving better performance and robustness against clipping, under personalized federated learning with differential privacy.

Balanced Training for Sparse GANs
Yite Wang Jing Wu Naira Hovakimyan Ruoyu Sun



研究问题:如何降低深度神经网络,特别是生成对抗网络(GANs)的训练和推理成本。
动机:尽管深度神经网络在许多任务上表现出色,但其高计算复杂性限制了其应用。
方法:提出一种新的动态稀疏训练(DST)方法,通过控制生成器和判别器之间的平衡来优化性能和计算成本。
效果:在多个数据集上的实验表明,该方法能有效降低训练和推理成本,同时保持良好的性能。

Over the past few years, there has been growing interest in developing larger and deeper neural networks, including deep generative models like generative adversarial networks (GANs). However, GANs typically come with high computational complexity, leading researchers to explore methods for reducing the training and inference costs. One such approach gaining popularity in supervised learning is dynamic sparse training (DST), which maintains good performance while enjoying excellent training efficiency. Despite its potential benefits, applying DST to GANs presents challenges due to the adversarial nature of the training process. In this paper, we propose a novel metric called the balance ratio (BR) to study the balance between the sparse generator and discriminator. We also introduce a new method called balanced dynamic sparse training (ADAPT), which seeks to control the BR during GAN training to achieve a good trade-off between performance and computational cost. Our proposed method shows promising results on multiple datasets, demonstrating its effectiveness.

Fast Partitioned Learned Bloom Filter
Atsuki Sato Yusuke Matsui



研究问题:如何减少构建分区学习布隆过滤器(PLBF)的时间复杂度,同时保持其内存效率。
动机:现有的PLBF虽然在内存效率上表现优秀,但其构造时间复杂度高达$mathcal{O}(N^3k)$,限制了其在实际应用中的使用。
方法:提出了两种方法来降低PLBF的构建时间。一种是快速PLBF,其时间复杂度为$\mathcal{O}(N^2k)$;另一种是快速PLBF++,其时间复杂度更低,为$\mathcal{O}(Nklog N + Nk^2)$。
效果:实验结果显示,快速PLBF和快速PLBF++的构建速度比PLBF快233倍和761倍,且快速PLBF在内存效率上与PLBF相当,快速PLBF++则几乎与PLBF具有相同的内存效率。

A Bloom filter is a memory-efficient data structure for approximate membership queries used in numerous fields of computer science. Recently, learned Bloom filters that achieve better memory efficiency using machine learning models have attracted attention. One such filter, the partitioned learned Bloom filter (PLBF), achieves excellent memory efficiency. However, PLBF requires a $\mathcal{O}(N^3k)$ time complexity to construct the data structure, where $N$ and $k$ are the hyperparameters of PLBF. One can improve memory efficiency by increasing $N$, but the construction time becomes extremely long. Thus, we propose two methods that can reduce the construction time while maintaining the memory efficiency of PLBF. First, we propose fast PLBF, which can construct the same data structure as PLBF with a smaller time complexity $\mathcal{O}(N^2k)$. Second, we propose fast PLBF++, which can construct the data structure with even smaller time complexity $\mathcal{O}(Nk\log N + Nk^2)$. Fast PLBF++ does not necessarily construct the same data structure as PLBF. Still, it is almost as memory efficient as PLBF, and it is proved that fast PLBF++ has the same data structure as PLBF when the distribution satisfies a certain constraint. Our experimental results from real-world datasets show that (i) fast PLBF and fast PLBF++ can construct the data structure up to 233 and 761 times faster than PLBF, (ii) fast PLBF can achieve the same memory efficiency as PLBF, and (iii) fast PLBF++ can achieve almost the same memory efficiency as PLBF. The codes are available at [this https URL](https://github.com/atsukisato/FastPLBF).

H3T: Efficient Integration of Memory Optimization and Parallelism for Large-scale Transformer Training
Yuzhong Wang Xu Han Weilin Zhao Guoyang Zeng Zhiyuan Liu Maosong Sun



研究问题:如何提高基于Transformer的大型AI模型的训练效率。
动机:尽管基于Transformer的模型在许多人工智能任务上取得了最先进的性能,但其巨大的参数大小给存储和计算带来了严重挑战。
方法:提出了一个自动寻找内存优化和并行化高效整合的框架(H3T),通过设计搜索算法来选择合适的内存优化策略和并行化方案,以实现内存开销和训练效率之间的平衡。
效果:实验结果表明,H3T比目前流行的深度学习工具包Megatron-DeepSpeed快1.2倍至4.3倍,同时减少了34.6%至80.5%的内存开销。此外,H3T只需使用64个NVIDIA A100 GPU就能训练GPT-3-175B,这在现有的深度学习工具包中是非常困难的。

In recent years, big models based on Transformers have achieved state-of-the-art performance on many artificial intelligence (AI) tasks. Despite the success of these Transformer-based models, their huge parameter size poses a serious challenge to their training, both from the storage and computation perspectives. To this end, memory optimization (e.g., rematerialization and offloading) and parallelism (e.g., data parallelism and model parallelism) are widely explored to make training Transformers more efficient. In this paper, we propose a framework to automatically find an efficient integration of memory optimization and parallelism for High-Throughput Transformer Training (named H3T), which is rarely considered by existing efforts for training big Transformer-based models. Specifically, we design search algorithms to combine appropriate memory optimization strategies and parallelism schemes to achieve a balance between memory overhead and training efficiency. We implement H3T based on an open-source toolkit BMTrain and then use H3T to train the Transformers of different sizes to evaluate the efficiency of H3T. The experimental results show that H3T outperforms the most popular deep learning (DL) toolkit Megatron-DeepSpeed by $1.2\times \sim 4.3\times$ training speed while reducing $34.6\% \sim 80.5\%$ of memory overhead. Moreover, H3T can use only 64 NVIDIA A100 GPUs to train GPT-3-175B, which is very difficult for existing DL toolkits. The source code is available at https://github.com/OpenBMB/BMTrain/tree/h3t.

Layer-Neighbor Sampling --- Defusing Neighborhood Explosion in GNNs
Muhammed Fatih Balin Umit Catalyurek



研究问题:大规模图神经网络训练的挑战。
动机:解决现有方法在处理大规模图神经网络训练时存在的邻居爆炸现象或性能不佳的问题。
方法:提出一种新的采样算法,称为Layer-neighbor Sampling(LABOR),作为邻居采样(NS)的直接替代方案,同时减少7倍的顶点采样,且不牺牲质量。
效果:实验证明,LABOR在相同的顶点采样预算约束下,比现有的层采样方法收敛更快,并能使用的批量大小比NS大112倍。

Graph Neural Networks (GNNs) have received significant attention recently, but training them at a large scale remains a challenge. Mini-batch training coupled with sampling is used to alleviate this challenge. However, existing approaches either suffer from the neighborhood explosion phenomenon or have suboptimal performance. To address these issues, we propose a new sampling algorithm called LAyer-neighBOR sampling (LABOR). It is designed to be a direct replacement for Neighbor Sampling (NS) with the same fanout hyperparameter while sampling up to 7 times fewer vertices, without sacrificing quality. By design, the variance of the estimator of each vertex matches NS from the point of view of a single vertex. Moreover, under the same vertex sampling budget constraints, LABOR converges faster than existing layer sampling approaches and can use up to 112 times larger batch sizes compared to NS.

Block-Coordinate Methods and Restarting for Solving Extensive-Form Games
Darshan Chakrabarti Jelena Diakonikolas Christian Kroer



研究问题:如何在大规模序列博弈中实现有效的优化策略?
动机:现有的优化方法在机器学习和优化领域表现优秀,但在大规模序列博弈中尚未找到适用的方法。
方法:提出一种类似循环坐标下降法的方法,用于解决序列形式策略的多面体问题,适用于扩展形博弈(EFG)的玩家策略空间。
效果:该方法具有O(1/T)的收敛速度,避免了最坏情况下与块数成多项式比例的扩展,实证表明其性能优于其他先进的一阶方法,并有时能超越零和EFG数值均衡计算的先进算法CFR+。此外,通过引入重启启发式方法,可以进一步提高现有方法的求解速度。

Coordinate descent methods are popular in machine learning and optimization for their simple sparse updates and excellent practical performance. In the context of large-scale sequential game solving, these same properties would be attractive, but until now no such methods were known, because the strategy spaces do not satisfy the typical separable block structure exploited by such methods. We present the first cyclic coordinate-descent-like method for the polytope of sequence-form strategies, which form the strategy spaces for the players in an extensive-form game (EFG). Our method exploits the recursive structure of the proximal update induced by what are known as dilated regularizers, in order to allow for a pseudo block-wise update. We show that our method enjoys a O(1/T) convergence rate to a two-player zero-sum Nash equilibrium, while avoiding the worst-case polynomial scaling with the number of blocks common to cyclic methods. We empirically show that our algorithm usually performs better than other state-of-the-art first-order methods (i.e., mirror prox), and occasionally can even beat CFR$^+$, a state-of-the-art algorithm for numerical equilibrium computation in zero-sum EFGs. We then introduce a restarting heuristic for EFG solving. We show empirically that restarting can lead to speedups, sometimes huge, both for our cyclic method, as well as for existing methods such as mirror prox and predictive CFR$^+$.

A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs
Zhaocheng Zhu Xinyu Yuan Mikhail Galkin Sophie Xhonneux Ming Zhang Maxime Gazeau Jian Tang



研究问题:如何有效地进行大规模知识图谱推理。
动机:现有的嵌入方法在处理大规模知识图谱推理时存在效率问题,路径基方法虽然具有归纳能力,但扩展性受限于路径数量的指数增长。
方法:提出A*Net,一种可扩展的基于路径的知识图谱推理方法。该方法受到最短路径问题的A*算法启发,通过学习优先函数来选择重要的节点和边,以减少训练和推理的时间和内存占用。
效果:实验表明,A*Net在转化型和归纳型知识图谱推理基准测试上的表现与现有最先进的基于路径的方法相当,而每次迭代仅访问10%的节点和边。在百万级数据集ogbl-wikikg2上,A*Net不仅取得了新的最优结果,而且收敛速度比嵌入方法更快。A*Net是首个能在如此大规模上进行知识图谱推理的基于路径的方法。

Reasoning on large-scale knowledge graphs has been long dominated by embedding methods. While path-based methods possess the inductive capacity that embeddings lack, their scalability is limited by the exponential number of paths. Here we present A\*Net, a scalable path-based method for knowledge graph reasoning. Inspired by the A\* algorithm for shortest path problems, our A\*Net learns a priority function to select important nodes and edges at each iteration, to reduce time and memory footprint for both training and inference. The ratio of selected nodes and edges can be specified to trade off between performance and efficiency. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A\*Net achieves competitive performance with existing state-of-the-art path-based methods, while merely visiting 10% nodes and 10% edges at each iteration. On a million-scale dataset ogbl-wikikg2, A\*Net not only achieves a new state-of-the-art result, but also converges faster than embedding methods. A\*Net is the first path-based method for knowledge graph reasoning at such scale.

DropCompute: simple and more robust distributed synchronous training via compute variance reduction
Niv Giladi Shahar Gottlieb Moran Shkolnik Asaf Karnieli Ron Banner Elad Hoffer Kfir Yehuda Levy Daniel Soudry



研究问题:如何减少分布式训练中由于计算时间差异导致的工作节点延迟,以提高同步训练的鲁棒性。
动机:当前主流的大规模深度神经网络训练方法都存在因等待所有工作节点而受限的问题。
方法:通过分析计算时间属性与由延迟工作节点引起的可扩展性限制之间的关系,提出一种简单有效的去中心化方法来减少工作节点之间的差异,从而提高同步训练的鲁棒性。
效果:该方法已成功应用于200个Gaudi加速器的大规模训练任务,并验证了其有效性。

Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers. Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

Symbolic Discovery of Optimization Algorithms
Xiangning Chen Chen Liang Da Huang Esteban Real Kaiyuan Wang Hieu Pham Xuanyi Dong Thang Luong Cho-Jui Hsieh Yifeng Lu Quoc V Le



研究问题:如何通过程序搜索来发现优化深度神经网络训练的算法。
动机:现有的优化算法在处理大规模任务时存在效率低下和内存占用大的问题。
方法:将算法发现视为程序搜索,并应用到深度学习网络训练的优化算法发现中,同时引入了程序选择和简化策略以减小代理任务和目标任务之间的泛化差距。
效果:该方法发现了一种名为Lion(Evolved Sign Motion)的简单有效的优化算法,其性能优于Adam,且在图像分类、视觉语言对比学习、扩散模型以及自回归、掩码语言建模和微调等任务上的表现均与Adam相当或更好。

We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, $\textbf{Lion}$ ($\textit{Evo$\textbf{L}$ved S$\textbf{i}$gn M$\textbf{o}$me$\textbf{n}$tum}$). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2\% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3\% $\textit{zero-shot}$ and 91.1\% $\textit{fine-tuning}$ accuracy on ImageNet, surpassing the previous best results by 2\% and 0.1\%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant.

MixFormerV2: Efficient Fully Transformer Tracking
Yutao Cui Tianhui Song Gangshan Wu Limin Wang



研究问题:现有的基于Transformer的跟踪器在标准基准上取得了强大的准确性,但其效率仍然是在GPU和CPU平台上实际部署的障碍。
动机:为了解决这个问题,本文提出了一种完全基于Transformer的跟踪框架MixFormerV2,无需任何密集卷积操作和复杂的得分预测模块。
方法:我们引入了四个特殊的预测标记,并将它们与目标模板和搜索区域的标记连接起来。然后,我们在这些混合标记序列上应用统一的Transformer主干。这些预测标记能够通过混合注意力捕捉目标模板和搜索区域之间的复杂关联性。基于这些标记,我们可以通过简单的MLP头部轻松预测跟踪框并估计其置信度分数。
效果:为了进一步提高MixFormerV2的效率,我们提出了一种新的基于蒸馏的模型缩小范式,包括密集到稀疏的蒸馏和深到浅的蒸馏。前者旨在将知识从基于密集头的MixViT转移到我们的全Transformer跟踪器,后者用于剪枝主干的一些层。我们实例化了两种类型的MixForemrV2,其中MixFormerV2-B在LaSOT上实现了70.6%的AUC和在TNL2k上实现了56.7%的AUC,具有高达165 FPS的GPU速度,而MixFormerV2-S在LaSOT上以实时CPU速度超过了FEAR-L 2.7%的AUC。

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and AUC of 56.7\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.

Federated Learning with Manifold Regularization and Normalized Update Reaggregation
Xuming An Li Shen Han Hu Yong Luo



研究问题:联邦学习中,由于客户端本地数据的异构性导致模型不一致,进而影响全局更新的收敛速度。
动机:现有的消除局部和全局模型参数(或梯度)差异的方法无法反映复杂的机器学习模型结构和欧几里得空间在有意义的几何表示方面的限制导致的模型不一致性。
方法:本文提出FedMRUR,采用流形模型融合方案和新全局优化器来缓解其负面影响。具体来说,FedMRUR采用双曲图流形正则化器,强制局部和全局模型的数据表示在低维子空间中彼此接近。
效果:通过利用表示的流形结构,FedMRUR显著减少了模型不一致性。同时,FedMRUR将客户端更新范数聚合为全局更新范数,从而适当增大每个客户端对全局更新的贡献,减轻了由客户端更新近乎正交引起的范数减小的影响。实验证明,FedMRUR可以在较少的通信下实现新的最先进的准确性。

Federated Learning (FL) is an emerging collaborative machine learning framework where multiple clients train the global model without sharing their own datasets. In FL, the model inconsistency caused by the local data heterogeneity across clients results in the near-orthogonality of client updates, which leads to the global update norm reduction and slows down the convergence. Most previous works focus on eliminating the difference of parameters (or gradients) between the local and global models, which may fail to reflect the model inconsistency due to the complex structure of the machine learning model and the Euclidean space's limitation in meaningful geometric representations. In this paper, we propose FedMRUR by adopting the manifold model fusion scheme and a new global optimizer to alleviate the negative impacts. Concretely, FedMRUR adopts a hyperbolic graph manifold regularizer enforcing the representations of the data in the local and global models are close to each other in a low-dimensional subspace. Because the machine learning model has the graph structure, the distance in hyperbolic space can reflect the model bias better than the Euclidean distance. In this way, FedMRUR exploits the manifold structures of the representations to significantly reduce the model inconsistency. FedMRUR also aggregates the client updates norms as the global update norm, which can appropriately enlarge each client's contribution to the global update, thereby mitigating the norm reduction introduced by the near-orthogonality of client updates. Furthermore, we theoretically prove that our algorithm can achieve a linear speedup property $\mathcal{O}(\frac{1}{\sqrt{SKT}})$ for non-convex setting under partial client participation, where $S$ is the participated clients number, $K$ is the local interval and $T$ is the total number of communication rounds. Experiments demonstrate that FedMRUR can achieve a new state-of-the-art (SOTA) accuracy with less communication.

Spectral Co-Distillation for Personalized Federated Learning
Zihan Chen Howard Hao Yang Tony Quek Kai Fong Ernest Chong



研究问题:个性化联邦学习(PFL)被广泛研究,以解决数据异构性的挑战,特别是当单一的通用模型无法同时满足本地客户端的多样化性能需求时。
动机:现有的PFL方法本质上基于全局通用和局部个性化模型之间的关系由模型权重的相似性捕捉的想法。这种相似性主要基于将模型架构划分为通用与个性化组件,或通过模型权重对客户端关系进行建模。
方法:为了更好地捕捉相似(但不同)的通用与个性化模型表示,我们提出了一种基于模型频谱信息的新颖蒸馏方法——光谱蒸馏。在光谱蒸馏的基础上,我们还引入了一种共同蒸馏框架,建立了通用和个性化模型训练之间的双向桥梁。此外,为了利用传统PFL中的本地空闲时间,我们提出了一种无需等待的本地训练协议。
效果:通过在多个数据集上进行广泛的实验,在不同的数据异构设置下,我们展示了所提出的光谱共同蒸馏方法和无需等待的训练协议的优越性和有效性。

Personalized federated learning (PFL) has been widely investigated to address the challenge of data heterogeneity, especially when a single generic model is inadequate in satisfying the diverse performance requirements of local clients simultaneously. Existing PFL methods are inherently based on the idea that the relations between the generic global and personalized local models are captured by the similarity of model weights. Such a similarity is primarily based on either partitioning the model architecture into generic versus personalized components or modeling client relationships via model weights. To better capture similar (yet distinct) generic versus personalized model representations, we propose $\textit{spectral distillation}$, a novel distillation method based on model spectrum information. Building upon spectral distillation, we also introduce a co-distillation framework that establishes a two-way bridge between generic and personalized model training. Moreover, to utilize the local idle time in conventional PFL, we propose a wait-free local training protocol. Through extensive experiments on multiple datasets over diverse heterogeneous data settings, we demonstrate the outperformance and efficacy of our proposed spectral co-distillation method, as well as our wait-free training protocol.

PTQD: Accurate Post-Training Quantization for Diffusion Models
Yefei He Luping Liu Jing Liu Weijia Wu Hong Zhou Bohan Zhuang



研究问题:扩散模型在图像生成等任务中表现优秀,但推理时的迭代去噪过程计算成本高,不适用于低延迟和可扩展的实时应用。
动机:对扩散模型进行后训练量化可以显著减小模型大小并加速采样过程,而无需重新训练。然而,直接将现有的后训练量化方法应用于低比特扩散模型会显著降低生成样本的质量。
方法:我们提出了一种统一的量化噪声和扩散扰动噪声的量化去噪过程公式。具体来说,我们将量化噪声分解为与其全精度对应部分相关的和剩余的不相关部分。相关部分可以通过估计相关系数来轻松纠正。对于不相关部分,我们从量化结果中减去偏差以纠正均值偏差,并校准去噪方差进度以吸收由量化产生的额外方差。此外,我们还引入了混合精度方案,为每个去噪步骤选择最优位宽,优先选择较低的位宽以加快早期去噪步骤,同时确保较高的位宽在后续步骤中保持高信噪比。
效果:大量实验证明,我们的方法在生成高质量样本方面优于以前的后训练量化扩散模型,与ImageNet 256x256上的全精度LDM-4相比,仅增加了0.06的FID分数,同时节省了19.9倍的位操作。代码可在[https://github.com/ziplab/PTQD](https://github.com/ziplab/PTQD)获取。

Diffusion models have recently dominated image synthesis and other related generative tasks. However, the iterative denoising process is expensive in computations at inference time, making diffusion models less practical for low-latency and scalable real-world applications. Post-training quantization of diffusion models can significantly reduce the model size and accelerate the sampling process without requiring any re-training. Nonetheless, applying existing post-training quantization methods directly to low-bit diffusion models can significantly impair the quality of generated samples. Specifically, for each denoising step, quantization noise leads to deviations in the estimated mean and mismatches with the predetermined variance schedule. Moreover, as the sampling process proceeds, the quantization noise may accumulate, resulting in a low signal-to-noise ratio (SNR) during the later denoising steps. To address these challenges, we propose a unified formulation for the quantization noise and diffusion perturbed noise in the quantized denoising process. Specifically, we first disentangle the quantization noise into its correlated and residual uncorrelated parts regarding its full-precision counterpart. The correlated part can be easily corrected by estimating the correlation coefficient. For the uncorrelated part, we subtract the bias from the quantized results to correct the mean deviation and calibrate the denoising variance schedule to absorb the excess variance resulting from quantization. Moreover, we introduce a mixed-precision scheme for selecting the optimal bitwidth for each denoising step, which prioritizes lower bitwidths to expedite early denoising steps, while ensuring that higher bitwidths maintain a high signal-to-noise ratio (SNR) in the later steps. Extensive experiments demonstrate that our method outperforms previous post-training quantized diffusion models in generating high-quality samples, with only a $0.06$ increase in FID score compared to full-precision LDM-4 on ImageNet $256\times256$, while saving $19.9\times$ bit operations. Code is available at [https://github.com/ziplab/PTQD](https://github.com/ziplab/PTQD).

Construction of Hierarchical Neural Architecture Search Spaces based on Context-free Grammars
Simon Schrodi Danny Stoll Binxin Ru Rhea Sanjay Sukthanker Thomas Brox Frank Hutter



研究问题:如何利用神经网络架构搜索(NAS)从简单的构建模块中发现神经架构。
动机:目前,虽然分层搜索空间在神经网络架构搜索中表现出了潜力,但它们缺乏统一的搜索空间设计框架,并且通常只搜索架构的某些有限方面。
方法:本文提出了一种基于上下文无关语法的统一搜索空间设计框架,该框架可以自然且紧凑地生成比文献中常见空间大100倍的表达性分层搜索空间。通过增强和使用其属性,我们有效地实现了对完整架构的搜索,并促进了规律性。此外,我们还为贝叶斯优化搜索策略提出了一种有效的分层内核设计,以高效地搜索如此巨大的空间。
效果:我们展示了搜索空间设计框架的通用性,并表明我们的搜索策略可以优于现有的NAS方法。

The discovery of neural architectures from simple building blocks is a long-standing goal of Neural Architecture Search (NAS). Hierarchical search spaces are a promising step towards this goal but lack a unifying search space design framework and typically only search over some limited aspect of architectures. In this work, we introduce a unifying search space design framework based on context-free grammars that can naturally and compactly generate expressive hierarchical search spaces that are 100s of orders of magnitude larger than common spaces from the literature. By enhancing and using their properties, we effectively enable search over the complete architecture and can foster regularity. Further, we propose an efficient hierarchical kernel design for a Bayesian Optimization search strategy to efficiently search over such huge spaces. We demonstrate the versatility of our search space design framework and show that our search strategy can be superior to existing NAS approaches. Code is available at https://github.com/automl/hierarchical_nas_construction.

Structural Pruning for Diffusion Models
Gongfan Fang Xinyin Ma Xinchao Wang



研究问题:如何有效地压缩深度学习模型,减少训练和推理的计算开销。
动机:扩散概率模型(DPMs)在生成建模方面取得了显著的进步,但其训练和推理过程中的大量计算消耗是一个挑战。
方法:提出一种名为“差异剪枝”的高效压缩方法,该方法通过在预训练模型上进行微调,无需重新训练即可学习轻量级的扩散模型。该方法的核心是对剪枝时间步进行泰勒展开,忽略非贡献扩散步骤,并通过集成有信息的梯度来识别重要权重。
效果:实验结果表明,该方法可以在原始训练开销的10%到20%之间实现大约50%的FLOPs减少,同时保留了与预训练模型一致的生成行为。

Generative modeling has recently undergone remarkable advancements, primarily propelled by the transformative implications of Diffusion Probabilistic Models (DPMs). The impressive capability of these models, however, often entails significant computational overhead during both training and inference. To tackle this challenge, we present Diff-Pruning, an efficient compression method tailored for learning lightweight diffusion models from pre-existing ones, without the need for extensive re-training. The essence of Diff-Pruning is encapsulated in a Taylor expansion over pruned timesteps, a process that disregards non-contributory diffusion steps and ensembles informative gradients to identify important weights. Our empirical assessment, undertaken across several datasets highlights two primary benefits of our proposed method: 1) Efficiency: it enables approximately a 50\% reduction in FLOPs at a mere 10% to 20% of the original training expenditure; 2) Consistency: the pruned diffusion models inherently preserve generative behavior congruent with their pre-trained models.

REx: Data-Free Residual Quantization Error Expansion
Edouard YVINEC Arnaud Dapogny Matthieu Cord Kevin Bailly



研究问题:深度神经网络在计算机视觉和自然语言处理中广泛应用,但高推理成本是一个问题。
动机:为了解决这个问题,我们专注于无需数据的方法,并关注隐私权问题。然而,这些技术缺乏对目标设备的适应性。
方法:我们提出了REx,一种利用残差误差扩展和组稀疏性的量化方法。这种方法可以灵活地为每个位宽和目标设备找到良好的精度与速度权衡。
效果:实验表明,REx在卷积网络、变换器以及计算机视觉和自然语言处理模型上都实现了更好的权衡。特别是在大型语言模型上,REx优雅地解决了阻碍最先进技术的离群值问题。此外,REx有强大的理论保证,可以与以前的量化工作结合使用。

Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point operations into a lower bit-width format. With the growing concerns on privacy rights, we focus our efforts on data-free methods. However, such techniques suffer from their lack of adaptability to the target devices, as a hardware typically only supports specific bit widths. Thus, to adapt to a variety of devices, a quantization method shall be flexible enough to find good accuracy v.s. speed trade-offs for every bit width and target device. To achieve this, we propose REx, a quantization method that leverages residual error expansion, along with group sparsity. We show experimentally that REx enables better trade-offs (in terms of accuracy given any target bit-width) on both convnets and transformers for computer vision, as well as NLP models. In particular, when applied to large language models, we show that REx elegantly solves the outlier problem that hinders state-of-the-art quantization methods. In addition, REx is backed off by strong theoretical guarantees on the preservation of the predictive function of the original model. Lastly, we show that REx is agnostic to the quantization operator and can be used in combination with previous quantization work.

Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations
Hyeonjeong Ha Minseon Kim Sung Ju Hwang



研究问题:现有的神经架构搜索(NAS)框架在寻找最优架构时,只考虑了干净图像的性能,而对各种类型的干扰或破坏的鲁棒性在实践中至关重要。
动机:尽管存在一些集成对抗训练的鲁棒NAS框架来解决这个问题,但他们只考虑了对抗攻击的鲁棒性,并且需要大量的计算资源来为单个任务发现最优架构,这使得它们在实际场景中不实用。
方法:我们提出了一种新的轻量级鲁棒零成本代理,它在初始化状态下考虑了干净和被干扰图像的特征、参数和梯度的一致性。这种方法可以快速有效地搜索能够学习具有各种干扰鲁棒性的泛化特征的神经网络架构。
效果:实验结果表明,我们的代理可以在多个基准数据集和不同的搜索空间上快速高效地搜索出一致地抵抗各种干扰的神经网络架构,大大优于现有的干净零射NAS和鲁棒NAS,并减少了搜索成本。

Recent neural architecture search (NAS) frameworks have been successful in finding optimal architectures for given conditions (e.g., performance or latency). However, they search for optimal architectures in terms of their performance on clean images only, while robustness against various types of perturbations or corruptions is crucial in practice. Although there exist several robust NAS frameworks that tackle this issue by integrating adversarial training into one-shot NAS, however, they are limited in that they only consider robustness against adversarial attacks and require significant computational resources to discover optimal architectures for a single task, which makes them impractical in real-world scenarios. To address these challenges, we propose a novel lightweight robust zero-cost proxy that considers the consistency across features, parameters, and gradients of both clean and perturbed images at the initialization state. Our approach facilitates an efficient and rapid search for neural architectures capable of learning generalizable features that exhibit robustness across diverse perturbations. The experimental results demonstrate that our proxy can rapidly and efficiently search for neural architectures that are consistently robust against various perturbations on multiple benchmark datasets and diverse search spaces, largely outperforming existing clean zero-shot NAS and robust NAS with reduced search cost.

Towards Better Dynamic Graph Learning: New Architecture and Unified Library
Le Yu Leilei Sun Bowen Du Weifeng Lv



研究问题:提出一种新的基于Transformer的动态图学习架构DyGFormer。
动机:现有的动态图学习方法在捕获节点间的相关性和长期时间依赖性上存在不足。
方法:通过设计邻居共现编码方案和分片技术,使模型能够有效地从更长的历史序列中学习和提取信息。
效果:实验结果表明,DyGFormer在大多数数据集上都取得了最先进的性能,证明了其在捕获节点相关性和长期时间依赖性方面的效果。同时,提出的DyGLib库也有助于推动动态图学习的可重复、可扩展和可信的研究。

We propose DyGFormer, a new Transformer-based architecture for dynamic graph learning. DyGFormer is conceptually simple and only needs to learn from nodes' historical first-hop interactions by: (1) a neighbor co-occurrence encoding scheme that explores the correlations of the source node and destination node based on their historical sequences; (2) a patching technique that divides each sequence into multiple patches and feeds them to Transformer, allowing the model to effectively and efficiently benefit from longer histories. We also introduce DyGLib, a unified library with standard training pipelines, extensible coding interfaces, and comprehensive evaluating protocols to promote reproducible, scalable, and credible dynamic graph learning research. By performing exhaustive experiments on thirteen datasets for dynamic link prediction and dynamic node classification tasks, we find that DyGFormer achieves state-of-the-art performance on most of the datasets, demonstrating its effectiveness in capturing nodes' correlations and long-term temporal dependencies. Moreover, some results of baselines are inconsistent with previous reports, which may be caused by their diverse but less rigorous implementations, showing the importance of DyGLib. All the used resources are publicly available at https://github.com/yule-BUAA/DyGLib.

Masked Image Residual Learning for Scaling Deeper Vision Transformers
Guoxi Huang Hongtao Fu Adrian G. Bors



研究问题:本文旨在解决深度视觉转换器(ViTs)在预训练过程中的挑战,即使用遮蔽图像建模(MIM)进行预训练时,深层的退化问题。
动机:作者发现深度ViTs更难训练,特别是在使用遮蔽图像建模进行预训练时,深层存在明显的退化问题。
方法:为了缓解这个问题,作者提出了一种名为"遮蔽图像残差学习"(MIRL)的自监督学习框架。通过将深层ViTs的预训练目标重新定义为学习恢复被遮蔽图像的残差,显著减轻了退化问题,使增加ViT深度成为提升性能的有希望的方向。
效果:实证研究表明,使用MIRL可以有效地优化深层ViTs,并且容易从增加的深度中获得准确性提升。在与ViT-Base和ViT-Large相同的计算复杂度下,实现了4.5倍和2倍更深的ViTs,分别称为ViT-S-54和ViT-B-48。更深层次的ViT-S-54的成本仅为ViT-Large的三分之一,但其性能却与ViT-Large相当。ViT-B-48在ImageNet上达到了86.2%的top-1准确率。此外,用MIRL预训练的深层ViTs在诸如目标检测和语义分割等下游任务上表现出优秀的泛化能力;同时,MIRL也显示出了高效的预训练效率。

Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called $\textbf{M}$asked $\textbf{I}$mage $\textbf{R}$esidual $\textbf{L}$earning ($\textbf{MIRL}$), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate $4.5{\times}$ and $2{\times}$ deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing $3{\times}$ less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2\% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.

Knowledge Distillation for High Dimensional Search Index
Zepu Lu Jin Chen Defu Lian ZAIXI ZHANG Yong Ge Enhong Chen



研究问题:如何设计一种新的学习算法,提高压缩搜索索引在高维空间中的检索性能。
动机:由于压缩方法在大规模数据集上的检索效率优势,轻量级压缩模型在近似最近邻搜索(ANNS)和最大内积搜索(MIPS)中广泛使用。然而,由于维度诅咒和优化目标的限制(如缺乏查询和文档之间的交互),压缩方法的结果准确性较低。
方法:本文提出了一种名为知识蒸馏的高维搜索索引框架(KDindex)。通过从高精度的ANNS和MIPS模型(如基于图的索引)中提炼知识,高效地学习轻量级索引。具体来说,学生模型被引导保持教师模型产生的前k个相关结果的相同排名顺序,这作为查询和文档之间的额外监督信号,以学习文档之间的相似性。此外,为了避免所有候选项都被分配到同一个质心这一平凡解,将最小化压缩误差的重构损失和平衡候选者的发布列表策略纳入学习目标。
效果:实验结果表明,KDindex优于现有的可学习量化索引方法,比最先进的非详尽方法轻40倍,同时达到相当的召回质量。

Lightweight compressed models are prevalent in Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) owing to their superiority of retrieval efficiency in large-scale datasets. However, results given by compressed methods are less accurate due to the curse of dimension and the limitations of optimization objectives (e.g., lacking interactions between queries and documents). Thus, we are encouraged to design a new learning algorithm for the compressed search index on high dimensions to improve retrieval performance. In this paper, we propose a novel KnowledgeDistillation for high dimensional search index framework (KDindex), with the aim of efficiently learning lightweight indexes by distilling knowledge from high-precision ANNS and MIPS models such as graph-based indexes. Specifically, the student is guided to keep the same ranking order of the top-k relevant results yielded by the teacher model, which acts as the additional supervision signals between queries and documents to learn the similarities between documents. Furthermore, to avoid the trivial solutions that all candidates are partitioned to the same centroid, the reconstruction loss that minimizes the compressed error, and the posting list balance strategy that equally allocates the candidates, are integrated into the learning objective. Experiment results demonstrate that KDindex outperforms existing learnable quantization-based indexes and is 40× lighter than the state-of-the-art non-exhaustive methods while achieving comparable recall quality.

One Less Reason for Filter Pruning: Gaining Free Adversarial Robustness with Structured Grouped Kernel Pruning
Shaochen Zhong Zaichuan You Jiamu Zhang Sebastian Zhao Zachary LeClaire Zirui Liu Daochen Zha Vipin Chaudhary Shuai Xu Xia Hu



研究问题:现代结构化剪枝方法在简单对抗攻击下的表现如何?
动机:现有的结构化剪枝方法虽然可以提供即时的压缩和加速效果,但在简单对抗攻击下表现脆弱。
方法:通过公平全面地调查10+种流行的结构化剪枝方法的对抗性能,利用Grouped Kernel Pruning(GKP)将密集结构化剪枝的自由度推向更细粒度的水平,并将内核平滑度这一典型的鲁棒性相关内核级指标混合到修改后的GKP过程中,提出一种一次后训练权重依赖的GKP方法。
效果:这种方法无需额外成本,就能在良性和对抗性规模上推进最先进的性能。

Densely structured pruning methods utilizing simple pruning heuristics can deliver immediate compression and acceleration benefits with acceptable benign performances. However, empirical findings indicate such naively pruned networks are extremely fragile under simple adversarial attacks. Naturally, we would be interested in knowing if such a phenomenon also holds for carefully designed modern structured pruning methods. If so, then to what extent is the severity? And what kind of remedies are available? Unfortunately, both the questions and the solution remain largely unaddressed: no prior art is able to provide a thorough investigation on the adversarial performance of modern structured pruning methods (spoiler: it is not good), yet the few works that attempt to provide mitigation often do so at various extra costs with only to-be-desired performance. In this work, we answer both questions by fairly and comprehensively investigating the adversarial performance of 10+ popular structured pruning methods. Solution-wise, we take advantage of *Grouped Kernel Pruning (GKP)*'s recent success in pushing densely structured pruning freedom to a more fine-grained level. By mixing up kernel smoothness — a classic robustness-related kernel-level metric — into a modified GKP procedure, we present a one-shot-post-train-weight-dependent GKP method capable of advancing SOTA performance on both the benign and adversarial scale, while requiring no extra (in fact, often less) cost than a standard pruning procedure. Please refer to our [GitHub repository](https://github.com/henryzhongsc/adv_robust_gkp) for code implementation, tool sharing, and model checkpoints.

White-Box Transformers via Sparse Rate Reduction
Yaodong Yu Sam Buchanan Druv Pai Tianzhe Chu Ziyang Wu Shengbang Tong Benjamin David Haeffele Yi Ma



研究问题:如何通过压缩和转换数据分布,实现高效的表示学习。
动机:优化目标函数的迭代方案可以自然地将流行的深度网络如变压器视为实现。
方法:通过交替优化互补的目标部分,得到标准的变压器块。其中,多头自注意力操作符可以看作是压缩标记集的梯度下降步骤,而后续的多层感知器则试图稀疏化标记的表示。
效果:实验表明,这些网络确实学会了优化设计的目标,它们压缩并稀疏化了大规模真实世界视觉数据集(如ImageNet)的表示,并且性能非常接近彻底工程化的变压器(如ViT)。

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at https://github.com/Ma-Lab-Berkeley/CRATE.

DP-HyPO: An Adaptive Private Framework for Hyperparameter Optimization
Hua Wang Sheng Gao Huanyu Zhang Weijie J Su Milan Shen



研究问题:本文旨在解决在训练私有机器学习模型时,超参数优化可能暴露底层数据集敏感信息的问题。
动机:目前,保护隐私的超参数优化方法通常是随机选择一组超参数进行多次运行,然后报告表现最好的结果,这种方法无法像非私有环境中那样根据先前输出的信息选择下一个候选超参数。
方法:本文提出了DP-HyPO,这是一个开创性的自适应私有超参数优化框架,通过提供全面的差分隐私分析并在实际数据集上进行验证,以缩小私有和非私有超参数优化之间的差距。
效果:实验结果表明,DP-HyPO在各种真实世界数据集上都表现出了良好的效果。

Hyperparameter optimization, also known as hyperparameter tuning, is a widely recognized technique for improving model performance. Regrettably, when training private ML models, many practitioners often overlook the privacy risks associated with hyperparameter optimization, which could potentially expose sensitive information about the underlying dataset. Currently, the sole existing approach to allow privacy-preserving hyperparameter optimization is to uniformly and randomly select hyperparameters for a number of runs, subsequently reporting the best-performing hyperparameter. In contrast, in non-private settings, practitioners commonly utilize "adaptive" hyperparameter optimization methods such as Gaussian Process-based optimization, which select the next candidate based on information gathered from previous outputs. This substantial contrast between private and non-private hyperparameter optimization underscores a critical concern. In our paper, we introduce DP-HyPO, a pioneering framework for "adaptive" private hyperparameter optimization, aiming to bridge the gap between private and non-private hyperparameter optimization. To accomplish this, we provide a comprehensive differential privacy analysis of our framework. Furthermore, we empirically demonstrate the effectiveness of DP-HyPO on a diverse set of real-world datasets.

PRIOR: Personalized Prior for Reactivating the Information Overlooked in Federated Learning.
Mingjia Shi Yuhao Zhou Kai Wang Huaizheng Zhang Shudong Huang Qing Ye Jiancheng Lv



研究问题:现有的联邦学习(FL)在保护隐私的同时训练机器学习模型,但由于数据异构性导致本地化模型性能下降。
动机:个性化联邦学习(PFL)通过在本地数据上训练全局模型来合成个性化模型,但这种方法可能会忽略客户端被采样的具体信息。
方法:本文提出了一种新的方案,将个性化先验知识注入每个客户端的全局模型中,以缓解PFL中引入的不完全信息问题。我们的方法的核心是一个框架,即带有Bregman散度的PFL(pFedBreD),它将个性化先验与由Bregman散度正则化的局部目标函数解耦,以适应个性化场景。
效果:实验证明,我们的方法在5个数据集上达到了最先进的性能,并在8个基准测试中比其他方法高出3.5%。广泛的分析验证了所提出设计的鲁棒性和必要性。代码将被公开。

Classical federated learning (FL) enables training machine learning models without sharing data for privacy preservation, but heterogeneous data characteristic degrades the performance of the localized model. Personalized FL (PFL) addresses this by synthesizing personalized models from a global model via training on local data. Such a global model may overlook the specific information that the clients have been sampled. In this paper, we propose a novel scheme to inject personalized prior knowledge into the global model in each client, which attempts to mitigate the introduced incomplete information problem in PFL. At the heart of our proposed approach is a framework, the $\textit{PFL with Bregman Divergence}$ (pFedBreD), decoupling the personalized prior from the local objective function regularized by Bregman divergence for greater adaptability in personalized scenarios. We also relax the mirror descent (RMD) to extract the prior explicitly to provide optional strategies. Additionally, our pFedBreD is backed up by a convergence analysis. Sufficient experiments demonstrate that our method reaches the $\textit{state-of-the-art}$ performances on 5 datasets and outperforms other methods by up to 3.5% across 8 benchmarks. Extensive analyses verify the robustness and necessity of proposed designs. The code will be made public.

VanillaNet: the Power of Minimalism in Deep Learning
Hanting Chen Yunhe Wang Jianyuan Guo Dacheng Tao



研究问题:如何简化复杂的预训练模型,使其更适应资源有限的环境?
动机:现有的预训练模型由于其复杂性和优化挑战,需要转向更简洁的设计。
方法:提出VanillaNet,一种简洁而强大的神经网络架构,避免使用深度、快捷方式和复杂的操作如自我注意力。每一层都精心设计得简洁明了,训练后会剪裁非线性激活函数以恢复原始架构。
效果:实验证明,VanillaNet的性能与著名的深度神经网络和视觉转换器相当,展示了极简主义在深度学习中的力量。这种开创性的工作有可能重新定义基础模型的格局,为有效和优雅的模型设计开辟新的道路。

At the heart of foundation models is the philosophy of "more is different", exemplified by the astonishing success in computer vision and natural language processing. However, the challenges of optimization and inherent complexity of transformer models call for a paradigm shift towards simplicity. In this study, we introduce VanillaNet, a neural network architecture that embraces elegance in design. By avoiding high depth, shortcuts, and intricate operations like self-attention, VanillaNet is refreshingly concise yet remarkably powerful. Each layer is carefully crafted to be compact and straightforward, with nonlinear activation functions pruned after training to restore the original architecture. VanillaNet overcomes the challenges of inherent complexity, making it ideal for resource-constrained environments. Its easy-to-understand and highly simplified architecture opens new possibilities for efficient deployment. Extensive experimentation demonstrates that VanillaNet delivers performance on par with renowned deep neural networks and vision transformers, showcasing the power of minimalism in deep learning. This visionary journey of VanillaNet has significant potential to redefine the landscape and challenge the status quo of foundation model, setting a new path for elegant and effective model design. Pre-trained models and codes are available at https://github.com/huawei-noah/VanillaNet and https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet

Epidemic Learning: Boosting Decentralized Learning with Randomized Communication
Martijn De Vos Sadegh Farhadkhani Rachid Guerraoui Anne-marie Kermarrec Rafael Pires Rishi Sharma



研究问题:本文旨在提出一种名为流行学习(EL)的简单但强大的分散式学习方法,该方法研究问题:本文旨在提出一种名为流行学习(EL)的简单但强大的分散式学习方法,该方法利用不断变化的通信拓扑结构,比传统的深度学习方法更快地实现模型收敛。
动机:在每个EL轮次中,每个节点将其模型更新发送给其他随机选择的$s$个节点(在一个有$n$个节点的系统中)。作者提供了对EL的广泛理论分析,证明其变化的拓扑结构在收敛性能上优于最先进的(静态和动态)拓扑结构。
方法:在每轮EL中,每个节点将其模型更新发送给其他随机选择的$s$个节点。然后,这些节点根据收到的更新来更新自己的模型。这个过程会持续进行,直到模型收敛。
效果:实验结果表明,EL在96个节点的网络中收敛速度比基线深度学习算法快1.7倍,并且在同一通信量下实现了2.2%更高的准确率。

We present Epidemic Learning (EL), a simple yet powerful decentralized learning (DL) algorithm that leverages changing communication topologies to achieve faster model convergence compared to conventional DL approaches. At each round of EL, each node sends its model updates to a random sample of $s$ other nodes (in a system of $n$ nodes). We provide an extensive theoretical analysis of EL, demonstrating that its changing topology culminates in superior convergence properties compared to the state-of-the-art (static and dynamic) topologies. Considering smooth non-convex loss functions, the number of transient iterations for EL, i.e., the rounds required to achieve asymptotic linear speedup, is in $O(n^3/s^2)$ which outperforms the best-known bound $O(n^3)$ by a factor of $s^2$, indicating the benefit of randomized communication for DL. We empirically evaluate EL in a 96-node network and compare its performance with state-of-the-art DL approaches. Our results illustrate that EL converges up to $ 1.7\times$ quicker than baseline DL algorithms and attains $2.2 $\% higher accuracy for the same communication volume.

Scattering Vision Transformer: Spectral Mixing Matters
Badri Narayana Patro Vijay Srinivas Agneeswaran



研究问题:解决视觉转换器在处理图像细节和降低计算复杂性方面的挑战。
动机:现有的解决方案如降采样操作会导致信息丢失,且无法恢复。
方法:提出一种名为散射视觉转换器(SVT)的新方法,通过引入光谱散射网络来捕获复杂的图像细节,并通过分离低频和高频组件来解决与降采样操作相关的不可逆问题。
效果:在ImageNet数据集上,SVT取得了最先进的性能,同时显著减少了参数和运算次数。在各种视觉任务中,包括实例分割,SVT也表现出色,并在CIFAR10、CIFAR100、Oxford Flower和Stanford Car等标准数据集上的迁移学习上也优于其他转换器。

Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can result in information loss. In this paper, we present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally scattering network that enables the capture of intricate image details. SVT overcomes the invertibility issue associated with down-sampling operations by separating low-frequency and high-frequency components. Furthermore, SVT introduces a unique spectral gating network utilizing Einstein multiplication for token and channel mixing, effectively reducing complexity. We show that SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS. SVT shows 2\% improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy, while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L reaches 85.7\% (again state-of-art for large versions). SVT also shows comparable results in other vision tasks such as instance segmentation. SVT also outperforms other transformers in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The project page is available on this webpage.\url{https://badripatro.github.io/svt/}.

Fed-FA: Theoretically Modeling Client Data Divergence for Federated Language Backdoor Defense
Zhiyuan Zhang Deli Chen Hao Zhou Fandong Meng Jie Zhou Xu Sun



研究问题:如何在联邦学习中检测和排除恶意客户端发起的后门攻击。
动机:现有的联邦学习算法在NLP任务中对后门攻击的防御效果不佳,因为文本的离散特征空间注入后门对模型参数统计影响较小,使得后门模式在参数层面隐藏。
方法:提出一种基于f-散度的联邦学习方法Fed-FA,通过理论分析导出f-散度指标来估计客户端数据差异,并设计了一种基于扩散理论的数据集合成方法来解决无法访问数据集的问题。
效果:实验结果表明,Fed-FA在所有自然语言后门攻击场景中都优于所有基于参数距离的方法,能有效防御后门攻击。

Federated learning algorithms enable neural network models to be trained across multiple decentralized edge devices without sharing private data. However, they are susceptible to backdoor attacks launched by malicious clients. Existing robust federated aggregation algorithms heuristically detect and exclude suspicious clients based on their parameter distances, but they are ineffective on Natural Language Processing (NLP) tasks. The main reason is that, although text backdoor patterns are obvious at the underlying dataset level, they are usually hidden at the parameter level, since injecting backdoors into texts with discrete feature space has less impact on the statistics of the model parameters. To settle this issue, we propose to identify backdoor clients by explicitly modeling the data divergence among clients in federated NLP systems. Through theoretical analysis, we derive the f-divergence indicator to estimate the client data divergence with aggregation updates and Hessians. Furthermore, we devise a dataset synthesization method with a Hessian reassignment mechanism guided by the diffusion theory to address the key challenge of inaccessible datasets in calculating clients' data Hessians. We then present the novel Federated F-Divergence-Based Aggregation~(\textbf{Fed-FA}) algorithm, which leverages the f-divergence indicator to detect and discard suspicious clients. Extensive empirical results show that Fed-FA outperforms all the parameter distance-based methods in defending against backdoor attacks among various natural language backdoor attack scenarios.

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Zangwei Zheng Xiaozhe Ren Fuzhao Xue Yang Luo Xin Jiang Yang You



研究问题:如何提高大规模语言模型的推理效率。
动机:尽管大规模语言模型在各种任务上表现出色,但其推理过程的计算成本高昂。
方法:提出一种有效的语言模型推理管道,利用模型准确感知和预测响应长度的能力,并引入一种高效的序列调度技术,将响应长度相似的查询分组为微批次进行处理。
效果:在基于LLaMA的模型上进行真实世界指令数据集的评估,结果显示,该方法在不牺牲效果的情况下,推理吞吐量提高了86%。

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM inference.

ASPEN: Breaking Operator Barriers for Efficient Parallelization of Deep Neural Networks
Jongseok Park Kyungmin Bin Gibum Park Sangtae Ha Kyunghan Lee



研究问题:现有的深度神经网络框架在并行化过程中存在显著的同步障碍,限制了操作间的并行计算范围。
动机:为了解决这一问题,我们提出了一种新的并行计算解决方案ASPEN,通过动态执行和调度来充分利用并行计算机会。
方法:ASPEN将深度学习网络表示为细粒度瓦片的数据流图,消除了操作间同步障碍,实现了细粒度的动态执行。同时,通过在运行时定位和调度这些机会,实现了高度的资源利用率和内存重用。
效果:我们的实验表明,ASPEN在CPU上的实现性能优异,比TorchScript和TVM等最先进的推理系统提高了3.2倍和4.3倍。

Modern Deep Neural Network (DNN) frameworks use tensor operators as the main building blocks of DNNs. However, we observe that operator-based construction of DNNs incurs significant drawbacks in parallelism in the form of synchronization barriers. Synchronization barriers of operators confine the scope of parallel computation to each operator and obscure the rich parallel computation opportunities that exist across operators. To this end, we present ASPEN, a novel parallel computation solution for DNNs that achieves fine-grained dynamic execution of DNNs, which (1) removes the operator barriers and expresses DNNs in dataflow graphs of fine-grained tiles to expose the parallel computation opportunities across operators, and (2) exploits these opportunities by dynamically locating and scheduling them in runtime. This novel approach of ASPEN enables opportunistic parallelism, a new class of parallelism for DNNs that is unavailable in the existing operator-based approaches. ASPEN also achieves high resource utilization and memory reuse by letting each resource asynchronously traverse depthwise in the DNN graph to its full computing potential. We provide challenges and solutions to our approach and show that our proof-of-concept implementation of ASPEN on CPU shows exceptional performance, outperforming state-of-the-art inference systems of TorchScript and TVM by up to 3.2$\times$ and 4.3$\times$, respectively.

Large-Scale Distributed Learning via Private On-Device LSH
Tahseen Rabbani Marco Bornstein Furong Huang



研究问题:如何有效地在计算和存储有限的设备上进行局部敏感哈希(LSH)分析?
动机:现有的LSH算法需要对全层权重进行随机投影,这在计算和存储有限的设备上是不现实的。
方法:我们开发了一种新的哈希函数家族,创建了第一个私有的、个性化的、内存高效的设备上的LSH框架。这个框架允许每个设备生成哈希表,而无需中心主机的帮助,使用设备特定的哈希超参数。
效果:我们的框架通过生成压缩的全权重集的哈希表,并可以串行生成和丢弃,如果过程是内存密集型的,从而避免了设备保持(i)全尺寸模型和(ii)大量的哈希表在本地内存中进行LSH分析。实验证明,与其他假设无限制的设备容量的LSH框架相比,我们的框架在训练大规模的推荐网络方面具有竞争力。

Locality-sensitive hashing (LSH) based frameworks have been used efficiently to select weight vectors in a dense hidden layer with high cosine similarity to an input, enabling dynamic pruning. While this type of scheme has been shown to improve computational training efficiency, existing algorithms require repeated randomized projection of the full layer weight, which is impractical for computational- and memory-constrained devices. In a distributed setting, deferring LSH analysis to a centralized host is (i) slow if the device cluster is large and (ii) requires access to input data which is forbidden in a federated context. Using a new family of hash functions, we develop the first private, personalized, and memory-efficient on-device LSH framework. Our framework enables privacy and personalization by allowing each device to generate hash tables, without the help of a central host, using device-specific hashing hyper-parameters (e.g., number of hash tables or hash length). Hash tables are generated with a compressed set of the full weights, and can be serially generated and discarded if the process is memory-intensive. This allows devices to avoid maintaining (i) the fully-sized model and (ii) large amounts of hash tables in local memory for LSH analysis. We prove several statistical and sensitivity properties of our hash functions, and experimentally demonstrate that our framework is competitive in training large scale recommender networks compared to other LSH frameworks which assume unrestricted on-device capacity.

Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing
Josh Alman Jiehao Liang Zhao Song Ruizhe Zhang Danyang Zhuo



研究问题:如何减少深度神经网络训练中的计算时间?
动机:随着神经网络模型规模的增大,模型训练消耗的计算资源也在增加。
方法:提出一种新的预处理方法,通过在树形数据结构中存储权重-数据相关性,快速动态检测每轮迭代中哪些神经元被激活。
效果:该方法仅需要$O(nmd)$的时间进行预处理,并在每轮迭代中仍能达到$o(nmd)$的时间效率。同时,论文还提供了对此算法的下界证明。

Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-the-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future. Using current training methods, in each iteration, to process a data point $x \in \mathbb{R}^d$ in a layer, we need to spend $\Theta(md)$ time to evaluate all the $m$ neurons in the layer. This means processing the entire layer takes $\Theta(nmd)$ time for $n$ data points. Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to $o(nmd)$, but requires exponential time to preprocess either the data or the neural network weights, making it unlikely to have practical usage. In this work, we present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly and dynamically detect which neurons fire at each iteration. Our method requires only $O(nmd)$ time in preprocessing and still achieves $o(nmd)$ time per iteration. We complement our new algorithm with a lower bound, proving that assuming a popular conjecture from complexity theory, one could not substantially speed up our algorithm for dynamic detection of firing neurons.

Improving Robustness with Adaptive Weight Decay
Amin Ghiasi Ali Shafahi Reza Ardekani



研究问题:如何提高预训练语言模型在知识驱动任务上的性能,同时在其他常见的NLP任务上与最先进的BERT模型相媲美。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文提出利用知识图谱中的有信息量的实体来增强语言表示。
方法:采用大规模文本语料库和知识图谱联合训练ERNIE模型,能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。此外,还提出了自适应权重衰减方法,可以自动调整权重衰减的超参数,从而提高对抗鲁棒性,无需额外的数据集和架构选择。

We propose adaptive weight decay, which automatically tunes the hyper-parameter for weight decay during each training iteration. For classification problems, we propose changing the value of the weight decay hyper-parameter on the fly based on the strength of updates from the classification loss (i.e., gradient of cross-entropy), and the regularization loss (i.e., $\ell_2$-norm of the weights). We show that this simple modification can result in large improvements in adversarial robustness — an area which suffers from robust overfitting — without requiring extra data accros various datasets and architecture choices. For example, our reformulation results in 20\% relative robustness improvement for CIFAR-100, and 10\% relative robustness improvement on CIFAR-10 comparing to the best tuned hyper-parameters of traditional weight decay resulting in models that have comparable performance to SOTA robustness methods. In addition, this method has other desirable properties, such as less sensitivity to learning rate, and smaller weight norms, which the latter contributes to robustness to overfitting to label noise, and pruning.

Beyond Exponential Graph: Communication-Efficient Topologies for Decentralized Learning via Finite-time Convergence
Yuki Takezawa Ryoma Sato Han Bao Kenta Niwa Makoto Yamada



研究问题:如何设计一种既有快速共识率又有小最大度的拓扑结构以提高分散式学习的收敛速度和准确性。
动机:现有的快速共识率的拓扑结构,如指数图,由于其大的最大度导致显著的通信成本。因此,寻找具有快速共识率和小最大度的拓扑结构是重要的。
方法:提出一种新的拓扑结构,名为Base-(k+1)图,它结合了快速共识率和小最大度的优点。与现有拓扑结构不同,Base-(k+1)图能让所有节点在有限迭代次数后达到精确共识。
效果:实验结果表明,Base-(k+1)图使各种分散式学习方法比现有拓扑结构具有更高的精度和更好的通信效率。

Decentralized learning has recently been attracting increasing attention for its applications in parallel computation and privacy preservation. Many recent studies stated that the underlying network topology with a faster consensus rate (a.k.a. spectral gap) leads to a better convergence rate and accuracy for decentralized learning. However, a topology with a fast consensus rate, e.g., the exponential graph, generally has a large maximum degree, which incurs significant communication costs. Thus, seeking topologies with both a fast consensus rate and small maximum degree is important. In this study, we propose a novel topology combining both a fast consensus rate and small maximum degree called the Base-$\left(k+1\right)$ Graph. Unlike the existing topologies, the Base-$\left(k+1\right)$ Graph enables all nodes to reach the exact consensus after a finite number of iterations for any number of nodes and maximum degree $k$. Thanks to this favorable property, the Base-$\left(k+1\right)$ Graph endows Decentralized SGD (DSGD) with both a faster convergence rate and more communication efficiency than the exponential graph. We conducted experiments with various topologies, demonstrating that the Base-$\left(k+1\right)$ Graph enables various decentralized learning methods to achieve higher accuracy with better communication efficiency than the existing topologies. Our code is available at https://github.com/yukiTakezawa/BaseGraph.

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Qiong Wu Wei Yu Yiyi Zhou Shubin Huang Xiaoshuai Sun Rongrong Ji



研究问题:本文旨在解决视觉-语言预训练模型(VLP)在下游任务适应中参数和计算开销过大的问题。
动机:目前的方法主要集中在通过只更新少量参数来进行参数高效的转移学习,但计算冗余仍然困扰着VLP的应用。
方法:本文提出了一种新颖的动态架构跳跃(DAS)方法,用于实现有效的参数和计算高效的转移学习。该方法首先通过基于强化学习的过程观察VLP模型模块对下游任务的重要性,然后根据获得的奖励跳过冗余的模块,使用轻量级网络(即适配器)进行优化。
效果:实验结果表明,DAS不仅能有效降低计算复杂度,而且在参数规模和性能方面与现有的参数高效转移学习方法具有竞争力。

With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only updating a small number of parameters. However, excessive computational overhead still plagues the application of VLPs. In this paper, we aim at parameter and computation efficient transfer learning (PCETL) for VLP models. In particular, PCETL not only needs to limit the number of trainable parameters in VLP models, but also to reduce the computational redundancy during inference, thus enabling a more efficient transfer. To approach this target, we propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL. Instead of directly optimizing the intrinsic architectures of VLP models, DAS first observes the significances of their modules to downstream tasks via a reinforcement learning (RL) based process, and then skips the redundant ones with lightweight networks, i.e. adapters, according to the obtained rewards. In this case, the VLP model can well maintain the scale of trainable parameters while speeding up its inference on downstream tasks. To validate DAS, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of VL tasks. The experimental results not only show the great advantages of DAS in reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but also confirm its competitiveness against existing PETL methods in terms of parameter scale and performance. Our source code is given in our appendix.

Q-DM: An Efficient Low-bit Quantized Diffusion Model
Yanjing Li Sheng Xu Xianbin Cao Xiao Sun Baochang Zhang



研究问题:去噪扩散生成模型虽然能生成高质量的数据,但由于使用全精度网络进行迭代噪声估计,其计算成本高。
动机:为了降低计算和内存消耗,研究人员尝试对扩散模型进行量化处理,但低比特噪声估计网络在扩散模型中的表现远不如全精度网络。
方法:本文提出了一种时间步感知量化(TaQ)方法和一种噪声估计模拟(NeM)方案,分别用于消除低比特量化扩散模型中的激活分布振荡和累积量化误差。
效果:实验结果表明,这种方法在流行的DDPM和DDIM模型上取得了显著的改进。例如,理论上4比特的Q-DM可以将1000步的DDPM加速7.8倍,并在CIFAR-10无条件下的数据集上获得5.17的FID分数。

Denoising diffusion generative models are capable of generating high-quality data, but suffers from the computation-costly generation process, due to a iterative noise estimation using full-precision networks. As an intuitive solution, quantization can significantly reduce the computational and memory consumption by low-bit parameters and operations. However, low-bit noise estimation networks in diffusion models (DMs) remain unexplored yet and perform much worse than the full-precision counterparts as observed in our experimental studies. In this paper, we first identify that the bottlenecks of low-bit quantized DMs come from a large distribution oscillation on activations and accumulated quantization error caused by the multi-step denoising process. To address these issues, we first develop a Timestep-aware Quantization (TaQ) method and a Noise-estimating Mimicking (NeM) scheme for low-bit quantized DMs (Q-DM) to effectively eliminate such oscillation and accumulated error respectively, leading to well-performed low-bit DMs. In this way, we propose an efficient Q-DM to calculate low-bit DMs by considering both training and inference process in the same framework. We evaluate our methods on popular DDPM and DDIM models. Extensive experimental results show that our method achieves a much better performance than the prior arts. For example, the 4-bit Q-DM theoretically accelerates the 1000-step DDPM by 7.8x and achieves a FID score of 5.17, on the unconditional CIFAR-10 dataset.

Efficient Activation Function Optimization through Surrogate Modeling
Garrett Bingham Risto Miikkulainen



研究问题:设计有效的激活函数对于提升神经网络在许多机器学习任务中的性能至关重要,但人类难以构造最优的激活函数,且现有的激活函数搜索算法成本过高。
动机:本文旨在通过三个步骤改进现有技术:首先,创建了基准数据集Act-Bench-CNN、Act-Bench-ResNet和Act-Bench-ViT,通过使用2913个系统生成的激活函数从零开始训练卷积、残差和视觉变换器架构;其次,开发了基准空间的特征描述,提出了一种新的基于替代物的优化方法;最后,使用替代物在几个实际任务中发现改进的激活函数。
方法:通过系统地生成大量的激活函数并训练模型,创建了基准数据集;通过对模型预测分布和激活函数输出分布的谱进行分析,发现了一种高度预测性能的方法;利用这种方法在多个实际任务中寻找更好的激活函数。
效果:实验结果表明,所提出的基于替代物的优化方法可以在多个实际任务中找到优于其他激活函数的新激活函数,挑战了深度学习中一直使用整流非线性函数的现状。这些步骤各自都是一项独立的贡献,共同为进一步研究激活函数优化提供了实践和理论基础。

Carefully designed activation functions can improve the performance of neural networks in many machine learning tasks. However, it is difficult for humans to construct optimal activation functions, and current activation function search algorithms are prohibitively expensive. This paper aims to improve the state of the art through three steps: First, the benchmark datasets Act-Bench-CNN, Act-Bench-ResNet, and Act-Bench-ViT were created by training convolutional, residual, and vision transformer architectures from scratch with 2,913 systematically generated activation functions. Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization. More specifically, the spectrum of the Fisher information matrix associated with the model's predictive distribution at initialization and the activation function's output distribution were found to be highly predictive of performance. Third, the surrogate was used to discover improved activation functions in several real-world tasks, with a surprising finding: a sigmoidal design that outperformed all other activation functions was discovered, challenging the status quo of always using rectifier nonlinearities in deep learning. Each of these steps is a contribution in its own right; together they serve as a practical and theoretical foundation for further research on activation function optimization.

Scale-Space Hypernetworks for Efficient Biomedical Image Analysis
Jose Javier Gonzalez Ortiz John Guttag Adrian V Dalca



研究问题:如何平衡卷积神经网络在医学图像分析任务中的计算效率和准确性。
动机:现有的卷积神经网络模型在处理体积数据时计算密集,通过调整网络中下采样和上采样层的缩放因子可以在精度和计算效率之间进行权衡,但探索这种权衡关系的成本过高。
方法:提出一种称为Scale-Space HyperNetworks(SSHN)的方法,该方法学习一系列具有不同内部缩放因子的卷积神经网络,通过训练少量的SSHN,可以匹配甚至超越训练许多具有固定缩放因子的网络的结果。
效果:在多个医学图像分析应用中展示了这种方法,与使用固定和动态缩放因子的策略相比,SSHN始终能以更低的训练成本提供更好的精度-效率权衡。

Convolutional Neural Networks (CNNs) are the predominant model used for a variety of medical image analysis tasks. At inference time, these models are computationally intensive, especially with volumetric data.In principle, it is possible to trade accuracy for computational efficiency by manipulating the rescaling factor in the downsample and upsample layers of CNN architectures.However, properly exploring the accuracy-efficiency trade-off is prohibitively expensive with existing models.To address this, we introduce Scale-Space HyperNetworks (SSHN), a method that learns a spectrum of CNNs with varying internal rescaling factors.A single SSHN characterizes an entire Pareto accuracy-efficiency curve of models that match, and occasionally surpass, the outcomes of training many separate networks with fixed rescaling factors.We demonstrate the proposed approach in several medical image analysis applications, comparing SSHN against strategies with both fixed and dynamic rescaling factors.We find that SSHN consistently provides a better accuracy-efficiency trade-off at a fraction of the training cost. Trained SSHNs enable the user to quickly choose a rescaling factor that appropriately balances accuracy and computational efficiency for their particular needs at inference.

Gradient Flossing: Improving Gradient Descent through Dynamic Control of Jacobians
Rainer Engelken



研究问题:训练循环神经网络(RNNs)由于长时序的梯度不稳定,导致梯度爆炸和消失的问题。
动机:近期研究发现这些问题与Lyapunov指数有关,该指数描述了无限小扰动的增长或收缩。
方法:提出梯度熔断法,通过在训练过程中将前向动力学的Lyapunov指数推向零来解决梯度不稳定问题。这通过使用可微分线性代数进行反向传播来正则化Lyapunov指数实现。
效果:实验表明,梯度熔断不仅可以控制梯度范数,还可以控制长期雅可比矩阵的条件数,促进多维误差反馈传播。在涉及长时序的任务中,应用梯度熔断可以显著提高成功率和收敛速度。

Training recurrent neural networks (RNNs) remains a challenge due to the instability of gradients across long time horizons, which can lead to exploding and vanishing gradients. Recent research has linked these problems to the values of Lyapunov exponents for the forward-dynamics, which describe the growth or shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a novel approach to tackling gradient instability by pushing Lyapunov exponents of the forward dynamics toward zero during learning. We achieve this by regularizing Lyapunov exponents through backpropagation using differentiable linear algebra. This enables us to "floss" the gradients, stabilizing them and thus improving network training. We show that gradient flossing controls not only the gradient norm but also the condition number of the long-term Jacobian, facilitating multidimensional error feedback propagation. We find that applying gradient flossing before training enhances both the success rate and convergence speed for tasks involving long time horizons. For challenging tasks, we show that gradient flossing during training can further increase the time horizon that can be bridged by backpropagation through time. Moreover, we demonstrate the effectiveness of our approach on various RNN architectures and tasks of variable temporal complexity. Additionally, we provide a simple implementation of our gradient flossing algorithm that can be used in practice. Our results indicate that gradient flossing via regularizing Lyapunov exponents can significantly enhance the effectiveness of RNN training and mitigate the exploding and vanishing gradients problem.

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
Haixin Wang Xinlong Yang Jianlong Chang Dian Jin Jinan Sun Shikun Zhang Xiao Luo Qi Tian



研究问题:如何进一步降低复杂度,实现轻量级设计,并在极低参数下增强模态间的对齐。
动机:随着大规模预训练的进步,参数高效的迁移学习在人工智能的不同子领域中越来越受欢迎。但如何在低参数下提高模态间对齐的问题尚未解决。
方法:提出了一种优雅的跨模态转换框架AURORA。首先利用模态近似生成0.1M可训练参数进行多模态参数高效调优,然后为了更好的模态对齐,在极低参数的场景下提出了信息上下文增强和门控查询转换模块。
效果:在六个跨模态基准测试中表现出色,不仅优于最先进的技术,甚至优于完全微调的方法。

Driven by the progress of large-scale pre-training, parameter-efficient transfer learning has gained immense popularity across different subfields of Artificial Intelligence. The core is to adapt the model to downstream tasks with only a small set of parameters. Recently, researchers have leveraged such proven techniques in multimodal tasks and achieve promising results. However, two critical issues remain unresolved: how to further reduce the complexity with lightweight design and how to boost alignment between modalities under extremely low parameters. In this paper, we propose A gracefUl pRompt framewOrk for cRoss-modal trAnsfer (AURORA) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal parameter-efficient tuning, which explores the low intrinsic dimension with only 0.04% parameters of the pre-trained model. Then, for better modality alignment, we propose the Informative Context Enhancement and Gated Query Transformation module under extremely few parameters scenes. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach. Our code is available at: https://github.com/WillDreamer/Aurora.

SPACE: Single-round Participant Amalgamation for Contribution Evaluation in Federated Learning
Yi-Chung Chen Hsi-Wen Chen Shun-Guei Wang Ming-Syan Chen



研究问题:联邦学习中参与者贡献的评估问题。
动机:现有的评估方法主要依赖于计算成本高的Shapley值,且需要多次通信轮次。
方法:提出一种名为SPACE的高效评估方法,通过联合知识融合和原型模型评估两个新组件,消除对验证集大小的依赖,实现单轮通信中的参与者评估。
效果:实验结果表明,SPACE在运行时间和皮尔森相关系数上优于现有方法,并在应用、客户端重权和客户端选择等方面表现出有效性。

The evaluation of participant contribution in federated learning (FL) has recently gained significant attention due to its applicability in various domains, such as incentive mechanisms, robustness enhancement, and client selection. Previous approaches have predominantly relied on the widely adopted Shapley value for participant evaluation. However, the computation of the Shapley value is expensive, despite using techniques like gradient-based model reconstruction and truncating unnecessary evaluations. Therefore, we present an efficient approach called Single-round Participants Amalgamation for Contribution Evaluation (SPACE). SPACE incorporates two novel components, namely Federated Knowledge Amalgamation and Prototype-based Model Evaluation to reduce the evaluation effort by eliminating the dependence on the size of the validation set and enabling participant evaluation within a single communication round. Experimental results demonstrate that SPACE outperforms state-of-the-art methods in terms of both running time and Pearson’s Correlation Coefficient (PCC). Furthermore, extensive experiments conducted on applications, client reweighting, and client selection highlight the effectiveness of SPACE. The code is available at https://github.com/culiver/SPACE.

Improved Communication Efficiency in Federated Natural Policy Gradient via ADMM-based Gradient Updates
Guangchen Lan Han Wang James Anderson Christopher Brinton Vaneet Aggarwal



研究问题:如何在不共享个体数据的情况下,让代理进行协作训练全局策略,同时降低通信开销。
动机:联邦强化学习(FedRL)允许代理进行协作训练全局策略,但高通信开销是一个关键瓶颈,尤其是在使用自然梯度方法的二次优化中。
方法:提出FedNPG-ADMM框架,利用交替方向乘子法(ADMM)高效地近似全局自然梯度方向。通过理论证明,使用ADMM基础的梯度更新可以将每次迭代的通信复杂度从O(d²)降低到O(d),其中d是模型参数的数量。
效果:在MuJoCo环境中进行的评估表明,FedNPG-ADMM保持了标准FedNPG的奖励性能,并且其收敛速度在联邦代理数量增加时有所提高。

Federated reinforcement learning (FedRL) enables agents to collaboratively train a global policy without sharing their individual data. However, high communication overhead remains a critical bottleneck, particularly for natural policy gradient (NPG) methods, which are second-order. To address this issue, we propose the FedNPG-ADMM framework, which leverages the alternating direction method of multipliers (ADMM) to approximate global NPG directions efficiently. We theoretically demonstrate that using ADMM-based gradient updates reduces communication complexity from $\mathcal{O}({d^{2}})$ to $\mathcal{O}({d})$ at each iteration, where $d$ is the number of model parameters. Furthermore, we show that achieving an $\epsilon$-error stationary convergence requires $\mathcal{O}(\frac{1}{(1-\gamma)^{2}{\epsilon}})$ iterations for discount factor $\gamma$, demonstrating that FedNPG-ADMM maintains the same convergence rate as standard FedNPG. Through evaluation of the proposed algorithms in MuJoCo environments, we demonstrate that FedNPG-ADMM maintains the reward performance of standard FedNPG, and that its convergence rate improves when the number of federated agents increases.

VCC: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens
Zhanpeng Zeng Cole Hawkins Mingyi Hong Aston Zhang Nikolaos Pappas Vikas Singh Shuai Zheng



研究问题:如何提高Transformers在超长序列上的效率?
动机:尽管已有研究致力于降低Transformers的二次成本,但处理超过16K个标记的超长序列仍然具有挑战性。
方法:提出VIP-token中心压缩(VCC)方案,通过压缩序列到每个层更小的表示来显著提高Transformers在超长序列上的效率。
效果:与竞争性基线相比,该算法不仅效率高(在4K和16K长度上比基线实现了超过3倍的计算效率增益),而且在大量任务上提供了竞争力/更好的性能。此外,我们的算法可以扩展到128K个标记(或更多),同时始终保持准确性提升。

Transformers are central in modern natural language processing and computer vision applications. Despite recent works devoted to reducing the quadratic cost of such models with respect to sequence length, dealing with ultra long sequences (e.g., $>$16K tokens) remains challenging. Applications such as answering questions based on a book or summarizing a scientific article are inefficient or infeasible. Here, we propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens, which we call VIP-tokens, are most relevant to the final prediction, we propose a VIP-token centric compression (VCC) scheme which selectively compresses the sequence based on their impact on approximating the representation of the VIP-tokens. Compared with competitive baselines, our algorithm is not only efficient (achieving more than $3\times$ compute efficiency gain compared to baselines on 4K and 16K lengths), but also offers competitive/better performance on a large number of tasks. Further, we show that our algorithm scales to 128K tokens (or more) while consistently offering accuracy improvement. Code is available at https://github.com/mlpen/VCC.

Probabilistic Weight Fixing: Large-scale training of neural network weight uncertainties for quantisation.
Chris Subia-Waud Srinandan Dasmahapatra



研究问题:如何通过将权重限制在一组有限的值中,减少大型神经网络在推理过程中的能量消耗。
动机:现有的方法通常假设权重仅基于值进行处理,忽视了权重位置的独特作用。
方法:本文提出了一种基于贝叶斯神经网络(BNNs)的概率框架和一种变分松弛法,根据各自特定位置学习的不确定性分布来确定哪些权重可以移动到哪个聚类中心以及移动的程度。
效果:通过利用概率分布的权重灵活性,增强了噪声韧性和压缩性。与最先进的方法相比,我们的迭代聚类过程在ResNet模型和更复杂的基于变压器的架构上表现出更好的压缩性和更高的精度。特别是在ImageNet上,使用DeiT-Tiny,我们的方法是最先进的量化方法top-1准确率高出1.6%,其500万+的权重现在仅由296个唯一值表示。

Weight-sharing quantization has emerged as a technique to reduce energy expenditure during inference in large neural networks by constraining their weights to a limited set of values. However, existing methods often assume weights are treated solely based on value, neglecting the unique role of weight position. This paper proposes a probabilistic framework based on Bayesian neural networks (BNNs) and a variational relaxation to identify which weights can be moved to which cluster center and to what degree based on their individual position-specific learned uncertainty distributions. We introduce a new initialization setting and a regularization term, enabling the training of BNNs with complex dataset-model combinations. Leveraging the flexibility of weight values from probability distributions, we enhance noise resilience and compressibility. Our iterative clustering procedure demonstrates superior compressibility and higher accuracy compared to state-of-the-art methods on both ResNet models and the more complex transformer-based architectures. In particular, our method outperforms the state-of-the-art quantization method top-1 accuracy by 1.6\% on ImageNet using DeiT-Tiny, with its 5 million+ weights now represented by only 296 unique values. Code available at https://github.com/subiawaud/PWFN.

Federated Learning via Meta-Variational Dropout
Insu Jeon Minui Hong Junhyeog Yun Gunhee Kim



研究问题:传统的联邦学习在实际应用中面临模型过拟合和由于客户端数据有限且非独立同分布导致本地模型分歧的问题。
动机:为了解决这些问题,我们提出了一种新的贝叶斯元学习方法,称为元变分dropout(MetaVD)。
方法:MetaVD通过一个共享的超网络来预测与客户端相关的dropout率,使得在有限的非独立同分布数据设置下,联邦学习算法能够进行有效的个性化。
效果:我们在各种稀疏和非独立同分布的联邦学习数据集上进行了广泛的实验。MetaVD展示了优秀的分类准确率和不确定性校准性能,尤其是在处理分布外(OOD)客户端时。此外,MetaVD压缩了每个客户端所需的本地模型参数,从而减轻了模型过拟合并降低了通信成本。

Federated Learning (FL) aims to train a global inference model from remotely distributed clients, gaining popularity due to its benefit of improving data privacy. However, traditional FL often faces challenges in practical applications, including model overfitting and divergent local models due to limited and non-IID data among clients. To address these issues, we introduce a novel Bayesian meta-learning approach called meta-variational dropout (MetaVD). MetaVD learns to predict client-dependent dropout rates via a shared hypernetwork, enabling effective model personalization of FL algorithms in limited non-IID data settings. We also emphasize the posterior adaptation view of meta-learning and the posterior aggregation view of Bayesian FL via the conditional dropout posterior. We conducted extensive experiments on various sparse and non-IID FL datasets. MetaVD demonstrated excellent classification accuracy and uncertainty calibration performance, especially for out-of-distribution (OOD) clients. MetaVD compresses the local model parameters needed for each client, mitigating model overfitting and reducing communication costs. Code is available at https://github.com/insujeon/MetaVD.

Learning Large Graph Property Prediction via Graph Segment Training
Kaidi Cao Phitchaya Mangpo Phothilimthana Sami Abu-El-Haija Dustin Zelle Yanqi Zhou Charith Mendis Jure Leskovec Bryan Perozzi



研究问题:如何有效地预测大型图的属性,同时在训练过程中保持有限的内存使用。
动机:大型图的预测需要对整个图的知识有全面了解,但训练过程中可用的内存有限。
方法:提出一种名为Graph Segment Training(GST)的通用框架,采用分而治之的方法进行大型图属性预测的学习,使内存使用量保持恒定。GST首先将大型图分割成多个片段,然后在每次训练迭代中只对少数几个片段进行反向传播。
效果:通过引入历史嵌入表来高效地获取未被采样用于反向传播的片段的嵌入,并设计了两种新颖的技术来减少陈旧嵌入的影响。实验证明,GST-EFD(包含所有技术)在两个大型图属性预测基准测试中表现优秀,不仅内存效率高且速度快,而且在测试准确率上也略高于典型的全图训练方案。

Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.

Deep Patch Visual Odometry
Zachary Teed Lahav Lipson Jia Deng



研究问题:本文旨在提出一种新的单目视觉里程计深度学习系统,即深度补丁视觉里程计(DPVO)。
动机:虽然现有的视觉里程计方法通过使用深度网络预测视频帧之间的密集流来显著提高了最先进的准确性,但这种方法的计算成本高,使得这些先前的方法在许多应用场景中并不实用。
方法:DPVO采用一种新颖的循环网络架构,用于跟踪图像补丁随时间的变化。同时,DPVO引入了一种新颖的基于补丁的对应关系循环更新算子和可微分束调整。
效果:实验结果表明,DPVO在所有标准基准上都优于所有先前的工作,包括使用三分之一内存且平均运行速度为3倍快的学习型最先进的视觉里程计系统(DROID)。

We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO). DPVO uses a novel recurrent network architecture designed for tracking image patches across time. Recent approaches to VO have significantly improved the state-of-the-art accuracy by using deep networks to predict dense flow between video frames. However, using dense flow incurs a large computational cost, making these previous methods impractical for many use cases. Despite this, it has been assumed that dense flow is important as it provides additional redundancy against incorrect matches. DPVO disproves this assumption, showing that it is possible to get the best accuracy and efficiency by exploiting the advantages of sparse patch-based matching over dense flow. DPVO introduces a novel recurrent update operator for patch based correspondence coupled with differentiable bundle adjustment. On Standard benchmarks, DPVO outperforms all prior work, including the learning-based state-of-the-art VO-system (DROID) using a third of the memory while running 3x faster on average. Code is available at https://github.com/princeton-vl/DPVO

You Only Condense Once: Two Rules for Pruning Condensed Datasets
Yang He Lingao Xiao Joey Tianyi Zhou



研究问题:如何在设备上通过减少训练数据集的大小来提高训练效率。
动机:设备上的计算资源有限,需要一种方法来灵活调整数据集大小并避免额外的压缩过程。
方法:提出了一种名为“You Only Condense Once”(YOCO)的方法,该方法在一次压缩的数据集基础上,通过两个简单的数据集剪枝规则(低LBPE得分和平衡构建)生成更小的压缩数据集。
效果:实验证明,YOCO在ConvNet、ResNet和DenseNet等网络上以及在CIFAR-10、CIFAR-100和ImageNet等数据集上都取得了优秀的效果,例如,在CIFAR-10上,YOCO超过了各种数据集压缩和数据集剪枝方法,实现了6.98%-8.89%和6.31%-23.92%的准确率提升。代码已在GitHub上开源。

Dataset condensation is a crucial tool for enhancing training efficiency by reducing the size of the training dataset, particularly in on-device scenarios. However, these scenarios have two significant challenges: 1) the varying computational resources available on the devices require a dataset size different from the pre-defined condensed dataset, and 2) the limited computational resources often preclude the possibility of conducting additional condensation processes. We introduce You Only Condense Once (YOCO) to overcome these limitations. On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules: Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including CIFAR-10, CIFAR-100 and ImageNet. For example, our YOCO surpassed various dataset condensation and dataset pruning methods on CIFAR-10 with ten Images Per Class (IPC), achieving 6.98-8.89% and 6.31-23.92% accuracy gains, respectively. The code is available at: [https://github.com/he-y/you-only-condense-once](https://github.com/he-y/you-only-condense-once).

A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting
Alexander Tyurin Peter Richtárik



研究问题:本文提出了一种结合分布式优化和联邦学习的新方法,包括随机梯度的方差减少、部分参与和压缩通信三个关键组件。
动机:为了解决联邦学习中节点参与度低、通信开销大的问题,作者提出了一种新的优化方法。
方法:通过引入随机梯度的方差减少、部分参与和压缩通信三个关键组件,实现了在不全部参与所有节点、无需限制梯度(相似性)的前提下,达到最优的查询复杂度和最先进的部分参与设置下的通信复杂度。
效果:实验证明,该方法无论是否具有通信压缩特性,都能成功结合方差减少和部分参与,达到最优查询复杂度,不需要所有节点的参与,也不需要限制梯度(相似性)的假设。

We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, partial participation, and compressed communication. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Regardless of the communication compression feature, our method successfully combines variance reduction and partial participation: we get the optimal oracle complexity, never need the participation of all nodes, and do not require the bounded gradients (dissimilarity) assumption.

Module-wise Training of Neural Networks via the Minimizing Movement Scheme
Skander Karkar Ibrahim Ayed Emmanuel de Bezenac patrick gallinari



研究问题:解决神经网络在有限内存设备上训练时,早期层过拟合和深层停止增加测试准确度的问题。
动机:贪婪的逐层或逐模块训练在有限的内存设备上具有吸引力,但存在停滞问题。
方法:通过引入一种简单的模块正则化方法来解决这个问题,该方法受分布空间中梯度流最小运动方案的启发。
效果:实验表明,当添加这种正则化方法时,各种架构(如ResNets、Transformers和VGG)的模块训练准确性有所提高,优于其他模块训练方法,通常比端到端训练更好,内存使用量减少了60%。

Greedy layer-wise or module-wise training of neural networks is compelling in constrained and on-device settings where memory is limited, as it circumvents a number of problems of end-to-end back-propagation. However, it suffers from a stagnation problem, whereby early layers overfit and deeper layers stop increasing the test accuracy after a certain depth. We propose to solve this issue by introducing a simple module-wise regularization inspired by the minimizing movement scheme for gradient flows in distribution space. We call the method TRGL for Transport Regularized Greedy Learning and study it theoretically, proving that it leads to greedy modules that are regular and that progressively solve the task. Experimentally, we show improved accuracy of module-wise training of various architectures such as ResNets, Transformers and VGG, when our regularization is added, superior to that of other module-wise training methods and often to end-to-end training, with as much as 60% less memory usage.

Rubik's Cube: High-Order Channel Interactions with a Hierarchical Receptive Field
Naishan Zheng Man Zhou Chong Zhou Chen Change Loy



研究问题:本文旨在解决图像恢复技术中,如卷积和变换器等方法主要利用基础的一阶通道交互,未最大化高阶建模的潜力的问题。
动机:为了充分利用通道维度内的关系,并提高图像恢复的效率和性能,我们提出了一种简单而有效的高阶通道操作符。
方法:我们的方法遵循零FLOP和零参数原则,使用跨通道组的空间移动机制。通过将有利的通道交互和聚合能力转化为元素级的乘法和$1 times 1$核的卷积单元,我们的新公式将先前工作中看到的一阶通道交互扩展到任意高阶,生成类似魔方的分层感受野。
效果:我们在各种低层视觉任务上进行了实验,包括图像去噪、低光图像增强、引导图像超分辨率和图像去模糊。结果一致表明,我们的魔方卷积运算在所有任务上都提高了性能。

Image restoration techniques, spanning from the convolution to the transformer paradigm, have demonstrated robust spatial representation capabilities to deliver high-quality performance.Yet, many of these methods, such as convolution and the Feed Forward Network (FFN) structure of transformers, primarily leverage the basic first-order channel interactions and have not maximized the potential benefits of higher-order modeling. To address this limitation, our research dives into understanding relationships within the channel dimension and introduces a simple yet efficient, high-order channel-wise operator tailored for image restoration. Instead of merely mimicking high-order spatial interaction, our approach offers several added benefits: Efficiency: It adheres to the zero-FLOP and zero-parameter principle, using a spatial-shifting mechanism across channel-wise groups. Simplicity: It turns the favorable channel interaction and aggregation capabilities into element-wise multiplications and convolution units with $1 \times 1$ kernel. Our new formulation expands the first-order channel-wise interactions seen in previous works to arbitrary high orders, generating a hierarchical receptive field akin to a Rubik's cube through the combined action of shifting and interactions. Furthermore, our proposed Rubik's cube convolution is a flexible operator that can be incorporated into existing image restoration networks, serving as a drop-in replacement for the standard convolution unit with fewer parameters overhead. We conducted experiments across various low-level vision tasks, including image denoising, low-light image enhancement, guided image super-resolution, and image de-blurring. The results consistently demonstrate that our Rubik's cube operator enhances performance across all tasks. Code is publicly available at https://github.com/zheng980629/RubikCube.

Scaling Laws for Hyperparameter Optimization
Arlind Kadra Maciej Janowski Martin Wistuba Josif Grabocka



研究问题:本文旨在解决深度学习中超参数优化的问题,特别是如何利用学习曲线的幂律性质进行贝叶斯优化。
动机:尽管目前已有许多超参数优化方法,但大多数并未充分利用学习曲线的幂律特性。因此,本文提出了一种新的方法——深度幂律(DPL),该方法通过构建一个神经网络模型集合,使预测结果遵循幂律缩放模式,从而更好地进行超参数优化。
方法:我们的方法动态决定哪些配置需要暂停和逐步训练,这是通过使用灰盒评估实现的。我们在三个与表格、图像和NLP数据集相关的基准测试上,与7种最先进的竞争对手进行了比较,涵盖了59个不同的任务。
效果:实验结果表明,我们的方法在所有基准测试上都取得了最好的效果,无论是任何时候的结果,都优于所有竞争对手。

Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the dominant power law nature of learning curves for Bayesian optimization. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.

Chanakya: Learning Runtime Decisions for Adaptive Real-Time Perception
Anurag Ghosh Vaibhav Balloli Akshay Nambi Aditya Singh Tanuja Ganu



研究问题:实时感知需要计划的资源利用,但如何平衡准确性和延迟性是一个挑战。
动机:早期的运行时执行框架采用基于规则的决策算法,并使用固定的算法延迟预算来平衡这些考虑因素,这是次优且不灵活的。
方法:我们提出了Chanakya,一种从流感知范式自然衍生出来的学习近似执行框架,用于自动学习由这些权衡产生的决策。Chanakya通过新颖的奖励机制平衡准确性和延迟性进行训练,无需对任何一个目标进行近似。
效果:Chanakya同时考虑内在和外在的上下文,并以灵活的方式预测决策。在公共数据集上,无论是在服务器GPU还是边缘设备上,Chanakya都优于最先进的静态和动态执行策略。

Real-time perception requires planned resource utilization. Computational planning in real-time perception is governed by two considerations -- accuracy and latency. There exist run-time decisions (e.g. choice of input resolution) that induce tradeoffs affecting performance on a given hardware, arising from intrinsic (content, e.g. scene clutter) and extrinsic (system, e.g. resource contention) characteristics. Earlier runtime execution frameworks employed rule-based decision algorithms and operated with a fixed algorithm latency budget to balance these concerns, which is sub-optimal and inflexible. We propose Chanakya, a learned approximate execution framework that naturally derives from the streaming perception paradigm, to automatically learn decisions induced by these tradeoffs instead. Chanakya is trained via novel rewards balancing accuracy and latency implicitly, without approximating either objectives. Chanakya simultaneously considers intrinsic and extrinsic context, and predicts decisions in a flexible manner. Chanakya, designed with low overhead in mind, outperforms state-of-the-art static and dynamic execution policies on public datasets on both server GPUs and edge devices.

Global Update Tracking: A Decentralized Learning Algorithm for Heterogeneous Data
Sai Aparna Aketi Abolfazl Hashemi Kaushik Roy



研究问题:如何设计一种去中心化学习方法,降低设备间数据分布差异对模型性能的影响。
动机:在实际应用中,由于设备间的数据分布可能存在显著差异,这可能会降低模型的性能。
方法:提出了一种名为全局更新追踪(GUT)的基于追踪的方法,旨在减轻异构数据在去中心化学习中的影响,而不引入任何通信开销。
效果:通过在各种计算机视觉数据集(CIFAR-10、CIFAR-100、Fashion MNIST和ImageNette)、模型架构和网络拓扑上进行大量实验,证明了该方法的有效性。与其他现有技术相比,该方法在测试准确性方面实现了1-6%的改进,达到了最先进的水平。

Decentralized learning enables the training of deep learning models over large distributed datasets generated at different locations, without the need for a central server. However, in practical scenarios, the data distribution across these devices can be significantly different, leading to a degradation in model performance. In this paper, we focus on designing a decentralized learning algorithm that is less susceptible to variations in data distribution across devices. We propose Global Update Tracking (GUT), a novel tracking-based method that aims to mitigate the impact of heterogeneous data in decentralized learning without introducing any communication overhead. We demonstrate the effectiveness of the proposed technique through an exhaustive set of experiments on various Computer Vision datasets (CIFAR-10, CIFAR-100, Fashion MNIST, and ImageNette), model architectures, and network topologies. Our experiments show that the proposed method achieves state-of-the-art performance for decentralized learning on heterogeneous data via a 1-6% improvement in test accuracy compared to other existing techniques.

Binarized Spectral Compressive Imaging
Yuanhao Cai Yuxin Zheng Jing Lin Xin Yuan Yulun Zhang Haoqian Wang



研究问题:如何有效地在资源有限的移动设备上进行高光谱图像(HSI)重建。
动机:现有的深度学习模型虽然在HSI重建方面表现良好,但需要强大的硬件和大量的内存和计算资源,难以部署在资源有限的移动设备上。
方法:本文提出了一种新颖的高效实用方法——二值化光谱重分布网络(BiSRNet),用于从快照压缩成像(SCI)系统中的压缩测量中恢复HSI。首先重新设计了一个紧凑且易于部署的基线模型进行二值化,然后提出了基本的单元——二值化光谱重分布卷积(BiSR-Conv)。基于我们的BiSR-Conv,我们定制了四个二值化卷积模块来解决维度不匹配问题,并在整张网络上传播全精度信息。最后,通过使用提出的技术对基线模型进行二值化,得到了我们的BiSRNet。
效果:全面的定量和定性实验表明,我们提出的BiSRNet优于最先进的二值化算法。代码和模型可在https://github.com/caiyuanhao1998/BiSCI上公开获取。

Existing deep learning models for hyperspectral image (HSI) reconstruction achieve good performance but require powerful hardwares with enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited mobile devices. In this paper, we propose a novel method, Binarized Spectral-Redistribution Network (BiSRNet), for efficient and practical HSI restoration from compressed measurement in snapshot compressive imaging (SCI) systems. Firstly, we redesign a compact and easy-to-deploy base model to be binarized. Then we present the basic unit, Binarized Spectral-Redistribution Convolution (BiSR-Conv). BiSR-Conv can adaptively redistribute the HSI representations before binarizing activation and uses a scalable hyperbolic tangent function to closer approximate the Sign function in backpropagation. Based on our BiSR-Conv, we customize four binarized convolutional modules to address the dimension mismatch and propagate full-precision information throughout the whole network. Finally, our BiSRNet is derived by using the proposed techniques to binarize the base model. Comprehensive quantitative and qualitative experiments manifest that our proposed BiSRNet outperforms state-of-the-art binarization algorithms. Code and models are publicly available at https://github.com/caiyuanhao1998/BiSCI

BiMatting: Efficient Video Matting via Binarization
Haotong Qin Lei Ke Xudong Ma Martin Danelljan Yu-Wing Tai Chi-Keung Tang Xianglong Liu Fisher Yu



研究问题:实时视频抠图在边缘设备上面临重大的计算资源限制,限制了其在在线会议和短视频制作等应用中的广泛使用。
动机:二值化是一种强大的压缩方法,通过使用1位参数和位操作大大减少了计算和内存消耗。然而,视频抠图模型的二值化过程并不简单,我们的实证分析揭示了两个主要瓶颈:编码器严重的表现退化和解码器大量的冗余计算。
方法:我们提出了BiMatting,一种使用二值化的准确且高效的视频抠图模型。具体来说,我们构建了可收缩和密集的二值化编码器块拓扑结构以增强提取的表示。我们通过稀疏化二值化单元来减少低信息解码计算。
效果:通过大量实验,我们发现BiMatting比其他二值化的视频抠图模型(包括最先进的二值化方法)有显著的性能提升。我们的方法甚至与全精度模型在视觉质量上相当。此外,BiMatting在计算和存储方面分别实现了12.4倍和21.6倍的显著节省,展示了其在实际资源受限场景中的巨大潜力和优势。

Real-time video matting on edge devices faces significant computational resource constraints, limiting the widespread use of video matting in applications such as online conferences and short-form video production. Binarization is a powerful compression approach that greatly reduces computation and memory consumption by using 1-bit parameters and bitwise operations. However, binarization of the video matting model is not a straightforward process, and our empirical analysis has revealed two primary bottlenecks: severe representation degradation of the encoder and massive redundant computations of the decoder. To address these issues, we propose BiMatting, an accurate and efficient video matting model using binarization. Specifically, we construct shrinkable and dense topologies of the binarized encoder block to enhance the extracted representation. We sparsify the binarized units to reduce the low-information decoding computation. Through extensive experiments, we demonstrate that BiMatting outperforms other binarized video matting models, including state-of-the-art (SOTA) binarization methods, by a significant margin. Our approach even performs comparably to the full-precision counterpart in visual quality. Furthermore, BiMatting achieves remarkable savings of 12.4$\times$ and 21.6$\times$ in computation and storage, respectively, showcasing its potential and advantages in real-world resource-constrained scenarios. Our code and models are released at https://github.com/htqin/BiMatting .

Unlocking Deterministic Robustness Certification on ImageNet
Kai Hu Andy Zou Zifan Wang Klas Leino Matt Fredrikson



研究问题:尽管基于Lipschitz的方法在深度学习中具有确定性保证,但目前最先进的研究问题:尽管基于Lipschitz的方法在深度学习中具有确定性保证,但目前最先进的结果仅限于低维数据的前馈卷积网络(如CIFAR-10)。
动机:本论文探讨了如何将可证明的鲁棒训练扩展到更大、更深的模型。
方法:设计了一种新的残差块,即线性残差网络(LiResNet)架构,并引入了效率边界最大化(EMMA)损失函数,通过同时惩罚多个类别的最坏情况对抗样本来稳定鲁棒训练。
效果:这些贡献使得新的线性残差网络在CIFAR-10/100和Tiny-ImageNet上取得了最新的鲁棒精度。此外,首次将快速确定性鲁棒性保证扩展到ImageNet,证明了这种鲁棒学习方法可以应用于实际应用场景。

Despite the promise of Lipschitz-based methods for provably-robust deep learning with deterministic guarantees, current state-of-the-art results are limited to feed-forward Convolutional Networks (ConvNets) on low-dimensional data, such as CIFAR-10. This paper investigates strategies for expanding certifiably robust training to larger, deeper models. A key challenge in certifying deep networks is efficient calculation of the Lipschitz bound for residual blocks found in ResNet and ViT architectures. We show that fast ways of bounding the Lipschitz constant for conventional ResNets are loose, and show how to address this by designing a new residual block, leading to the *Linear ResNet* (LiResNet) architecture. We then introduce *Efficient Margin MAximization* (EMMA), a loss function that stabilizes robust training by penalizing worst-case adversarial examples from multiple classes simultaneously. Together, these contributions yield new *state-of-the-art* robust accuracy on CIFAR-10/100 and Tiny-ImageNet under $\ell_2$ perturbations. Moreover, for the first time, we are able to scale up fast deterministic robustness guarantees to ImageNet, demonstrating that this approach to robust learning can be applied to real-world applications.

Learning To Dive In Branch And Bound
Max B. Paulus Andreas Krause



研究问题:如何利用混合整数线性规划中的原始启发式方法,特别是深潜启发式方法,来寻找可行的解决方案,以便于分支和边界搜索。
动机:现有的深潜启发式方法依赖于通用的决策规则,无法充分利用实践中经常出现的相似问题实例之间的结构共性。因此,提出了L2Dive方法,通过图神经网络学习特定的深潜启发式方法。
方法:L2Dive训练生成模型预测变量分配,并利用线性程序的对偶性基于模型的预测做出深潜决策。L2Dive完全集成到开源求解器SCIP中。
效果:实验结果表明,L2Dive在一系列组合优化问题上优于标准深潜方法,找到了更好的可行解。对于来自服务器负载平衡和神经网络验证的实际应用,L2Dive在调优(默认)求解器基线上将原始-对偶积分提高了平均7%(35%),并将平均求解时间减少了20%(29%)。

Primal heuristics are important for solving mixed integer linear programs, because they find feasible solutions that facilitate branch and bound search. A prominent group of primal heuristics are diving heuristics. They iteratively modify and resolve linear programs to conduct a depth-first search from any node in the search tree. Existing divers rely on generic decision rules that fail to exploit structural commonality between similar problem instances that often arise in practice. Therefore, we propose L2Dive to learn specific diving heuristics with graph neural networks: We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions based on the model's predictions. L2Dive is fully integrated into the open-source solver SCIP. We find that L2Dive outperforms standard divers to find better feasible solutions on a range of combinatorial optimization problems. For real-world applications from server load balancing and neural network verification, L2Dive improves the primal-dual integral by up to 7% (35%) on average over a tuned (default) solver baseline and reduces average solving time by 20% (29%).

Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping
Feng Zhang Ming Tian Zhiqiang Li Bin Xu Qingbo Lu Changxin Gao Nong Sang



研究问题:本文旨在解决全局操作的3D查找表方法在局部区域效果不佳的问题,以改善色调映射的效果。
动机:目前的3D查找表方法在进行色调映射时,由于其基于像素值的全局操作特性,无法充分利用关键的局部信息,导致在局部区域的映射效果不佳。
方法:本文提出了一种新颖的策略,通过使用闭型拉普拉斯金字塔分解和重建来整合全局和局部操作符。具体来说,我们利用图像自适应的3D LUTs对低频图像进行色调处理,同时利用频率信息的特性。此外,我们还采用局部拉普拉斯滤波器以自适应的方式优化高频成分的边缘细节。
效果:我们在两个基准数据集上进行了广泛的实验,结果表明,该方法在全局色调处理和局部边缘保留方面均优于现有方法。

Tone mapping aims to convert high dynamic range (HDR) images to low dynamic range (LDR) representations, a critical task in the camera imaging pipeline. In recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained attention due to their ability to strike a favorable balance between enhancement performance and computational efficiency. However, these methods often fail to deliver satisfactory results in local areas since the look-up table is a global operator for tone mapping, which works based on pixel values and fails to incorporate crucial local information. To this end, this paper aims to address this issue by exploring a novel strategy that integrates global and local operators by utilizing closed-form Laplacian pyramid decomposition and reconstruction. Specifically, we employ image-adaptive 3D LUTs to manipulate the tone in the low-frequency image by leveraging the specific characteristics of the frequency information. Furthermore, we utilize local Laplacian filters to refine the edge details in the high-frequency components in an adaptive manner. Local Laplacian filters are widely used to preserve edge details in photographs, but their conventional usage involves manual tuning and fixed implementation within camera imaging pipelines or photo editing tools. We propose to learn parameter value maps progressively for local Laplacian filters from annotated data using a lightweight network. Our model achieves simultaneous global tone manipulation and local edge detail preservation in an end-to-end manner. Extensive experimental results on two benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.

Towards Higher Ranks via Adversarial Weight Pruning
Yuchuan Tian Hanting Chen Tianyu Guo Chao Xu Yunhe Wang



研究问题:如何有效地在边缘设备上部署卷积神经网络(CNNs)。
动机:由于高计算和存储复杂性,现有的网络剪枝方法难以在边缘设备上部署。
方法:提出一种基于排名的剪枝(RPG)方法,通过最小化权重矩阵的低秩近似误差并最大化它们的距离,将稀疏权重引导向高维拓扑结构。
效果:在各种数据集和任务上的实验结果表明,该方法在高稀疏度下非常有效,在ImageNet上使用ResNet-50模型实现了98%的稀疏度,比最先进的性能提高了1.13%。

Convolutional Neural Networks (CNNs) are hard to deploy on edge devices due to its high computation and storage complexities. As a common practice for model compression, network pruning consists of two major categories: unstructured and structured pruning, where unstructured pruning constantly performs better. However, unstructured pruning presents a structured pattern at high pruning rates, which limits its performance. To this end, we propose a Rank-based PruninG (RPG) method to maintain the ranks of sparse weights in an adversarial manner. In each step, we minimize the low-rank approximation error for the weight matrices using singular value decomposition, and maximize their distance by pushing the weight matrices away from its low rank approximation. This rank-based optimization objective guides sparse weights towards a high-rank topology. The proposed method is conducted in a gradual pruning fashion to stabilize the change of rank during training. Experimental results on various datasets and different tasks demonstrate the effectiveness of our algorithm in high sparsity. The proposed RPG outperforms the state-of-the-art performance by 1.13\% top-1 accuracy on ImageNet in ResNet-50 with 98\% sparsity. The codes are available at https://github.com/huawei-noah/Efficient-Computing/tree/master/Pruning/RPG and https://gitee.com/mindspore/models/tree/master/research/cv/RPG.

Aligning Optimization Trajectories with Diffusion Models for Constrained Design Generation
Giorgio Giannone Akash Srivastava Ole Winther Faez Ahmed



研究问题:如何在数据有限且需要高精度的约束环境中,优化生成模型的性能?
动机:尽管生成模型在视觉和语言领域取得了显著影响,但在科学和工程领域中,特别是在数据有限且需要高精度的约束环境中,传统基于物理的优化方法往往优于生成模型。
方法:介绍了扩散优化模型(DOM)和轨迹对齐(TA),这是一种学习框架,通过将扩散模型的采样轨迹与基于物理的迭代优化方法产生的轨迹对齐,确保采样过程始终基于底层物理原理进行。
效果:实验结果表明,TA在内部分布配置上优于最先进的深度生成模型,并在推理计算成本上减半。当与少数几步优化结合时,它还提高了外部分布条件的可制造性。DOM的效率和性能改进大大加快了设计过程,并将其引向最优和可制造的结果,突显了生成模型在数据驱动设计中的潜力。

Generative models have significantly influenced both vision and language domains, ushering in innovative multimodal applications. Although these achievements have motivated exploration in scientific and engineering fields, challenges emerge, particularly in constrained settings with limited data where precision is crucial. Traditional engineering optimization methods rooted in physics often surpass generative models in these contexts. To address these challenges, we introduce Diffusion Optimization Models (DOM) and Trajectory Alignment (TA), a learning framework that demonstrates the efficacy of aligning the sampling trajectory of diffusion models with the trajectory derived from physics-based iterative optimization methods. This alignment ensures that the sampling process remains grounded in the underlying physical principles. This alignment eliminates the need for costly preprocessing, external surrogate models, or extra labeled data, generating feasible and high-performance designs efficiently. We apply our framework to structural topology optimization, a fundamental problem in mechanical design, evaluating its performance on in- and out-of-distribution configurations. Our results demonstrate that TA outperforms state-of-the-art deep generative models on in-distribution configurations and halves the inference computational cost. When coupled with a few steps of optimization, it also improves manufacturability for out-of-distribution conditions. DOM's efficiency and performance improvements significantly expedite design processes and steer them toward optimal and manufacturable outcomes, highlighting the potential of generative models in data-driven design.

topic-4

Topic words :  neural,  networks,  graph,  network,  learning,  deep,  graphs,  structure

Spatial-frequency channels, shape bias, and adversarial robustness
Ajay Subramanian Elena Sizikova Najib J. Majaj Denis G. Pelli



研究问题:人类和神经网络使用何种空间频率信息来识别对象?
动机:通过神经科学中的关键频带掩蔽工具,可以揭示用于对象识别的频率选择性过滤器。
方法:将关键频带掩蔽作为网络与人类的比较任务,并在16种ImageNet分类中测试了14名人类和76个神经网络在窄带噪声存在的情况下的表现。
效果:研究发现,人类在自然图像中识别物体使用的是与字母和光栅相同的一个倍频宽的通道,这是人类对象识别的标准特征。而神经网络的通道比人类通道要宽2-4倍,这意味着网络通道扩展到了高于或低于人类敏感的频率范围。因此,在这些频率上的噪声会损害网络性能,但对人的性能没有影响。对抗性训练和增强图像训练常用于提高网络的鲁棒性和塑造偏见。这种训练是否使网络和人类的物体识别通道对齐?网络通道的三个属性(带宽、中心频率、峰值噪声敏感性)与形状偏见(51%的方差解释)和对抗性训练的网络鲁棒性(66%的方差解释)有很强的相关性。对抗性训练提高了鲁棒性,但使通道带宽进一步扩展到超过人类带宽的范围。因此,关键频带掩蔽揭示了网络通道比人类通道宽两倍以上,而对抗性训练只会使其变得更糟。具有较窄通道的网络可能更具鲁棒性。

What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel") that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. Unlike humans, the neural network channel is very broad, 2-4 times wider than the human channel. This means that the network channel extends to frequencies higher and lower than those that humans are sensitive to. Thus, noise at those frequencies will impair network performance and spare human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (51% variance explained) and robustness of adversarially-trained networks (66% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further beyond the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only makes it worse. Networks with narrower channels might be more robust.

Clifford Group Equivariant Neural Networks
David Ruhe Johannes Brandstetter Patrick Forré



研究问题:本文提出了一种构建O(n)和E(n)等变模型的新方法,即Clifford群等变神经网络。
动机:为了解决传统神经网络在处理高维数据时的问题,本文引入了Clifford群等变神经网络,该网络能够更好地处理多向量的几何乘积结构。
方法:通过研究Clifford群及其在Clifford代数中的作用,本文提出了一种新的参数化等变神经网络层的方法。该方法可以优雅地推广到任意维度的内积空间。
效果:实验结果表明,本文提出的Clifford群等变神经网络在多个任务上取得了最先进的性能,包括三维n体问题、四维洛伦兹等变高能物理实验和五维凸包实验。

We introduce Clifford Group Equivariant Neural Networks: a novel approach for constructing $\mathrm{O}(n)$- and $\mathrm{E}(n)$-equivariant models. We identify and study the *Clifford group*: a subgroup inside the Clifford algebra tailored to achieve several favorable properties. Primarily, the group's action forms an orthogonal automorphism that extends beyond the typical vector space to the entire Clifford algebra while respecting the multivector grading. This leads to several non-equivalent subrepresentations corresponding to the multivector decomposition. Furthermore, we prove that the action respects not just the vector space structure of the Clifford algebra but also its multiplicative structure, i.e., the geometric product. These findings imply that every polynomial in multivectors, including their grade projections, constitutes an equivariant map with respect to the Clifford group, allowing us to parameterize equivariant neural network layers. An advantage worth mentioning is that we obtain expressive layers that can elegantly generalize to inner-product spaces of any dimension. We demonstrate, notably from a single core implementation, state-of-the-art performance on several distinct tasks, including a three-dimensional $n$-body experiment, a four-dimensional Lorentz-equivariant high-energy physics experiment, and a five-dimensional convex hull experiment.

Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity
Tianqin Li Ziqi Wen Yangfan Li Tai Sing Lee



研究问题:目前的深度学习模型在物体识别上对纹理有强烈的偏好,而人类视觉系统则偏向于形状和结构。如何设计出能引入更多形状偏好的深度学习模型?
动机:人类视觉系统的设计与深度学习模型在物体识别上的偏好存在差异,本研究旨在找出人类视觉系统中的设计原则,并将其引入到深度学习模型中。
方法:通过使用稀疏编码这一大脑中普遍存在的原则,可以给网络引入形状偏好。研究发现,通过非微分Top-K操作来强制稀疏编码约束,可以使卷积神经网络中的神经元出现结构编码,从而将物体平滑分解为部分和子部分,使网络具有形状偏好。
效果:实验表明,这种形状偏好的出现及其在不同网络结构和数据集上的功能效益。对于物体识别的卷积神经网络,形状偏好可以提高其对抗风格和模式变化干扰的鲁棒性。对于图像合成的生成对抗网络,出现的形状偏好可以使合成的图像具有更连贯和可分解的结构。消融研究显示,稀疏码趋向于编码结构,而更分散的码则倾向于偏好纹理。

Current deep-learning models for object recognition are known to be heavily biased toward texture. In contrast, human visual systems are known to be biased toward shape and structure. What could be the design principles in human visual systems that led to this difference? How could we introduce more shape bias into the deep learning models? In this paper, we report that sparse coding, a ubiquitous principle in the brain, can in itself introduce shape bias into the network. We found that enforcing the sparse coding constraint using a non-differential Top-K operation can lead to the emergence of structural encoding in neurons in convolutional neural networks, resulting in a smooth decomposition of objects into parts and subparts and endowing the networks with shape bias. We demonstrated this emergence of shape bias and its functional benefits for different network structures with various datasets. For object recognition convolutional neural networks, the shape bias leads to greater robustness against style and pattern change distraction. For the image synthesis generative adversary networks, the emerged shape bias leads to more coherent and decomposable structures in the synthesized images. Ablation studies suggest that sparse codes tend to encode structures, whereas the more distributed codes tend to favor texture. Our code is host at the github repository: \url{https://github.com/Crazy-Jack/nips2023_shape_vs_texture}

Going beyond persistent homology using persistent homology
Johanna Emilia Immonen Amauri H Souza Vikas Garg



研究问题:本文旨在解决图神经网络(MP-GNNs)在表示能力上的限制,特别是在研究问题:本文旨在解决图神经网络(MP-GNNs)在表示能力上的限制,特别是在同构性检测的Weisfeiler-Leman(WL)测试方面。
动机:尽管通过持久同调(PH)增强这些图模型以获取拓扑特征已经取得了显著的效果,但确定PH可以识别的有属性图的类别仍然是一个开放的问题。
方法:本文提出了一种新的颜色分离集概念,为解决这个重要问题提供了完整的解决方案。具体来说,我们建立了基于顶点和边颜色的滤波函数得到的连通分量的持久性的图之间区分的必要和充分条件。
效果:利用这些理论洞察,我们提出了RePHINE用于学习图形上的拓扑特征。RePHINE有效地结合了顶点和边级别的PH,实现了一个比两者都强大的方案。将RePHINE集成到MP-GNNs中增强了它们的表达能力,在几个图分类基准测试中超过了标准的PH。

Representational limits of message-passing graph neural networks (MP-GNNs), e.g., in terms of the Weisfeiler-Leman (WL) test for isomorphism, are well understood. Augmenting these graph models with topological features via persistent homology (PH) has gained prominence, but identifying the class of attributed graphs that PH can recognize remains open. We introduce a novel concept of color-separating sets to provide a complete resolution to this important problem. Specifically, we establish the necessary and sufficient conditions for distinguishing graphs based on the persistence of their connected components, obtained from filter functions on vertex and edge colors. Our constructions expose the limits of vertex- and edge-level PH, proving that neither category subsumes the other. Leveraging these theoretical insights, we propose RePHINE for learning topological features on graphs. RePHINE efficiently combines vertex- and edge-level PH, achieving a scheme that is provably more powerful than both. Integrating RePHINE into MP-GNNs boosts their expressive power, resulting in gains over standard PH on several benchmarks for graph classification.

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
Kaiyue Wen Zhiyuan Li Tengyu Ma



研究问题:尽管已有大量研究,但过参数化神经网络为何能泛化的原因仍然不明确。
动机:现有的理论表明,常见的随机优化器倾向于选择训练损失的较平坦的最小值,因此自然的潜在解释是平坦度意味着泛化。
方法:通过理论和实证研究,我们为两层ReLU网络确定了以下三种情况:(1)平坦度必然意味着泛化;(2)存在非泛化的最平坦模型,锐度最小化算法无法泛化;(3)最引人注目的是,存在非泛化的最平坦模型,但锐度最小化算法仍然可以泛化。
效果:我们的研究结果表明,锐度和泛化之间的关系微妙地取决于数据分布和模型架构,并且锐度最小化算法不仅通过最小化锐度来实现更好的泛化。这需要寻找其他解释来解释过参数化神经网络的泛化能力。

Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1) flatness provably implies generalization; (2) there exist non-generalizing flattest models and sharpness minimization algorithms fail to generalize poorly, and (3) perhaps most strikingly, there exist non-generalizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the model architectures and sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. This calls for the search for other explanations for the generalization of over-parameterized neural networks

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
Ziqian Zhong Ziming Liu Max Tegmark Jacob Andreas



研究问题:训练在算法任务上的神经网络是否能可靠地重新发现已知的算法?
动机:最近的一些研究表明,神经网络可以从训练集中学习到算法。然而,对于模块化加法问题,我们发现神经网络学习算法的过程可能更复杂。
方法:我们使用模块化加法作为原型问题,通过改变模型的超参数和初始化,观察神经网络是否能够学习到不同的算法。
效果:实验结果显示,神经网络不仅可以学习到已知的“时钟”算法,还能学习到一个之前未被描述、理解起来较困难但可解释的“披萨”算法,以及一系列更为复杂的过程。这些结果表明,即使是简单的学习问题,也可能有令人惊讶的解决方案多样性。

Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known algorithms? Several recent studies, on tasks ranging from group operations to in-context linear regression, have suggested that the answer is yes. Using modular addition as a prototypical problem, we show that algorithm discovery in neural networks is sometimes more complex: small changes to model hyperparameters and initializations can induce discovery of qualitatively different algorithms from a fixed training set, and even learning of multiple different solutions in parallel. In modular addition, we specifically show that models learn a known *Clock* algorithm, a previously undescribed, less intuitive, but comprehensible procedure we term the *Pizza* algorithm, and a variety of even more complex procedures. Our results show that even simple learning problems can admit a surprising diversity of solutions, motivating the development of new tools for mechanistically characterizing the behavior of neural networks across the algorithmic phase space.

Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models
Guillermo Ortiz-Jimenez Alessandro Favero Pascal Frossard



研究问题:任务算术在视觉-语言模型中的效果及其基本原理。
动机:任务算术是一种直接在权重空间编辑预训练模型的有效且可扩展的方法,但其效果和原理尚未完全理解。
方法:通过添加不同任务的微调权重,研究了任务算术在视觉-语言模型中的有效性,并发现权重解耦是其有效的关键因素。
效果:研究发现,通过线性化模型在其切空间进行微调可以放大权重解耦,从而显著提高多个任务算术基准的性能。同时,建立了任务算术与神经切线核(NTK)本征函数的空间定位之间的紧密联系。

Task arithmetic has recently emerged as a cost-effective and scalable approach to edit pre-trained models directly in weight space: By adding the fine-tuned weights of different tasks, the model's performance can be improved on these tasks, while negating them leads to task forgetting. Yet, our understanding of the effectiveness of task arithmetic and its underlying principles remains limited. We present a comprehensive study of task arithmetic in vision-language models and show that weight disentanglement is the crucial factor that makes it effective. This property arises during pre-training and manifests when distinct directions in weight space govern separate, localized regions in function space associated with the tasks. Notably, we show that fine-tuning models in their tangent space by linearizing them amplifies weight disentanglement. This leads to substantial performance improvements across multiple task arithmetic benchmarks and diverse models. Building on these findings, we provide theoretical and empirical analyses of the neural tangent kernel (NTK) of these models and establish a compelling link between task arithmetic and the spatial localization of the NTK eigenfunctions. Overall, our work uncovers novel insights into the fundamental mechanisms of task arithmetic and offers a more reliable and effective approach to edit pre-trained models through the NTK linearization.

Abide by the law and follow the flow: conservation laws for gradient flows
Sibylle Marcotte Rémi Gribonval Gabriel Peyré



研究问题:理解梯度下降动力学的几何性质是揭示大型机器学习模型近期成功的关键。
动机:过参数化模型在训练过程中保留了优化初始化的一些属性,这种“隐含偏见”被认为是已训练模型一些有利特性的原因,并可以解释其良好的泛化性能。
方法:我们严格地定义了“守恒定律”,明确了在任何训练数据和任何损失下,给定模型(如具有特定架构的ReLU网络)的梯度流中保留的量的守恒性,并通过对模型雅可比矩阵生成的李代数进行有限维代数操作,找出最大数量的独立守恒定律。
效果:我们提供了一系列的算法,包括计算一族多项式定律、计算最大数量的(不一定是多项式)独立守恒定律。通过应用这两种算法,我们确认了对于多个ReLU网络架构,所有已知的定律都被算法恢复,并且没有其他独立的定律。这些计算工具为理解大型机器学习模型中优化初始化的理想特性铺平了道路。

Understanding the geometric properties of gradient descent dynamics is a key ingredient in deciphering the recent success of very large machine learning models. A striking observation is that trained over-parameterized models retain some properties of the optimization initialization. This "implicit bias" is believed to be responsible for some favorable properties of the trained models and could explain their good generalization properties. The purpose of this article is threefold. First, we rigorously expose the definition and basic properties of "conservation laws", that define quantities conserved during gradient flows of a given model (e.g. of a ReLU network with a given architecture) with any training data and any loss. Then we explain how to find the maximal number of independent conservation laws by performing finite-dimensional algebraic manipulations on the Lie algebra generated by the Jacobian of the model. Finally, we provide algorithms to: a) compute a family of polynomial laws; b) compute the maximal number of (not necessarily polynomial) independent conservation laws. We provide showcase examples that we fully work out theoretically. Besides, applying the two algorithms confirms for a number of ReLU network architectures that all known laws are recovered by the algorithm, and that there are no other independent laws. Such computational tools pave the way to understanding desirable properties of optimization initialization in large machine learning models.

Uncovering motifs of concurrent signaling across multiple neuronal populations
Evren Gokcen Anna Ivic Jasper Alison Xu Adam Kohn Christian K. Machens Byron M. Yu



研究问题:如何描述和理解不同脑网络中不同神经群体的多维、并发信号流。
动机:现代记录技术使我们能够从不同的大脑网络中记录来自不同神经群体的信号,但需要新的理论和统计框架来描述这些多维、并发的信号流。
方法:开发了一种降维框架,该框架可以确定(1)每个潜在维度描述的神经群体子集,(2)这些群体之间的信号流方向,以及(3)信号如何在实验试验内和跨试验中随时间演化。
效果:通过模拟和对先前研究的猕猴视觉区域V1和V2的神经群体记录的应用,验证了该方法的有效性。进一步研究了V1、V2和V3d的选择性通讯,并发现其与它们的视网映射有关。这项工作推动了多个神经群体并发信号研究的发展。

Modern recording techniques now allow us to record from distinct neuronal populations in different brain networks. However, especially as we consider multiple (more than two) populations, new conceptual and statistical frameworks are needed to characterize the multi-dimensional, concurrent flow of signals among these populations. Here, we develop a dimensionality reduction framework that determines (1) the subset of populations described by each latent dimension, (2) the direction of signal flow among those populations, and (3) how those signals evolve over time within and across experimental trials. We illustrate these features in simulation, and further validate the method by applying it to previously studied recordings from neuronal populations in macaque visual areas V1 and V2. Then we study interactions across select laminar compartments of areas V1, V2, and V3d, recorded simultaneously with multiple Neuropixels probes. Our approach uncovered signatures of selective communication across these three areas that related to their retinotopic alignment. This work advances the study of concurrent signaling across multiple neuronal populations.

Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy Augustine N. Mavor-Parker Aengus Lynch Stefan Heimersheim Adrià Garriga-Alonso



研究问题:如何系统化地理解Transformer模型的复杂行为。
动机:现有的工作通过大量的努力和直觉,反向工程了Transformer模型的一些复杂行为,但这个过程需要手动操作且耗时。
方法:本文提出了一种自动化的方法来找到构成电路的抽象神经网络单元之间的连接。我们提出了几种算法,并通过重现以前的可解释性结果来验证它们。
效果:例如,ACDC算法重新发现了GPT-2 Small中计算大于操作的电路中的5/5组件类型。ACDC选择了GPT-2 Small中的68个边,所有这些边都是以前手动发现的。我们的代码可以在https://github.com/ArthurConmy/Automatic-Circuit-Discovery上找到。

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: finding the connections between the abstract neural network units that form a circuit. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery

Exploring Geometry of Blind Spots in Vision models
Sriram Balasubramanian Gaurang Sriramanan Vinu Sankar Sadasivan Soheil Feizi



研究问题:深度神经网络对微小的扰动非常敏感,即对抗性攻击。同时,也有观察表明深度网络可能对输入空间的大范围扰动不敏感。
动机:本研究旨在详细研究视觉模型(如CNN和Transformers)中的不敏感性现象,并呈现研究这种网络“等信心”水平集的几何形状和范围的技术。
方法:提出了一种水平集遍历算法,该算法使用局部梯度的正交分量迭代地探索与输入空间相关的高信心区域。给定一个源图像,我们使用此算法来识别与源图像处于同一等信心水平集的输入,尽管它们与来自其他类别的任意图像在感知上相似。
效果:我们发现源图像通过高信心路径线性连接到这些输入,揭示了深度网络水平集的星状结构。此外,我们还试图识别和估计模型保持高度信心的这些连接的更高维区域的广度。

Despite the remarkable success of deep neural networks in a myriad of settings, several works have demonstrated their overwhelming sensitivity to near-imperceptible perturbations, known as adversarial attacks. On the other hand, prior works have also observed that deep networks can be under-sensitive, wherein large-magnitude perturbations in input space do not induce appreciable changes to network activations. In this work, we study in detail the phenomenon of under-sensitivity in vision models such as CNNs and Transformers, and present techniques to study the geometry and extent of “equi-confidence” level sets of such networks. We propose a Level Set Traversal algorithm that iteratively explores regions of high confidence with respect to the input space using orthogonal components of the local gradients. Given a source image, we use this algorithm to identify inputs that lie in the same equi-confidence level set as the source image despite being perceptually similar to arbitrary images from other classes. We further observe that the source image is linearly connected by a high-confidence path to these inputs, uncovering a star-like structure for level sets of deep networks. Furthermore, we attempt to identify and estimate the extent of these connected higher-dimensional regions over which the model maintains a high degree of confidence.

Locality Sensitive Hashing in Fourier Frequency Domain For Soft Set Containment Search
Indradyumna Roy Rishi Agarwal Soumen Chakrabarti Anirban Dasgupta Abir De



研究问题:在与段落检索、文本蕴含和子图搜索相关的许多搜索应用中,查询和每个“文档”都是一组元素,如果一个文档包含查询,那么它就是相关的。这些元素不是由原子ID表示的,而是由嵌入表示表示的,从而将集合包含扩展到了软集合包含。
动机:现有的LSH方法大多适用于对称或少数简单的非对称距离函数,不适用于铰链距离。因此,我们提出了一种新的方法来处理这个问题。
方法:我们将铰链距离转化为一种提出的支配相似性度量,然后对其应用傅里叶变换,从而将支配相似性表达为频率域内函数内积的期望。接下来,我们用重要采样估计来近似期望。最后,我们使用传统的LSH,但在频率域中进行。
效果:我们的实验表明,所提出的非对称支配相似性对于目标应用至关重要,我们的LSH(我们称之为FourierHashNet)相比于几个基线提供了更好的查询时间与检索质量权衡。傅里叶变换和可训练的哈希码都对性能增益做出了贡献。

In many search applications related to passage retrieval, text entailment, and subgraph search, the query and each 'document' is a set of elements, with a document being relevant if it contains the query. These elements are not represented by atomic IDs, but by embedded representations, thereby extending set containment to *soft* set containment. Recent applications address soft set containment by encoding sets into fixed-size vectors and checking for elementwise *vector* *dominance*. This 0/1 property can be relaxed to an asymmetric *hinge* *distance* for scoring and ranking candidate documents. Here we focus on data-sensitive, trainable indices for fast retrieval of relevant documents. Existing LSH methods are designed for mostly symmetric or few simple asymmetric distance functions, which are not suitable for hinge distance. Instead, we transform hinge distance into a proposed *dominance* *similarity* measure, to which we then apply a Fourier transform, thereby expressing dominance similarity as an expectation of inner products of functions in the frequency domain. Next, we approximate the expectation with an importance-sampled estimate. The overall consequence is that now we can use a traditional LSH, but in the frequency domain. To ensure that the LSH uses hash bits efficiently, we learn hash functions that are sensitive to both corpus and query distributions, mapped to the frequency domain. Our experiments show that the proposed asymmetric dominance similarity is critical to the targeted applications, and that our LSH, which we call FourierHashNet, provides a better query time vs. retrieval quality trade-off, compared to several baselines. Both the Fourier transform and the trainable hash codes contribute to performance gains.

Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance
Jinwoo Kim Dat Tien Nguyen Ayhan Suleymanzade Hyeokjun An Seunghoon Hong



研究问题:本文旨在解决等变架构在学习具有群对称性的函数时的限制。
动机:目前的等变架构需要使用特定的骨干网络,并且通过一个小的等变网络来参数化潜在的对称分布,而新的框架则使用任意的骨干网络并通过一个小的等变网络来实现对给定群的等变性。
方法:该框架采用一个任意的骨干网络(如多层感知机或变压器),并通过一个小的等变网络来参数化潜在的对称分布,从而实现对给定群的等变性。分布与骨干网络一起进行端到端训练,以最大化性能并减少对称化的样本复杂度。
效果:实验结果表明,该方法不仅能够保证对给定群的等变性,还具有期望中的通用近似能力。在广泛的对称性群组上进行实证测试,包括置换和欧几里得群及其组合,并与定制的等变架构竞争,显示出潜力,可以使用非等变的通用骨干网络来学习不同群体的等变函数。进一步证明,当从非对称模态(如视觉)预训练时,可以在对称模态(如图形)中实现增强的学习效果。

We present a novel framework to overcome the limitations of equivariant architectures in learning functions with group symmetries. In contrary to equivariant architectures, the framework uses an arbitrary backbone (such as an MLP or a transformer) and symmetrizes it to be equivariant to given group by employing a small equivariant network that parameterizes the probabilistic distribution underlying the symmetrization. The distribution is end-to-end trained with the backbone which can maximize performance while reducing sample complexity of symmetrization. We show that this approach ensures not only equivariance to the given group but also universal approximation ability in expectation. We implement our method on a simple patch-based transformer backbone initialized from pretrained vision transformer, and test it for a wide range of symmetry groups including permutation and Euclidean groups and their combinations. Empirical tests show competitive results against tailored equivariant architectures, suggesting the potential for learning equivariant functions for diverse groups using a non-equivariant universal backbone. We further show evidence of enhanced learning in symmetric modalities, like graphs, when pretrained from non-symmetric modalities, like vision.

Provable Training for Graph Contrastive Learning
Yue Yu Xiao Wang Mengmei Zhang Nian Liu Chuan Shi



研究问题:尽管图对比学习(GCL)已成为一种流行的无标签增强图节点嵌入学习方法,但其在处理复杂图结构时存在一些基本问题,如是否所有节点都能遵循最大化正节点对相似性、最小化负节点对相似性的原则进行训练?
动机:考虑到图的复杂性,有些节点可能在所有图增强中都不能得到良好的训练,甚至违反了这一原则。因此,需要找出这些节点并进一步指导GCL的训练。
方法:我们首先提出“节点紧凑度”这一指标,作为衡量节点如何遵循GCL原则与增强范围关系的下界。然后,通过边界传播从理论上推导出节点紧凑度的形式,并将其整合到二元交叉熵中作为正则化项。为此,我们提出了用于GCL的可证明训练(POT),该训练方法对GCL进行正则化,以更好地编码遵循GCL原则的节点嵌入。
效果:通过对各种基准的大量实验,POT显著提高了现有的GCL方法的性能,成为一种有效的插件。

Graph Contrastive Learning (GCL) has emerged as a popular training approach for learning node embeddings from augmented graphs without labels. Despite the key principle that maximizing the similarity between positive node pairs while minimizing it between negative node pairs is well established, some fundamental problems are still unclear. Considering the complex graph structure, are some nodes consistently well-trained and following this principle even with different graph augmentations? Or are there some nodes more likely to be untrained across graph augmentations and violate the principle? How to distinguish these nodes and further guide the training of GCL? To answer these questions, we first present experimental evidence showing that the training of GCL is indeed imbalanced across all nodes. To address this problem, we propose the metric "node compactness", which is the lower bound of how a node follows the GCL principle related to the range of augmentations. We further derive the form of node compactness theoretically through bound propagation, which can be integrated into binary cross-entropy as a regularization. To this end, we propose the PrOvable Training (POT) for GCL, which regularizes the training of GCL to encode node embeddings that follows the GCL principle better. Through extensive experiments on various benchmarks, POT consistently improves the existing GCL approaches, serving as a friendly plugin.

Learning Layer-wise Equivariances Automatically using Gradients
Tycho F.A. van der Ouderaa Alexander Immer Mark van der Wilk



研究问题:如何让神经网络学习灵活的对称性约束?
动机:目前的神经网络模型中的对称性是固定的,不能适应数据。
方法:通过优化边际似然估计来学习层间的等变结构,以实现深度神经网络的层间对称发现。
效果:在图像分类任务上,该方法能自动学习层间的等变结构,并取得与硬编码对称性相当或更好的性能。

Convolutions encode equivariance symmetries into neural networks leading to better generalisation performance. However, symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients. Learning symmetry and associated weight connectivity structures from scratch is difficult for two reasons. First, it requires efficient and flexible parameterisations of layer-wise equivariances. Secondly, symmetries act as constraints and are therefore not encouraged by training losses measuring data fit. To overcome these challenges, we improve parameterisations of soft equivariance and learn the amount of equivariance in layers by optimising the marginal likelihood, estimated using differentiable Laplace approximations. The objective balances data fit and model complexity enabling layer-wise symmetry discovery in deep networks. We demonstrate the ability to automatically learn layer-wise equivariances on image classification tasks, achieving equivalent or improved performance over baselines with hard-coded symmetry.

Provably Bounding Neural Network Preimages
Suhas Kotha Christopher Brix J Zico Kolter Krishnamurthy Dj Dvijotham Huan Zhang



研究问题:本文旨在解决神经网络验证中的逆问题,即如何找到导致特定输出的输入集。
动机:大多数神经网络验证工作都集中在对给定输入集对应的输出集进行边界限制(例如,名义输入的有界扰动)。然而,许多神经网络验证用例需要解决逆问题,或对导致某些输出的输入集进行过度近似。
方法:我们提出了INVPROP算法,用于验证线性约束输出集的预像属性,该算法可以与分支定界相结合以提高精度。与其他方法不同,我们的高效算法是GPU加速的,不需要线性规划求解器。
效果:我们在多个基准测试中证明了我们的方法,包括在VNN-COMP 2023中使用了一个具有167k个神经元的大模型。结果显示,在某些设置下,我们找到的过度近似比之前的工作更紧2500倍,同时快2.5倍。通过加强输出约束的鲁棒性验证,我们始终能验证更多的属性,超过了之前的最先进的方法。

Most work on the formal verification of neural networks has focused on bounding the set of outputs that correspond to a given set of inputs (for example, bounded perturbations of a nominal input). However, many use cases of neural network verification require solving the inverse problem, or over-approximating the set of inputs that lead to certain outputs. We present the INVPROP algorithm for verifying properties over the preimage of a linearly constrained output set, which can be combined with branch-and-bound to increase precision. Contrary to other approaches, our efficient algorithm is GPU-accelerated and does not require a linear programming solver. We demonstrate our algorithm for identifying safe control regions for a dynamical system via backward reachability analysis, verifying adversarial robustness, and detecting out-of-distribution inputs to a neural network. Our results show that in certain settings, we find over-approximations over $2500\times$ tighter than prior work while being $2.5\times$ faster. By strengthening robustness verification with output constraints, we consistently verify more properties than the previous state-of-the-art on multiple benchmarks, including a large model with 167k neurons in VNN-COMP 2023. Our algorithm has been incorporated into the $\alpha,\beta$-CROWN verifier, available at https://abcrown.org.

State Sequences Prediction via Fourier Transform for Representation Learning
Mingxuan Ye Yufei Kuang Jie Wang Rui Yang Wengang Zhou Houqiang Li Feng Wu



研究问题:深度强化学习在解决复杂控制任务上有效,但样本效率仍是一个关键挑战。
动机:现有的研究探索了利用表示学习进行数据高效的强化学习,但许多方法并未充分利用状态序列中的结构性信息,这可能改善长期决策的质量,但在时间域中难以察觉。
方法:提出一种通过傅里叶变换预测状态序列(SPF)的新方法,该方法利用状态序列的频率域提取时序数据中的底层模式以高效地学习表现力表示。
效果:实验证明,所提出的方法在样本效率和性能方面优于几种最先进的算法。

While deep reinforcement learning (RL) has been demonstrated effective in solving complex control tasks, sample efficiency remains a key challenge due to the large amounts of data required for remarkable performance. Existing research explores the application of representation learning for data-efficient RL, e.g., learning predictive representations by predicting long-term future states. However, many existing methods do not fully exploit the structural information inherent in sequential state signals, which can potentially improve the quality of long-term decision-making but is difficult to discern in the time domain. To tackle this problem, we propose State Sequences Prediction via Fourier Transform (SPF), a novel method that exploits the frequency domain of state sequences to extract the underlying patterns in time series data for learning expressive representations efficiently. Specifically, we theoretically analyze the existence of structural information in state sequences, which is closely related to policy performance and signal regularity, and then propose to predict the Fourier transform of infinite-step future state sequences to extract such information. One of the appealing features of SPF is that it is simple to implement while not requiring storage of infinite-step future states as prediction targets. Experiments demonstrate that the proposed method outperforms several state-of-the-art algorithms in terms of both sample efficiency and performance.

From Tempered to Benign Overfitting in ReLU Neural Networks
Guy Kornowski Gilad Yehudai Ohad Shamir



研究问题:本研究旨在探讨过参数化神经网络在处理噪声数据时的泛化能力,以及输入维度、样本大小、架构和训练算法对过拟合类型的影响。
动机:尽管过参数化神经网络在训练时会完全适应噪声数据,但其仍具有良好的泛化能力,这一现象激发了“良性过拟合”的研究。最近,有研究者提出并观察到,神经网络的行为更适宜被描述为“温和的过拟合”,即性能非最优但也不为零,且随着噪声水平的增加而降低。然而,对于非线性神经网络的这种观点,目前还缺乏理论上的证明。
方法:本研究通过研究具有两层ReLU神经网络的简单分类设置,证明了在不同的假设下,当输入维度为一维时,过拟合类型从温和变为良性;而在高维情况下,过拟合类型为良性。
效果:我们的研究结果揭示了输入维度、样本大小、架构和训练算法与产生的过拟合类型之间的复杂关系。同时,我们也验证了在中间维度上的结果。

Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the overfitting profile in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.

Demystifying Oversmoothing in Attention-Based Graph Neural Networks
Xinyi Wu Amir Ajorlou Zihui Wu Ali Jadbabaie



研究问题:本文旨在解决图神经网络中过度平滑的问题,即随着网络深度的增加,节点表示变得同质化。
动机:尽管先前的研究已经证明图卷积网络(GCNs)会指数级地失去表达能力,但关于图注意力机制是否能缓解过度平滑的问题仍存在争议。
方法:本文通过将基于注意力的图神经网络视为非线性时变动力系统,并引入非均匀矩阵乘积和联合谱半径理论的工具和技术,进行了严格的数学分析。
效果:与普遍看法相反,本文发现图注意力机制不能防止过度平滑,并且会指数级地失去表达能力。这一框架将现有的关于对称GCNs过度平滑的结果扩展到了更广泛的一类GNN模型,包括随机游走GCNs、图注意力网络(GATs)和(图)转换器。

Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations. While previous work has established that Graph Convolutional Networks (GCNs) exponentially lose expressive power, it remains controversial whether the graph attention mechanism can mitigate oversmoothing. In this work, we provide a definitive answer to this question through a rigorous mathematical analysis, by viewing attention-based GNNs as nonlinear time-varying dynamical systems and incorporating tools and techniques from the theory of products of inhomogeneous matrices and the joint spectral radius. We establish that, contrary to popular belief, the graph attention mechanism cannot prevent oversmoothing and loses expressive power exponentially. The proposed framework extends the existing results on oversmoothing for symmetric GCNs to a significantly broader class of GNN models, including random walk GCNs, Graph Attention Networks (GATs) and (graph) transformers. In particular, our analysis accounts for asymmetric, state-dependent and time-varying aggregation operators and a wide range of common nonlinear activation functions, such as ReLU, LeakyReLU, GELU and SiLU.

What Planning Problems Can A Relational Neural Network Solve?
Jiayuan Mao Tomás Lozano-Pérez Joshua B. Tenenbaum Leslie Pack Kaelbling



研究问题:本论文旨在探讨在何种情况下可以学习目标条件策略,以及这种策略的效率如何。
动机:尽管目标条件策略通常被理解为前馈电路,即从当前状态和目标规范映射到下一步行动的神经网络形式,但对其学习和效率的理解仍不充分。
方法:通过将关系神经网络(如图神经网络和变压器)与序列化目标回归搜索(S-GRS)进行比较,对表示规划问题的策略的关系神经网络进行了电路复杂度分析。
效果:论文展示了规划问题在电路宽度和深度随对象数量和规划时间的增长方面的三种一般类别,并提供了构造性证明。此外,还说明了这种分析对于设计用于策略学习的神经网络的效用。

Goal-conditioned policies are generally understood to be "feed-forward" circuits, in the form of neural networks that map from the current state and the goal specification to the next action to take. However, under what circumstances such a policy can be learned and how efficient the policy will be are not well understood. In this paper, we present a circuit complexity analysis for relational neural networks (such as graph neural networks and transformers) representing policies for planning problems, by drawing connections with serialized goal regression search (S-GRS). We show that there are three general classes of planning problems, in terms of the growth of circuit width and depth as a function of the number of objects and planning horizon, providing constructive proofs. We also illustrate the utility of this analysis for designing neural networks for policy learning.

Break It Down: Evidence for Structural Compositionality in Neural Networks
Michael A. Lepori Thomas Serre Ellie Pavlick



研究问题:现代神经网络在视觉和语言任务中实现了令人印象深刻的性能,但其实现的功能仍不清楚。
动机:神经网络可能通过模块化子网络实现子任务的解决方案,或者只是学习将新输入匹配到已学习的模板,完全省略任务分解。
方法:利用模型剪枝技术在各种架构、任务和预训练方案中对视觉和语言进行调查。
效果:结果显示,模型通常通过模块化子网络实现子任务的解决方案,这些子网络可以被剔除,同时保持其他子网络的功能。这表明神经网络可能能够学习组合性,无需专门的符号机制。

Though modern neural networks have achieved impressive performance in both vision and language tasks, we know little about the functions that they implement. One possibility is that neural networks implicitly break down complex tasks into subroutines, implement modular solutions to these subroutines, and compose them into an overall solution to a task --- a property we term structural compositionality. Another possibility is that they may simply learn to match new inputs to learned templates, eliding task decomposition entirely. Here, we leverage model pruning techniques to investigate this question in both vision and language across a variety of architectures, tasks, and pretraining regimens. Our results demonstrate that models oftentimes implement solutions to subroutines via modular subnetworks, which can be ablated while maintaining the functionality of other subnetworks. This suggests that neural networks may be able to learn compositionality, obviating the need for specialized symbolic mechanisms.

Distance-Restricted Folklore Weisfeiler-Leman GNNs with Provable Cycle Counting Power
Junru Zhou Jiarui Feng Xiyuan Wang Muhan Zhang



研究问题:如何提高图神经网络(GNNs)在各种任务上的成功率,特别是在计数特定图子结构,如环路方面的能力。
动机:许多已提出的具有可证明的环路计数能力的GNN模型都是基于子图GNNs,这种方法需要大量的预处理,并且时间和内存成本高。
方法:提出了一种新的GNNs类别——$d$-Distance-Restricted FWL(2) GNNs,或 $d$-DRFWL(2) GNNs,该方法简化了FWL(2)算法,限制了消息传递的范围为距离不超过d的节点对,从而在保持表达能力的同时降低了复杂度。
效果:实验结果表明,$d$-DRFWL(2) GNNs即使在d=2时也具有强大的环路计数能力,可以计数所有3、4、5、6-环路。在合成数据集和分子数据集上的实验验证了这一理论,$d=2$的$DRFWL(2)$ GNN是目前为止最有效(无论在理论上还是实证上)的能够计数至6-环路的GNN模型。

The ability of graph neural networks (GNNs) to count certain graph substructures, especially cycles, is important for the success of GNNs on a wide range of tasks. It has been recently used as a popular metric for evaluating the expressive power of GNNs. Many of the proposed GNN models with provable cycle counting power are based on subgraph GNNs, i.e., extracting a bag of subgraphs from the input graph, generating representations for each subgraph, and using them to augment the representation of the input graph. However, those methods require heavy preprocessing, and suffer from high time and memory costs. In this paper, we overcome the aforementioned limitations of subgraph GNNs by proposing a novel class of GNNs---$d$-Distance-Restricted FWL(2) GNNs, or $d$-DRFWL(2) GNNs, based on the well-known FWL(2) algorithm. As a heuristic method for graph isomorphism testing, FWL(2) colors all node pairs in a graph and performs message passing among those node pairs. In order to balance the expressive power and complexity, $d$-DRFWL(2) GNNs simplify FWL(2) by restricting the range of message passing to node pairs whose mutual distances are at most $d$. This way, $d$-DRFWL(2) GNNs exploit graph sparsity while avoiding the expensive subgraph extraction operations in subgraph GNNs, making both the time and space complexity lower. We theoretically investigate both the discriminative power and the cycle counting power of $d$-DRFWL(2) GNNs. Our most important finding is that $d$-DRFWL(2) GNNs have provably strong cycle counting power even with $d=2$: they can count all 3, 4, 5, 6-cycles. Since 6-cycles (e.g., benzene rings) are ubiquitous in organic molecules, being able to detect and count them is crucial for achieving robust and generalizable performance on molecular tasks. Experiments on both synthetic datasets and molecular datasets verify our theory. To the best of our knowledge, 2-DRFWL(2) GNN is the most efficient GNN model to date (both theoretically and empirically) that can count up to 6-cycles.

A Spectral Theory of Neural Prediction and Alignment
Abdulkadir Canatar Jenelle Feather Albert Wakhloo SueYeon Chung



研究问题:如何对表现相似的深度神经网络进行区分,并理解模型如何捕捉神经活动。
动机:尽管许多先进的深度神经网络在预测神经反应上表现相似,但如何区分这些表现相同的模型,以及理解其捕捉神经活动的方式仍不清楚。
方法:通过应用最近的理论框架,将回归中的泛化误差与模型和目标的谱性质联系起来,并将此理论应用于模型激活和神经反应之间的回归。通过分解神经网络预测误差,引入几何测量来解读神经网络预测误差。
效果:通过对大量预测视觉皮层活动的深度神经网络进行测试,发现有多种类型的几何结构会导致低的神经网络预测误差。这项工作表明,仔细分解表示度量可以提供模型如何捕捉神经活动的可解释性,并为改进的神经活动模型指明了方向。

The representations of neural networks are often compared to those of biological systems by performing regression between the neural network responses and those measured from biological systems. Many different state-of-the-art deep neural networks yield similar neural predictions, but it remains unclear how to differentiate among models that perform equally well at predicting neural responses. To gain insight into this, we use a recent theoretical framework that relates the generalization error from regression to the spectral properties of the model and the target. We apply this theory to the case of regression between model activations and neural responses and decompose the neural prediction error in terms of the model eigenspectra, alignment of model eigenvectors and neural responses, and the training set size. Using this decomposition, we introduce geometrical measures to interpret the neural prediction error. We test a large number of deep neural networks that predict visual cortical activity and show that there are multiple types of geometries that result in low neural prediction error as measured via regression. The work demonstrates that carefully decomposing representational metrics can provide interpretability of how models are capturing neural activity and points the way towards improved models of neural activity.

Expressive Sign Equivariant Networks for Spectral Geometric Learning
Derek Lim Joshua Robinson Stefanie Jegelka Haggai Maron



研究问题:本文旨在探讨符号不变性在机器学习模型中的应用及其限制,并开发新型的符号等变神经网络架构。
动机:尽管现有的工作已经展示了尊重特征向量结构与对称性的机器学习模型的效用,但作者发现对于一些任务如构建正交等变模型和学习图的链接预测节点位置编码,符号不变性在理论上存在限制。
方法:通过开发新的符号等变多项式分析表征,作者开发出了基于这种表征的新型符号等变神经网络架构。
效果:控制合成实验表明,这些网络能够实现理论预测的符号等变模型的优势。

Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation -v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models.

Deep Neural Collapse Is Provably Optimal for the Deep Unconstrained Features Model
Peter Súkeník Marco Mondelli Christoph H Lampert



研究问题:本文旨在解决深度神经网络训练过程中的多层神经网络坍塌(DNC)现象,特别是在非线性层中的表现。
动机:尽管神经网络坍塌在最后一层的理论已经相对成熟,但在多层神经网络中的坍塌现象却鲜有研究。现有的工作或者只关注线性层,或者仅对最后两层进行研究,但都需要额外的假设。
方法:本文将已建立的单层神经网络坍塌分析框架——无约束特征模型——推广到多层非线性层。通过梯度下降优化深层无约束特征模型,证明了其全局最优解具有典型的神经网络坍塌特性。
效果:实验结果表明,(i) 通过梯度下降优化深层无约束特征模型,得到的解决方案与理论相符;(ii) 训练后的网络恢复了适合发生神经网络坍塌的无约束特征,从而支持了这种建模原理的有效性。

Neural collapse (NC) refers to the surprising structure of the last layer of deep neural networks in the terminal phase of gradient descent training. Recently, an increasing amount of experimental evidence has pointed to the propagation of NC to earlier layers of neural networks. However, while the NC in the last layer is well studied theoretically, much less is known about its multi-layered counterpart - deep neural collapse (DNC). In particular, existing work focuses either on linear layers or only on the last two layers at the price of an extra assumption. Our work fills this gap by generalizing the established analytical framework for NC - the unconstrained features model - to multiple non-linear layers. Our key technical contribution is to show that, in a deep unconstrained features model, the unique global optimum for binary classification exhibits all the properties typical of DNC. This explains the existing experimental evidence of DNC. We also empirically show that (i) by optimizing deep unconstrained features models via gradient descent, the resulting solution agrees well with our theory, and (ii) trained networks recover the unconstrained features suitable for the occurrence of DNC, thus supporting the validity of this modeling principle.

Deep Fractional Fourier Transform
Hu Yu Jie Huang Lingzhi Li Man Zhou Feng Zhao



研究问题:现有的深度学习计算机视觉方法通常在空间和频率域中操作,这两个是图像处理的两个正交的独立视角。
动机:本文介绍了一种新的空间-频率分析工具,分数傅里叶变换(FRFT),以提供全面统一的空间-频率视角。
方法:FRFT是一种统一的连续空间-频率变换,同时反映图像的空间和频率表示,使其成为处理非平稳图像信号的最佳选择。我们探索了FRFT用于图像处理的性质,并提出了2D FRFT的快速实现,促进了其广泛应用。基于这些探索,我们引入了一种简单而有效的操作符,多阶分数傅里叶卷积(MFRFC),它在空间-频率平面上从更多的角度处理图像表现出显著的优点。
效果:我们在各种计算机视觉任务上对MFRFC进行了实验评估,包括目标检测、图像分类、引导超分辨率、去噪、去雾、去雨滴和低光增强。我们提出的MFRFC在所有任务上都大大超过了基线方法。

Existing deep learning-based computer vision methods usually operate in the spatial and frequency domains, which are two orthogonal \textbf{individual} perspectives for image processing. In this paper, we introduce a new spatial-frequency analysis tool, Fractional Fourier Transform (FRFT), to provide comprehensive \textbf{unified} spatial-frequency perspectives. The FRFT is a unified continuous spatial-frequency transform that simultaneously reflects an image's spatial and frequency representations, making it optimal for processing non-stationary image signals. We explore the properties of the FRFT for image processing and present a fast implementation of the 2D FRFT, which facilitates its widespread use. Based on these explorations, we introduce a simple yet effective operator, Multi-order FRactional Fourier Convolution (MFRFC), which exhibits the remarkable merits of processing images from more perspectives in the spatial-frequency plane. Our proposed MFRFC is a general and basic operator that can be easily integrated into various tasks for performance improvement. We experimentally evaluate the MFRFC on various computer vision tasks, including object detection, image classification, guided super-resolution, denoising, dehazing, deraining, and low-light enhancement. Our proposed MFRFC consistently outperforms baseline methods by significant margins across all tasks.

The Exact Sample Complexity Gain from Invariances for Kernel Regression
Behrooz Tahmasebi Stefanie Jegelka



研究问题:本研究从理论角度探讨了将不变性编码到模型中以提高样本复杂度的现象。
动机:为了提高模型的泛化能力和效率,需要研究如何通过编码不变性来降低样本复杂度。
方法:本文针对紧致流形上的核岭回归问题,研究了目标函数在流形上受群作用不变的情况下的最小最大最优速率。
效果:实验结果表明,对于有限群,不变性带来的增益相当于样本数量乘以群大小;对于正维群,除了与商空间体积成比例的因子外,不变性还降低了流形的维数。这种新的几何观点可能对学习不变性具有独立的意义。

In practice, encoding invariances into models improves sample complexity. In this work, we study this phenomenon from a theoretical perspective. In particular, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a group action on the manifold. Our results hold for any smooth compact Lie group action, even groups of positive dimension. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold's dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials. This new geometric viewpoint on learning with invariances may be of independent interest.

Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks
Mingze Wang Chao Ma



研究问题:训练过程中的ReLU神经网络经常表现出复杂的非线性现象,这对理论分析构成了重大挑战。
动机:大多数先前的理论工作主要集中在局部分析(如训练结束时)或近似线性模型(如Neural Tangent Kernel)。
方法:本研究对在可分数据上使用梯度流训练的两层ReLU网络的整个训练过程进行了完整的理论表征。
效果:尽管我们研究的模型和数据相对简单,但我们揭示了从随机初始化到最终收敛的整个优化过程中的四个不同阶段,显示出一种简化到复杂的学习趋势。此外,我们还精确地识别并理论捕捉到了特定的非线性行为,如初始凝聚、鞍点到平稳态动力学、平稳态突破、激活模式的变化以及随着复杂度增加的学习等。

The training process of ReLU neural networks often exhibits complicated nonlinear phenomena. The nonlinearity of models and non-convexity of loss pose significant challenges for theoretical analysis. Therefore, most previous theoretical works on the optimization dynamics of neural networks focus either on local analysis (like the end of training) or approximate linear models (like Neural Tangent Kernel). In this work, we conduct a complete theoretical characterization of the training process of a two-layer ReLU network trained by Gradient Flow on a linearly separable data. In this specific setting, our analysis captures the whole optimization process starting from random initialization to final convergence. Despite the relatively simple model and data that we studied, we reveal four different phases from the whole training process showing a general simplifying-to-complicating learning trend. Specific nonlinear behaviors can also be precisely identified and captured theoretically, such as initial condensation, saddle-to-plateau dynamics, plateau escape, changes of activation patterns, learning with increasing complexity, etc.

Equivariant Neural Operator Learning with Graphon Convolution
Chaoran Cheng Jian Peng



研究问题:本文旨在提出一种结合系数学习方案和残差操作层用于在3D欧几里得空间中连续函数之间学习映射的通用架构。
动机:目前的模型无法有效地捕获几何信息并保持等变性,因此需要一种新的方法来解决这个问题。
方法:通过结合连续图论结构和输入数据的离散图结构,提出了一种称为InfGCN的新模型,该模型可以有效地捕获几何信息并保持等变性。
效果:通过在大规模电子密度数据集上的大量实验,发现该模型显著优于当前最先进的架构。

We propose a general architecture that combines the coefficient learning scheme with a residual operator layer for learning mappings between continuous functions in the 3D Euclidean space. Our proposed model is guaranteed to achieve SE(3)-equivariance by design. From the graph spectrum view, our method can be interpreted as convolution on graphons (dense graphs with infinitely many nodes), which we term InfGCN. By leveraging both the continuous graphon structure and the discrete graph structure of the input data, our model can effectively capture the geometric information while preserving equivariance. Through extensive experiments on large-scale electron density datasets, we observed that our model significantly outperformed the current state-of-the-art architectures. Multiple ablation studies were also carried out to demonstrate the effectiveness of the proposed architecture.

Critical Initialization of Wide and Deep Neural Networks using Partial Jacobians: General Theory and Applications
Darshil Doshi Tianyu He Andrey Gromov



研究问题:如何对深度神经网络进行理论处理,特别是在网络参数趋向无穷大时。
动机:尽管深度神经网络在理论上的处理存在困难,但当每层参数趋向无穷大时,网络函数可以视为高斯过程(GP),从而可以进行定量预测描述。
方法:提出了一种新的实用方法来诊断深度神经网络的临界性,即通过计算部分雅可比矩阵,并利用其范数的递归关系进行分析。同时,还开发了一种简单且经济的数值测试方法,用于选择广泛类别的深度神经网络的最佳初始化。
效果:通过这些工具,我们定量地证明了适当的LayerNorm预激活和残差连接堆叠会产生一个对于任何初始值都处于临界状态的架构。最后,我们将这些方法应用于ResNet和MLP-Mixer架构的分析,展示了无处不在的临界状态。

Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce *partial Jacobians* of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0\leq l$. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows one to select optimal initialization for a broad class of deep neural networks; containing fully connected, convolutional and normalization layers. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze ResNet and MLP-Mixer architectures; demonstrating the everywhere-critical regime.

Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks
Eshaan Nichani Alex Damian Jason D. Lee



研究问题:深度学习理论中的一个核心问题是理解神经网络如何学习分层特征。
动机:深度网络提取显著特征的能力对其出色的泛化能力和现代深度学习的预训练和微调范式至关重要。然而,从理论上讲,这种特征学习过程仍然知之甚少,现有的分析主要限于两层网络。
方法:本研究展示了三层神经网络比两层网络具有更强的特征学习能力。我们分析了通过逐层梯度下降训练的三层网络学习的特征,并提出了一项通用定理,该定理上界了当目标具有特定层次结构时实现低测试误差所需的样本复杂度和宽度。
效果:我们在特定的统计学习设置中实例化我们的框架——单指数模型和二次特征函数——并表明,在后一种情况下,三层网络在所有现有的两层网络保证中获得了样本复杂度的改进。关键地,这种样本复杂度的改进依赖于三层网络有效学习非线性特征的能力。然后,我们通过构造一个函数建立了具体的优化基础深度分离,该函数可以通过三层网络上的梯度下降高效地学习,但无法由两层网络有效地学习。我们的研究在理解三层神经网络相对于两层网络在特征学习领域的可证明优势方面取得了进展。

One of the central questions in the theory of deep learning is to understand how neural networks learn hierarchical features. The ability of deep networks to extract salient features is crucial to both their outstanding generalization ability and the modern deep learning paradigm of pretraining and finetuneing. However, this feature learning process remains poorly understood from a theoretical perspective, with existing analyses largely restricted to two-layer networks. In this work we show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks. We analyze the features learned by a three-layer network trained with layer-wise gradient descent, and present a general purpose theorem which upper bounds the sample complexity and width needed to achieve low test error when the target has specific hierarchical structure. We instantiate our framework in specific statistical learning settings -- single-index models and functions of quadratic features -- and show that in the latter setting three-layer networks obtain a sample complexity improvement over all existing guarantees for two-layer networks. Crucially, this sample complexity improvement relies on the ability of three-layer networks to efficiently learn *nonlinear* features. We then establish a concrete optimization-based depth separation by constructing a function which is efficiently learnable via gradient descent on a three-layer network, yet cannot be learned efficiently by a two-layer network. Our work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.

Balancing memorization and generalization in RNNs for high performance brain-machine Interfaces
Joseph T Costello Hisham Temmar Luis H Cubillos Matthew J Mender Dylan M Wallace Matthew S Willsey Parag G Patil Cynthia Chestek



研究问题:脑机接口(BMI)能够恢复瘫痪人士的肌肉功能,但目前受到实时解码算法准确性的限制。
动机:现代训练技术下的循环神经网络(RNN)已显示出从神经信号准确预测运动的能力,但在闭环设置中尚未与其他解码算法进行严格评估。
方法:使用非人灵长类动物的皮质内信号,在实时、连续的运动解码中比较了RNN和其他神经网络架构。
效果:在一个和两个手指在线任务中,LSTMs(一种RNN)的表现优于卷积和基于变压器的神经网络,其吞吐量平均比卷积网络高出18%。在简化的任务中,允许RNN解码器记忆运动模式并匹配健全的控制。随着不同运动数量的增加,性能逐渐下降,但并未低于完全连续的解码器性能。最后,在一个两个手指的任务中,其中一个自由度的信号较差,我们使用训练得像运动分类器和连续解码器一样的RNN恢复了功能性控制。我们的研究结果表明,RNN可以通过学习和生成准确的运动模式实现功能性实时脑机接口控制。

Brain-machine interfaces (BMIs) can restore motor function to people with paralysis but are currently limited by the accuracy of real-time decoding algorithms. Recurrent neural networks (RNNs) using modern training techniques have shown promise in accurately predicting movements from neural signals but have yet to be rigorously evaluated against other decoding algorithms in a closed-loop setting. Here we compared RNNs to other neural network architectures in real-time, continuous decoding of finger movements using intracortical signals from nonhuman primates. Across one and two finger online tasks, LSTMs (a type of RNN) outperformed convolutional and transformer-based neural networks, averaging 18% higher throughput than the convolution network. On simplified tasks with a reduced movement set, RNN decoders were allowed to memorize movement patterns and matched able-bodied control. Performance gradually dropped as the number of distinct movements increased but did not go below fully continuous decoder performance. Finally, in a two-finger task where one degree-of-freedom had poor input signals, we recovered functional control using RNNs trained to act both like a movement classifier and continuous decoder. Our results suggest that RNNs can enable functional real-time BMI control by learning and generating accurate movement patterns.

Structure-free Graph Condensation: From Large-scale Graphs to Condensed Graph-free Data
Xin Zheng Miao Zhang Chunyang Chen Quoc Viet Hung Nguyen Xingquan Zhu Shirui Pan



研究问题:现有的图缩并方法在效果和泛化能力上存在关键问题。
动机:图缩并可以减小大规模图的大小,为各种图学习任务带来直接好处。
方法:本文提出了一种新的无结构图缩并范式SFGC,将大规模图蒸馏成小尺度的无结构图节点集,即图自由数据。
效果:通过训练轨迹元匹配和图神经网络特征评分度量,SFGC在不同的缩并比例下表现出优越性。

Graph condensation, which reduces the size of a large-scale graph by synthesizing a small-scale condensed graph as its substitution, has immediate benefits for various graph learning tasks. However, existing graph condensation methods rely on the joint optimization of nodes and structures in the condensed graph, and overlook critical issues in effectiveness and generalization ability. In this paper, we advocate a new Structure-Free Graph Condensation paradigm, named SFGC, to distill a large-scale graph into a small-scale graph node set without explicit graph structures, i.e., graph-free data. Our idea is to implicitly encode topology structure information into the node attributes in the synthesized graph-free data, whose topology is reduced to an identity matrix. Specifically, SFGC contains two collaborative components: (1) a training trajectory meta-matching scheme for effectively synthesizing small-scale graph-free data; (2) a graph neural feature score metric for dynamically evaluating the quality of the condensed data. Through training trajectory meta-matching, SFGC aligns the long-term GNN learning behaviors between the large-scale graph and the condensed small-scale graph-free data, ensuring comprehensive and compact transfer of informative knowledge to the graph-free data. Afterward, the underlying condensed graph-free data would be dynamically evaluated with the graph neural feature score, which is a closed-form metric for ensuring the excellent expressiveness of the condensed graph-free data. Extensive experiments verify the superiority of SFGC across different condensation ratios.

Quasi-Monte Carlo Graph Random Features
Isaac Reid Adrian Weller Krzysztof Marcin Choromanski



研究问题:如何提高图随机特征(GRFs)的准确性。
动机:最近引入的图随机特征(GRFs)类需要改进其准确性。
方法:我们提出了一种新的机制,通过强制对立终止来引发算法随机游走长度之间的负相关性,从而采样更多样化的随机游走。
效果:我们的实验结果表明,这种方法在各种任务上都有准确性的提高,包括一种新的实际应用:时间高效的图扩散过程近似。

We present a novel mechanism to improve the accuracy of the recently-introduced class of graph random features (GRFs). Our method induces negative correlations between the lengths of the algorithm's random walks by imposing antithetic termination: a procedure to sample more diverse random walks which may be of independent interest. It has a trivial drop-in implementation. We derive strong theoretical guarantees on the properties of these quasi-Monte Carlo GRFs (q-GRFs), proving that they yield lower-variance estimators of the $2$-regularised Laplacian kernel under mild conditions. Remarkably, our results hold for any graph topology. We demonstrate empirical accuracy improvements on a variety of tasks including a new practical application: time-efficient approximation of the graph diffusion process. To our knowledge, q-GRFs constitute the first rigorously studied quasi-Monte Carlo scheme for kernels defined on combinatorial objects, inviting new research on correlations between graph random walks.

Deep Reinforcement Learning with Plasticity Injection
Evgenii Nikishin Junhyuk Oh Georg Ostrovski Clare Lyle Razvan Pascanu Will Dabney Andre Barreto



研究问题:深度强化学习中神经网络的可塑性逐渐丧失,但这种现象的分析与缓解受到可塑性、探索和性能之间复杂关系的影响。
动机:提出一种最小干预方法——可塑性注入,以增加网络的可塑性,而不改变训练参数的数量或预测的偏差。
方法:通过在训练过程中注入新的信息来增加网络的可塑性,以此作为诊断工具,确定哪些环境会导致性能停滞,并提高强化学习的训练效率。
效果:实验结果显示,可塑性注入在Atari游戏上取得了比替代方法更强的性能,同时具有计算效率。

A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, and performance in RL. This paper introduces plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool — if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent has to re-learn from scratch due to exhausted plasticity or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance compared to alternative methods while being computationally efficient.

HyTrel: Hypergraph-enhanced Tabular Data Representation Learning
Pei Chen Soumajyoti Sarkar Leonard Lausen Balasubramaniam Srinivasan Sheng Zha Ruihong Huang George Karypis



研究问题:现有的预训练语言模型在处理大规模表格数据时,没有考虑到行/列置换不变性、层次结构等特性。
动机:为了解决这些问题,我们提出了HyTrel,一种能够通过使用超图来捕捉表格数据的置换不变性和三种其他结构特性的表格语言模型。
方法:我们将表格中的单元格作为节点,将每行、每列以及整个表格中共同出现的单元格用于形成三种不同类型的超边,从而获取表格数据的结构特性。
效果:实验结果表明,HyTrel在一些下游任务上的表现优于其他竞争基线,即使在最小的预训练下也能保持一致的优势,证明了将与表格数据相关的归纳偏置引入表示法的优点。

Language models pretrained on large collections of tabular data have demonstrated their effectiveness in several downstream tasks. However, many of these models do not take into account the row/column permutation invariances, hierarchical structure, etc. that exist in tabular data. To alleviate these limitations, we propose HyTrel, a tabular language model, that captures the permutation invariances and three more structural properties of tabular data by using hypergraphs--where the table cells make up the nodes and the cells occurring jointly together in each row, column, and the entire table are used to form three different types of hyperedges. We show that HyTrel is maximally invariant under certain conditions for tabular data, i.e., two tables obtain the same representations via HyTrel iff the two tables are identical up to permutation. Our empirical results demonstrate that HyTrel consistently outperforms other competitive baselines on four downstream tasks with minimal pretraining, illustrating the advantages of incorporating inductive biases associated with tabular data into the representations. Finally, our qualitative analyses showcase that HyTrel can assimilate the table structure to generate robust representations for the cells, rows, columns, and the entire table.

Adaptive whitening with fast gain modulation and slow synaptic plasticity
Lyndon Duong Eero P Simoncelli Dmitri Chklovskii David Lipshutz



研究问题:本研究旨在解决现有模型在解释神经元如何快速适应感官统计变化方面的局限性。
动机:现有的模型主要依赖突触可塑性或增益调制来解释适应性白化,但各自都有明显限制。
方法:本研究提出了一个多时标规范性机制模型,通过结合突触可塑性和增益调制,互补地实现其响应的适应性白化。增益在快时标上调整以适应当前的统计环境,而突触则在慢时标上调整以匹配输入统计结构的不变属性。
效果:通过对合成和自然数据集的测试,发现突触能在长时标上学习到最优配置,使得短时标上的增益调制能实现适应性白化。

Neurons in early sensory areas rapidly adapt to changing sensory statistics, both by normalizing the variance of their individual responses and by reducing correlations between their responses. Together, these transformations may be viewed as an adaptive form of statistical whitening. Existing mechanistic models of adaptive whitening exclusively use either synaptic plasticity or gain modulation as the biological substrate for adaptation; however, on their own, each of these models has significant limitations. In this work, we unify these approaches in a normative multi-timescale mechanistic model that adaptively whitens its responses with complementary computational roles for synaptic plasticity and gain modulation. Gains are modified on a fast timescale to adapt to the current statistical context, whereas synapses are modified on a slow timescale to match structural properties of the input statistics that are invariant across contexts. Our model is derived from a novel multi-timescale whitening objective that factorizes the inverse whitening matrix into basis vectors, which correspond to synaptic weights, and a diagonal matrix, which corresponds to neuronal gains. We test our model on synthetic and natural datasets and find that the synapses learn optimal configurations over long timescales that enable adaptive whitening on short timescales using gain modulation.

Pareto Frontiers in Deep Feature Learning: Data, Compute, Width, and Luck
Benjamin L. Edelman Surbhi Goel Sham M. Kakade eran malach Cyril Zhang



研究问题:本文旨在探讨现代深度学习中算法选择(如宽度、深度和学习率)如何调节特征学习的细微资源权衡。
动机:在存在计算统计差距的情况下,了解这些复杂性如何必然出现对于特征学习是必要的。
方法:通过考虑离线稀疏偶数学习,一个有监督的分类问题,其对梯度训练的多层感知器有一个统计查询下界。理论上和实验上证明,稀疏初始化和增加网络宽度可以显著提高样本效率。
效果:宽度在这里起到了并行搜索的作用:它放大了找到“彩票”神经元的概率,这些神经元以更高的样本效率学习稀疏特征。最后,我们证明了需要轴对齐特征学习的现实世界问题可以使用合成稀疏偶数任务作为代理。通过使用宽且稀疏初始化的MLP模型,我们在表格分类基准测试上提高了样本效率;这些网络有时能超越调整过的随机森林。

In modern deep learning, algorithmic choices (such as width, depth, and learning rate) are known to modulate nuanced resource tradeoffs. This work investigates how these complexities necessarily arise for feature learning in the presence of computational-statistical gaps. We begin by considering offline sparse parity learning, a supervised classification problem which admits a statistical query lower bound for gradient-based training of a multilayer perceptron. This lower bound can be interpreted as a *multi-resource tradeoff frontier*: successful learning can only occur if one is sufficiently rich (large model), knowledgeable (large dataset), patient (many training iterations), or lucky (many random guesses). We show, theoretically and experimentally, that sparse initialization and increasing network width yield significant improvements in sample efficiency in this setting. Here, width plays the role of parallel search: it amplifies the probability of finding "lottery ticket" neurons, which learn sparse features more sample-efficiently. Finally, we show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning. We demonstrate improved sample efficiency on tabular classification benchmarks by using wide, sparsely-initialized MLP models; these networks sometimes outperform tuned random forests.

Extraction and Recovery of Spatio-Temporal Structure in Latent Dynamics Alignment with Diffusion Model
Yule Wang Zijing Wu Chengrui Li Anqi Wu



研究问题:在行为相关脑计算领域,需要对原始神经信号进行对齐,以解决它们之间的剧烈领域偏移问题。
动机:尽管神经科学研究中存在一个基本框架,即基于试验的神经群体活动依赖于低维潜在动态,但现有方法在对齐阶段忽略了内在的时空结构,导致潜在动态结构质量和整体性能较差。
方法:我们提出了一种对齐方法ERDiff,利用扩散模型的表达能力来保留潜在动态的时空结构。首先通过扩散模型提取源领域的的潜在动态结构,然后在该扩散模型的指导下,通过最大似然对齐过程在目标领域中恢复这些结构。
效果:我们在合成数据集上首先证明了该方法的有效性。然后将其应用于非人灵长类动物运动皮层的神经记录,无论是跨天还是跨受试者设置,我们的方法都能一致地表现出保留潜在动态的时空结构的能力,并在对齐拟合度和神经解码性能方面优于现有方法。

In the field of behavior-related brain computation, it is necessary to align raw neural signals against the drastic domain shift among them. A foundational framework within neuroscience research posits that trial-based neural population activities rely on low-dimensional latent dynamics, thus focusing on the latter greatly facilitates the alignment procedure. Despite this field's progress, existing methods ignore the intrinsic spatio-temporal structure during the alignment phase. Hence, their solutions usually lead to poor quality in latent dynamics structures and overall performance. To tackle this problem, we propose an alignment method ERDiff, which leverages the expressivity of the diffusion model to preserve the spatio-temporal structure of latent dynamics. Specifically, the latent dynamics structures of the source domain are first extracted by a diffusion model. Then, under the guidance of this diffusion model, such structures are well-recovered through a maximum likelihood alignment procedure in the target domain. We first demonstrate the effectiveness of our proposed method on a synthetic dataset. Then, when applied to neural recordings from the non-human primate motor cortex, under both cross-day and inter-subject settings, our method consistently manifests its capability of preserving the spatio-temporal structure of latent dynamics and outperforms existing approaches in alignment goodness-of-fit and neural decoding performance.

Slow and Weak Attractor Computation Embedded in Fast and Strong E-I Balanced Neural Dynamics
Xiaohan Lin Liyuan Li Boxin Shi Tiejun Huang Yuanyuan Mi Si Wu



研究问题:吸引子网络和兴奋抑制平衡网络(E-INNs)在大脑中的共存方式及其结构需求尚不清楚。
动机:吸引子网络和E-INNs是神经回路的两种典型模型,但通常被单独研究,其共存方式及其结构需求尚未明确。
方法:通过模拟和理论分析,研究发现如果神经元突触由两组构成:一组强且快用于不规则发射,另一组弱且慢用于吸引子动态,那么神经网络可以同时展现吸引子网络和E-INNs的特性。
效果:结果显示,与只使用一组突触的情况相比,这种网络具有增强的性能,能更快地收敛到吸引子状态,并能保留局部输入的E-I平衡条件。此外,该网络模型还成功应用于解决实际的跟踪问题,表现出良好的追踪快速移动物体的能力。

Attractor networks require neuronal connections to be highly structured in order to maintain attractor states that represent information, while excitation and inhibition balanced networks (E-INNs) require neuronal connections to be random and sparse to generate irregular neuronal firings. Despite being regarded as canonical models of neural circuits, both types of networks are usually studied in isolation, and it remains unclear how they coexist in the brain, given their very different structural demands. In this study, we investigate the compatibility of continuous attractor neural networks (CANNs) and E-INNs. In line with recent experimental data, we find that a neural circuit can exhibit both the traits of CANNs and E-INNs if the neuronal synapses consist of two sets: one set is strong and fast for irregular firing, and the other set is weak and slow for attractor dynamics. Our results from simulations and theoretical analysis reveal that the network also exhibits enhanced performance compared to the case of using only one set of synapses, with accelerated convergence of attractor states and retained E-I balanced condition for localized input. We also apply the network model to solve a real-world tracking problem and demonstrate that it can track fast-moving objects well. We hope that this study provides insight into how structured neural computations are realized by irregular firings of neurons.

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training
Yefan Zhou Tianyu Pang Keqin Liu charles h martin Michael W. Mahoney Yaoqing Yang



研究问题:如何有效地调整神经网络训练中各层的学习率?
动机:学习率在神经网络训练中起着关键作用,目前的训练策略主要是定义学习率随时间衰减的过程。
方法:本文提出了一种简单而有效的逐层学习率调整方法——TempBalance,该方法基于Heavy-Tailed Self-Regularization(HT-SR)理论,通过使用HT-SR启发的度量标准来指导模型训练过程中所有网络层的温度调度和平衡。
效果:在CIFAR10、CIFAR100、SVHN和TinyImageNet数据集上,使用不同深度和宽度的ResNets、VGGs和WideResNets进行实验,结果显示TempBalance显著优于普通SGD和精心调整的谱范数正则化,同时也优于一些先进的优化器和学习率调度器。

Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.

Attentive Transfer Entropy to Exploit Transient Emergence of Coupling Effect
Xiaolei Ru Xin-Ya Zhang Zijia Liu Jack Murdoch Moore Gang Yan



研究问题:重建耦合网络,即连接大量变量(如神经细胞)的网络,其状态演变受强自我驱动和弱耦合驱动的耗散动力学控制。
动机:耦合效应的稀疏性是核心困难,即耦合力只在时间序列中短暂出现,其余时间保持静止。
方法:借鉴注意力机制,引导分类器关注可能显现耦合效应的关键时间序列数据区域。通过训练人工神经网络最大化注意力转移熵(ATEn),自动分配注意力系数。
效果:无需任何先验动态知识,ATEn能明确识别出耦合驱动力明显大于零的区域。这一创新显著提高了对合成和真实有向耦合网络的重建性能,适用于广泛用于神经科学的神经元模型生成的数据。

We consider the problem of reconstructing coupled networks (e.g., biological neural networks) connecting large numbers of variables (e.g.,nerve cells), of which state evolution is governed by dissipative dynamics consisting of strong self-drive (dominants the evolution) and weak coupling-drive. The core difficulty is sparseness of coupling effect that emerges (the coupling force is significant) only momentarily and otherwise remains quiescent in time series (e.g., neuronal activity sequence). Here we learn the idea from attention mechanism to guide the classifier to make inference focusing on the critical regions of time series data where coupling effect may manifest. Specifically, attention coefficients are assigned autonomously by artificial neural networks trained to maximise the Attentive Transfer Entropy (ATEn), which is a novel generalization of the iconic transfer entropy metric. Our results show that, without any prior knowledge of dynamics, ATEn explicitly identifies areas where the strength of coupling-drive is distinctly greater than zero. This innovation substantially improves reconstruction performance for both synthetic and real directed coupling networks using data generated by neuronal models widely used in neuroscience.

Prefix-Tree Decoding for Predicting Mass Spectra from Molecules
Samuel Goldman John Bradshaw Jiayi Xin Connor W. Coley



研究问题:目前的分子质量光谱预测工具存在局限性,不是过度刚性的分子组合碎片处理,就是解码有损和离散的质量光谱向量。
动机:为了解决这些问题,本文提出了一种新的中间策略,将质量光谱视为分子式集合,通过编码输入的分子图并解码分子子式集来预测质量光谱。
方法:首先对输入的分子图进行编码,然后解码分子子式集,每个子式都预测了质量光谱中的一个峰值,其强度由第二个模型预测。主要创新点是通过使用前缀树结构,逐个原子地解码分子式集,克服了分子子式的排列组合可能性。
效果:实验结果表明,这种方法在质量光谱预测任务上表现出良好的效果。

Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites. However, such predictive tools are still limited as they occupy one of two extremes, either operating (a) by fragmenting molecules combinatorially with overly rigid constraints on potential rearrangements and poor time complexity or (b) by decoding lossy and nonphysical discretized spectra vectors. In this work, we use a new intermediate strategy for predicting mass spectra from molecules by treating mass spectra as sets of molecular formulae, which are themselves multisets of atoms. After first encoding an input molecular graph, we decode a set of molecular subformulae, each of which specify a predicted peak in the mass spectrum, the intensities of which are predicted by a second model. Our key insight is to overcome the combinatorial possibilities for molecular subformulae by decoding the formula set using a prefix tree structure, atom-type by atom-type, representing a general method for ordered multiset decoding. We show promising empirical results on mass spectra prediction tasks.

What Makes Data Suitable for a Locally Connected Neural Network? A Necessary and Sufficient Condition Based on Quantum Entanglement.
Yotam Alexander Nimrod De La Vega Noam Razin Nadav Cohen



研究问题:如何判断一个数据分布是否适合深度学习?
动机:针对局部连接神经网络(包括卷积神经网络、循环神经网络和局部自注意力模型),我们采用量子物理的理论工具来解决这个问题。
方法:我们的主要理论成果表明,如果特定特征的正交划分下的数据分布具有低量子纠缠性,那么特定的局部连接神经网络就能对该数据分布进行准确预测。作为这一结果的实际运用,我们推导出一种预处理方法,以提高数据分布对局部连接神经网络的适应性。
效果:我们在各种数据集上广泛使用不同的模型进行实验,证明了我们的发现。我们希望利用量子纠缠性能进一步推动物理学工具在深度学习与真实世界数据关系的形式推理中的采用。

The question of what makes a data distribution suitable for deep learning is a fundamental open problem. Focusing on locally connected neural networks (a prevalent family of architectures that includes convolutional and recurrent neural networks as well as local self-attention models), we address this problem by adopting theoretical tools from quantum physics. Our main theoretical result states that a certain locally connected neural network is capable of accurate prediction over a data distribution if and only if the data distribution admits low quantum entanglement under certain canonical partitions of features. As a practical application of this result, we derive a preprocessing method for enhancing the suitability of a data distribution to locally connected neural networks. Experiments with widespread models over various datasets demonstrate our findings. We hope that our use of quantum entanglement will encourage further adoption of tools from physics for formally reasoning about the relation between deep learning and real-world data.

Dynamic Tensor Decomposition via Neural Diffusion-Reaction Processes
Zheng Wang Shikai Fang Shibo Li Shandian Zhe



研究问题:如何更好地利用稀疏但富含时间信息的张量数据进行多维数据分析。
动机:现有的方法往往忽视了时间信息和稀疏观测的张量条目中的结构性知识,为了克服这些限制并更好地捕捉潜在的时间结构,我们提出了动态嵌入用于动态张量分解(DEMOTE)的方法。
方法:我们开发了一个神经扩散-反应过程来估计每个张量模式实体的动态嵌入。基于观察到的张量条目,我们构建了一个多部分图来编码实体之间的相关性。然后,我们使用神经网络为每个单独的实体构建一个反应过程。
效果:通过模拟研究和实际应用,我们展示了该方法的优势。我们的代码可以在https://github.com/wzhut/Dynamic-Tensor-Decomposition-via-Neural-Diffusion-Reaction-Processes上找到。

Tensor decomposition is an important tool for multiway data analysis. In practice, the data is often sparse yet associated with rich temporal information. Existing methods, however, often under-use the time information and ignore the structural knowledge within the sparsely observed tensor entries. To overcome these limitations and to better capture the underlying temporal structure, we propose Dynamic EMbedIngs fOr dynamic Tensor dEcomposition (DEMOTE). We develop a neural diffusion-reaction process to estimate dynamic embeddings for the entities in each tensor mode. Specifically, based on the observed tensor entries, we build a multi-partite graph to encode the correlation between the entities. We construct a graph diffusion process to co-evolve the embedding trajectories of the correlated entities and use a neural network to construct a reaction process for each individual entity. In this way, our model can capture both the commonalities and personalities during the evolution of the embeddings for different entities. We then use a neural network to model the entry value as a nonlinear function of the embedding trajectories. For model estimation, we combine ODE solvers to develop a stochastic mini-batch learning algorithm. We propose a stratified sampling method to balance the cost of processing each mini-batch so as to improve the overall efficiency. We show the advantage of our approach in both simulation studies and real-world applications. The code is available at https://github.com/wzhut/Dynamic-Tensor-Decomposition-via-Neural-Diffusion-Reaction-Processes.

Adversarial Training from Mean Field Perspective
Soichiro Kumano Hiroshi Kera Toshihiko Yamasaki



研究问题:尽管对抗性训练被证明对对抗性示例有效,但其训练动态并不完全清楚。
动机:本研究旨在对随机深度神经网络的对抗性训练进行首次理论分析,无需对数据分布做出任何假设。
方法:我们提出了一个新的基于平均场理论的理论框架,该框架解决了现有基于平均场的方法的限制。基于这个框架,我们推导出了各种p和q值下的基于$ell_q$范数的对抗性损失与基于$\ell_p$范数的对抗性示例之间的(经验上紧)上界。
效果:我们证明了没有捷径的网络通常无法进行对抗性训练,并且对抗性训练会降低网络容量。我们还发现,网络宽度可以缓解这些问题。此外,我们还展示了输入和输出维度对权重方差上界和时间演化的各种影响。

Although adversarial training is known to be effective against adversarial examples, training dynamics are not well understood. In this study, we present the first theoretical analysis of adversarial training in random deep neural networks without any assumptions on data distributions. We introduce a new theoretical framework based on mean field theory, which addresses the limitations of existing mean field-based approaches. Based on the framework, we derive the (empirically tight) upper bounds of $\ell_q$ norm-based adversarial loss with $\ell_p$ norm-based adversarial examples for various values of $p$ and $q$. Moreover, we prove that networks without shortcuts are generally not adversarially trainable and that adversarial training reduces network capacity. We also show that the network width alleviates these issues. Furthermore, the various impacts of input and output dimensions on the upper bounds and time evolution of weight variance are presented.

Epistemic Neural Networks
Ian Osband Zheng Wen Seyed Mohammad Asghari Vikranth Dwaracherla Morteza Ibrahimi Xiuyuan Lu Benjamin Van Roy



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches can produce effective joint predictions, but the computational costs of large ensembles become prohibitive. We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. The epinet does not fit the traditional framework of Bayesian neural networks. To accommodate development of approaches beyond BNNs, such as the epinet, we introduce the epistemic neural network (ENN) as a general interface for models that produce joint predictions.

Banana: Banach Fixed-Point Network for Pointcloud Segmentation with Inter-Part Equivariance
Congyue Deng Jiahui Lei Bokui Shen Kostas Daniilidis Leonidas Guibas



研究问题:如何有效地捕捉复杂系统中的部件间转换,如关节对象或多物体场景?
动机:在处理复杂的系统时,有效的捕捉部件间的转换是一个挑战,因为它与整体结构和局部转换紧密相关。
方法:我们提出了一种名为Banana的新型网络结构,通过构建实现部件间等变的等变分割。我们的关键洞察是迭代解决一个固定点问题,其中点-部件分配标签和每个部分的SE(3)等变同时共同演化。
效果:实验证明,即使在面临点云几何和拓扑的重大变化时,我们的方法也能在部件间转换下实现强大的泛化能力。

Equivariance has gained strong interest as a desirable network property that inherently ensures robust generalization. However, when dealing with complex systems such as articulated objects or multi-object scenes, effectively capturing inter-part transformations poses a challenge, as it becomes entangled with the overall structure and local transformations. The interdependence of part assignment and per-part group action necessitates a novel equivariance formulation that allows for their co-evolution. In this paper, we present Banana, a Banach fixed-point network for equivariant segmentation with inter-part equivariance by construction. Our key insight is to iteratively solve a fixed-point problem, where point-part assignment labels and per-part SE(3)-equivariance co-evolve simultaneously. We provide theoretical derivations of both per-step equivariance and global convergence, which induces an equivariant final convergent state. Our formulation naturally provides a strict definition of inter-part equivariance that generalizes to unseen inter-part configurations. Through experiments conducted on both articulated objects and multi-object scans, we demonstrate the efficacy of our approach in achieving strong generalization under inter-part transformations, even when confronted with substantial changes in pointcloud geometry and topology.

Exploring Loss Functions for Time-based Training Strategy in Spiking Neural Networks
Yaoyu Zhu Wei Fang Xiaodong Xie Tiejun Huang Zhaofei Yu



研究问题:如何更好地利用时间信息,以异步的方式进行训练,提高脉冲神经网络(SNNs)的性能。
动机:SNNs是一种受大脑启发的、能源效率高的模型,其基于事件的计算模式具有潜力。在SNNs中,用于传递信息的时空脉冲模式包括率编码和时间编码,其中时间编码对于生物逼真的学习规则(如尖峰时间依赖可塑性)至关重要。
方法:提出了一种基于时间的培训策略,以更好地利用SNNs中的时间信息进行异步学习。同时,将基于率的损耗函数映射到基于时间的对应项,并解释为何它们也适用于基于时间的培训方案。此外,还提出了增强计数损耗来取代常用的均方计数损耗。
效果:实验表明,该方法在大多数数据集上都优于先前的时间基础训练方法。这项工作为使用基于时间的方法训练SNNs提供了见解,并为率编码和时间编码之间的关联提供了新的视角。

Spiking Neural Networks (SNNs) are considered promising brain-inspired energy-efficient models due to their event-driven computing paradigm. The spatiotemporal spike patterns used to convey information in SNNs consist of both rate coding and temporal coding, where the temporal coding is crucial to biological-plausible learning rules such as spike-timing-dependent-plasticity. The time-based training strategy is proposed to better utilize the temporal information in SNNs and learn in an asynchronous fashion. However, some recent works train SNNs by the time-based scheme with rate-coding-dominated loss functions. In this paper, we first map rate-based loss functions to time-based counterparts and explain why they are also applicable to the time-based training scheme. After that, we infer that loss functions providing adequate positive overall gradients help training by theoretical analysis. Based on this, we propose the enhanced counting loss to replace the commonly used mean square counting loss. In addition, we transfer the training of scale factor in weight standardization into thresholds. Experiments show that our approach outperforms previous time-based training methods in most datasets. Our work provides insights for training SNNs with time-based schemes and offers a fresh perspective on the correlation between rate coding and temporal coding. Our code is available at https://github.com/zhuyaoyu/SNN-temporal-training-losses.

Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks
Blake Bordelon Cengiz Pehlevan



研究问题:本文旨在分析有限宽度效应在宽但有限的特征学习神经网络中的动态变化。
动机:从深度神经网络无限宽度内核和预测动态的动力学平均场理论描述出发,提供网络权重随机初始化下DMFT序参数的$\mathcal{O}(1/\sqrt{\text{width}})$波动特性。
方法:通过对比宽度和特征学习强度的非微扰性,对网络训练的懒惰极限、两层网络、深层网络以及卷积神经网络进行研究,并展示如何动态减少最终切线内核和最终网络预测的方差。
效果:实验结果表明,对于CIFAR-10训练的CNN,由于有限宽度,网络动态的偏倚和方差都会出现显著的修正。

We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Starting from a dynamical mean field theory description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $\mathcal{O}(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initializations of the network weights. Our results, while perturbative in width, unlike prior analyses, are non-perturbative in the strength of feature learning. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with a variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final tangent kernel and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the signal-to-noise ratio of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.

AMAG: Additive, Multiplicative and Adaptive Graph Neural Network For Forecasting Neuron Activity
Jingyuan Li Leo Scholl Trung Le Pavithra Rajeswaran Amy L Orsborn Eli Shlizerman



研究问题:本文旨在通过预测任务,利用先验信息(包括成对的神经单元交互)来改善潜在变量模型(LVMs)在捕捉神经活动动态性上的表现。
动机:现有的LVMs主要基于深度学习方法,通过重构输入的神经活动来建立潜在的表示,但这种方法无法捕获到时间上的因果关系。因此,本文希望通过预测任务来改进LVMs的性能。
方法:本文提出了一种基于图神经网络(GNN)的模型——Additive, Multiplicative, and Adaptive Graph Neural Network (AMAG),该模型利用了类似于神经元系统中的交互的加法和乘法消息传递操作,并自适应地学习神经单元之间的交互,以预测其未来的活动。
效果:实验结果表明,AMAG模型在恢复真实的空间交互以及预测神经群体的未来动态方面优于非GNN的方法,并在多模态神经记录(来自四只恒河猴的穿透电极或表面级微电极图)上表现出优越性能。

Latent Variable Models (LVMs) propose to model the dynamics of neural populations by capturing low-dimensional structures that represent features involved in neural activity. Recent LVMs are based on deep learning methodology where a deep neural network is trained to reconstruct the same neural activity given as input and as a result to build the latent representation. Without taking past or future activity into account such a task is non-causal. In contrast, the task of forecasting neural activity based on given input extends the reconstruction task. LVMs that are trained on such a task could potentially capture temporal causality constraints within its latent representation. Forecasting has received less attention than reconstruction due to recording challenges such as limited neural measurements and trials. In this work, we address modeling neural population dynamics via the forecasting task and improve forecasting performance by including a prior, which consists of pairwise neural unit interaction as a multivariate dynamic system. Our proposed model---Additive, Multiplicative, and Adaptive Graph Neural Network (AMAG)---leverages additive and multiplicative message-passing operations analogous to the interactions in neuronal systems and adaptively learns the interaction among neural units to forecast their future activity. We demonstrate the advantage of AMAG compared to non-GNN based methods on synthetic data and multiple modalities of neural recordings (field potentials from penetrating electrodes or surface-level micro-electrocorticography) from four rhesus macaques. Our results show the ability of AMAG to recover ground truth spatial interactions and yield estimation for future dynamics of the neural population.

Representational Strengths and Limitations of Transformers
Clayton Sanford Daniel Hsu Matus Telgarsky



研究问题:本文旨在探讨注意力层与其他架构相比的优势和劣势,并建立其表示能力的正面和负面结果。
动机:尽管注意力层在现代深度学习中被广泛使用,但对其优势和劣势的数学描述却很少。
方法:通过关注宽度、深度和嵌入维度等内在复杂性参数,对注意力层的表示能力进行正反两方面的研究。
效果:实验结果表明,注意力层在稀疏平均任务上具有优势,而在三元检测任务上则表现不佳。同时,论文强调了通信复杂度在分析变压器及相关模型中的价值,以及稀疏平均作为典型注意力任务的作用。

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.

Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width
Dayal Singh Kalra Maissam Barkeshli



研究问题:本文系统分析了深度神经网络(DNNs)使用随机梯度下降(SGD)训练的优化动态,并研究了学习率、网络深度和宽度的影响。
动机:通过分析损失函数海森矩阵的最大特征值,该值是损失景观锐度的度量,作者发现优化动态可以表现出四种不同的模式。
方法:使用大规模文本语料库和知识图谱训练增强的语言表示模型ERNIE,将KG中的知识与文本语料库进行联合训练,使ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate $\eta$, depth $d$, and width $w$ of the neural network. By analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time "edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on $\eta \equiv c / \lambda_0^H $, $d$, and $w$. We identify several critical values of $c$, which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a "sharpness reduction" phase, where sharpness decreases at early times, as $d$ and $ 1/w$ are increased.

Explaining V1 Properties with a Biologically Constrained Deep Learning Architecture
Galen Pogoncheff Jacob Granley Michael Beyeler



研究问题:如何通过将神经科学衍生的架构组件系统地整合到卷积神经网络(CNNs)中,以更全面地解释初级视觉皮层(V1)的活动。
动机:尽管当前的顶级V1模型已经从对抗性例子和大量增强的数据训练中浮现,但这些模型仍然无法解释由生物电路引起的V1中观察到的关键神经属性。
方法:我们将神经科学衍生的架构组件系统地整合到CNNs中,以确定一组能够更全面解释V1活动的机制和架构。
效果:通过增强具有模拟中心-周围对抗、局部感受野、调谐归一化和皮质放大等神经科学衍生架构组件的任务驱动CNNs,我们发现了一组具有潜在表示的模型,这些模型在解释V1神经活动和调谐特性方面达到了最先进的水平。此外,对这些组件学习到的参数以及最能激活评估网络神经元的刺激的分析,为它们在解释V1神经属性方面的作用提供了支持。

Convolutional neural networks (CNNs) have recently emerged as promising models of the ventral visual stream, despite their lack of biological specificity. While current state-of-the-art models of the primary visual cortex (V1) have surfaced from training with adversarial examples and extensively augmented data, these models are still unable to explain key neural properties observed in V1 that arise from biological circuitry. To address this gap, we systematically incorporated neuroscience-derived architectural components into CNNs to identify a set of mechanisms and architectures that more comprehensively explain V1 activity. Upon enhancing task-driven CNNs with architectural components that simulate center-surround antagonism, local receptive fields, tuned normalization, and cortical magnification, we uncover models with latent representations that yield state-of-the-art explanation of V1 neural activity and tuning properties. Moreover, analyses of the learned parameters of these components and stimuli that maximally activate neurons of the evaluated networks provide support for their role in explaining neural properties of V1. Our results highlight an important advancement in the field of NeuroAI, as we systematically establish a set of architectural components that contribute to unprecedented explanation of V1. The neuroscience insights that could be gleaned from increasingly accurate in-silico models of the brain have the potential to greatly advance the fields of both neuroscience and artificial intelligence.

A Unified, Scalable Framework for Neural Population Decoding
Mehdi Azabou Vinam Arora Venkataramana Ganesh Ximeng Mao Santosh B Nachimuthu Michael Jacob Mendelson Blake Aaron Richards Matthew G Perich Guillaume Lajoie Eva L Dyer



研究问题:如何将大规模的神经记录集成到一个统一的模型中,以解析神经网络活动。
动机:尽管深度学习方法在解析神经网络活动方面具有潜力,但整合大量神经记录到单一模型中是具有挑战性的,因为每个记录都包含来自不同动物的不同神经元的活动。
方法:本文介绍了一种训练框架和架构,用于模拟跨多种大规模神经记录的神经活动群体动态。该方法首先对数据集中的各个尖峰进行标记化,以构建有效的神经事件表示,该表示捕获了神经活动的精细时间结构。然后,我们使用交叉注意力和一个PerceiverIO主干进一步构建神经群体活动的潜标记化。利用这种架构和训练框架,我们构建了一个在来自七种非人灵长类动物的大型数据集上训练的大型多会话模型,跨越了158个不同的记录会话,涉及超过27,373个神经单元和超过100小时的记录。
效果:在多个不同的任务中,我们证明我们的预训练模型可以快速适应新的、未见过的会话,无需指定神经元对应关系,实现了少量标注下的即插即用性能。这项工作为构建分析神经数据的深度学习工具提供了一种强大的新方法,并为神经网络解码模型的训练规模指明了一条清晰的道路。

Our ability to use deep learning approaches to decipher neural activity would likely benefit from greater scale, in terms of both the model size and the datasets. However, the integration of many neural recordings into one unified model is challenging, as each recording contains the activity of different neurons from different individual animals. In this paper, we introduce a training framework and architecture designed to model the population dynamics of neural activity across diverse, large-scale neural recordings. Our method first tokenizes individual spikes within the dataset to build an efficient representation of neural events that captures the fine temporal structure of neural activity. We then employ cross-attention and a PerceiverIO backbone to further construct a latent tokenization of neural population activities. Utilizing this architecture and training framework, we construct a large-scale multi-session model trained on large datasets from seven nonhuman primates, spanning over 158 different sessions of recording from over 27,373 neural units and over 100 hours of recordings. In a number of different tasks, we demonstrate that our pretrained model can be rapidly adapted to new, unseen sessions with unspecified neuron correspondence, enabling few-shot performance with minimal labels. This work presents a powerful new approach for building deep learning tools to analyze neural data and stakes out a clear path to training at scale for neural decoding models.

Polyhedron Attention Module: Learning Adaptive-order Interactions
Tan Zhu Fei Dou Xinyu Wang Jin Lu Jinbo Bi



研究问题:如何有效地学习多元预测模型中的特征交互作用。
动机:现有的深度学习方法在处理特征交互时存在局限性,如ReLU激活函数只能创建分段线性预测模型,而其他非线性激活函数则导致高阶特征交互的模型。
方法:提出一种多面体注意力模块(PAM),将输入空间分割成多面体,每个多面体定义不同的部分,并在每个部分上形成多面体边界的超平面以形成交互项,从而实现自适应每个部分的交互顺序。
效果:理论分析表明,PAM具有比ReLU激活网络更强的表达能力。大量实验结果表明,PAM在点击率预测等大规模数据集上具有优越的分类性能,并能在医疗问题上学习有意义的交互效应。

Learning feature interactions can be the key for multivariate predictive modeling. ReLU-activated neural networks create piecewise linear prediction models, and other nonlinear activation functions lead to models with only high-order feature interactions. Recent methods incorporate candidate polynomial terms of fixed orders into deep learning, which is subject to the issue of combinatorial explosion, or learn the orders that are difficult to adapt to different regions of the feature space. We propose a Polyhedron Attention Module (PAM) to create piecewise polynomial models where the input space is split into polyhedrons which define the different pieces and on each piece the hyperplanes that define the polyhedron boundary multiply to form the interactive terms, resulting in interactions of adaptive order to each piece. PAM is interpretable to identify important interactions in predicting a target. Theoretic analysis shows that PAM has stronger expression capability than ReLU-activated networks. Extensive experimental results demonstrate the superior classification performance of PAM on massive datasets of the click-through rate prediction and PAM can learn meaningful interaction effects in a medical problem.

Equivariant Adaptation of Large Pretrained Models
Arnab Kumar Mondal Siba Smarak Panigrahi Sékou-Oumar Kaba Sai Rajeswar Siamak Ravanbakhsh



研究问题:如何使预训练的大型神经网络模型具有等变性质,以提高预测的准确性和鲁棒性。
动机:目前的预训练网络需要对每个组件进行重新设计以实现所选的等变性质,这既困难又昂贵。
方法:提出一种替代方案,使用一个简化的规范化网络将输入转换为标准形式,然后将其输入到无约束的预测网络中。
效果:这种方法可以有效地使大型预训练网络具有等变性,同时保持其性能。这种等变适应大型预训练模型可以提高它们对已知对称性的特定领域的应用的鲁棒性。

Equivariant networks are specifically designed to ensure consistent behavior with respect to a set of input transformations, leading to higher sample efficiency and more accurate and robust predictions. However, redesigning each component of prevalent deep neural network architectures to achieve chosen equivariance is a difficult problem and can result in a computationally expensive network during both training and inference. A recently proposed alternative towards equivariance that removes the architectural constraints is to use a simple canonicalization network that transforms the input to a canonical form before feeding it to an unconstrained prediction network. We show here that this approach can effectively be used to make a large pretrained network equivariant. However, we observe that the produced canonical orientations can be misaligned with those of the training distribution, hindering performance. Using dataset-dependent priors to inform the canonicalization function, we are able to make large pretrained models equivariant while maintaining their performance. This significantly improves the robustness of these models to deterministic transformations of the data, such as rotations. We believe this equivariant adaptation of large pretrained models can help their domain-specific applications with known symmetry priors.

Normalization Layers Are All That Sharpness-Aware Minimization Needs
Maximilian Mueller Tiffany Joyce Vlaar David Rolnick Matthias Hein



研究问题:本研究旨在探讨在SAM中仅对仿射归一化参数进行扰动是否可以提高泛化性能。
动机:尽管SAM已被证明可以通过减少极小值的锐度来提高各种设置中的泛化性能,但通常需要对所有参数进行扰动。本研究试图找出是否只需对部分参数进行扰动就可以达到类似的效果。
方法:本研究通过实验发现,在SAM的对抗步骤中,只对仿射归一化参数进行扰动(这些参数通常只占总数的0.1%),就可以超越对所有参数进行扰动的效果。这一发现适用于不同的SAM变体和ResNet(批量归一化)以及Vision Transformer(层归一化)架构。
效果:虽然本研究的发现再次证实了SAM在提高泛化性能方面的有效性,但也引发了对其是否完全由降低锐度引起的疑问。

Sharpness-aware minimization (SAM) was proposed to reduce sharpness of minima and has been shown to enhance generalization performance in various settings. In this work we show that perturbing only the affine normalization parameters (typically comprising 0.1% of the total parameters) in the adversarial step of SAM can outperform perturbing all of the parameters. This finding generalizes to different SAM variants and both ResNet (Batch Normalization) and Vision Transformer (Layer Normalization) architectures. We consider alternative sparse perturbation approaches and find that these do not achieve similar performance enhancement at such extreme sparsity levels, showing that this behaviour is unique to the normalization layers. Although our findings reaffirm the effectiveness of SAM in improving generalization performance, they cast doubt on whether this is solely caused by reduced sharpness.

The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation
Elia Turner Omri Barak



研究问题:本文旨在研究在循环神经网络(RNNs)中,单个互联神经群体如何执行各自具有不同动态需求的任务。
动机:目前对于RNNs中单个任务的动态需求和神经动力学之间的关系已有研究,但多个任务之间共同动态力的影响尚未得到充分探索。
方法:本文首先构建了一个系统框架来研究RNNs中的多个任务,通过最小化输入和输出与隐藏表示之间的相关性干扰,揭示RNNs倾向于共享吸引子和复用动态的特性,即“简单性偏好”。
效果:研究发现RNNs在训练过程中会按顺序形成吸引子,优先复用现有动态并尽可能选择简单的解决方案。这种按顺序出现和优先复用的现象体现了简单性偏好。通过具体例子,作者证明新的吸引子主要由于任务需求或架构限制而出现,说明了简单性偏好与外部因素之间的平衡。此外,作者还探讨了单个吸引子内联合表示的几何结构,并发现相似输入间距的点在达到共享吸引子时会经历相似的变换,这再次强调了简单性偏好。这些发现为推断未知任务的性质以及网络专业化所需条件提供了有力依据。

How does a single interconnected neural population perform multiple tasks, each with its own dynamical requirements? The relation between task requirements and neural dynamics in Recurrent Neural Networks (RNNs) has been investigated for single tasks. The forces shaping joint dynamics of multiple tasks, however, are largely unexplored. In this work, we first construct a systematic framework to study multiple tasks in RNNs, minimizing interference from input and output correlations with the hidden representation. This allows us to reveal how RNNs tend to share attractors and reuse dynamics, a tendency we define as the "simplicity bias". We find that RNNs develop attractors sequentially during training, preferentially reusing existing dynamics and opting for simple solutions when possible. This sequenced emergence and preferential reuse encapsulate the simplicity bias. Through concrete examples, we demonstrate that new attractors primarily emerge due to task demands or architectural constraints, illustrating a balance between simplicity bias and external factors. We examine the geometry of joint representations within a single attractor, by constructing a family of tasks from a set of functions. We show that the steepness of the associated functions controls their alignment within the attractor. This arrangement again highlights the simplicity bias, as points with similar input spacings undergo comparable transformations to reach the shared attractor. Our findings propose compelling applications. The geometry of shared attractors might allow us to infer the nature of unknown tasks. Furthermore, the simplicity bias implies that without specific incentives, modularity in RNNs may not spontaneously emerge, providing insights into the conditions required for network specialization.

Affinity-Aware Graph Networks
Ameya Velingker Ali Kemal Sinop Ira Ktena Petar Veličković Sreenivas Gollapudi



研究问题:如何提高图神经网络(GNNs)在关系数据上的表达能力?
动机:由于图神经网络的消息传递步骤有限,影响了其对底层图结构的表达能力。因此,人们希望通过引入结构方面的信息来提高其表达能力。
方法:本文探索了将亲和力测量作为图神经网络特征的使用,特别是从随机游走中产生的有效阻力、命中和通勤时间等测量。我们提出了基于这些特征的消息传递网络,并在各种节点和图属性预测任务上评估了它们的表现。
效果:我们的架构具有低计算复杂度,而我们的特征与底层图的排列无关。我们计算的测量允许网络利用图的连接性属性,从而在各种任务上超越相关基准,通常只需要较少的消息传递步骤。在我们所拥有的最大的公开图形回归数据集之一OGB-LSC-PCQM4Mv1上,我们获得了当时最好的单模型验证MAE。

Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform—and hence a smaller receptive field—there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks. Our architecture has low computational complexity, while our features are invariant to the permutations of the underlying graph. The measures we compute allow the network to exploit the connectivity properties of the graph, thereby allowing us to outperform relevant benchmarks for a wide variety of tasks, often with significantly fewer message passing steps. On one of the largest publicly available graph regression datasets, OGB-LSC-PCQM4Mv1, we obtain the best known single-model validation MAE at the time of writing.

Adaptive recurrent vision performs zero-shot computation scaling to unseen difficulty levels
Vijay Veerabadran Srinivas Ravishankar Yuan Tang Ritik Raina Virginia R. de Sa



研究问题:人类解决算法或推理问题时,解决方案的时间通常会随着问题的困难程度而增加。
动机:已有研究表明,适应性循环神经网络在各种语言处理任务中表现出了这种特性。然而,对于视觉模型是否也能通过这种自适应计算扩展到其训练分布的难度级别之外,目前的研究还很少。
方法:本研究使用循环神经网络来探究这种自适应处理的关键功能:根据输入需求动态调整计算资源,从而实现对未见过的训练难度级别的零样本泛化。我们结合了卷积循环神经网络(ConvRNNs)和基于Graves(2016)的学习停止机制。
效果:我们发现,1) AdRNNs能学会动态地提前或延后停止处理以解决较易或较难的问题;2) 这些RNN能在测试时动态增加循环迭代次数,从而对训练中未出现的难题进行零样本泛化。这项研究为支持“循环处理使得网络能够根据输入需求动态分配计算资源,从而实现对视觉推理问题更难难度级别的泛化”的假设提供了模型证据。

Humans solving algorithmic (or) reasoning problems typically exhibit solution times that grow as a function of problem difficulty. Adaptive recurrent neural networks have been shown to exhibit this property for various language-processing tasks. However, little work has been performed to assess whether such adaptive computation can also enable vision models to extrapolate solutions beyond their training distribution's difficulty level, with prior work focusing on very simple tasks. In this study, we investigate a critical functional role of such adaptive processing using recurrent neural networks: to dynamically scale computational resources conditional on input requirements that allow for zero-shot generalization to novel difficulty levels not seen during training using two challenging visual reasoning tasks: PathFinder and Mazes. We combine convolutional recurrent neural networks (ConvRNNs) with a learnable halting mechanism based on Graves (2016). We explore various implementations of such adaptive ConvRNNs (AdRNNs) ranging from tying weights across layers to more sophisticated biologically inspired recurrent networks that possess lateral connections and gating. We show that 1) AdRNNs learn to dynamically halt processing early (or late) to solve easier (or harder) problems, 2) these RNNs zero-shot generalize to more difficult problem settings not shown during training by dynamically increasing the number of recurrent iterations at test time. Our study provides modeling evidence supporting the hypothesis that recurrent processing enables the functional advantage of adaptively allocating compute resources conditional on input requirements and hence allowing generalization to harder difficulty levels of a visual reasoning problem without training.

Time Series Kernels based on Nonlinear Vector AutoRegressive Delay Embeddings
Giovanni De Felice John Y Goulermas Vladimir Gusev



研究问题:如何设计一种有效的时间序列分析核方法,特别是在小数据集的情况下。
动机:当前的时间序列分析方法在处理小数据集时面临挑战,而水库计算(RC)作为一种强大的工具,其性能高度依赖于难以解释和优化的超参数设置。
方法:提出了一种新的基于水库动态与非线性向量自回归(NVAR)过程等价性的时间序列核方法。这种核是非循环的,依赖于一组有意义的小超参数,并建议了一种有效的启发式方法。
效果:在各种真实世界的分类任务上表现出色,无论是在准确性还是在速度上都表现出色,进一步推动了水库计算表示学习模型的理解,并将NVAR框架的典型应用扩展到了真实世界时间序列数据的核设计和表示上。

Kernel design is a pivotal but challenging aspect of time series analysis, especially in the context of small datasets. In recent years, Reservoir Computing (RC) has emerged as a powerful tool to compare time series based on the underlying dynamics of the generating process rather than the observed data. However, the performance of RC highly depends on the hyperparameter setting, which is hard to interpret and costly to optimize because of the recurrent nature of RC. Here, we present a new kernel for time series based on the recently established equivalence between reservoir dynamics and Nonlinear Vector AutoRegressive (NVAR) processes. The kernel is non-recurrent and depends on a small set of meaningful hyperparameters, for which we suggest an effective heuristic. We demonstrate excellent performance on a wide range of real-world classification tasks, both in terms of accuracy and speed. This further advances the understanding of RC representation learning models and extends the typical use of the NVAR framework to kernel design and representation of real-world time series data.

Optimality of Message-Passing Architectures for Sparse Graphs
Aseem Baranwal Kimon Fountoulakis Aukosh Jagannath



研究问题:本文研究了在稀疏设置下,特征装饰图上的节点分类问题。
动机:当图中节点的预期度数为O(1)时,即节点数量很大而特征数据的维度固定时,我们引入了一种称为渐近局部贝叶斯最优性的节点分类任务的最优性概念。
方法:我们计算了根据这一标准在具有任意分布的节点特征和边连通性的一般统计数据模型中进行最优分类器的计算,该最优分类器可以使用消息传递图神经网络架构实现。
效果:我们发现,最优的消息传递架构在低图信号和高图信号的环境下分别插值于标准的多层感知机和典型的卷积运算。此外,我们还证明了相应的非渐近结果。

We study the node classification problem on feature-decorated graphs in the sparse setting, i.e., when the expected degree of a node is $O(1)$ in the number of nodes, in the fixed-dimensional asymptotic regime, i.e., the dimension of the feature data is fixed while the number of nodes is large. Such graphs are typically known to be locally tree-like. We introduce a notion of Bayes optimality for node classification tasks, called asymptotic local Bayes optimality, and compute the optimal classifier according to this criterion for a fairly general statistical data model with arbitrary distributions of the node features and edge connectivity. The optimal classifier is implementable using a message-passing graph neural network architecture. We then compute the generalization error of this classifier and compare its performance against existing learning methods theoretically on a well-studied statistical model with naturally identifiable signal-to-noise ratios (SNRs) in the data. We find that the optimal message-passing architecture interpolates between a standard MLP in the regime of low graph signal and a typical convolution in the regime of high graph signal. Furthermore, we prove a corresponding non-asymptotic result.

Transformers learn through gradual rank increase
Emmanuel Abbe Samy Bengio Enric Boix-Adserà Etai Littwin Joshua M. Susskind



研究问题:探索Transformers中的增量学习动态,即训练后权重与初始权重之间的差异在等级上逐渐增大。
动机:为了理解Transformers模型的学习过程和优化效果,需要揭示其内部的学习动态。
方法:通过理论分析和实验验证,发现在对角权重矩阵和小初始化的简化假设下,训练后权重与初始权重的差异会逐渐增大。
效果:实验结果支持了这一理论,并表明即使没有这些简化假设,这种现象在实际中也可能发生。

We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.

Reversible and irreversible bracket-based dynamics for deep graph neural networks
Anthony Gruber Kookjin Lee Nathaniel Trask



研究问题:如何训练深度图神经网络(GNNs)而不过度平滑,以及物理学在这其中的作用是什么。
动机:尽管成功的示例包括可逆和不可逆的现象,但目前的机制存在根本的对立,并且由于与数学理论的经验偏差而进一步复杂化。
方法:提出了一系列基于结构保持的括号式动态系统的新颖GNN架构,这些架构被证明能够随着深度的增加保持能量或产生正的耗散。
效果:理论上的原则性框架使得模型具有内在的可解释性,可以更好地阐明网络性能中的可逆性和不可逆性的角色。

Recent works have shown that physics-inspired architectures allow the training of deep graph neural networks (GNNs) without oversmoothing. The role of these physics is unclear, however, with successful examples of both reversible (e.g., Hamiltonian) and irreversible (e.g., diffusion) phenomena producing comparable results despite diametrically opposed mechanisms, and further complications arising due to empirical departures from mathematical theory. This work presents a series of novel GNN architectures based upon structure-preserving bracket-based dynamical systems, which are provably guaranteed to either conserve energy or generate positive dissipation with increasing depth. It is shown that the theoretically principled framework employed here allows for inherently explainable constructions, which contextualize departures from theory in current architectures and better elucidate the roles of reversibility and irreversibility in network performance. Code is available at the Github repository \url{https://github.com/natrask/BracketGraphs}.

Stable Vectorization of Multiparameter Persistent Homology using Signed Barcodes as Measures
David Loiseaux Luis Scoccola Mathieu Carrière Magnus Bakke Botnan Steve Oudot



研究问题:本文旨在解决多参数持久同调(MPH)描述符在数据科学中稳定向量化的问题。
动机:虽然单参数持久同调可以很好地对数据进行拓扑描述,但多参数版本由于稳定性结果的稀缺性,其应用受到了限制。
方法:通过将带符号条码解读为带符号拉冬测度,实现了从单参数到多参数的向量化策略的自然扩展。
效果:实验证明,这种方法生成的特征向量易于定义和计算,且具有稳定性。与当前最先进的基于拓扑的方法相比,该方法在各种类型的数据上都表现出了显著的性能提升。

Persistent homology (PH) provides topological descriptors for geometric data, such as weighted graphs, which are interpretable, stable to perturbations, and invariant under, e.g., relabeling. Most applications of PH focus on the one-parameter case---where the descriptors summarize the changes in topology of data as it is filtered by a single quantity of interest---and there is now a wide array of methods enabling the use of one-parameter PH descriptors in data science, which rely on the stable vectorization of these descriptors as elements of a Hilbert space. Although the multiparameter PH (MPH) of data that is filtered by several quantities of interest encodes much richer information than its one-parameter counterpart, the scarceness of stability results for MPH descriptors has so far limited the available options for the stable vectorization of MPH. In this paper, we aim to bring together the best of both worlds by showing how the interpretation of signed barcodes---a recent family of MPH descriptors---as signed Radon measures leads to natural extensions of vectorization strategies from one parameter to multiple parameters. The resulting feature vectors are easy to define and to compute, and provably stable. While, as a proof of concept, we focus on simple choices of signed barcodes and vectorizations, we already see notable performance improvements when comparing our feature vectors to state-of-the-art topology-based methods on various types of data.

A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge Graphs
Xingyue Huang Miguel Romero Orth Ismail Ilkan Ceylan Pablo Barcelo



研究问题:本文旨在对图神经网络在知识图谱上的应用进行系统性理解,特别是在链接预测这一主要任务上。
动机:尽管图神经网络在简单图结构上的能力及其限制已经得到充分理解,但在知识图谱的环境下,我们的理解仍然不完整。
方法:通过对一系列看似无关的模型进行统一视角的分析,解锁了一系列其他模型。并通过对应的关系Weisfeiler-Leman算法来描述各种模型的表达能力。
效果:理论上的发现解释了广泛采用的一些实用设计选择的优点,这些优点在实践中得到了验证。

Graph neural networks are prominent models for representation learning over graph-structured data. While the capabilities and limitations of these models are well-understood for simple graphs, our understanding remains incomplete in the context of knowledge graphs. Our goal is to provide a systematic understanding of the landscape of graph neural networks for knowledge graphs pertaining to the prominent task of link prediction. Our analysis entails a unifying perspective on seemingly unrelated models and unlocks a series of other models. The expressive power of various models is characterized via a corresponding relational Weisfeiler-Leman algorithm. This analysis is extended to provide a precise logical characterization of the class of functions captured by a class of graph neural networks. The theoretical findings presented in this paper explain the benefits of some widely employed practical design choices, which are validated empirically.

Attention as Implicit Structural Inference
Ryan Singh Christopher Buckley



研究问题:本文旨在探讨注意力机制在认知系统中的作用,并从结构推理的角度理解Transformers中的attention。
动机:尽管Transformers已成为机器学习中的主流架构,但其attention的核心创新基于数据库管理系统中键和查询的概念。因此,本文试图从结构推理的角度来理解attention。
方法:通过整合先前的理论描述,如高斯混合模型、对齐机制和霍普菲尔德网络,本文将attention视为在图形模型中可能的邻接结构上的推理,从而揭示了这种机制的通用性。
效果:本文提出了两种attention的新视角,并在解释性的玩具问题上进行了实验。结果表明,这两种新的视角可以改进现有的attention机制,并且能够连接机器学习中的attention机制和神经科学中的贝叶斯attention概念。

Attention mechanisms play a crucial role in cognitive systems by allowing them to flexibly allocate cognitive resources. Transformers, in particular, have become a dominant architecture in machine learning, with attention as their central innovation. However, the underlying intuition and formalism of attention in Transformers is based on ideas of keys and queries in database management systems. In this work, we pursue a structural inference perspective, building upon, and bringing together, previous theoretical descriptions of attention such as; Gaussian Mixture Models, alignment mechanisms and Hopfield Networks. Specifically, we demonstrate that attention can be viewed as inference over an implicitly defined set of possible adjacency structures in a graphical model, revealing the generality of such a mechanism. This perspective unifies different attentional architectures in machine learning and suggests potential modifications and generalizations of attention. Here we investigate two and demonstrate their behaviour on explanatory toy problems: (a) extending the value function to incorporate more nodes of a graphical model yielding a mechanism with a bias toward attending multiple tokens; (b) introducing a geometric prior (with conjugate hyper-prior) over the adjacency structures producing a mechanism which dynamically scales the context window depending on input. Moreover, by describing a link between structural inference and precision-regulation in Predictive Coding Networks, we discuss how this framework can bridge the gap between attentional mechanisms in machine learning and Bayesian conceptions of attention in Neuroscience. We hope by providing a new lens on attention architectures our work can guide the development of new and improved attentional mechanisms.

Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs
Emmanuel Abbe Elisabetta Cornacchia Aryo Lotfi



研究问题:如何通过改变训练样本的分布和学习步骤的顺序,提高神经网络的学习效率。
动机:现有的研究表明,先呈现简单例子再逐渐增加复杂性(课程学习)可以提高学习效率。同时,改变采样分布也有助于神经网络学习等式。
方法:本研究在常见的采样分布上,使用标准的(有界的)学习率,对训练步骤的数量进行了分离结果的研究。如果数据分布是稀疏和密集输入的混合体,那么先使用稀疏例子进行训练的2层ReLU神经网络,可以通过课程噪声梯度下降(或随机梯度下降)算法学习到足够大的等式,而任何宽度或深度可能更大的全连接神经网络,如果没有额外的步骤,就无法在学习无序样本的过程中进行学习。
效果:实验结果表明,除了特定理论结果的区间外,还存在着定性的分离效果,支持了改变训练样本分布和学习步骤顺序可以提高神经网络学习效率的观点。

Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the number of training steps with standard (bounded) learning rates on a common sample distribution: if the data distribution is a mixture of sparse and dense inputs, there exists a regime in which a 2-layer ReLU neural network trained by a curriculum noisy-GD (or SGD) algorithm that uses sparse examples first, can learn parities of sufficiently large degree, while any fully connected neural network of possibly larger width or depth trained by noisy-GD on the unordered samples cannot learn without additional steps. We also provide experimental results supporting the qualitative separation beyond the specific regime of the theoretical results.

Should Under-parameterized Student Networks Copy or Average Teacher Weights?
Berfin Simsek Amire Bendjeddou Wulfram Gerstner Johanni Brea



研究问题:如何用一个神经元数量少于教师网络的“学生”网络来近似一个具有一个隐藏层和k个神经元的“教师”网络。
动机:由于学生网络的神经元数量少于教师网络,因此不清楚每个学生神经元是应该复制一个教师神经元还是平均一组教师神经元。
方法:对于具有erf激活函数和标准高斯输入分布的浅层神经网络,证明了当教师的输入向量正交且输出权重为酉时,“复制-平均”配置是临界点。
效果:实验发现,对于erf激活函数,梯度流要么收敛到最优的复制-平均临界点,要么收敛到一个每个学生神经元大约复制一个不同的教师神经元的点。对于ReLU激活函数也得到了类似的结果,表明欠参数化网络的最优解具有通用结构。

Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

A generative model of the hippocampal formation trained with theta driven local learning rules
Tom George Kim Stachenfeld Caswell Barry Claudia Clopath Tomoki Fukai



研究问题:本研究旨在通过模拟海马体结构,探索动物智能中的生成模型。
动机:理解支持这些过程的生物机制有助于揭示生物和人工智能之间的关系。
方法:我们介绍了一种相当于Helmholtz机的海马体结构模型,并将其应用于时间序列输入。我们的模型的一个新颖之处在于快速的theta波段振荡(5-10 Hz)控制了整个网络的信息流方向,类似于高频醒睡算法的训练方式。
效果:我们的模型能够准确推断高维感官环境的隐藏状态并生成逼真的感官预测。此外,它可以通过发展与先前理论建议相匹配的环形吸引子连接结构来学习路径整合,并能在环境之间灵活地转移这种结构。

Advances in generative models have recently revolutionised machine learning. Meanwhile, in neuroscience, generative models have long been thought fundamental to animal intelligence. Understanding the biological mechanisms that support these processes promises to shed light on the relationship between biological and artificial intelligence. In animals, the hippocampal formation is thought to learn and use a generative model to support its role in spatial and non-spatial memory. Here we introduce a biologically plausible model of the hippocampal formation tantamount to a Helmholtz machine that we apply to a temporal stream of inputs. A novel component of our model is that fast theta-band oscillations (5-10 Hz) gate the direction of information flow throughout the network, training it akin to a high-frequency wake-sleep algorithm. Our model accurately infers the latent state of high-dimensional sensory environments and generates realistic sensory predictions. Furthermore, it can learn to path integrate by developing a ring attractor connectivity structure matching previous theoretical proposals and flexibly transfer this structure between environments. Whereas many models trade-off biological plausibility with generality, our model captures a variety of hippocampal cognitive functions under one biologically plausible local learning rule.

Ignorance is Bliss: Robust Control via Information Gating
Manan Tomar Riashat Islam Matthew E. Taylor Sergey Levine Philip Bachman



研究问题:如何通过信息门控学习实现更好的泛化,同时减少噪声和虚假相关性的影响?
动机:提出信息门控作为一种学习简洁表示的方法,通过最小化任务所需信息来提高泛化能力。
方法:使用可微分的信噪比参数化进行信息门控,可以应用于网络中的任意值,例如在输入层擦除像素或隐藏某些中间层的激活。
效果:实验结果表明,通过学习识别和使用最小必要信息,可以提高下游任务的泛化能力。基于信息门控的策略对无关视觉特征具有更强的鲁棒性,有助于改善强化学习模型的预训练和微调。

Informational parsimony provides a useful inductive bias for learning representations that achieve better generalization by being robust to noise and spurious correlations. We propose *information gating* as a way to learn parsimonious representations that identify the minimal information required for a task. When gating information, we can learn to reveal as little information as possible so that a task remains solvable, or hide as little information as possible so that a task becomes unsolvable. We gate information using a differentiable parameterization of the signal-to-noise ratio, which can be applied to arbitrary values in a network, e.g., erasing pixels at the input layer or activations in some intermediate layer. When gating at the input layer, our models learn which visual cues matter for a given task. When gating intermediate layers, our models learn which activations are needed for subsequent stages of computation. We call our approach *InfoGating*. We apply InfoGating to various objectives such as multi-step forward and inverse dynamics models, Q-learning, and behavior cloning, highlighting how InfoGating can naturally help in discarding information not relevant for control. Results show that learning to identify and use minimal information can improve generalization in downstream tasks. Policies based on InfoGating are considerably more robust to irrelevant visual features, leading to improved pretraining and finetuning of RL models.

MeGraph: Capturing Long-Range Interactions by Alternating Local and Hierarchical Aggregation on Multi-Scaled Graph Hierarchy
Honghua Dong Jiawei Xu Yu Yang Rui Zhao Shiwen Wu Chun Yuan Xiu Li Chris J. Maddison Lei Han



研究问题:本文旨在解决图神经网络在捕捉图中长程交互信息方面的困难。
动机:现有的图神经网络主要通过局部邻居之间的信息交换来捕捉信息,但往往难以捕获图中的长程交互(LRIs)。
方法:提出了一种名为MeGraph的模型,该模型将多尺度图的本地和层次结构整合到一个单一的大图中。MeGraph模型由多个交替进行本地和层次信息聚合的层组成。每一层首先通过内部图的边缘在不同尺度的图上进行本地感知的消息传递,然后沿着由外部图的边缘形成的双向路径在整个层次结构中融合信息。
效果:实验结果表明,MeGraph模型在捕捉长程交互信息的能力上表现出色,并在常见的真实世界数据集上也显示出优越或相当的性能。

Graph neural networks, which typically exchange information between local neighbors, often struggle to capture long-range interactions (LRIs) within the graph. Building a graph hierarchy via graph pooling methods is a promising approach to address this challenge; however, hierarchical information propagation cannot entirely take over the role of local information aggregation. To balance locality and hierarchy, we integrate the local and hierarchical structures, represented by intra- and inter-graphs respectively, of a multi-scale graph hierarchy into a single mega graph. Our proposed MeGraph model consists of multiple layers alternating between local and hierarchical information aggregation on the mega graph. Each layer first performs local-aware message-passing on graphs of varied scales via the intra-graph edges, then fuses information across the entire hierarchy along the bidirectional pathways formed by inter-graph edges. By repeating this fusion process, local and hierarchical information could intertwine and complement each other. To evaluate our model, we establish a new Graph Theory Benchmark designed to assess LRI capture ability, in which MeGraph demonstrates dominant performance. Furthermore, MeGraph exhibits superior or equivalent performance to state-of-the-art models on the Long Range Graph Benchmark. The experimental results on commonly adopted real-world datasets further demonstrate the broad applicability of MeGraph.

Domain Agnostic Fourier Neural Operators
Ning Liu Siavash Jafarzadeh Yue Yu



研究问题:目前的傅立叶神经算子(FNOs)在处理不规则几何和拓扑变化的问题时,依赖于快速傅立叶变换(FFT),这限制了其在非矩形域上的应用。
动机:为了解决这一问题,我们提出了一种名为“领域无关的傅立叶神经算子”(DAFNO)的新型神经算子架构,用于学习具有不规则几何和演化领域的替代模型。
方法:我们通过在FNOs的积分层架构中引入平滑特性函数,并利用FFT实现快速计算,使得几何信息被显式编码在架构中。
效果:实验结果表明,DAFNO在材料建模和翼型模拟的两个基准数据集上,与基线神经算子模型相比,取得了最先进的精度。此外,我们还展示了DAFNO在处理具有拓扑变化的复杂领域的能力,仅用一个训练裂缝模拟样本,DAFNO就能推广到未见过的压力情况和与训练场景大相径庭的裂缝模式。

Fourier neural operators (FNOs) can learn highly nonlinear mappings between function spaces, and have recently become a popular tool for learning responses of complex physical systems. However, to achieve good accuracy and efficiency, FNOs rely on the Fast Fourier transform (FFT), which is restricted to modeling problems on rectangular domains. To lift such a restriction and permit FFT on irregular geometries as well as topology changes, we introduce domain agnostic Fourier neural operator (DAFNO), a novel neural operator architecture for learning surrogates with irregular geometries and evolving domains. The key idea is to incorporate a smoothed characteristic function in the integral layer architecture of FNOs, and leverage FFT to achieve rapid computations, in such a way that the geometric information is explicitly encoded in the architecture. In our empirical evaluation, DAFNO has achieved state-of-the-art accuracy as compared to baseline neural operator models on two benchmark datasets of material modeling and airfoil simulation. To further demonstrate the capability and generalizability of DAFNO in handling complex domains with topology changes, we consider a brittle material fracture evolution problem. With only one training crack simulation sample, DAFNO has achieved generalizability to unseen loading scenarios and substantially different crack patterns from the trained scenario. Our code and data accompanying this paper are available at https://github.com/ningliu-iga/DAFNO.

Exact Verification of ReLU Neural Control Barrier Functions
Hongchao Zhang Junlin Wu Yevgeniy Vorobeychik Andrew Clark



研究问题:如何验证学习到的控制屏障函数(CBFs)的安全性。
动机:在非线性系统的安全控制中,控制屏障函数是一种流行的方法。然而,验证学习到的CBFs的安全性仍然是一个挑战。
方法:本文提出了一种新的精确条件和算法,用于验证具有ReLU激活功能的前馈神经网络控制屏障函数(NCBFs)的安全性。我们通过利用非光滑边界集合不变的Nagumo定理的推广来解决这个问题。
效果:我们的实验结果表明,我们的方法比最先进的基于SMT的方法更有效。

Control Barrier Functions (CBFs) are a popular approach for safe control of nonlinear systems. In CBF-based control, the desired safety properties of the system are mapped to nonnegativity of a CBF, and the control input is chosen to ensure that the CBF remains nonnegative for all time. Recently, machine learning methods that represent CBFs as neural networks (neural control barrier functions, or NCBFs) have shown great promise due to the universal representability of neural networks. However, verifying that a learned CBF guarantees safety remains a challenging research problem. This paper presents novel exact conditions and algorithms for verifying safety of feedforward NCBFs with ReLU activation functions. The key challenge in doing so is that, due to the piecewise linearity of the ReLU function, the NCBF will be nondifferentiable at certain points, thus invalidating traditional safety verification methods that assume a smooth barrier function. We resolve this issue by leveraging a generalization of Nagumo's theorem for proving invariance of sets with nonsmooth boundaries to derive necessary and sufficient conditions for safety. Based on this condition, we propose an algorithm for safety verification of NCBFs that first decomposes the NCBF into piecewise linear segments and then solves a nonlinear program to verify safety of each segment as well as the intersections of the linear segments. We mitigate the complexity by only considering the boundary of the safe region and by pruning the segments with Interval Bound Propagation (IBP) and linear relaxation. We evaluate our approach through numerical studies with comparison to state-of-the-art SMT-based methods. Our code is available at https://github.com/HongchaoZhang-HZ/exactverif-reluncbf-nips23.

Dis-inhibitory neuronal circuits can control the sign of synaptic plasticity
Julian Rossbroich Friedemann Zenke



研究问题:神经回路如何实现信用分配仍是系统神经科学中未解决的核心问题。
动机:各种研究提出了通过多层网络反向传播错误信号的可行解决方案,但这些纯功能性模型假设了不同的神经元隔室来表示决定突触可塑性符号的局部错误信号,这与主要依赖于突触后活动的表观可塑性模型不一致。
方法:我们展示了如何在自适应控制理论框架内推导出一个合理的微电路模型和赫伯学习规则来解决这种不一致性。假设错误被编码在自上而下的抑制性突触输入中,我们发现当循环抑制明确影响赫伯可塑性时,电路级别上的错误调制学习自然出现。
效果:同样的学习规则可以解释实验观察到的无抑制情况下的可塑性,并在几个非线性可分基准上与误差反向传播(BP)表现相当。我们的发现弥合了功能和实验观察到的可塑性规则之间的差距,并对兴奋性可塑性的抑制性调制做出了具体预测。

How neuronal circuits achieve credit assignment remains a central unsolved question in systems neuroscience. Various studies have suggested plausible solutions for back-propagating error signals through multi-layer networks. These purely functionally motivated models assume distinct neuronal compartments to represent local error signals that determine the sign of synaptic plasticity. However, this explicit error modulation is inconsistent with phenomenological plasticity models in which the sign depends primarily on postsynaptic activity. Here we show how a plausible microcircuit model and Hebbian learning rule derived within an adaptive control theory framework can resolve this discrepancy. Assuming errors are encoded in top-down dis-inhibitory synaptic afferents, we show that error-modulated learning emerges naturally at the circuit level when recurrent inhibition explicitly influences Hebbian plasticity. The same learning rule accounts for experimentally observed plasticity in the absence of inhibition and performs comparably to back-propagation of error (BP) on several non-linearly separable benchmarks. Our findings bridge the gap between functional and experimentally observed plasticity rules and make concrete predictions on inhibitory modulation of excitatory plasticity.

Calibrate and Boost Logical Expressiveness of GNN Over Multi-Relational and Temporal Graphs
Yeyuan Chen Dingmin Wang



研究问题:本文旨在分析图神经网络(GNN)作为多关系图上布尔节点分类器的逻辑表达能力。
动机:尽管图神经网络在图表示学习中具有强大的框架,但目前还没有对GNN作为布尔节点分类器的逻辑表达能力进行形式化分析。
方法:本文研究了$\mathcal{FOC}_2$,这是一种有两个变量和计数量词的第一阶逻辑片段。我们通过扩展局部消息传递的GNN来构建R^2-GNN模型,并证明了在某些限制性但合理的场景下,R^2-GNN模型等价于$\mathcal{FOC}_2$分类器。
效果:为了解决R^2-GNN在表达能力方面的局限性,我们提出了一种简单的图转换技术,类似于预处理步骤,可以在线性时间内执行。这种转换使得R^2-GNN能够有效地捕获任何$\mathcal{FOC}_2$分类器。此外,我们将表达性分析和图转换扩展到了时间图,探索了几种时间GNN架构,并为它们提供了一种表达性层次结构。实验结果证明,使用图形转换的R^2-GNN在节点分类任务上优于各种支持多关系或时间图的知名GNN架构。

As a powerful framework for graph representation learning, Graph Neural Networks (GNNs) have garnered significant attention in recent years. However, to the best of our knowledge, there has been no formal analysis of the logical expressiveness of GNNs as Boolean node classifiers over multi-relational graphs, where each edge carries a specific relation type. In this paper, we investigate $\mathcal{FOC}_2$, a fragment of first-order logic with two variables and counting quantifiers. On the negative side, we demonstrate that the R$^2$-GNN architecture, which extends the local message passing GNN by incorporating global readout, fails to capture $\mathcal{FOC}_2$ classifiers in the general case. Nevertheless, on the positive side, we establish that R$^2$-GNNs models are equivalent to $\mathcal{FOC}_2$ classifiers under certain restricted yet reasonable scenarios. To address the limitations of R$^2$-GNNs regarding expressiveness, we propose a simple graph transformation technique, akin to a preprocessing step, which can be executed in linear time. This transformation enables R$^2$-GNNs to effectively capture any $\mathcal{FOC}_2$ classifiers when applied to the "transformed" input graph. Moreover, we extend our analysis of expressiveness and graph transformation to temporal graphs, exploring several temporal GNN architectures and providing an expressiveness hierarchy for them. To validate our findings, we implement R$^2$-GNNs and the graph transformation technique and conduct empirical tests in node classification tasks against various well-known GNN architectures that support multi-relational or temporal graphs. Our experimental results consistently demonstrate that R$^2$-GNN with the graph transformation outperforms the baseline methods on both synthetic and real-world datasets

Machine learning detects terminal singularities
Tom Coates Alexander M. Kasprzyk Sara Veneziale



研究问题:Q-Fano变量分类问题。
动机:Q-Fano变量是复杂的几何形状的基本组成部分,其分类问题对于理解更复杂的形状至关重要,但目前尚未解决。
方法:利用机器学习技术,特别是神经网络,对八维正曲率代数变量进行分类预测。
效果:开发的神经网络分类器可以以95%的准确率预测这类代数变量是否为Q-Fano类型,为高维Q-Fano变量分类提供了初步的概览。此外,还提出了一个新的全局组合标准,证明了一类具有两个Picard等级的正曲率代数变量具有终端奇异性。这些发现表明,机器学习可能是发展数学猜想和加速理论发现的重要工具。

Algebraic varieties are the geometric shapes defined by systems of polynomial equations; they are ubiquitous across mathematics and science. Amongst these algebraic varieties are Q-Fano varieties: positively curved shapes which have Q-factorial terminal singularities. Q-Fano varieties are of fundamental importance in geometry as they are `atomic pieces’ of more complex shapes – the process of breaking a shape into simpler pieces in this sense is called the Minimal Model Programme. Despite their importance, the classification of Q-Fano varieties remains unknown. In this paper we demonstrate that machine learning can be used to understand this classification. We focus on eight-dimensional positively-curved algebraic varieties that have toric symmetry and Picard rank two, and develop a neural network classifier that predicts with 95% accuracy whether or not such an algebraic variety is Q-Fano. We use this to give a first sketch of the landscape of Q-Fano varieties in dimension eight. How the neural network is able to detect Q-Fano varieties with such accuracy remains mysterious, and hints at some deep mathematical theory waiting to be uncovered. Furthermore, when visualised using the quantum period, an invariant that has played an important role in recent theoretical developments, we observe that the classification as revealed by ML appears to fall within a bounded region, and is stratified by the Fano index. This suggests that it may be possible to state and prove conjectures on completeness in the future. Inspired by the ML analysis, we formulate and prove a new global combinatorial criterion for a positively curved toric variety of Picard rank two to have terminal singularities. Together with the first sketch of the landscape of Q-Fano varieties in higher dimensions, this gives strong new evidence that machine learning can be an essential tool in developing mathematical conjectures and accelerating theoretical discovery.

Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff
Arthur Jacot



研究问题:深度神经网络的深度和正则化如何影响其学习输入表示的方式。
动机:以前的工作显示,深度神经网络倾向于学习输入的低维表示,这可以解释为最小化所学函数的秩的概念,被推测为瓶颈秩。
方法:计算有限深度的修正结果,揭示一个衡量正则化的量,该量约束雅可比矩阵的伪行列式,并在复合和添加下具有次可加性。
效果:证明了在无限深度下,几乎所有隐藏表示都是近似瓶颈结构的维度,并且几乎所有权重矩阵都有接近1的瓶颈秩奇异值,而其他奇异值为O(L^{-1/2})。有趣的是,需要使用大的学习率来保证几乎所有层的表示的阶O(L) NTK收敛。

Previous work has shown that DNNs with large depth $L$ and $L_{2}$-regularization are biased towards learning low-dimensional representations of the inputs, which can be interpreted as minimizing a notion of rank $R^{(0)}(f)$ of the learned function $f$, conjectured to be the Bottleneck rank. We compute finite depth corrections to this result, revealing a measure $R^{(1)}$ of regularity which bounds the pseudo-determinant of the Jacobian $\left\|Jf(x)\right\|\_\+$ and is subadditive under composition and addition. This formalizes a balance between learning low-dimensional representations and minimizing complexity/irregularity in the feature maps, allowing the network to learn the `right' inner dimension. Finally, we prove the conjectured bottleneck structure in the learned features as $L\to\infty$: for large depths, almost all hidden representations are approximately $R^{(0)}(f)$-dimensional, and almost all weight matrices $W_{\ell}$ have $R^{(0)}(f)$ singular values close to 1 while the others are $O(L^{-\frac{1}{2}})$. Interestingly, the use of large learning rates is required to guarantee an order $O(L)$ NTK which in turns guarantees infinite depth convergence of the representations of almost all layers.

Training biologically plausible recurrent neural networks on cognitive tasks with long-term dependencies
Wayne WM Soo Vishwa Goudar Xiao-Jing Wang



研究问题:训练循环神经网络(RNNs)以生成和评估认知机制的神经假设,但RNNs难以学习长期依赖性的任务。
动机:解决RNNs在训练过程中难以学习长期依赖性的问题,提高RNNs在模拟认知过程任务中的效率和性能。
方法:通过引入专门的跨时间跳跃连接来支持任务相关动态的出现,并恢复原始架构以增强生物合理性。
效果:该方法使RNNs能够成功学习需要长期依赖性或记忆过去事件的认知任务,减少了训练步骤和计算时间,扩大了生物合理RNN模型可学习实验任务的范围。

Training recurrent neural networks (RNNs) has become a go-to approach for generating and evaluating mechanistic neural hypotheses for cognition. The ease and efficiency of training RNNs with backpropagation through time and the availability of robustly supported deep learning libraries has made RNN modeling more approachable and accessible to neuroscience. Yet, a major technical hindrance remains. Cognitive processes such as working memory and decision making involve neural population dynamics over a long period of time within a behavioral trial and across trials. It is difficult to train RNNs to accomplish tasks where neural representations and dynamics have long temporal dependencies without gating mechanisms such as LSTMs or GRUs which currently lack experimental support and prohibit direct comparison between RNNs and biological neural circuits. We tackled this problem based on the idea of specialized skip-connections through time to support the emergence of task-relevant dynamics, and subsequently reinstitute biological plausibility by reverting to the original architecture. We show that this approach enables RNNs to successfully learn cognitive tasks that prove impractical if not impossible to learn using conventional methods. Over numerous tasks considered here, we achieve less training steps and shorter wall-clock times, particularly in tasks that require learning long-term dependencies via temporal integration over long timescales or maintaining a memory of past events in hidden-states. Our methods expand the range of experimental tasks that biologically plausible RNN models can learn, thereby supporting the development of theory for the emergent neural mechanisms of computations involving long-term dependencies.

Efficient Learning of Linear Graph Neural Networks via Node Subsampling
Seiyun Shin Ilan Shomorony Han Zhao



研究问题:如何避免全量计算图的邻接矩阵和数据矩阵的乘积,以实现在(准)线性时间内进行图神经网络操作。
动机:图神经网络的操作通常需要对大规模的图进行输入,这在训练和测试阶段会带来巨大的计算/存储成本。
方法:通过执行节点抽样、基于抽样图估计邻接矩阵和数据矩阵乘积的杠杆分数,以及在邻接矩阵和数据矩阵乘积上执行杠杆分数采样,来开发一种高效的训练算法。
效果:实验结果表明,该算法在观察邻接矩阵的$O(nd\epsilon^{-2}\log n)$个条目时,可以在$O(nd^2 \epsilon^{-2}\log n)$的时间内学习回归模型,且学习到的权重与使用整个邻接矩阵学习的模型在$ell_2$范数下的偏差不超过$\epsilon$。

Graph Neural Networks (GNNs) are a powerful class of machine learning models with applications in recommender systems, drug discovery, social network analysis, and computer vision. One challenge with their implementation is that GNNs often take large-scale graphs as inputs, which imposes significant computational/storage costs in the training and testing phases. In particular, the message passing operations of a GNN require multiplication of the graph adjacency matrix $A \in \mathbb{R}^{n \times n}$ and the data matrix $X \in \mathbb{R}^{n \times d}$, and the $O(n^2 d)$ time complexity can be prohibitive for large $n$. Thus, a natural question is whether it is possible to perform the GNN operations in (quasi-)linear time by avoiding the full computation of $A X$. To study this question, we consider the setting of a regression task on a two-layer Linear Graph Convolutional Network (GCN). We develop an efficient training algorithm based on (1) performing node subsampling, (2) estimating the leverage scores of $A X$ based on the subsampled graph, and (3) performing leverage score sampling on $A X$. We show that our proposed scheme learns the regression model observing only $O(nd\epsilon^{-2}\log n)$ entries of $A$ in time $O(nd^2 \epsilon^{-2}\log n)$, with the guarantee that the learned weights deviate by at most $\epsilon$ under the $\ell_2$ norm from the model learned using the entire adjacency matrix $A$. We present empirical results for regression problems on real-world graphs and show that our algorithm significantly outperforms other baseline sampling strategies that exploit the same number of observations.

An information-theoretic quantification of the content of communication between brain regions
Marco Celotto Jan Bím Alejandro Tlaie Vito De Feo Alessandro Toso Stefan M Lemke Daniel Chicharro Hamed Nili Malte Bieler Ileana Livia Hanganu-Opatz Tobias H. Donner Andrea Brovelli Stefano Panzeri



研究问题:如何量化大脑区域之间的信息交流量、内容和方向,以理解大脑功能。
动机:传统的基于维纳-格兰杰因果关系原理的脑活动分析方法只能量化同时记录的大脑区域之间神经活动传播的总体信息,无法揭示关于特定特征(如感官刺激)的信息流动。
方法:开发了一种新的信息理论测量方法,称为特征特定的信息转移(FIT),用于量化两个区域之间关于特定特征的信息流量。FIT将维纳-格兰杰因果关系原理与信息内容特异性相结合。
效果:通过模拟神经活动,证明了FIT能够识别出在区域之间传输的关于特定特征的信息。通过对三种不同的记录方法(磁电图和脑电图以及尖峰活性)获得的三个神经网络数据集的分析,展示了FIT能够揭示传统分析方法无法辨别的大脑区域之间的信息流的内容和方向。FIT可以通过揭示以前未解决的特征特定的信息流来提高我们对大脑区域如何通信的理解。

Quantifying the amount, content and direction of communication between brain regions is key to understanding brain function. Traditional methods to analyze brain activity based on the Wiener-Granger causality principle quantify the overall information propagated by neural activity between simultaneously recorded brain regions, but do not reveal the information flow about specific features of interest (such as sensory stimuli). Here, we develop a new information theoretic measure termed Feature-specific Information Transfer (FIT), quantifying how much information about a specific feature flows between two regions. FIT merges the Wiener-Granger causality principle with information-content specificity. We first derive FIT and prove analytically its key properties. We then illustrate and test them with simulations of neural activity, demonstrating that FIT identifies, within the total information propagated between regions, the information that is transmitted about specific features. We then analyze three neural datasets obtained with different recording methods, magneto- and electro-encephalography, and spiking activity, to demonstrate the ability of FIT to uncover the content and direction of information flow between brain regions beyond what can be discerned with traditional analytical methods. FIT can improve our understanding of how brain regions communicate by uncovering previously unaddressed feature-specific information flow.

Expressivity-Preserving GNN Simulation
Fabian Jogl Maximilian Thiessen Thomas Gärtner



研究问题:本文旨在通过图变换实现标准消息传递,以模拟最先进的图神经网络(GNNs),同时保持表达能力。
动机:目前的图神经网络实现存在许多实施问题和代码优化困难,因此需要一种直接的方式将非标准的GNNs的常见操作转化为图变换,以实现强或弱的模拟。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We systematically investigate graph transformations that enable standard message passing to simulate state-of-the-art graph neural networks (GNNs) without loss of expressivity. Using these, many state-of-the-art GNNs can be implemented with message passing operations from standard libraries, eliminating many sources of implementation issues and allowing for better code optimization. We distinguish between weak and strong simulation: weak simulation achieves the same expressivity only after several message passing steps while strong simulation achieves this after every message passing step. Our contribution leads to a direct way to translate common operations of non-standard GNNs to graph transformations that allow for strong or weak simulation. Our empirical evaluation shows competitive predictive performance of message passing on transformed graphs for various molecular benchmark datasets, in several cases surpassing the original GNNs.

Empowering Convolutional Neural Nets with MetaSin Activation
Farnood Salehi Tunc Ozan Aydin André Gaillard Guglielmo Camporese Yuxuan Wang



研究问题:尽管ReLU网络在图像预测领域一直是默认选择,但其对学习低频信息的偏好以及难以重现高频视觉细节的问题。
动机:Sin网络在学习视觉数据的隐式表示方面表现出了有希望的结果,但在实际应用中的训练却相当困难。
方法:提出用具有可训练参数的新的集成函数替换基线网络的现有激活函数,即MetaSin激活。
效果:所提出的MetaSin激活可以可靠地训练,无需复杂的初始化方案,并且与替代方案相比,其测试损失始终较低。在蒙特卡洛去噪和图像重采样等领域中,通过基于知识蒸馏的训练过程,我们设置了新的最先进的性能。

ReLU networks have remained the default choice for models in the area of image prediction despite their well-established spectral bias towards learning low frequencies faster, and consequently their difficulty of reproducing high frequency visual details. As an alternative, sin networks showed promising results in learning implicit representations of visual data. However training these networks in practically relevant settings proved to be difficult, requiring careful initialization, dealing with issues due to inconsistent gradients, and a degeneracy in local minima. In this work, we instead propose replacing a baseline network’s existing activations with a novel ensemble function with trainable parameters. The proposed MetaSin activation can be trained reliably without requiring intricate initialization schemes, and results in consistently lower test loss compared to alternatives. We demonstrate our method in the areas of Monte-Carlo denoising and image resampling where we set new state-of-the-art through a knowledge distillation based training procedure. We present ablations on hyper-parameter settings, comparisons with alternative activation function formulations, and discuss the use of our method in other domains, such as image classification.

EICIL: Joint Excitatory Inhibitory Cycle Iteration Learning for Deep Spiking Neural Networks
Zihang Shao Xuanye Fang Yaxin Li Chaoran Feng Jiangrong Shen Qi Xu



研究问题:本文旨在解决传统深度尖峰神经网络训练方法的局限性,如依赖预训练和微调、间接编码和重建以及近似梯度等策略。
动机:传统的深度尖峰神经网络训练方法缺乏完整的训练模型,需要梯度近似,因此提出一种新的学习方式,即联合兴奋抑制循环迭代学习(EICIL)。
方法:通过将两种行为模式有机地嵌入到一个框架中,提出的EICIL显著提高了尖峰神经元模型的生物模拟和适应性,并扩展了尖峰神经元的表示空间。
效果:基于EICIL和传统学习方法的大量实验表明,EICIL在CIFAR10和CIFAR100等各种数据集上优于传统方法,揭示了训练过程中整合两种行为的关键作用。

Spiking neural networks (SNNs) have undergone continuous development and extensive study for decades, leading to increased biological plausibility and optimal energy efficiency. However, traditional training methods for deep SNNs have some limitations, as they rely on strategies such as pre-training and fine-tuning, indirect coding and reconstruction, and approximate gradients. These strategies lack a complete training model and require gradient approximation. To overcome these limitations, we propose a novel learning method named Joint Excitatory Inhibitory Cycle Iteration learning for Deep Spiking Neural Networks (EICIL) that integrates both excitatory and inhibitory behaviors inspired by the signal transmission of biological neurons.By organically embedding these two behavior patterns into one framework, the proposed EICIL significantly improves the bio-mimicry and adaptability of spiking neuron models, as well as expands the representation space of spiking neurons. Extensive experiments based on EICIL and traditional learning methods demonstrate that EICIL outperforms traditional methods on various datasets, such as CIFAR10 and CIFAR100, revealing the crucial role of the learning approach that integrates both behaviors during training.

NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning
Yun Yi Haokui Zhang Rong Xiao Nannan Wang Xiaoyu Wang



研究问题:如何有效地对神经网络本身进行建模和学习表示,以预测网络的目标属性,而无需实际的训练和部署过程。
动机:随着深度学习模型在实际应用中的广泛应用,对神经网络的表示学习和模型化的需求日益增长。有效的表示可以预测网络的目标属性,从而简化网络的设计和部署过程。
方法:本文重新审视了Transformer,并将其与图神经网络(GNN)进行了比较,分析了两者不同的架构特性。然后提出了一种改进的基于Transformer的通用神经网络表示学习模型NAR-Former V2,它可以从细胞结构网络和整个网络中学习有效的表示。具体来说,我们首先将网络视为图,并设计一个简单的标记器将网络编码为序列。然后,我们将GNN的归纳表示学习能力融入到Transformer中,使Transformer在遇到未见过的结构时能更好地泛化。此外,我们还引入了一系列简单而有效的修改,以提高Transformer从图形结构中学习表示的能力。
效果:在对整个网络进行编码并预测延迟方面,我们的方法在NNLQP数据集上显著超过了基于GNN的方法NNLP。此外,在对细胞结构的NASBench101和NASBench201数据集进行准确性预测方面,我们的方法达到了与其他最先进的方法相当的性能。

As more deep learning models are being applied in real-world applications, there is a growing need for modeling and learning the representations of neural networks themselves. An effective representation can be used to predict target attributes of networks without the need for actual training and deployment procedures, facilitating efficient network design and deployment. Recently, inspired by the success of Transformer, some Transformer-based representation learning frameworks have been proposed and achieved promising performance in handling cell-structured models. However, graph neural network (GNN) based approaches still dominate the field of learning representation for the entire network. In this paper, we revisit the Transformer and compare it with GNN to analyze their different architectural characteristics. We then propose a modified Transformer-based universal neural network representation learning model NAR-Former V2. It can learn efficient representations from both cell-structured networks and entire networks. Specifically, we first take the network as a graph and design a straightforward tokenizer to encode the network into a sequence. Then, we incorporate the inductive representation learning capability of GNN into Transformer, enabling Transformer to generalize better when encountering unseen architecture. Additionally, we introduce a series of simple yet effective modifications to enhance the ability of the Transformer in learning representation from graph structures. In encoding entire networks and then predicting the latency, our proposed method surpasses the GNN-based method NNLP by a significant margin on the NNLQP dataset. Furthermore, regarding accuracy prediction on the cell-structured NASBench101 and NASBench201 datasets, our method achieves highly comparable performance to other state-of-the-art methods. The code is available at https://github.com/yuny220/NAR-Former-V2.

Graph Contrastive Learning with Stable and Scalable Spectral Encoding
Deyu Bo Yuan Fang Yang Liu Chuan Shi



研究问题:本文旨在解决传统图对比学习中空间视图生成的问题,以及现有基于光谱的研究问题:本文旨在解决传统图对比学习中空间视图生成的问题,以及现有基于光谱的图视图忽略位置编码信息或在处理光谱特征不稳定性时复杂度高的挑战。
动机:尽管传统的图对比学习方法主要在空间域生成视图,但最近发现光谱域也对补充空间视图起着关键作用。然而,现有的基于光谱的图视图方法要么忽略了编码有价值位置信息的本征向量,要么在处理光谱特征不稳定性时面临高复杂度的问题。
方法:首先设计了一个名为EigenMLP的具有信息性、稳定性和可扩展性的光谱编码器,用于从光谱特征中学习有效的表示。然后,提出了一个空间-光谱对比框架(Sp$^{2}$GCL),以捕捉由图神经网络编码的空间信息与由EigenMLP学习的光谱信息之间的一致性,从而有效地融合这两种图视图。
效果:实验结果显示,该方法不仅学习了有效的图表示,而且在节点级和图级数据集上比其他基于光谱的方法快2到10倍。

Graph contrastive learning (GCL) aims to learn representations by capturing the agreements between different graph views. Traditional GCL methods generate views in the spatial domain, but it has been recently discovered that the spectral domain also plays a vital role in complementing spatial views. However, existing spectral-based graph views either ignore the eigenvectors that encode valuable positional information or suffer from high complexity when trying to address the instability of spectral features. To tackle these challenges, we first design an informative, stable, and scalable spectral encoder, termed EigenMLP, to learn effective representations from the spectral features. Theoretically, EigenMLP is invariant to the rotation and reflection transformations on eigenvectors and robust against perturbations. Then, we propose a spatial-spectral contrastive framework (Sp$^{2}$GCL) to capture the consistency between the spatial information encoded by graph neural networks and the spectral information learned by EigenMLP, thus effectively fusing these two graph views. Experiments on the node- and graph-level datasets show that our method not only learns effective graph representations but also achieves a 2--10x speedup over other spectral-based methods.

Circuit as Set of Points
Jialv Zou Xinggang Wang JiaHao Guo Wenyu Liu Qian Zhang Chang Huang



研究问题:随着电路设计规模的快速增长,如何快速评估布局成为物理设计过程中最耗时的部分。
动机:现有的方法要么通过手工制作的方法将电路设计转换为图像,然后使用卷积神经网络(CNN)提取特征,这受限于手工制作方法的质量,无法实现端到端训练;要么将电路设计视为图形结构,并使用图神经网络(GNN)提取特征,这需要耗时的预处理。
方法:我们提出一种新的电路设计视角,将电路元件视为点云,并使用基于变压器的点云感知方法从电路中提取特征。这种方法可以直接从原始数据中提取特征,无需任何预处理,允许端到端训练,并产生高性能的结果。
效果:实验结果表明,我们的方法在CircuitNet和ISPD2015数据集上的拥塞预测任务以及CircuitNet数据集上的设计规则检查(DRC)违规预测任务中实现了最先进的性能。我们的方法在相对成熟的点云感知方法和快速发展的EDA算法之间建立了桥梁,使我们能够利用更多的集体智能来解决这个问题。

As the size of circuit designs continues to grow rapidly, artificial intelligence technologies are being extensively used in Electronic Design Automation (EDA) to assist with circuit design. Placement and routing are the most time-consuming parts of the physical design process, and how to quickly evaluate the placement has become a hot research topic. Prior works either transformed circuit designs into images using hand-crafted methods and then used Convolutional Neural Networks (CNN) to extract features, which are limited by the quality of the hand-crafted methods and could not achieve end-to-end training, or treated the circuit design as a graph structure and used Graph Neural Networks (GNN) to extract features, which require time-consuming preprocessing. In our work, we propose a novel perspective for circuit design by treating circuit components as point clouds and using Transformer-based point cloud perception methods to extract features from the circuit. This approach enables direct feature extraction from raw data without any preprocessing, allows for end-to-end training, and results in high performance. Experimental results show that our method achieves state-of-the-art performance in congestion prediction tasks on both the CircuitNet and ISPD2015 datasets, as well as in design rule check (DRC) violation prediction tasks on the CircuitNet dataset. Our method establishes a bridge between the relatively mature point cloud perception methods and the fast-developing EDA algorithms, enabling us to leverage more collective intelligence to solve this task. To facilitate the research of open EDA design, source codes and pre-trained models are released at https://github.com/hustvl/circuitformer.

SAME: Uncovering GNN Black Box with Structure-aware Shapley-based Multipiece Explanations
Ziyuan Ye Rihan Huang Qilin Wu Quanying Liu



研究问题:本文旨在解决图神经网络(GNNs)解释性差的问题,提供一种经济有效的方法来揭示模型的内部工作机制。
动机:尽管许多GNN解释变体在各种基准测试中取得了最先进的解释结果,但它们很少对其内在属性和解释能力进行理论分析。
方法:本文提出了一种名为SAME的结构感知Shapley基础多片段解释(SAME)方法,该方法通过扩展的蒙特卡洛树搜索来探索多粒度的结构感知连接子结构,并通过优化不同单一子结构的组合,使解释结果具有图性质的信息性。
效果:在真实世界和合成基准测试上的大量实验表明,SAME在BBBP、MUTAG、Graph-SST2、Graph-SST5、BA-2Motifs和BA-Shapes等数据集上,将先前最先进的逼真性能提高了12.9%、7.01%、42.3%、38.9%、11.3%和18.2%。

Post-hoc explanation techniques on graph neural networks (GNNs) provide economical solutions for opening the black-box graph models without model retraining. Many GNN explanation variants have achieved state-of-the-art explaining results on a diverse set of benchmarks, while they rarely provide theoretical analysis for their inherent properties and explanatory capability. In this work, we propose $\underline{\text{S}}$tructure-$\underline{\text{A}}$ware Shapley-based $\underline{\text{M}}$ultipiece $\underline{\text{E}}$xplanation (SAME) method to address the structure-aware feature interactions challenges for GNNs explanation. Specifically, SAME leverages an expansion-based Monte Carlo tree search to explore the multi-grained structure-aware connected substructure. Afterward, the explanation results are encouraged to be informative of the graph properties by optimizing the combination of distinct single substructures. With the consideration of fair feature interactions in the process of investigating multiple connected important substructures, the explanation provided by SAME has the potential to be as explainable as the theoretically optimal explanation obtained by the Shapley value within polynomial time. Extensive experiments on real-world and synthetic benchmarks show that SAME improves the previous state-of-the-art fidelity performance by 12.9\% on BBBP, 7.01\% on MUTAG, 42.3\% on Graph-SST2, 38.9\% on Graph-SST5, 11.3\% on BA-2Motifs and 18.2\% on BA-Shapes under the same testing condition. Code is available at https://github.com/same2023neurips/same.

Meta-learning families of plasticity rules in recurrent spiking networks using simulation-based inference
Basile Confavreux Poornima Ramesh Pedro J. Goncalves Jakob H. Macke Tim P. Vogels



研究问题:寻找并理解生物网络中多种并行的可塑性规则。
动机:目前的对可塑性规则的研究主要依赖人类的直觉,对于在生物网络中的多个并行的可塑性规则的探索成果有限。
方法:开发了一种基于模拟的推理(SBI)方法,通过逐步细化的约束条件来过滤可塑性规则,这些约束条件可以实时修改。
效果:该方法能够在脉冲网络中推断出一系列复杂且并行的可塑性规则。首先考虑了灵活参数化的成对(赫伯)规则,发现推断出的规则集包含了扩展和精炼-以及拒绝-平均场理论预测的解决方案。然后,通过将可塑性规则建模为结合了若干与可塑性相关的因素(如权重、电压、三元组和共依赖性)的多层感知器,扩大了可塑性规则的搜索空间。从数百万种可能的规则中,识别出了满足诸如合理的活动和权重动态等生物学约束条件的数千种独特的规则组合。这些得出的规则可以作为进一步研究特定网络计算的起点,并对经典的可塑性实验方法提出了改进和预测。这种在大型循环脉冲网络中探索复杂可塑性规则的灵活方法,是目前为止最先进且强大的工具,能够实现对大脑功能背后的可塑性机制的深入理解和准确预测。

There is substantial experimental evidence that learning and memory-related behaviours rely on local synaptic changes, but the search for distinct plasticity rules has been driven by human intuition, with limited success for multiple, co-active plasticity rules in biological networks. More recently, automated meta-learning approaches have been used in simplified settings, such as rate networks and small feed-forward spiking networks. Here, we develop a simulation-based inference (SBI) method for sequentially filtering plasticity rules through an increasingly fine mesh of constraints that can be modified on-the-fly. This method, _filter SBI_, allows us to infer entire families of complex and co-active plasticity rules in spiking networks. We first consider flexibly parameterized doublet (Hebbian) rules, and find that the set of inferred rules contains solutions that extend and refine -and also reject- predictions from mean-field theory. Next, we expand the search space of plasticity rules by modelling them as multi-layer perceptrons that combine several plasticity-relevant factors, such as weight, voltage, triplets and co-dependency. Out of the millions of possible rules, we identify thousands of unique rule combinations that satisfy biological constraints like plausible activity and weight dynamics. The resulting rules can be used as a starting point for further investigations into specific network computations, and already suggest refinements and predictions for classical experimental approaches on plasticity. This flexible approach for principled exploration of complex plasticity rules in large recurrent spiking networks presents the most advanced search tool to date for enabling robust predictions and deep insights into the plasticity mechanisms underlying brain function.

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models
Yubin Shi Yixuan Chen Mingzhi Dong Xiaochen Yang Dongsheng Li Yujiang Wang Robert P. Dick Qin Lv Yingying Zhao Fan Yang Tun Lu Ning Gu Li Shang



研究问题:本文旨在研究深度学习中过参数化模型的学习动态,以获得更有效和富有成效的训练策略。
动机:尽管过参数化模型在深度学习社区中广泛存在,但其对适当训练的高计算成本需求很大。
方法:通过缩小到网络模块(如自注意力模型的头部)来观察每个模块的可训练性与学习模式之间的隐含关联,并引入了一种新的概念——模块化神经切线核(mNTK)。
效果:实验表明,MAT可以显著减少模型训练的计算成本,并通过其部分更新策略进一步提高性能。

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $\lambda_{\max}$. A large $\lambda_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $\lambda_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.

Neural Sculpting: Uncovering hierarchically modular task structure in neural networks through pruning and network analysis
Shreyas Malakarjun Patil Loizos Michael Constantine Dovrolis



研究问题:如何通过深度神经网络学习任务,揭示其底层的子功能层次结构?
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Natural target functions and tasks typically exhibit hierarchical modularity -- they can be broken down into simpler sub-functions that are organized in a hierarchy. Such sub-functions have two important features: they have a distinct set of inputs (input-separability) and they are reused as inputs higher in the hierarchy (reusability). Previous studies have established that hierarchically modular neural networks, which are inherently sparse, offer benefits such as learning efficiency, generalization, multi-task learning, and transfer. However, identifying the underlying sub-functions and their hierarchical structure for a given task can be challenging. The high-level question in this work is: if we learn a task using a sufficiently deep neural network, how can we uncover the underlying hierarchy of sub-functions in that task? As a starting point, we examine the domain of Boolean functions, where it is easier to determine whether a task is hierarchically modular. We propose an approach based on iterative unit and edge pruning (during training), combined with network analysis for module detection and hierarchy inference. Finally, we demonstrate that this method can uncover the hierarchical modularity of a wide range of Boolean functions and two vision tasks based on the MNIST digits dataset.

The Contextual Lasso: Sparse Linear Models via Deep Neural Networks
Ryan Thompson Amir Dezfouli Robert Kohn



研究问题:如何提高稀疏线性模型的灵活性,使其在解释性机器学习领域与深度神经网络等黑盒模型竞争。
动机:随着预测模型在许多领域的决策中广泛应用,解释性机器学习的重要性日益凸显。然而,稀疏线性模型作为输入特征的函数,其灵活性远不如深度神经网络等黑盒模型。
方法:提出了一种名为上下文lasso的新统计估计器,该估计器将稀疏线性模型拟合到解释性特征上,使得稀疏模式和系数随上下文特征而变化。通过深层神经网络非参数化地学习这种函数。为了获得稀疏的系数,我们使用一种新的lasso正则化器训练网络,该正则化器的形式为将网络输出映射到$\ell_1$约束线性模型空间的投影层。
效果:大量实验表明,学习到的模型保持高度透明,可以比常规lasso更稀疏,而不牺牲标准深度神经网络的预测能力。

Sparse linear models are one of several core tools for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which are candidates for inclusion as variables in an interpretable model, and contextual features, which select from the candidate variables and determine their effects. This dichotomy leads us to the contextual lasso, a new statistical estimator that fits a sparse linear model to the explanatory features such that the sparsity pattern and coefficients vary as a function of the contextual features. The fitting process learns this function nonparametrically via a deep neural network. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. An extensive suite of experiments on real and synthetic data suggests that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.

Optimal Block-wise Asymmetric Graph Construction for Graph-based Semi-supervised Learning
Zixing Song Yifei Zhang Irwin King



研究问题:如何有效地利用大规模文本语料库和知识图谱训练语言表示模型,以捕捉语义模式并提高各种NLP任务的性能。
动机:目前的预训练语言模型缺乏对结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型同时利用大规模文本语料库和知识图谱进行训练,能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Graph-based semi-supervised learning (GSSL) serves as a powerful tool to model the underlying manifold structures of samples in high-dimensional spaces. It involves two phases: constructing an affinity graph from available data and inferring labels for unlabeled nodes on this graph. While numerous algorithms have been developed for label inference, the crucial graph construction phase has received comparatively less attention, despite its significant influence on the subsequent phase. In this paper, we present an optimal asymmetric graph structure for the label inference phase with theoretical motivations. Unlike existing graph construction methods, we differentiate the distinct roles that labeled nodes and unlabeled nodes could play. Accordingly, we design an efficient block-wise graph learning algorithm with a global convergence guarantee. Other benefits induced by our method, such as enhanced robustness to noisy node features, are explored as well. Finally, we perform extensive experiments on synthetic and real-world datasets to demonstrate its superiority to the state-of-the-art graph construction methods in GSSL.

Correlative Information Maximization: A Biologically Plausible Approach to Supervised Deep Neural Networks without Weight Symmetry
Bariscan Bozkurt Cengiz Pehlevan Alper Tunga Erdogan



研究问题:大脑是否采用了类似反向传播的监督学习机制?
动机:反向传播算法在训练大规模人工神经网络上取得了显著成功,但其生物合理性受到强烈质疑。
方法:提出层激活之间的关联信息最大化作为描述生物神经网络信号传播的替代规范方法。
效果:该方法解决了传统人工神经网络和反向传播算法的生物合理性问题,同时解决了前向和后向信号传播路径之间的权重对称问题,为模拟更真实的生物神经网络提供了一种解决方案。

The backpropagation algorithm has experienced remarkable success in training large-scale artificial neural networks; however, its biological plausibility has been strongly criticized, and it remains an open question whether the brain employs supervised learning mechanisms akin to it. Here, we propose correlative information maximization between layer activations as an alternative normative approach to describe the signal propagation in biological neural networks in both forward and backward directions. This new framework addresses many concerns about the biological-plausibility of conventional artificial neural networks and the backpropagation algorithm. The coordinate descent-based optimization of the corresponding objective, combined with the mean square error loss function for fitting labeled supervision data, gives rise to a neural network structure that emulates a more biologically realistic network of multi-compartment pyramidal neurons with dendritic processing and lateral inhibitory neurons. Furthermore, our approach provides a natural resolution to the weight symmetry problem between forward and backward signal propagation paths, a significant critique against the plausibility of the conventional backpropagation algorithm. This is achieved by leveraging two alternative, yet equivalent forms of the correlative mutual information objective. These alternatives intrinsically lead to forward and backward prediction networks without weight symmetry issues, providing a compelling solution to this long-standing challenge.

AI for Interpretable Chemistry: Predicting Radical Mechanistic Pathways via Contrastive Learning
Mohammadamin Tavakoli Pierre Baldi Ann Marie Carlton Yinting Chiu Alexander Shmakov David Van Vranken



研究问题:目前的深度学习反应预测模型主要依赖美国专利局的反应,导致预测研究问题:目前的深度学习反应预测模型主要依赖美国专利局的反应,导致预测结果缺乏解释性且在其他化学领域(如自由基和大气化学)的泛化能力有限。
动机:为了解决这些问题,我们提出了一种新的反应预测系统RMechRP,它结合了对比学习和机制途径,这是化学反应最具解释性的表示。
方法:我们使用公共的自由基反应数据库RMechDB开发和训练多个深度学习模型,以建立预测自由基反应的第一个基准。
效果:实验结果表明,RMechRP在提供准确和可解释的自由基反应预测方面非常有效,并具有在大气化学等各种应用中的潜力。

Deep learning-based reaction predictors have undergone significant architectural evolution. However, their reliance on reactions from the US Patent Office results in a lack of interpretable predictions and limited generalizability to other chemistry domains, such as radical and atmospheric chemistry. To address these challenges, we introduce a new reaction predictor system, RMechRP, that leverages contrastive learning in conjunction with mechanistic pathways, the most interpretable representation of chemical reactions. Specifically designed for radical reactions, RMechRP provides different levels of interpretation of chemical reactions. We develop and train multiple deep-learning models using RMechDB, a public database of radical reactions, to establish the first benchmark for predicting radical reactions. Our results demonstrate the effectiveness of RMechRP in providing accurate and interpretable predictions of radical reactions, and its potential for various applications in atmospheric chemistry.

A Unified Framework for U-Net Design and Analysis
Christopher Williams Fabian Falck George Deligiannidis Christopher C. Holmes Arnaud Doucet Saifuddin Syed



研究问题:本研究旨在对U-Net神经网络架构进行设计和分析,以理解其在图像和偏微分方程等连续信号处理任务中的应用。
动机:尽管U-Net在许多任务中都是首选的神经网络架构,但其设计和架构的研究还不够充分。
方法:本研究提供了一个设计和分析通用U-Net架构的框架,包括理论结果,这些结果描述了U-Net中编码器和解码器的作用、其高分辨率缩放限制以及通过预条件与ResNets的共轭性。此外,还提出了Multi-ResNets,这是一种简化的、基于小波的编码器,无需学习参数的U-Net。
效果:实验结果表明,Multi-ResNets在图像分割、偏微分方程代理建模和扩散模型生成建模等方面,通常能取得优于传统U-Net的竞争性和优越性能。此外,本研究的U-Net框架为研究U-Net的理论性质和设计用于多种问题的自然、可扩展的神经网络架构铺平了道路。

U-Nets are a go-to neural architecture across numerous tasks for continuous signals on a square such as images and Partial Differential Equations (PDE), however their design and architecture is understudied. In this paper, we provide a framework for designing and analysing general U-Net architectures. We present theoretical results which characterise the role of the encoder and decoder in a U-Net, their high-resolution scaling limits and their conjugacy to ResNets via preconditioning. We propose Multi-ResNets, U-Nets with a simplified, wavelet-based encoder without learnable parameters. Further, we show how to design novel U-Net architectures which encode function constraints, natural bases, or the geometry of the data. In diffusion models, our framework enables us to identify that high-frequency information is dominated by noise exponentially faster, and show how U-Nets with average pooling exploit this. In our experiments, we demonstrate how Multi-ResNets achieve competitive and often superior performance compared to classical U-Nets in image segmentation, PDE surrogate modelling, and generative modelling with diffusion models. Our U-Net framework paves the way to study the theoretical properties of U-Nets and design natural, scalable neural architectures for a multitude of problems beyond the square.

TempME: Towards the Explainability of Temporal Graph Neural Networks via Motif Discovery
Jialin Chen Zhitao Ying



研究问题:当前的时间图神经网络(TGNN)在预测未来交互时,其底层机制通常由图中的一组重复子结构,即时间模式所控制,但哪些时间模式被模型视为触发特定预测的重要指标仍不确定。
动机:解决当前TGNN的解释性和可信度的关键挑战,提高模型的可理解性。
方法:提出一种名为Temporal Motifs Explainer(TempME)的新方法,该方法基于信息瓶颈原理,提取与交互最相关的模式,同时最小化包含的信息量,以保持解释的稀疏性和简洁性。
效果:实验证明,TempME生成的解释中事件的空间和时间相关性高于现有方法,提供了更易于理解的见解。在六个真实世界数据集上进行的广泛实验验证了TempME的优越性,解释准确性提高了8.21%,并使当前TGNN的平均精度提高了22.96%。

Temporal graphs are widely used to model dynamic systems with time-varying interactions. In real-world scenarios, the underlying mechanisms of generating future interactions in dynamic systems are typically governed by a set of recurring substructures within the graph, known as temporal motifs. Despite the success and prevalence of current temporal graph neural networks (TGNN), it remains uncertain which temporal motifs are recognized as the significant indications that trigger a certain prediction from the model, which is a critical challenge for advancing the explainability and trustworthiness of current TGNNs. To address this challenge, we propose a novel approach, called **Temp**oral **M**otifs **E**xplainer (**TempME**), which uncovers the most pivotal temporal motifs guiding the prediction of TGNNs. Derived from the information bottleneck principle, TempME extracts the most interaction-related motifs while minimizing the amount of contained information to preserve the sparsity and succinctness of the explanation. Events in the explanations generated by TempME are verified to be more spatiotemporally correlated than those of existing approaches, providing more understandable insights. Extensive experiments validate the superiority of TempME, with up to 8.21% increase in terms of explanation accuracy across six real-world datasets and up to 22.96% increase in boosting the prediction Average Precision of current TGNNs.

Predicting Global Label Relationship Matrix for Graph Neural Networks under Heterophily
Langzhang Liang Xiangjing Hu Zenglin Xu Zixing Song Irwin King



研究问题:现有的图神经网络(GNNs)在处理异质性图时可能会遇到困难,即不同标签的节点更有可能被连接。
动机:为了解决这个问题,我们提出了一种适用于同质和异质图的通用GNN,即低秩图神经网络(LRGNN)。
方法:我们通过解决一个鲁棒的低秩矩阵近似问题来预测标签关系矩阵,因为已有研究表明,在某些条件下,低秩近似可以实现完美恢复。
效果:实验结果表明,该解决方案与标签关系矩阵非常相似,为图建模提供了两个优点:块对角结构和变化的内部类和类间条目分布。

Graph Neural Networks (GNNs) have been shown to achieve remarkable performance on node classification tasks by exploiting both graph structures and node features. The majority of existing GNNs rely on the implicit homophily assumption. Recent studies have demonstrated that GNNs may struggle to model heterophilous graphs where nodes with different labels are more likely connected. To address this issue, we propose a generic GNN applicable to both homophilous and heterophilous graphs, namely Low-Rank Graph Neural Network (LRGNN). Our analysis demonstrates that a signed graph's global label relationship matrix has a low rank. This insight inspires us to predict the label relationship matrix by solving a robust low-rank matrix approximation problem, as prior research has proven that low-rank approximation could achieve perfect recovery under certain conditions. The experimental results reveal that the solution bears a strong resemblance to the label relationship matrix, presenting two advantages for graph modeling: a block diagonal structure and varying distributions of within-class and between-class entries.

Efficiently incorporating quintuple interactions into geometric deep learning force fields
Zun Wang Guoqing Liu Yichi Zhou Tong Wang Bin Shao



研究问题:如何有效地将五体相互作用纳入机器学习力场(MLFFs)中,以提高模型的表达能力和准确性。
动机:尽管现有的模型已经能够明确地包含至四体相互作用,但五体相互作用在各个领域都有其重要性,将其高效地融入MLFFs仍是一个挑战。
方法:本文提出了一种名为“五元网络”(QuinNet)的端到端图神经网络,该网络能以“从头算”的准确性高效地表达多达五体相互作用的多体量子交互。通过分析多种多体量子交互的拓扑结构,设计了该模型的架构,以有效且明确地表示这些交互。
效果:我们在MD17及其修订版等小分子公开数据集上评估QuinNet,结果显示,它在这些基准测试上与其它最先进的模型相兼容。此外,QuinNet在更大、更复杂的分子系统(如MD22和Chignolin)上超越了许多领先的模型,而没有增加计算复杂性。我们还使用QuinNet作为分子动力学(MD)模拟的力场,以证明其准确性和稳定性,并进行消融研究以阐明五体相互作用的重要性。

Machine learning force fields (MLFFs) have instigated a groundbreaking shift in molecular dynamics (MD) simulations across a wide range of fields, such as physics, chemistry, biology, and materials science. Incorporating higher order many-body interactions can enhance the expressiveness and accuracy of models. Recent models have achieved this by explicitly including up to four-body interactions. However, five-body interactions, which have relevance in various fields, are still challenging to incorporate efficiently into MLFFs. In this work, we propose the quintuple network (QuinNet), an end-to-end graph neural network that efficiently expresses many-body interactions up to five-body interactions with \emph{ab initio} accuracy. By analyzing the topology of diverse many-body interactions, we design the model architecture to efficiently and explicitly represent these interactions. We evaluate QuinNet on public datasets of small molecules, such as MD17 and its revised version, and show that it is compatible with other state-of-the-art models on these benchmarks. Moreover, QuinNet surpasses many leading models on larger and more complex molecular systems, such as MD22 and Chignolin, without increasing the computational complexity. We also use QuinNet as a force field for molecular dynamics (MD) simulations to demonstrate its accuracy and stability, and conduct an ablation study to elucidate the significance of five-body interactions. We open source our implementation at https://github.com/Zun-Wang/QuinNet.

Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory
Minhak Song Chulhee Yun



研究问题:本文旨在通过实证研究来观察损失函数Hessian的最大特征值,即锐度,在梯度下降过程中的演变,并研究稳定性边缘(EoS)现象。
动机:训练初期,锐度会逐渐增大(称为渐进锐化),最终在阈值约2/步长处饱和。当出现稳定性边缘现象时,不同的梯度下降轨迹(经过适当的重参数化)会在一个与初始值无关的特定分岔图上对齐。
方法:通过对两个具有单一数据点的两层全连接线性网络和单神经元非线性网络进行严格的证明,证明了这种轨迹对齐现象。
效果:这项轨迹对齐分析证实了渐进锐化和稳定性边缘现象的存在,包括并扩展了文献中的最新发现。

Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this paper, we start by demonstrating through empirical studies that when the EoS phenomenon occurs, different GD trajectories (after a proper reparameterization) align on a specific bifurcation diagram independent of initialization. We then rigorously prove this trajectory alignment phenomenon for a two-layer fully-connected linear network and a single-neuron nonlinear network trained with a single data point. Our trajectory alignment analysis establishes both progressive sharpening and EoS phenomena, encompassing and extending recent findings in the literature.

Learning and processing the ordinal information of temporal sequences in recurrent neural circuits
Xiaolong Zou Zhikun Chu Qinghai Guo Jie Cheng Bo Ho Si Wu Yuanyuan Mi



研究问题:本研究旨在探索递归神经网络如何学习表示时间序列的抽象顺序结构,以及这种从内容中分离出来的位置结构表示如何促进时间序列的处理。
动机:实验数据显示,大脑对顺序信息和时间序列内容的表示是解耦的,但这种解耦背后的神经机制还不清楚。
方法:通过适当的学习协议,我们让一个递归神经网络学习一组树状吸引子状态来编码给定时间序列的相应树状顺序。然后,这个抽象的时间顺序模板可以与不同的内容结合,实现灵活和强大的时间序列处理。
效果:通过转移学习任务,我们发现重用时间顺序模板有助于获取具有相同或相似顺序结构的新时间序列。在关键词检测任务中,我们发现如果顺序信息是区分不同序列的关键,那么位置结构的吸引子表示可以提高时间序列识别的鲁棒性。

Temporal sequence processing is fundamental in brain cognitive functions. Experimental data has indicated that the representations of ordinal information and contents of temporal sequences are disentangled in the brain, but the neural mechanism underlying this disentanglement remains largely unclear. Here, we investigate how recurrent neural circuits learn to represent the abstract order structure of temporal sequences, and how this disentangled representation of order structure from that of contents facilitates the processing of temporal sequences. We show that with an appropriate learn protocol, a recurrent neural circuit can learn a set of tree-structured attractor states to encode the corresponding tree-structured orders of given temporal sequences. This abstract temporal order template can then be bound with different contents, allowing for flexible and robust temporal sequence processing. Using a transfer learning task, we demonstrate that the reuse of a temporal order template facilitates the acquisition of new temporal sequences of the same or similar ordinal structure. Using a key-word spotting task, we demonstrate that the attractor representation of order structure improves the robustness of temporal sequence discrimination, if the ordinal information is the key to differentiate different sequences. We hope this study gives us insights into the neural mechanism of representing the ordinal information of temporal sequences in the brain, and helps us to develop brain-inspired temporal sequence processing algorithms.

Train Once and Explain Everywhere: Pre-training Interpretable Graph Neural Networks
Jun Yin Chaozhuo Li Hao Yan Jianxun Lian Senzhang Wang



研究问题:如何训练一个能对不同图进行通用解释的图神经网络(GNN)模型。
动机:现有的可解释GNN大多针对特定数据集,难以泛化到不同的图上。受最近预训练技术成功的启发,首次提出预训练可解释图神经网络(π-GNN),通过在具有真实解释的合成图上进行预训练来提炼GNN的通用可解释性。
方法:引入结构模式学习模块提取多样的通用结构模式并整合它们以全面表示不同类型的图;提出超图细化模块,通过结合通用结构模式和局部边交互来确定解释子图;最后,将任务特定的预测器与预训练的π-GNN模型级联,并在下游任务中进行微调。
效果:大量实验表明,π-GNN显著超越了领先的可解释GNN基线,解释性能提高了9.98%,分类准确率提高了16.06%。同时,在图分类任务上预训练的π-GNN也在节点分类任务上实现了顶级的解释性能,进一步验证了其在各种下游任务中的出色泛化性能。

Intrinsic interpretable graph neural networks aim to provide transparent predictions by identifying the influential fraction of the input graph that guides the model prediction, i.e., the explanatory subgraph. However, current interpretable GNNs mostly are dataset-specific and hard to generalize to different graphs. A more generalizable GNN interpretation model which can effectively distill the universal structural patterns of different graphs is until-now unexplored. Motivated by the great success of recent pre-training techniques, we for the first time propose the Pre-training Interpretable Graph Neural Network ($\pi$-GNN) to distill the universal interpretability of GNNs by pre-training over synthetic graphs with ground-truth explanations. Specifically, we introduce a structural pattern learning module to extract diverse universal structure patterns and integrate them together to comprehensively represent the graphs of different types. Next, a hypergraph refining module is proposed to identify the explanatory subgraph by incorporating the universal structure patterns with local edge interactions. Finally, the task-specific predictor is cascaded with the pre-trained $\pi$-GNN model and fine-tuned over downstream tasks. Extensive experiments demonstrate that $\pi$-GNN significantly surpasses the leading interpretable GNN baselines with up to 9.98\% interpretation improvement and 16.06\% classification accuracy improvement. Meanwhile, $\pi$-GNN pre-trained on graph classification task also achieves the top-tier interpretation performance on node classification task, which further verifies its promising generalization performance among different downstream tasks. Our code and datasets are available at https://anonymous.4open.science/r/PI-GNN-F86C

Sheaf Hypergraph Networks
Iulia Duta Giulia Cassarà Fabrizio Silvestri Pietro Lio



研究问题:如何更好地表示和处理超图中的复杂交互关系。
动机:现有的方法通常使用超图来表示这些交互关系,但效果有限。
方法:提出一种细胞束超图的概念,为常规超图添加额外的结构,同时保持局部的高阶连通性。并在此基础上,开发出两种独特的束超图拉普拉斯矩阵形式:线性和非线性。
效果:通过实验证明,这种新的表示方法在多个基准数据集上的超图节点分类任务上取得了优秀的性能。

Higher-order relations are widespread in nature, with numerous phenomena involving complex interactions that extend beyond simple pairwise connections. As a result, advancements in higher-order processing can accelerate the growth of various fields requiring structured data. Current approaches typically represent these interactions using hypergraphs. We enhance this representation by introducing cellular sheaves for hypergraphs, a mathematical construction that adds extra structure to the conventional hypergraph while maintaining their local, higher-order connectivity. Drawing inspiration from existing Laplacians in the literature, we develop two unique formulations of sheaf hypergraph Laplacians: linear and non-linear. Our theoretical analysis demonstrates that incorporating sheaves into the hypergraph Laplacian provides a more expressive inductive bias than standard hypergraph diffusion, creating a powerful instrument for effectively modelling complex data structures. We employ these sheaf hypergraph Laplacians to design two categories of models: Sheaf Hypergraph Neural Networks and Sheaf Hypergraph Convolutional Networks. These models generalize classical Hypergraph Networks often found in the literature. Through extensive experimentation, we show that this generalization significantly improves performance, achieving top results on multiple benchmark datasets for hypergraph node classification.

Simplifying and Empowering Transformers for Large-Graph Representations
Qitian Wu Wentao Zhao Chenxiao Yang Hengrui Zhang Fan Nie Haitian Jiang Yatao Bian Junchi Yan



研究问题:如何有效地在大规模图上学习表示?
动机:现有的方法在处理大规模图时,由于数据点的相互依赖性,往往需要复杂的模型和大量的计算。
方法:本文提出了一种简化的图变压器(SGFormer)方法,该方法仅使用一层注意力就可以在任意节点之间有效传播信息,无需位置编码、特征/图预处理或增强损失。
效果:实验结果表明,SGFormer可以成功扩展到网络规模的ogbn-papers100M图,并在中等规模图上的推理速度比最先进的变压器快141倍。

Learning representations on large-sized graphs is a long-standing challenge due to the inter-dependence nature involved in massive data points. Transformers, as an emerging class of foundation encoders for graph-structured data, have shown promising performance on small graphs due to its global attention capable of capturing all-pair influence beyond neighboring nodes. Even so, existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated models by stacking deep multi-head attentions. In this paper, we critically demonstrate that even using a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks where node numbers range from thousand-level to billion-level. This encourages us to rethink the design philosophy for Transformers on large graphs, where the global attention is a computation overhead hindering the scalability. We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model that can efficiently propagate information among arbitrary nodes in one layer. SGFormer requires none of positional encodings, feature/graph pre-processing or augmented loss. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M and yields up to 141x inference acceleration over SOTA Transformers on medium-sized graphs. Beyond current results, we believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci Chuning Li Mufan Bill Li Bobby He Thomas Hofmann Chris J. Maddison Daniel M. Roy



研究问题:本文旨在研究变换器的注意力机制在无限深度和宽度比例下的协方差矩阵,以了解网络的可训练性。
动机:受到变换器成功的启发,我们希望通过修改Softmax-based注意力模型并加入跳跃连接来研究其协方差矩阵。
方法:我们对变换器的注意力机制进行了修改,将Softmax输出中心化并对Softmax logits进行宽度依赖的温度参数缩放。通过对应的随机微分方程(SDE)来检查网络的稳定性,展示了如何利用残差连接优雅地控制漂移和扩散的规模。
效果:实验结果表明,SDE为相应的有限大小模型提供了一种令人惊讶的良好描述。我们将这种修改后的架构称为“成形变换器”。

In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network’s trainability. Motivated by the success of Transform- ers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer’s attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffu- sion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

Tailoring Self-Attention for Graph via Rooted Subtrees
Siyuan Huang Yunchong Song Jiayue Zhou Zhouhan Lin



研究问题:现有的图学习中的注意力机制存在局限性,如局部注意力难以捕捉长距离信息,全局注意力无法反映层次化的邻域结构。
动机:为了解决这些问题,本文提出了一种新的多跳图注意力机制——子树注意力(STA)。
方法:STA将完全注意力结构和根子树无缝连接,通过理论证明在极端情况下,STA近似于全局注意力。通过允许直接计算多跳邻居之间的注意力权重,STA缓解了现有图注意力机制的内在问题。
效果:通过使用核 softmax 设计了一种有效的 STA 形式,实现了线性时间复杂度。由此产生的基于 STA 的图神经网络 STAGNN 在十个节点分类数据集上表现出色,优于现有的图转换器和主流 GNNs。

Attention mechanisms have made significant strides in graph learning, yet they still exhibit notable limitations: local attention faces challenges in capturing long-range information due to the inherent problems of the message-passing scheme, while global attention cannot reflect the hierarchical neighborhood structure and fails to capture fine-grained local information. In this paper, we propose a novel multi-hop graph attention mechanism, named Subtree Attention (STA), to address the aforementioned issues. STA seamlessly bridges the fully-attentional structure and the rooted subtree, with theoretical proof that STA approximates the global attention under extreme settings. By allowing direct computation of attention weights among multi-hop neighbors, STA mitigates the inherent problems in existing graph attention mechanisms. Further we devise an efficient form for STA by employing kernelized softmax, which yields a linear time complexity. Our resulting GNN architecture, the STAGNN, presents a simple yet performant STA-based graph neural network leveraging a hop-aware attention strategy. Comprehensive evaluations on ten node classification datasets demonstrate that STA-based models outperform existing graph transformers and mainstream GNNs. The code is available at https://github.com/LUMIA-Group/SubTree-Attention.

Boosting Verification of Deep Reinforcement Learning via Piece-Wise Linear Decision Neural Networks
Jiaxu Tian Dapeng Zhi Si Liu Peixin Wang Cheng Chen Min Zhang



研究问题:形式验证深度强化学习系统的准确性和可扩展性问题。
动机:训练过程中的过度估计以及将难以解释的决策模型(即深度神经网络)转化为易于验证的模型是主要障碍。
方法:提出一种逆转换-然后训练的方法,首先将DNN编码为一组高效且紧密可验证的线性控制策略,然后通过强化学习优化它们。同时提出一种新型神经网络模型,称为分段线性决策神经网络(PLDNN),与大多数现有的DRL训练算法兼容,性能与传统的DNN相当。
效果:与基于DNN的DRL系统相比,基于PLDNN的系统可以更高效、更紧密地验证,验证速度提高了438倍,过度估计也大大减少。特别地,即使是一个复杂的12维DRL系统,也可以在更深的计算步骤下进行有效验证。

Formally verifying deep reinforcement learning (DRL) systems suffers from both inaccurate verification results and limited scalability. The major obstacle lies in the large overestimation introduced inherently during training and then transforming the inexplicable decision-making models, i.e., deep neural networks (DNNs), into easy-to-verify models. In this paper, we propose an inverse transform-then-train approach, which first encodes a DNN into an equivalent set of efficiently and tightly verifiable linear control policies and then optimizes them via reinforcement learning. We accompany our inverse approach with a novel neural network model called piece-wise linear decision neural networks (PLDNNs), which are compatible with most existing DRL training algorithms with comparable performance against conventional DNNs. Our extensive experiments show that, compared to DNN-based DRL systems, PLDNN-based systems can be more efficiently and tightly verified with up to $438$ times speedup and a significant reduction in overestimation. In particular, even a complex $12$-dimensional DRL system is efficiently verified with up to 7 times deeper computation steps.

What functions can Graph Neural Networks compute on random graphs? The role of Positional Encoding
Nicolas Keriven Samuel Vaiter



研究问题:本文旨在深化对大型图上图神经网络(GNNs)的理论理解,特别是它们的表达能力。
动机:现有的分析将此概念与图同构问题相关联,这主要适用于小型图,或者研究的图分类或回归任务,而在大型图上的节点预测任务则更为相关。最近,几项研究表明,在非常一般的随机图模型上,随着节点数量的增加,GNN会收敛到某些函数。
方法:本文通过包含以前几个例子的一般收敛概念,为节点任务生成的等变GNN函数空间提供了更完整和直观的描述。我们强调输入节点特征的作用,并研究了“节点位置编码”(PEs)的影响,这是最近的一项工作,已被证明在实践中能产生最佳效果。通过对大型随机图上的几种PEs的例子进行研究,我们将已知的普适性结果扩展到更一般的模型上。
效果:我们的理论研究结果暗示了一些规范化技巧,数值实验表明这对GNN在合成数据和真实数据上的泛化有积极影响。我们的证明包含了一些新的独立感兴趣的集中度不等式。

We aim to deepen the theoretical understanding of Graph Neural Networks (GNNs) on large graphs, with a focus on their expressive power. Existing analyses relate this notion to the graph isomorphism problem, which is mostly relevant for graphs of small sizes, or studied graph classification or regression tasks, while prediction tasks on \emph{nodes} are far more relevant on large graphs. Recently, several works showed that, on very general random graphs models, GNNs converge to certains functions as the number of nodes grows. In this paper, we provide a more complete and intuitive description of the function space generated by equivariant GNNs for node-tasks, through general notions of convergence that encompass several previous examples. We emphasize the role of input node features, and study the impact of \emph{node Positional Encodings} (PEs), a recent line of work that has been shown to yield state-of-the-art results in practice. Through the study of several examples of PEs on large random graphs, we extend previously known universality results to significantly more general models. Our theoretical results hint at some normalization tricks, which is shown numerically to have a positive impact on GNN generalization on synthetic and real data. Our proofs contain new concentration inequalities of independent interest.

Computational Complexity of Learning Neural Networks: Smoothness and Degeneracy
Amit Daniely Nathan Srebro Gal Vardi



研究问题:理解神经网络何时能被有效学习是学习理论中的一个基本问题。
动机:现有的困难结果表明,对输入分布和网络权重的假设对于获得有效算法是必要的。
方法:本研究探讨了这些假设是否足以学习更深的网络,并证明了负面结果。我们展示了在高斯输入分布下学习深度为3的ReLU网络即使在平滑分析框架中也是困难的,即使权重矩阵是非退化的。
效果:我们的困难结果表明,在高斯分布下学习深度为3的ReLU网络即使在权重矩阵是非退化的情况下也是困难的。此外,我们还考虑了深度为2的网络,并在平滑分析框架中展示了学习的困难性,其中网络参数和输入分布都被平滑处理。我们的困难结果是基于对局部伪随机数生成器存在性的假设。

Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution is Gaussian, and the weight matrix is non-degenerate. In this work, we study whether such assumptions may suffice for learning deeper networks and prove negative results. We show that learning depth-$3$ ReLU networks under the Gaussian input distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network's parameters. It implies that learning depth-$3$ ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Moreover, we consider depth-$2$ networks, and show hardness of learning in the smoothed-analysis framework, where both the network parameters and the input distribution are smoothed. Our hardness results are under a well-studied assumption on the existence of local pseudorandom generators.

Limits, approximation and size transferability for GNNs on sparse graphs via graphops
Thien Le Stefanie Jegelka



研究问题:图神经网络是否能推广到与其训练图不同的图,例如大小?
动机:尽管最近的一些工作通过图极限(如通过图论)建立了这种可转移性和近似结果,但这些只适用于稠密图。为了包括常见的稀疏图,如度有限或幂律图,我们采取了从图中导出的运算符(如构成GNN的聚合操作)的视角。
方法:我们通过图极限的概念引入了图算子(graphops)的概念,并展示了如何从运算符的角度制定出有限的GNN和其在无限图上的极限之间的距离,以及在共享结构属性的不同大小的图上的GNN之间的距离的定量界限。
效果:我们的结果适用于稠密和稀疏的图,以及各种图极限的概念。

Can graph neural networks generalize to graphs that are different from the graphs they were trained on, e.g., in size? In this work, we study this question from a theoretical perspective. While recent work established such transferability and approximation results via graph limits, e.g., via graphons, these only apply nontrivially to dense graphs. To include frequently encountered sparse graphs such as bounded-degree or power law graphs, we take a perspective of taking limits of operators derived from graphs, such as the aggregation operation that makes up GNNs. This leads to the recently introduced limit notion of graphops (Backhausz and Szegedy, 2022). We demonstrate how the operator perspective allows us to develop quantitative bounds on the distance between a finite GNN and its limit on an infinite graph, as well as the distance between the GNN on graphs of different sizes that share structural properties, under a regularity assumption verified for various graph sequences. Our results hold for dense and sparse graphs, and various notions of graph limits.

Graph Convolutional Kernel Machine versus Graph Convolutional Networks
Zhihao Wu Zhao Zhang Jicong Fan



研究问题:如何利用图卷积核函数进行图基机器学习?
动机:现有的图卷积神经网络(GCN)在处理图数据时,深度的增加往往带来的收益微小甚至为负。这意味着图数据的复杂性有限,浅层模型通常足以提取各种任务(如节点分类)的表达特征。
方法:提出了一种基于核函数与图卷积结合的图卷积核机(GCKM)框架。以图卷积核支持向量机(GCKSVM)为例,分析了其泛化误差界并讨论了图结构的影响。
效果:与GCN相比,GCKM在架构设计、超参数调整和优化上需要的努力更少。更重要的是,GCKM能保证获得全局最优解,具有强大的泛化能力和高度的可解释性。实验结果表明,除了上述优点外,GCKM在准确性上也至少与GCN相当。

Graph convolutional networks (GCN) with one or two hidden layers have been widely used in handling graph data that are prevalent in various disciplines. Many studies showed that the gain of making GCNs deeper is tiny or even negative. This implies that the complexity of graph data is often limited and shallow models are often sufficient to extract expressive features for various tasks such as node classification. Therefore, in this work, we present a framework called graph convolutional kernel machine (GCKM) for graph-based machine learning. GCKMs are built upon kernel functions integrated with graph convolution. An example is the graph convolutional kernel support vector machine (GCKSVM) for node classification, for which we analyze the generalization error bound and discuss the impact of the graph structure. Compared to GCNs, GCKMs require much less effort in architecture design, hyperparameter tuning, and optimization. More importantly, GCKMs are guaranteed to obtain globally optimal solutions and have strong generalization ability and high interpretability. GCKMs are composable, can be extended to large-scale data, and are applicable to various tasks (e.g., node or graph classification, clustering, feature extraction, dimensionality reduction). The numerical results on benchmark datasets show that, besides the aforementioned advantages, GCKMs have at least competitive accuracy compared to GCNs.

Multi-resolution Spectral Coherence for Graph Generation with Score-based Diffusion
Hyuna Cho Minjae Jeong Sooyeon Jeon Sungsoo Ahn Won Hwa Kim



研究问题:如何准确估计训练数据中图组件(如节点和边)的联合分布以成功生成图。
动机:现有的深度神经网络在生成现实图形方面表现出色,但受到传统图卷积带来的过度平滑问题影响,导致节点和边的高频特性难以处理。
方法:提出一种新方法,通过在频谱空间中捕获节点和边的多分辨率依赖关系,并在共享图小波空间中对节点和边信号的联合分布进行建模,配合基于分数的扩散模型,生成具有真实感节点和边频率特性的合成图。
效果:在四个代表性基准数据集上的实验结果验证了Wave-GD优于现有方法,显示出其在涉及图数据的各种应用中的潜力。

Successful graph generation depends on the accurate estimation of the joint distribution of graph components such as nodes and edges from training data. While recent deep neural networks have demonstrated sampling of realistic graphs together with diffusion models, however, they still suffer from oversmoothing problems which are inherited from conventional graph convolution and thus high-frequency characteristics of nodes and edges become intractable. To overcome such issues and generate graphs with high fidelity, this paper introduces a novel approach that captures the dependency between nodes and edges at multiple resolutions in the spectral space. By modeling the joint distribution of node and edge signals in a shared graph wavelet space, together with a score-based diffusion model, we propose a Wavelet Graph Diffusion Model (Wave-GD) which lets us sample synthetic graphs with real-like frequency characteristics of nodes and edges. Experimental results on four representative benchmark datasets validate the superiority of the Wave-GD over existing approaches, highlighting its potential for a wide range of applications that involve graph data.

May the Force be with You: Unified Force-Centric Pre-Training for 3D Molecular Conformations
Rui Feng Qi Zhu Huan Tran Binghong Chen Aubrey Toland Rampi Ramprasad Chao Zhang



研究问题:现有的预训练模型主要关注平衡数据,忽视了非平衡构象,如何将此类方法扩展到非平衡数据上是一个挑战。
动机:由于现有预训练模型的训练目标依赖于假设构象是局部能量最小值,因此直接从原子力学习非平衡数据的方法具有挑战性。
方法:提出一种针对3D分子构象的力中心预训练模型,该模型同时覆盖平衡和非平衡数据。对于非平衡数据,模型直接从其原子力中学习;对于平衡数据,引入零力正则化和强制基去噪技术以近似近平衡力。
效果:通过预训练目标,实验表明与未预训练的Equivariant Transformer模型相比,我们的力量精度提高了约3倍。通过在平衡数据上引入正则化,我们解决了普通Equivariant Transformers中的不稳定MD模拟问题,实现了比NequIP快2.45倍的推理速度,达到了最先进的模拟性能。作为强大的分子编码器,我们的预训练模型在最先进的属性预测任务上取得了同等的性能。

Recent works have shown the promise of learning pre-trained models for 3D molecular representation. However, existing pre-training models focus predominantly on equilibrium data and largely overlook off-equilibrium conformations. It is challenging to extend these methods to off-equilibrium data because their training objective relies on assumptions of conformations being the local energy minima. We address this gap by proposing a force-centric pretraining model for 3D molecular conformations covering both equilibrium and off-equilibrium data. For off-equilibrium data, our model learns directly from their atomic forces. For equilibrium data, we introduce zero-force regularization and forced-based denoising techniques to approximate near-equilibrium forces. We obtain a unified pre-trained model for 3D molecular representation with over 15 million diverse conformations. Experiments show that, with our pre-training objective, we increase forces accuracy by around 3 times compared to the un-pre-trained Equivariant Transformer model. By incorporating regularizations on equilibrium data, we solved the problem of unstable MD simulations in vanilla Equivariant Transformers, achieving state-of-the-art simulation performance with 2.45 times faster inference time than NequIP. As a powerful molecular encoder, our pre-trained model achieves on-par performance with state-of-the-art property prediction tasks.

Geometric Transformer with Interatomic Positional Encoding
Yusong Wang Shaoning Li Tong Wang Bin Shao Nanning Zheng Tie-Yan Liu



研究问题:Transformer架构在各种数据模态中的广泛应用为分子建模开辟了新途径,但研究问题:Transformer架构在各种数据模态中的广泛应用为分子建模开辟了新途径,但尚不清楚基于Transformer的架构是否能够像等变图神经网络一样进行分子建模。
动机:设计了一种原子环境参数化的Transformer位置编码(IPE),提出了一种新的几何Transformer——Geoformer,以有效建模各种分子属性预测的分子结构。
方法:通过引入IPE,将原子环境参数化为Transformer的位置编码,从而提出一种新颖的几何Transformer Geoformer。
效果:在QM9数据集和最近提出的Molecule3D数据集等多个基准测试中,与Transformer和等变图神经网络模型相比,Geoformer在QM9上优于最先进的算法,并在Molecule3D的随机和支架分割上都实现了最佳性能。通过引入IPE,Geoformer为基于Transformer架构的分子几何建模铺平了道路。

The widespread adoption of Transformer architectures in various data modalities has opened new avenues for the applications in molecular modeling. Nevertheless, it remains elusive that whether the Transformer-based architecture can do molecular modeling as good as equivariant GNNs. In this paper, by designing Interatomic Positional Encoding (IPE) that parameterizes atomic environments as Transformer's positional encodings, we propose Geoformer, a novel geometric Transformer to effectively model molecular structures for various molecular property prediction. We evaluate Geoformer on several benchmarks, including the QM9 dataset and the recently proposed Molecule3D dataset. Compared with both Transformers and equivariant GNN models, Geoformer outperforms the state-of-the-art (SoTA) algorithms on QM9, and achieves the best performance on Molecule3D for both random and scaffold splits. By introducing IPE, Geoformer paves the way for molecular geometric modeling based on Transformer architecture.

Scaling MLPs: A Tale of Inductive Bias
Gregor Bachmann Sotiris Anagnostidis Thomas Hofmann



研究问题:本文旨在重新审视深度学习中最基本的构建块——多层感知器(MLP),并研究其在视觉任务上的性能极限。
动机:由于最近"较少的归纳偏差更好"的观点在transformers超越卷积模型后变得流行,因此探索这一假设的极限是很自然的。为此,MLP提供了一个理想的测试平台,因为它们没有任何与视觉相关的归纳偏差。此外,由于其数学上的简单性,MLP几乎一直是深度学习理论文献中的主要角色,作为解释更复杂架构所观察到的经验现象的代理。
方法:我们进行了大量的预训练实验,并在CIFAR10、CIFAR100和ImageNet Real三个数据集上评估了MLP的性能。
效果:我们的实验结果表明,MLP的性能随着规模的增大而显著提高(在CIFAR10上达到95%,在CIFAR100上达到82%,在ImageNet Real上达到58%),这表明缺乏归纳偏差确实可以得到补偿。我们还发现,MLP能够忠实地模仿现代同类模型的行为,但在学习设置中的一些组件表现出更强或意外的行为。

In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative "less inductive bias is better", popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, as they lack any vision-specific inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: \textit{Do MLPs reflect the empirical advances exhibited by practical models?} Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects. We show that the performance of MLPs drastically improves with scale (95% on CIFAR10, 82% on CIFAR100, 58% on ImageNet ReaL), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however exhibiting stronger or unexpected behaviours. Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.

The emergence of clusters in self-attention dynamics
Borjan Geshkovski Cyril Letrouit Yury Polyanskiy Philippe Rigollet



研究问题:将Transformers视为相互作用的粒子系统,描述当权重不随时间变化时学习到的表示的几何形状。
动机:探索Transformers内部工作机制,理解其学习到的表示的几何特性。
方法:使用动态系统和偏微分方程的技术,将Transformers中的tokens视为粒子,研究其在时间趋于无穷大时的聚集行为。
效果:证明了在一维情况下,Transformers的自注意力矩阵会收敛到一个低秩布尔矩阵,从而数学上确认了Vaswani等人[ VSP`17 ]的观察结果,即在处理序列tokens时,会出现“领导者”现象。

Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time-dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Using techniques from dynamical systems and partial differential equations, we show that type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [ VSP`17 ] that leaders appear in a sequence of tokens when processed by Transformers.

An Inductive Bias for Tabular Deep Learning
Ege Beyazit Jonathan Kozaczuk Bo Li Vanessa Wallace Bilal H Fadlallah



研究问题:深度学习在图像、文本和音频等任务上表现优秀,但在处理表格数据时却常常不如基于树的方法。
动机:作者认为这种性能差距的主要原因是不规则的目标函数与神经网络学习平滑函数的趋势之间的交互作用。
方法:通过频谱分析工具,作者发现表格数据集描述的函数往往具有高度的不规则性,可以通过缩放和排序等转换进行平滑以改善性能。同时,为了解决这些转换在优化过程中可能丢失信息或对损失景观产生负面影响的问题,作者提出引入频率降低作为归纳偏置。
效果:该方法比全连接层引入更少的计算复杂性,同时显著提高神经网络的性能,并在14个表格数据集上加快了其收敛速度。

Deep learning methods have achieved state-of-the-art performance in most modeling tasks involving images, text and audio, however, they typically underperform tree-based methods on tabular data. In this paper, we hypothesize that a significant contributor to this performance gap is the interaction between irregular target functions resulting from the heterogeneous nature of tabular feature spaces, and the well-known tendency of neural networks to learn smooth functions. Utilizing tools from spectral analysis, we show that functions described by tabular datasets often have high irregularity, and that they can be smoothed by transformations such as scaling and ranking in order to improve performance. However, because these transformations tend to lose information or negatively impact the loss landscape during optimization, they need to be rigorously fine-tuned for each feature to achieve performance gains. To address these problems, we propose introducing frequency reduction as an inductive bias. We realize this bias as a neural network layer that promotes learning low-frequency representations of the input features, allowing the network to operate in a space where the target function is more regular. Our proposed method introduces less computational complexity than a fully connected layer, while significantly improving neural network performance, and speeding up its convergence on 14 tabular datasets.

Facilitating Graph Neural Networks with Random Walk on Simplicial Complexes
Cai Zhou Xiyuan Wang Muhan Zhang



研究问题:本文旨在系统地分析不同阶数的简单复合体上的随机游走如何提高图神经网络的理论表达能力。
动机:尽管节点级别的随机游走已被广泛用于改进图神经网络,但对边和更高阶的$k$-单纯形上的随机游走的关注却相对有限。
方法:本文通过在不同阶数的单纯复合体上进行随机游走,设计了相应的位置编码方法。包括在0-单纯形或节点级别上,将现有的定位编码(PE)和结构编码(SE)方法通过随机游走的桥梁联系起来;在1-单纯形或边级别上,将边级随机游走与Hodge 1-Laplacians连接起来,并设计相应的边PE。
效果:实验结果表明,基于随机游走的方法在各种任务上都取得了显著的效果。

Node-level random walk has been widely used to improve Graph Neural Networks. However, there is limited attention to random walk on edge and, more generally, on $k$-simplices. This paper systematically analyzes how random walk on different orders of simplicial complexes (SC) facilitates GNNs in their theoretical expressivity. First, on $0$-simplices or node level, we establish a connection between existing positional encoding (PE) and structure encoding (SE) methods through the bridge of random walk. Second, on $1$-simplices or edge level, we bridge edge-level random walk and Hodge $1$-Laplacians and design corresponding edge PE respectively. In spatial domain, we directly make use of edge level random walk to construct EdgeRWSE. Based on spectral analysis of Hodge $1$-Laplcians, we propose Hodge1Lap, a permutation equivariant and expressive edge-level positional encoding. Third, we generalize our theory to random walk on higher-order simplices and propose the general principle to design PE on simplices based on random walk and Hodge Laplacians. Inter-level random walk is also introduced to unify a wide range of simplicial networks. Extensive experiments verify the effectiveness of our random walk-based methods.

Residual Alignment: Uncovering the Mechanisms of Residual Networks
Jianing Li Vardan Papyan



研究问题:本研究旨在通过线性化残差块并测量其奇异值分解,对ResNet架构在分类任务中的表现进行深入的实证研究。
动机:尽管ResNet架构由于使用简单的跳过连接而大大提高了性能,但其成功背后的机制仍然在很大程度上未知。
方法:我们通过使用残差雅可比矩阵并将其线性化,然后测量它们的奇异值分解,来对ResNet架构进行深入研究。
效果:我们的测量结果显示了一个被称为“残差对齐”(RA)的过程,它具有四个特性:(RA1)给定输入的中间表示在高维空间中是等间距的线;(RA2)残差雅可比矩阵的上左和右奇异向量相互对齐,并且在不同的深度之间也对齐;(RA3)对于全连接的ResNets,残差雅可比矩阵的秩最多为C,其中C是类别的数量;(RA4)残差雅可比矩阵的上奇异值与深度成反比。RA过程在所有测试数据集上的各种深度和宽度、不同的类别数量以及全连接和卷积架构中都一致地出现在表现良好的模型中,但一旦跳过连接被移除,它就停止了。

The ResNet architecture has been widely adopted in deep learning due to its significant boost to performance through the use of simple skip connections, yet the underlying mechanisms leading to its success remain largely unknown. In this paper, we conduct a thorough empirical study of the ResNet architecture in classification tasks by linearizing its constituent residual blocks using Residual Jacobians and measuring their singular value decompositions. Our measurements ([code](https://colab.research.google.com/drive/1yKjEg2yF616tnZFAfuN0aQ-E9v3JmyjN?usp=sharing)) reveal a process called Residual Alignment (RA) characterized by four properties: - **(RA1):** intermediate representations of a given input are *equispaced* on a *line*, embedded in high dimensional space, as observed by Gai and Zhang [2021]; - **(RA2):** top left and right singular vectors of Residual Jacobians align with each other and across different depths; - **(RA3):** Residual Jacobians are at most rank $C$ for fully-connected ResNets, where $C$ is the number of classes; and - **(RA4):** top singular values of Residual Jacobians scale inversely with depth. RA consistently occurs in models that generalize well, in both fully-connected and convolutional architectures, across various depths and widths, for varying numbers of classes, on all tested benchmark datasets, but ceases to occur once the skip connections are removed. It also provably occurs in a novel mathematical model we propose. This phenomenon reveals a strong alignment between residual branches of a ResNet (RA2+4), imparting a highly rigid geometric structure to the intermediate representations as they progress *linearly* through the network (RA1) up to the final layer, where they undergo Neural Collapse.

Structured Neural-PI Control with End-to-End Stability and Output Tracking Guarantees
Wenqi Cui Yan Jiang Baosen Zhang Yuanyuan Shi



研究问题:本文旨在通过设计具有稳定性和输出跟踪保证的神经网络控制器,研究多输入多输出动态系统的最优控制。
动机:虽然基于神经网络的非线性控制器在各种应用中表现出优越的性能,但由于缺乏可证明的保证,限制了其在高风险现实世界应用中的采用。
方法:利用广泛存在于物理系统中的平衡无关无源性,提出了具有稳定性和零稳态输出跟踪误差保证的神经比例积分(PI)控制器。关键结构是比例和积分项的严格单调性,其参数化为严格凸神经网络(SCNN)。
效果:实验结果表明,所提出的方法改善了瞬态和稳态性能,而未结构化的神经网络则导致不稳定的行为。

We study the optimal control of multiple-input and multiple-output dynamical systems via the design of neural network-based controllers with stability and output tracking guarantees. While neural network-based nonlinear controllers have shown superior performance in various applications, their lack of provable guarantees has restricted their adoption in high-stake real-world applications. This paper bridges the gap between neural network-based controllers and the need for stabilization guarantees. Using equilibrium-independent passivity, a property present in a wide range of physical systems, we propose neural Proportional-Integral (PI) controllers that have provable guarantees of stability and zero steady-state output tracking error. The key structure is the strict monotonicity on proportional and integral terms, which is parameterized as gradients of strictly convex neural networks (SCNN). We construct SCNN with tunable softplus-$\beta$ activations, which yields universal approximation capability and is also useful in incorporating communication constraints. In addition, the SCNNs serve as Lyapunov functions, giving us end-to-end performance guarantees. Experiments on traffic and power networks demonstrate that the proposed approach improves both transient and steady-state performances, while unstructured neural networks lead to unstable behaviors.

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Yuandong Tian Yiping Wang Beidi Chen Simon Shaolei Du



研究问题:本文旨在通过数学严谨的方式,分析Transformer模型在单层自注意力层和解码器层任务中的表现。
动机:尽管Transformer架构在多个研究领域表现出色,但其工作机制仍不明确。特别是对于简单的预测损失,如何从梯度训练动态中产生表示仍然是一个谜团。
方法:本文对1层Transformer(包含一个自注意力层和一个解码器层)的SGD训练动态进行了数学上的严格分析,揭示了其自我注意层组合输入标记的本质和潜在的归纳偏置。
效果:实验结果表明,自我注意可以作为一种"辨别性扫描算法",它逐渐将更多的关注点放在特定的下一个标记上,而对在不同下一个标记中出现的常见键标记的关注则较少。这种过程不会引发赢家通吃的现象,而是会因为解码器学习率控制的"阶段过渡"而停止,留下几乎固定的标记组合。

Transformer architecture has shown impressive performance in multiple research domains and has become the backbone of many neural network models. However, there is limited understanding on how it works. In particular, with a simple predictive loss, how the representation emerges from the gradient \emph{training dynamics} remains a mystery. In this paper, for 1-layer transformer with one self-attention layer plus one decoder layer, we analyze its SGD training dynamics for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as a \emph{discriminative scanning algorithm}: starting from uniform attention, it gradually attends more to distinct key tokens for a specific next token to be predicted, and pays less attention to common key tokens that occur across different next tokens. Among distinct tokens, it progressively drops attention weights, following the order of low to high co-occurrence between the key and the query token in the training set. Interestingly, this procedure does not lead to winner-takes-all, but stops due to a \emph{phase transition} that is controllable by the learning rate of the decoder layer, leaving (almost) fixed token combination. We verify this \textbf{\emph{scan and snap}} dynamics on synthetic and real-world data (WikiText-103).

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data
Yiwen Kou Zixiang Chen Quanquan Gu



研究问题:本文旨在解决非平滑神经网络通过梯度下降训练的隐含偏置问题。
动机:目前,对于同质神经网络(包括ReLU和泄漏ReLU网络),人们已经广泛研究了梯度流的隐含偏置,但对于平滑神经网络,梯度下降的隐含偏置仍然是一个未解的问题。
方法:本文通过对两个全连接层(泄漏)ReLU神经网络进行梯度下降训练,来研究梯度下降的隐含偏置。
效果:实验结果显示,当训练数据接近正交时,对于泄漏ReLU激活函数,梯度下降会找到一个稳定秩收敛到1的网络;而对于ReLU激活函数,梯度下降会找到一个稳定秩被一个常数上界约束的网络。此外,我们还发现梯度下降会找到一个所有训练数据点具有相同归一化边距的网络。在合成数据和真实数据的实验中,我们的理论发现得到了验证。

The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks. Therefore, implicit bias in non-smooth neural networks trained by gradient descent remains an open question. In this paper, we aim to answer this question by studying the implicit bias of gradient descent for training two-layer fully connected (leaky) ReLU neural networks. We showed that when the training data are nearly-orthogonal, for leaky ReLU activation function, gradient descent will find a network with a stable rank that converges to $1$, whereas for ReLU activation function, gradient descent will find a neural network with a stable rank that is upper bounded by a constant. Additionally, we show that gradient descent will find a neural network such that all the training data points have the same normalized margin asymptotically. Experiments on both synthetic and real data backup our theoretical findings.

TopoSRL: Topology preserving self-supervised Simplicial Representation Learning
Hiren Madhu Sundeep Prabhakar Chepuri



研究问题:本文旨在提出一种新的自监督学习方法,用于有效地捕捉更高阶的交互作用并保留在学习表示中的拓扑结构。
动机:现有的基于图的自监督学习方法通常只关注成对的关系,忽视了捕获拓扑信息的关键长程依赖性。
方法:提出了一种名为$\texttt{TopoSRL}$的新方法,通过生成两个视图的复杂数据来丰富表示,同时保持高效。此外,还提出了一种新的复杂对比损失函数,以保留在复杂数据中存在的局部和全局信息。
效果:大量实验结果表明,相比于最先进的图自监督技术和有监督复杂神经网络模型,$\texttt{TopoSRL}$在各种数据集上表现出优越的性能,证明了其在自监督设置中处理复杂数据集合的有效性。

In this paper, we introduce $\texttt{TopoSRL}$, a novel self-supervised learning (SSL) method for simplicial complexes to effectively capture higher-order interactions and preserve topology in the learned representations. $\texttt{TopoSRL}$ addresses the limitations of existing graph-based SSL methods that typically concentrate on pairwise relationships, neglecting long-range dependencies crucial to capture topological information. We propose a new simplicial augmentation technique that generates two views of the simplicial complex that enriches the representations while being efficient. Next, we propose a new simplicial contrastive loss function that contrasts the generated simplices to preserve local and global information present in the simplicial complexes. Extensive experimental results demonstrate the superior performance of $\texttt{TopoSRL}$ compared to state-of-the-art graph SSL techniques and supervised simplicial neural models across various datasets corroborating the efficacy of $\texttt{TopoSRL}$ in processing simplicial complex data in a self-supervised setting.

Simplicity Bias in 1-Hidden Layer Neural Networks
Depen Morwani jatin batra Prateek Jain Praneeth Netrapalli



研究问题:本文旨在严格定义并全面建立单隐藏层神经网络在无限宽度条件下的极端简单性偏见(SB)。
动机:最近的研究表明,神经网络表现出了极端的简单性偏见,即它们只学习最简单的特征来解决手头的任务,即使在存在其他更强大但更复杂的特征的情况下。
方法:我们通过在无限宽度条件下对单隐藏层神经网络进行严格的定义和全面的建立,来深入探究这一问题。具体来说,(i)我们将SB定义为网络基本上是输入的低维投影的函数;(ii)理论上,我们证明当数据是线性可分时,即使存在大量其他更复杂的特征,网络主要依赖于线性可分的一维子空间;(iii)实证上,我们证明在真实数据集如Imagenet和Waterbirds-Landbirds上训练的模型确实依赖于输入的低维投影,从而证明了这些数据集上的SB;(iv)最后,我们提出了一种自然的集成方法,通过让后续模型在早期模型未使用的特征上进行训练,来鼓励模型的多样性,并证明这种方法产生的模型对高斯噪声具有显著的鲁棒性。
效果:大量实验结果表明,相比于最先进的图自监督技术和有监督复杂神经网络模型,$\texttt{TopoSRL}$在各种数据集上表现出优越的性能,证明了其在自监督设置中处理复杂数据集合的有效性。

Recent works have demonstrated that neural networks exhibit extreme *simplicity bias* (SB). That is, they learn *only the simplest* features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of *features*, these works showcase SB on *semi-synthetic* datasets such as Color-MNIST , MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for *one hidden layer* neural networks in the infinite width regime. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on *real* datasets such as Imagenet and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.

Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity
Zhanpeng Zhou Yongyi Yang Xiaojiang Yang Junchi Yan Wei Hu



研究问题:本文旨在揭示神经网络训练中的一种现象——线性模式连接(LMC),并进一步提出更强的概念——逐层线性特征连接(LLFC)。
动机:尽管神经网络的训练损失景观和动态过程复杂且理解不足,但最近的研究发现了许多有趣的实证现象。其中,LMC引起了广泛关注,因为它观察到不同的解决方案可以在参数空间中通过线性路径连接,同时保持接近恒定的训练和测试损失。
方法:本文提出了一个更强的概念——逐层线性特征连接(LLFC),它认为不同训练网络的每一层的特征映射也是线性连接的。作者在各种设置下提供了全面的实证证据,证明只要两个训练网络满足LMC(通过生成或置换方法),它们在几乎所有层上都满足LLFC。
效果:这项对LLFC的研究超越了并推进了我们对LMC的理解,通过采用特征学习的视角。

Recent work has revealed many intriguing empirical phenomena in neural network training, despite the poorly understood and highly complex loss landscapes and training dynamics. One of these phenomena, Linear Mode Connectivity (LMC), has gained considerable attention due to the intriguing observation that different solutions can be connected by a linear path in the parameter space while maintaining near-constant training and test losses. In this work, we introduce a stronger notion of linear connectivity, Layerwise Linear Feature Connectivity (LLFC), which says that the feature maps of every layer in different trained networks are also linearly connected. We provide comprehensive empirical evidence for LLFC across a wide range of settings, demonstrating that whenever two trained networks satisfy LMC (via either spawning or permutation methods), they also satisfy LLFC in nearly all the layers. Furthermore, we delve deeper into the underlying factors contributing to LLFC, which reveal new insights into the permutation approaches. The study of LLFC transcends and advances our understanding of LMC by adopting a feature-learning perspective.

High dimensional, tabular deep learning with an auxiliary knowledge graph
Camilo Ruiz Hongyu Ren Kexin Huang Jure Leskovec



研究问题:对于高维特征但样本量有限的表格型数据集,机器学习模型往往表现不佳。
动机:大量的辅助领域信息可以结构化为异构知识图谱,用于描述输入特征,这可能有助于改善模型性能。
方法:提出PLATO方法,通过使用辅助的知识图谱来正则化多层感知机(MLP),以实现对高维特征的表格型数据的良好处理。在PLATO中,每个输入特征对应于辅助知识图谱中的一个节点,并在MLP的第一层中,每个输入特征也对应一个权重向量。
效果:在6个高维特征但样本量有限的数据集上,PLATO超越了13种最先进的基线方法,最高达到了10.19%的性能提升。

Machine learning models exhibit strong performance on datasets with abundant labeled samples. However, for tabular datasets with extremely high $d$-dimensional features but limited $n$ samples (i.e. $d \gg n$), machine learning models struggle to achieve strong performance due to the risk of overfitting. Here, our key insight is that there is often abundant, auxiliary domain information describing input features which can be structured as a heterogeneous knowledge graph (KG). We propose PLATO, a method that achieves strong performance on tabular data with $d \gg n$ by using an auxiliary KG describing input features to regularize a multilayer perceptron (MLP). In PLATO, each input feature corresponds to a node in the auxiliary KG. In the MLP’s first layer, each input feature also corresponds to a weight vector. PLATO is based on the inductive bias that two input features corresponding to similar nodes in the auxiliary KG should have similar weight vectors in the MLP's first layer. PLATO captures this inductive bias by inferring the weight vector for each input feature from its corresponding node in the KG via a trainable message-passing function. Across 6 $d \gg n$ datasets, PLATO outperforms 13 state-of-the-art baselines by up to 10.19%.

Inner Product-based Neural Network Similarity
Wei Chen Zichen Miao Qiang Qiu



研究问题:如何有效地评估和比较在大量神经网络模型中表示的相似性。
动机:在许多应用中,需要评估和比较不同神经网络模型的相似性,但现有的方法计算效率低下。
方法:提出一种新的方法,将卷积滤波器分解为一组滤波子空间元素(称为滤波原子),并共享这些分解原子系数,从而简化神经网络表示的相似性计算为计算各自滤波原子之间的余弦距离。
效果:该方法在理论和实证上都证明了其有效性,不仅保留了与流行探针基础指标的强线性相关性,而且获取效率高,对探针数据的鲁棒性强。在联邦学习和持续学习等存在大量模型的应用中,该方法的效果也得到了验证。

Analyzing representational similarity among neural networks (NNs) is essential for interpreting or transferring deep models. In application scenarios where numerous NN models are learned, it becomes crucial to assess model similarities in computationally efficient ways. In this paper, we propose a new paradigm for reducing NN representational similarity to filter subspace distance. Specifically, when convolutional filters are decomposed as a linear combination of a set of filter subspace elements, denoted as filter atoms, and have those decomposed atom coefficients shared across networks, NN representational similarity can be significantly simplified as calculating the cosine distance among respective filter atoms, to achieve millions of times computation reduction over popular probing-based methods. We provide both theoretical and empirical evidence that such simplified filter subspace-based similarity preserves a strong linear correlation with other popular probing-based metrics, while being significantly more efficient to obtain and robust to probing data. We further validate the effectiveness of the proposed method in various application scenarios where numerous models exist, such as federated and continual learning as well as analyzing training dynamics. We hope our findings can help further explorations of real-time large-scale representational similarity analysis in neural networks.

A new perspective on building efficient and expressive 3D equivariant graph neural networks
weitao Du Yuanqi Du Limei Wang Dieqiao Feng Guifeng Wang Shuiwang Ji Carla P Gomes Zhi-Ming Ma



研究问题:本文旨在通过局部到全局的分析,评估等变图神经网络在编码3D对称性方面的表达能力。
动机:尽管在将3D对称性编码到图神经网络(GNNs)方面取得了快速进展,但目前还缺乏对这些网络架构表达能力的全面评估。
方法:本文提出了一种局部3D同构层次结构来评估等变GNN的表达能力,并研究了从局部补丁表示全局几何信息的过程。这导致设计出两个关键的模块用于设计表现力强且高效的几何GNNs,即局部子结构编码(LSE)和框架转换编码(FTE)。
效果:为了证明理论的适用性,我们提出了LEFTNet,该模型有效地实现了这些模块,并在标量值和向量值分子属性预测任务上取得了最先进的性能。我们还指出了未来3D等变图神经网络的设计空间。

Geometric deep learning enables the encoding of physical symmetries in modeling 3D objects. Despite rapid progress in encoding 3D symmetries into Graph Neural Networks (GNNs), a comprehensive evaluation of the expressiveness of these network architectures through a local-to-global analysis lacks today. In this paper, we propose a local hierarchy of 3D isomorphism to evaluate the expressive power of equivariant GNNs and investigate the process of representing global geometric information from local patches. Our work leads to two crucial modules for designing expressive and efficient geometric GNNs; namely local substructure encoding (\textbf{LSE}) and frame transition encoding (\textbf{FTE}). To demonstrate the applicability of our theory, we propose LEFTNet which effectively implements these modules and achieves state-of-the-art performance on both scalar-valued and vector-valued molecular property prediction tasks. We further point out future design space for 3D equivariant graph neural networks. Our codes are available at \url{https://github.com/yuanqidu/LeftNet}.

CosNet: A Generalized Spectral Kernel Network
Yanfang Xue Pengfei Fang Jinyue Tian Shipeng Zhu hui xue



研究问题:如何充分利用复数值特征映射来提高时间序列数据的表示能力。
动机:现有的基于频谱核的方法由于消除了虚部,限制了其表示能力。
方法:提出了一种广义的频谱核网络(CosNet),包括频谱核映射一般化模块和复数值频谱核嵌入模块。
效果:实验证明,CosNet优于主流的核方法和复数值神经网络。

Complex-valued representation exists inherently in the time-sequential data that can be derived from the integration of harmonic waves. The non-stationary spectral kernel, realizing a complex-valued feature mapping, has shown its potential to analyze the time-varying statistical characteristics of the time-sequential data, as a result of the modeling frequency parameters. However, most existing spectral kernel-based methods eliminate the imaginary part, thereby limiting the representation power of the spectral kernel. To tackle this issue, we propose a generalized spectral kernel network, namely, \underline{Co}mplex-valued \underline{s}pectral kernel \underline{Net}work (CosNet), which includes spectral kernel mapping generalization (SKMG) module and complex-valued spectral kernel embedding (CSKE) module. Concretely, the SKMG module is devised to generalize the spectral kernel mapping in the real number domain to the complex number domain, recovering the inherent complex-valued representation for the real-valued data. Then a following CSKE module is further developed to combine the complex-valued spectral kernels and neural networks to effectively capture long-range or periodic relations of the data. Along with the CosNet, we study the effect of the complex-valued spectral kernel mapping via theoretically analyzing the bound of covering number and generalization error. Extensive experiments demonstrate that CosNet performs better than the mainstream kernel methods and complex-valued neural networks.

Concept Algebra for (Score-Based) Text-Controlled Generative Models
Zihao Wang Lin Gui Jeffrey Negrea Victor Veitch



研究问题:本文关注基于文本的生成模型中学习到的表示结构,特别是基于分数的模型。
动机:这类模型的一个关键特性是它们能够以“解缠”的方式组合不同的概念,这表明这些模型的内部表示以“解缠”的方式编码概念。
方法:我们专注于概念被编码为某种表示空间的子空间的想法,并为此制定了一个简单方法来识别表示中对应于给定概念的部分。
效果:通过使用Stable Diffusion的例子,我们展示了这个想法,证明了可以通过代数操作表示来操纵模型表达的概念。

This paper concerns the structure of learned representations in text-guided generative models, focusing on score-based models. A key property of such models is that they can compose disparate concepts in a 'disentangled' manner.This suggests these models have internal representations that encode concepts in a 'disentangled' manner. Here, we focus on the idea that concepts are encoded as subspaces of some representation space. We formalize what this means, show there's a natural choice for the representation, and develop a simple method for identifying the part of the representation corresponding to a given concept. In particular, this allows us to manipulate the concepts expressed by the model through algebraic manipulation of the representation. We demonstrate the idea with examples using Stable Diffusion.

Recurrent Temporal Revision Graph Networks
YIZHOU CHEN Anxiang Zeng Qingtao Yu Kerui Zhang Cao Yuanpeng Kangle Wu Guangda Huzhang Han Yu Zhiming Zhou



研究问题:如何更准确地对现实世界中的许多场景进行建模,特别是在时间图网络中进行邻居聚合的问题。
动机:虽然静态图可以提供一些现实世界的模型,但时间图提供了更精确的模型。然而,目前的时间图网络在进行邻居聚合时,通常是直接从静态图中扩展过来的,这在计算上可能会非常昂贵。
方法:我们提出了一种新的框架来进行时间邻居聚合,该框架使用带有节点隐藏状态的循环神经网络来整合每个节点的所有历史邻居的信息,以获取完整的邻居信息。
效果:实验结果表明,我们的方法在理论上具有更强的表达能力,并在实际应用中取得了最先进的性能。在真实的电子商务数据集上,我们的方法比现有的方法平均提高了9.4%的精度。

Temporal graphs offer more accurate modeling of many real-world scenarios than static graphs. However, neighbor aggregation, a critical building block of graph networks, for temporal graphs, is currently straightforwardly extended from that of static graphs. It can be computationally expensive when involving all historical neighbors during such aggregation. In practice, typically only a subset of the most recent neighbors are involved. However, such subsampling leads to incomplete and biased neighbor information. To address this limitation, we propose a novel framework for temporal neighbor aggregation that uses the recurrent neural network with node-wise hidden states to integrate information from all historical neighbors for each node to acquire the complete neighbor information. We demonstrate the superior theoretical expressiveness of the proposed framework as well as its state-of-the-art performance in real-world applications. Notably, it achieves a significant +9.4% improvement on averaged precision in a real-world Ecommerce dataset over existing methods on 2-layer models.

Learning Rule-Induced Subgraph Representations for Inductive Relation Prediction
Tianyu Liu Qitan Lv Jie Wang Shuling Yang Hanzhu Chen



研究问题:如何有效地从知识图谱中学习规则引导的子图表示,以完成不断发展的知识图谱。
动机:现有的方法在处理目标链接和其他链接的消息传递时无法区分,导致最终的子图表示包含与目标链接无关的规则信息,降低了推理性能,严重阻碍了实际应用。
方法:提出了一种新颖的单源边逐条GNN模型来学习规则引导的子图表示(REST),该模型在子图中编码相关规则并消除无关规则。具体来说,我们提出了一种单源初始化方法,只为目标链接初始化边特征,确保挖掘出的规则和目标链接的相关性。然后,我们提出了几种基于RNN的边逐条消息传递函数,以模拟挖掘出的规则的序列性质。
效果:实验结果表明,我们的REST在归纳关系预测基准测试中非常有效。此外,REST不需要节点标记,可以显著加速子图预处理时间最多11.66倍。

Inductive relation prediction (IRP)---where entities can be different during training and inference---has shown great power for completing evolving knowledge graphs. Existing works mainly focus on using graph neural networks (GNNs) to learn the representation of the subgraph induced from the target link, which can be seen as an implicit rule-mining process to measure the plausibility of the target link. However, these methods are not able to differentiate the target link and other links during message passing, hence the final subgraph representation will contain irrelevant rule information to the target link, which reduces the reasoning performance and severely hinders the applications for real-world scenarios. To tackle this problem, we propose a novel $\textit{single-source edge-wise}$ GNN model to learn the $\textbf{R}$ule-induc$\textbf{E}$d $\textbf{S}$ubgraph represen$\textbf{T}$ations $(\textbf{REST}$), which encodes relevant rules and eliminates irrelevant rules within the subgraph. Specifically, we propose a $\textit{single-source}$ initialization approach to initialize edge features only for the target link, which guarantees the relevance of mined rules and target link. Then we propose several RNN-based functions for $\textit{edge-wise}$ message passing to model the sequential property of mined rules. REST is a simple and effective approach with theoretical support to learn the $\textit{rule-induced subgraph representation}$. Moreover, REST does not need node labeling, which significantly accelerates the subgraph preprocessing time by up to $\textbf{11.66}\times$. Experiments on inductive relation prediction benchmarks demonstrate the effectiveness of our REST.

Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion
weitao Du Jiujiu Chen Xuecang Zhang Zhi-Ming Ma Shengchao Liu



研究问题:如何更好地利用机器学习技术进行药物发现,特别是分子几何形状的表示。
动机:药物发现的基本原理是分子几何形状,因此,分子的几何表示是更好地利用机器学习技术进行药物发现的主要瓶颈。
方法:提出一种分子联合自编码预训练方法(MoleculeJAE),可以学习二维键(拓扑)和三维构象(几何)信息,并应用扩散过程模型模拟这两种模态的增强轨迹,基于此,MoleculeJAE将以自监督的方式学习内在的化学结构。
效果:实验证明,与12个竞争性基线相比,MoleculeJAE在20个任务中的15个达到了最先进的性能,显示出其有效性。

Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.

Self-Supervised Learning of Representations for Space Generates Multi-Modular Grid Cells
Rylan Schaeffer Mikail Khona Tzuhsuan Ma Cristobal Eyzaguirre Sanmi Koyejo Ila R Fiete



研究问题:哺乳动物如何通过显著的空间表示来解决映射、定位和导航的空间问题。
动机:哺乳动物的神经系统发展出了一种奇特的网格细胞,这种细胞能以看似奇怪的非局部和周期性活动模式来表示自我位置这一局部和非周期性量。
方法:通过动态系统、编码理论、函数优化和监督深度学习四种方法,提出了一种新的自监督学习框架,该框架无需访问监督位置信息,可以产生多个网格细胞模块,并能在训练分布之外进行泛化。
效果:实验结果表明,该方法不仅可以解释网格细胞的起源,也为机器学习研究者提供了新的自监督学习框架。

To solve the spatial problems of mapping, localization and navigation, the mammalian lineage has developed striking spatial representations. One important spatial representation is the Nobel-prize winning grid cells: neurons that represent self-location, a local and aperiodic quantity, with seemingly bizarre non-local and spatially periodic activity patterns of a few discrete periods. Why has the mammalian lineage learnt this peculiar grid representation? Mathematical analysis suggests that this multi-periodic representation has excellent properties as an algebraic code with high capacity and intrinsic error-correction, but to date, synthesis of multi-modular grid cells in deep recurrent neural networks remains absent. In this work, we begin by identifying key insights from four families of approaches to answering the grid cell question: dynamical systems, coding theory, function optimization and supervised deep learning. We then leverage our insights to propose a new approach that elegantly combines the strengths of all four approaches. Our approach is a self-supervised learning (SSL) framework - including data, data augmentations, loss functions and a network architecture - motivated from a normative perspective, with no access to supervised position information. Without making assumptions about internal or readout representations, we show that multiple grid cell modules can emerge in networks trained on our SSL framework and that the networks generalize significantly beyond their training distribution. This work contains insights for neuroscientists interested in the origins of grid cells as well as machine learning researchers interested in novel SSL frameworks.

Modeling Dynamics over Meshes with Gauge Equivariant Nonlinear Message Passing
Jung Yeon Park Lawson L.S. Wong Robin Walters



研究问题:如何在计算机图形学、生物和物理系统中处理非欧几里得流形上的数据,特别是在曲面网格上的偏微分方程(PDEs)的求解。
动机:虽然图神经网络已被成功应用于PDEs,但它们并未考虑到曲面几何形状和局部规范对称性。而现有的在网格上利用底层几何形状的架构,在模拟具有复杂非线性动态的曲面PDEs时表现不佳。
方法:我们提出了一种新的基于非线性消息传递的规范等变架构。这种新型架构在高度复杂和非线性动态的领域中的表现优于卷积或注意力网络。
效果:然而,与非网格情况类似,不同的任务更适合使用卷积、注意力或消息传递网络;我们调查了在何种情况下,我们的消息传递方法能提供最大的效益。

Data over non-Euclidean manifolds, often discretized as surface meshes, naturally arise in computer graphics and biological and physical systems. In particular, solutions to partial differential equations (PDEs) over manifolds depend critically on the underlying geometry. While graph neural networks have been successfully applied to PDEs, they do not incorporate surface geometry and do not consider local gauge symmetries of the manifold. Alternatively, recent works on gauge equivariant convolutional and attentional architectures on meshes leverage the underlying geometry but underperform in modeling surface PDEs with complex nonlinear dynamics. To address these issues, we introduce a new gauge equivariant architecture using nonlinear message passing. Our novel architecture achieves higher performance than either convolutional or attentional networks on domains with highly complex and nonlinear dynamics. However, similar to the non-mesh case, design trade-offs favor convolutional, attentional, or message passing networks for different tasks; we investigate in which circumstances our message passing method provides the most benefit.

Energy Transformer
Benjamin Hoover Yuchen Liang Bao Pham Rameswar Panda Hendrik Strobelt Duen Horng Chau Mohammed J Zaki Dmitry Krotov



研究问题:本文旨在结合注意力机制、能量模型和联想记忆,提出一种新的架构——能量转换器(ET),用于优化语言表示。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。同时,提出了一种新架构——能量转换器(ET),使用一系列专门设计的注意力层来最小化特定的能量函数,以表示标记之间的关系。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。此外,能量转换器在图像完成、图异常检测和图分类任务上也表现出强大的能力。

Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory. Attention is the power-house driving modern deep learning successes, but it lacks clear theoretical foundations. Energy-based models allow a principled approach to discriminative and generative tasks, but the design of the energy functional is not straightforward. At the same time, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, and allow an intuitive design of the energy function. We propose a novel architecture, called the Energy Transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. In this work, we introduce the theoretical foundations of ET, explore its empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection and graph classification tasks.

Expressive probabilistic sampling in recurrent neural networks
Shirui Chen Linxing Preston Jiang Rajesh P. N. Rao Eric Todd SheaBrown



研究问题:本研究旨在探索递归神经网络电路如何从复杂的概率分布中进行采样。
动机:目前的神经活动模型假设神经活动是从大脑用于概率计算的概率分布中抽取的样本,但对于神经动力学如何从任意分布中进行采样的全面理解仍然缺乏。
方法:使用功能分析和随机微分方程的工具,探讨了递归神经网络电路从复杂分布中进行采样所需的最小架构要求。首先考虑传统的采样模型,该模型由一组神经元组成,其输出直接表示样本(仅采样器网络)。然后,我们证明了具有单独输出单元的递归神经网络电路的放电率动态可以从未标记的数据中学习复杂的非线性函数。
效果:通过实证分析,我们的模型能够从几个复杂的数据分布中进行采样,展示了其在开发下一代基于采样的贝叶斯脑模型中的适用性。

In sampling-based Bayesian models of brain function, neural activities are assumed to be samples from probability distributions that the brain uses for probabilistic computation. However, a comprehensive understanding of how mechanistic models of neural dynamics can sample from arbitrary distributions is still lacking. We use tools from functional analysis and stochastic differential equations to explore the minimum architectural requirements for $\textit{recurrent}$ neural circuits to sample from complex distributions. We first consider the traditional sampling model consisting of a network of neurons whose outputs directly represent the samples ($\textit{sampler-only}$ network). We argue that synaptic current and firing-rate dynamics in the traditional model have limited capacity to sample from a complex probability distribution. We show that the firing rate dynamics of a recurrent neural circuit with a separate set of output units can sample from an arbitrary probability distribution. We call such circuits $\textit{reservoir-sampler networks}$ (RSNs). We propose an efficient training procedure based on denoising score matching that finds recurrent and output weights such that the RSN implements Langevin sampling. We empirically demonstrate our model's ability to sample from several complex data distributions using the proposed neural dynamics and discuss its applicability to developing the next generation of sampling-based Bayesian brain models.

Cross-links Matter for Link Prediction: Rethinking the Debiased GNN from a Data Perspective
Zihan Luo Hong Huang Jianxun Lian Xiran Song Xing Xie Hai Jin



研究问题:本文旨在解决图神经网络(GNN)在链接预测中存在的偏见问题。
动机:现有的GNN模型在处理内部链接和跨链接时存在严重的数据偏见,这对信息孤岛的形成和图的连通性保持产生了影响。
方法:本文设计了一个简单而有效的双结构框架,通过生成无偏的节点嵌入并将其融合到原始GNN的嵌入中,以减轻偏见并提高其效用。
效果:实验结果表明,该框架不仅可以缓解内部链接和跨链接之间的偏见,还可以提高整体准确率,并在与现有最先进技术的比较中验证了其优越性。

Recently, the bias-related issues in GNN-based link prediction have raised widely spread concerns. In this paper, we emphasize the bias on links across different node clusters, which we call cross-links, after considering its significance in both easing information cocoons and preserving graph connectivity. Instead of following the objective-oriented mechanism in prior works with compromised utility, we empirically find that existing GNN models face severe data bias between internal-links (links within the same cluster) and cross-links, and this inspires us to rethink the bias issue on cross-links from a data perspective. Specifically, we design a simple yet effective twin-structure framework, which can be easily applied to most of GNNs to mitigate the bias as well as boost their utility in an end-to-end manner. The basic idea is to generate debiased node embeddings as demonstrations, and fuse them into the embeddings of original GNNs. In particular, we learn debiased node embeddings with the help of augmented supervision signals, and a novel dynamic training strategy is designed to effectively fuse debiased node embeddings with the original node embeddings. Experiments on three datasets with six common GNNs show that our framework can not only alleviate the bias between internal-links and cross-links, but also boost the overall accuracy. Comparisons with other state-of-the-art methods also verify the superiority of our method.

ANTN: Bridging Autoregressive Neural Networks and Tensor Networks for Quantum Many-Body Simulation
Zhuo Chen Laker Newhouse Eddie Chen Di Luo Marin Soljacic



研究问题:量子多体物理模拟对于理解基础科学和量子材料设计、量子技术的应用有重要影响,但由于希尔伯特空间的大小随粒子数量呈指数级增长,直接模拟难以处理。
动机:目前,张量网络和神经网络是近似模拟的两种最先进的方法,但在表达能力和诱导偏差方面各有局限。
方法:我们开发了一种新颖的架构——自回归神经张量网(ANTN),将张量网络和自回归神经网络相结合。
效果:实验表明,自回归神经张量网能参数化归一化波函数,允许精确采样,提升张量网络和自回归神经网络的表达能力,并继承自回归神经网络的各种对称性。我们在量子态学习和寻找具有不同系统大小和耦合参数的具有挑战性的二维J1-J2海森堡模型的基态方面表现出色,超越了张量网络和自回归神经网络。我们的工作为量子多体物理模拟、量子技术设计和人工智能中的生成建模开辟了新的机会。

Quantum many-body physics simulation has important impacts on understanding fundamental science and has applications to quantum materials design and quantum technology. However, due to the exponentially growing size of the Hilbert space with respect to the particle number, a direct simulation is intractable. While representing quantum states with tensor networks and neural networks are the two state-of-the-art methods for approximate simulations, each has its own limitations in terms of expressivity and inductive bias. To address these challenges, we develop a novel architecture, Autoregressive Neural TensorNet (ANTN), which bridges tensor networks and autoregressive neural networks. We show that Autoregressive Neural TensorNet parameterizes normalized wavefunctions, allows for exact sampling, generalizes the expressivity of tensor networks and autoregressive neural networks, and inherits a variety of symmetries from autoregressive neural networks. We demonstrate our approach on quantum state learning as well as finding the ground state of the challenging 2D $J_1$-$J_2$ Heisenberg model with different systems sizes and coupling parameters, outperforming both tensor networks and autoregressive neural networks. Our work opens up new opportunities for quantum many-body physics simulation, quantum technology design, and generative modeling in artificial intelligence.

Probabilistic Invariant Learning with Randomized Linear Classifiers
Leonardo Cotta Gal Yehuda Assaf Schuster Chris J. Maddison



研究问题:设计既具有表达能力又能保持任务已知不变性(invariances)的模型是一个日益困难的问题。
动机:现有的解决方案在保持任务不变性与计算或内存资源之间进行权衡。本研究通过引入随机性,展示了如何设计出既具有表达能力又保持任务不变性,但使用更少资源的模型。
方法:受随机化算法的启发,我们提出了一类名为随机线性分类器(RLCs)的二进制分类模型。我们给出了参数和样本大小条件,在这些条件下,RLCs可以以高概率近似任何(平滑)函数,同时保持对紧群变换的不变性。
效果:利用这一结果,我们设计了三种RLCs,它们在集合、图和球面数据的分类任务上被证明具有概率不变性。我们展示了这些模型如何使用比确定性神经网络及其不变对应物更少的资源来实现概率不变性和通用性。最后,我们在确定性不变神经网络表现不佳的不变任务上,实证证明了这类新模型的优势。

Designing models that are both expressive and preserve known invariances of tasks is an increasingly hard problem. Existing solutions tradeoff invariance for computational or memory resources. In this work, we show how to leverage randomness and design models that are both expressive and invariant but use less resources. Inspired by randomized algorithms, our key insight is that accepting probabilistic notions of universal approximation and invariance can reduce our resource requirements. More specifically, we propose a class of binary classification models called Randomized Linear Classifiers (RLCs). We give parameter and sample size conditions in which RLCs can, with high probability, approximate any (smooth) function while preserving invariance to compact group transformations. Leveraging this result, we design three RLCs that are provably probabilistic invariant for classification tasks over sets, graphs, and spherical data. We show how these models can achieve probabilistic invariance and universality using less resources than (deterministic) neural networks and their invariant counterparts. Finally, we empirically demonstrate the benefits of this new class of models on invariant tasks where deterministic invariant neural networks are known to struggle.

Exploiting Connections between Lipschitz Structures for Certifiably Robust Deep Equilibrium Models
Aaron J Havens Alexandre Araujo Siddharth Garg Farshad Khorrami Bin Hu



研究问题:深度平衡模型(DEQs)的认证鲁棒性理解远不如显式网络模型。
动机:通过探索各种显式和隐式模型的Lipschitz网络参数化之间的联系,提高对DEQs认证鲁棒性的理解。
方法:将流行的Lipschitz网络结构,包括凸势层(CPL)、基于SDP的Lipschitz层(SLL)、几乎正交层(AOL)、三明治层和单调DEQs(MonDEQ)重新参数化为Lipschitz约束均衡网络(LBEN)的特殊案例,同时不改变原始网络参数化中的预定Lipschitz常数。
效果:实证结果显示,该方法提高了DEQs在分类任务上的认证鲁棒准确性。

Recently, deep equilibrium models (DEQs) have drawn increasing attention from the machine learning community. However, DEQs are much less understood in terms of certified robustness than their explicit network counterparts. In this paper, we advance the understanding of certified robustness of DEQs via exploiting the connections between various Lipschitz network parameterizations for both explicit and implicit models. Importantly, we show that various popular Lipschitz network structures, including convex potential layers (CPL), SDP-based Lipschitz layers (SLL), almost orthogonal layers (AOL), Sandwich layers, and monotone DEQs (MonDEQ) can all be reparameterized as special cases of the Lipschitz-bounded equilibrium networks (LBEN) without changing the prescribed Lipschitz constant in the original network parameterization. A key feature of our reparameterization technique is that it preserves the Lipschitz prescription used in different structures. This opens the possibility of achieving improved certified robustness of DEQs via a combination of network reparameterization, structure-preserving regularization, and LBEN-based fine-tuning. We also support our theoretical understanding with new empirical results, which show that our proposed method improves the certified robust accuracy of DEQs on classification tasks. All codes and experiments are made available at \url{https://github.com/AaronHavens/ExploitingLipschitzDEQ}.

Uncovering Meanings of Embeddings via Partial Orthogonality
Yibo Jiang Bryon Aragam Victor Veitch



研究问题:本文探讨了如何将语言的语义结构编码在嵌入向量的代数结构中。
动机:尽管直观上理解"茄子"和"西红柿"在"蔬菜"的条件下是独立的,但形式化这种语义独立性的概念却很困难。因此,需要一种符合独立性公理的代数结构来捕捉这种语义结构。
方法:我们使用偏正交性作为相关的代数结构,并开发理论和方法来证明偏正交性确实能捕捉到语义独立性。同时,我们还引入了保持条件独立结构的嵌入概念,并证明了这类嵌入的存在性和近似性。
效果:通过以上方法,我们成功地将语言的语义结构编码在嵌入向量的代数结构中,为进一步理解和利用自然语言提供了新的视角和工具。

Machine learning tools often rely on embedding text as vectors of real numbers. In this paper, we study how the semantic structure of language is encoded in the algebraic structure of such embeddings. Specifically, we look at a notion of "semantic independence" capturing the idea that, e.g., "eggplant" and "tomato" are independent given "vegetable". Although such examples are intuitive, it is difficult to formalize such a notion of semantic independence. The key observation here is that any sensible formalization should obey a set of so-called independence axioms, and thus any algebraic encoding of this structure should also obey these axioms. This leads us naturally to use partial orthogonality as the relevant algebraic structure. We develop theory and methods that allow us to demonstrate that partial orthogonality does indeed capture semantic independence. Complementary to this, we also introduce the concept of independence preserving embeddings where embeddings preserve the conditional independence structures of a distribution, and we prove the existence of such embeddings and approximations to them.

Learning threshold neurons via edge of stability
Kwangjun Ahn Sebastien Bubeck Sinho Chewi Yin Tat Lee Felipe Suarez Yi Zhang



研究问题:现有的神经网络训练分析通常在非常小的学习率下进行,这与实际经验和实证研究存在明显差异。本研究旨在通过详细分析大型学习率下的非凸训练动态来理解这一问题。
动机:虽然近期有许多关于此主题的研究,但大学习率下的训练效果及其对泛化能力的潜在好处仍然不明确。
方法:通过对简化的两层神经网络模型进行梯度下降分析,我们证明了稳定性边缘现象,并发现了步长小于某一阈值时,神经网络无法学习到“阈值类”神经元的现象。
效果:这一发现阐明了稳定性边缘可能实际上导致更好泛化的一种可能机制,因为阈值神经元是许多任务的基本构建块,具有有用的归纳偏置。

Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability"' or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn ``threshold-like'' neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

Isometric Quotient Variational Auto-Encoders for Structure-Preserving Representation Learning
In Huh changwook jeong Jae Myung Choe Young-Gu Kim Dae Sin Kim



研究问题:如何利用变分自动编码器(VAEs)对嵌入在高维观测空间中的数据流形进行结构保持的低维表示。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过将数据流形分解为对称变换群和流形的商空间,定义了这种流形的结构保持表示为与商空间而不是流形同构(即距离保持)的潜在空间。为此,提出了一种新的自编码框架,名为等距商VAEs(IQVAEs),可以从观察中提取商空间并以一种无监督的方式学习提取的商的黎曼几何。
效果:实验证明,该方法可以发现学习到的数据的有意义的表示,并在下游任务中优于其他竞争对手。

We study structure-preserving low-dimensional representation of a data manifold embedded in a high-dimensional observation space based on variational auto-encoders (VAEs). We approach this by decomposing the data manifold $\mathcal{M}$ as $\mathcal{M} = \mathcal{M} / G \times G$, where $G$ and $\mathcal{M} / G$ are a group of symmetry transformations and a quotient space of $\mathcal{M}$ up to $G$, respectively. From this perspective, we define the structure-preserving representation of such a manifold as a latent space $\mathcal{Z}$ which is isometrically isomorphic (i.e., distance-preserving) to the quotient space $\mathcal{M} / G$ rather $\mathcal{M}$ (i.e., symmetry-preserving). To this end, we propose a novel auto-encoding framework, named isometric quotient VAEs (IQVAEs), that can extract the quotient space from observations and learn the Riemannian isometry of the extracted quotient in an unsupervised manner. Empirical proof-of-concept experiments reveal that the proposed method can find a meaningful representation of the learned data and outperform other competitors for downstream tasks.

Are GATs Out of Balance?
Nimrah Mustafa Aleksandar Bojchevski Rebekka Burkholz



研究问题:本研究旨在探索图神经网络(GNNs)的优化和学习动态,特别是针对一种流行的GNN架构——图注意力网络(GAT)。
动机:尽管图神经网络的表达能力和计算能力在理论上得到了研究,但其优化和学习动态在很大程度上仍未被探索。特别是在图注意力网络中,大部分参数在标准初始化后的训练过程中难以改变,这一问题在深层网络中更为严重。
方法:我们提出了一种新的初始化方案,以平衡图注意力网络。这种方法不仅使深层网络的训练成为可能,而且与标准初始化相比,训练和收敛时间大大加快。
效果:我们的主定理为研究具有注意力机制的正齐次模型的学习动态奠定了基础。

While the expressive power and computational capabilities of graph neural networks (GNNs) have been theoretically studied, their optimization and learning dynamics, in general, remain largely unexplored. Our study undertakes the Graph Attention Network (GAT), a popular GNN architecture in which a node's neighborhood aggregation is weighted by parameterized attention coefficients. We derive a conservation law of GAT gradient flow dynamics, which explains why a high portion of parameters in GATs with standard initialization struggle to change during training. This effect is amplified in deeper GATs, which perform significantly worse than their shallow counterparts. To alleviate this problem, we devise an initialization scheme that balances the GAT network. Our approach i) allows more effective propagation of gradients and in turn enables trainability of deeper networks, and ii) attains a considerable speedup in training and convergence time in comparison to the standard initialization. Our main theorem serves as a stepping stone to studying the learning dynamics of positive homogeneous models with attention mechanisms.

Approximation-Generalization Trade-offs under (Approximate) Group Equivariance
Mircea Petrache Shubhendu Trivedi



研究问题:本文旨在通过对称性显式引入任务特定的归纳偏差,以开发高性能的机器学习模型。
动机:例如,群等变神经网络在蛋白质和药物设计等多个领域和应用中表现出了令人印象深刻的性能。这种模型的一个普遍直觉是,相关对称性的整合会增强泛化能力。此外,当数据和/或模型仅呈现近似或部分对称性时,最优或性能最佳的模型是模型对称性与数据对称性对齐的模型。
方法:我们首先提出了一些定量的界限,用以展示捕捉任务特定对称性的模型如何提高泛化能力。然后,我们利用这个量化结果来探讨处理近似/部分对称性的更一般的问题。
效果:我们为给定的对称群建立了模型近似等变性和数据分布等变性之间的定量比较,从而精确地连接了模型等变性误差和数据等变性误差。我们的研究结果明确了模型等变性误差最优的条件,从而为给定的任务和数据产生了性能最佳的模型。

The explicit incorporation of task-specific inductive biases through symmetry has emerged as a general design precept in the development of high-performance machine learning models. For example, group equivariant neural networks have demonstrated impressive performance across various domains and applications such as protein and drug design. A prevalent intuition about such models is that the integration of relevant symmetry results in enhanced generalization. Moreover, it is posited that when the data and/or the model exhibits only approximate or partial symmetry, the optimal or best-performing model is one where the model symmetry aligns with the data symmetry. In this paper, we conduct a formal unified investigation of these intuitions. To begin, we present quantitative bounds that demonstrate how models capturing task-specific symmetries lead to improved generalization. Utilizing this quantification, we examine the more general question of dealing with approximate/partial symmetries. We establish, for a given symmetry group, a quantitative comparison between the approximate equivariance of the model and that of the data distribution, precisely connecting model equivariance error and data equivariance error. Our result delineates the conditions under which the model equivariance error is optimal, thereby yielding the best-performing model for the given task and data.

Fragment-based Pretraining and Finetuning on Molecular Graphs
Kha-Dinh Luong Ambuj Singh



研究问题:如何有效利用未标记的分子数据进行图神经网络(GNN)的预训练?
动机:未标记的分子数据大量存在,这为化学领域的GNN自我监督学习提供了便利。
方法:提出在片段级别对GNN进行预训练的方法,通过借鉴最近关于主子图挖掘的工作,从大型预训练数据集中提取出常见的片段词汇表,并基于此设计了几种片段对比和预测预训练任务。
效果:实验结果显示,该方法在8个常见分子基准测试中的5个上提高了性能,并且在长范围生物基准测试上的性能至少提高了11.5%。

Property prediction on molecular graphs is an important application of Graph Neural Networks (GNNs). Recently, unlabeled molecular data has become abundant, which facilitates the rapid development of self-supervised learning for GNNs in the chemical domain. In this work, we propose pretraining GNNs at the fragment level, a promising middle ground to overcome the limitations of node-level and graph-level pretraining. Borrowing techniques from recent work on principal subgraph mining, we obtain a compact vocabulary of prevalent fragments from a large pretraining dataset. From the extracted vocabulary, we introduce several fragment-based contrastive and predictive pretraining tasks. The contrastive learning task jointly pretrains two different GNNs: one on molecular graphs and the other on fragment graphs, which represents higher-order connectivity within molecules. By enforcing consistency between the fragment embedding and the aggregated embedding of the corresponding atoms from the molecular graphs, we ensure that the embeddings capture structural information at multiple resolutions. The structural information of fragment graphs is further exploited to extract auxiliary labels for graph-level predictive pretraining. We employ both the pretrained molecular-based and fragment-based GNNs for downstream prediction, thus utilizing the fragment information during finetuning. Our graph fragment-based pretraining (GraphFP) advances the performances on 5 out of 8 common molecular benchmarks and improves the performances on long-range biological benchmarks by at least 11.5%. Code is available at: https://github.com/lvkd84/GraphFP.

Exact Representation of Sparse Networks with Symmetric Nonnegative Embeddings
Sudhanshu Chanpuriya Ryan A. Rossi Anup Rao Tung Mai Nedim Lipka Zhao Song Cameron N Musco



研究问题:现有的基于邻接矩阵分解的图模型往往无法捕捉到不同节点之间的链接(异质性)的网络结构。
动机:我们提出了一种新的图因子分解模型,利用每个节点的两个非负向量来解释相似和不同节点之间的链接。
方法:我们的模型可以精确表示任何具有低度树性的图,这是一种许多现实世界网络都满足的属性。此外,由于其对称结构和非负性,拟合该模型会自然地找到节点社区,并且模型的链接预测可以从这些社区的角度进行解释。
效果:在真实世界网络的实验中,我们在各种任务上展示了我们的因子分解的有效性,包括社区检测和链接预测。

Graph models based on factorization of the adjacency matrix often fail to capture network structures related to links between dissimilar nodes (heterophily). We introduce a novel graph factorization model that leverages two nonnegative vectors per node to interpretably account for links between both similar and dissimilar nodes. We prove that our model can exactly represent any graph with low *arboricity*, a property that many real-world networks satisfy; our proof also applies to related models but has much greater scope than the closest prior bound, which is based on low *max degree*. Our factorization also has compelling properties besides expressiveness: due to its symmetric structure and nonnegativity, fitting the model inherently finds node communities, and the model's link predictions can be interpreted in terms of these communities. In experiments on real-world networks, we demonstrate our factorization's effectiveness on a variety of tasks, including community detection and link prediction.

Large language models transition from integrating across position-yoked, exponential windows to structure-yoked, power-law windows
David Skrill Samuel Victor Norman-Haignere



研究问题:本文旨在探索大型语言模型(LLMs)中的时间整合模式,并尝试理解其与生物神经网络系统的相似性。
动机:人类大脑对语言的反应显示出分层组织的“整合窗口”,这大大限制了输入标记(如单词)对神经反应的总体影响。然而,很少有研究试图使用整合窗口来描述大型语言模型中的计算。
方法:我们开发了一种简单的词交换程序,用于从黑箱语言模型中估计整合窗口,而无需访问梯度或了解模型架构(如注意力权重)。
效果:训练后的大型语言模型展现出刻板的整合窗口,这些窗口可以通过指数函数和幂函数的凸组合很好地拟合,并且在网络层之间部分地从指数动态过渡到幂律动态。我们发现,随着网络层数的增加,整合窗口越来越受到结构的限制。这些发现在未训练的模型中都没有观察到。

Modern language models excel at integrating across long temporal scales needed to encode linguistic meaning and show non-trivial similarities to biological neural systems. Prior work suggests that human brain responses to language exhibit hierarchically organized "integration windows" that substantially constrain the overall influence of an input token (e.g., a word) on the neural response. However, little prior work has attempted to use integration windows to characterize computations in large language models (LLMs). We developed a simple word-swap procedure for estimating integration windows from black-box language models that does not depend on access to gradients or knowledge of the model architecture (e.g., attention weights). Using this method, we show that trained LLMs exhibit stereotyped integration windows that are well-fit by a convex combination of an exponential and a power-law function, with a partial transition from exponential to power-law dynamics across network layers. We then introduce a metric for quantifying the extent to which these integration windows vary with structural boundaries (e.g., sentence boundaries), and using this metric, we show that integration windows become increasingly yoked to structure at later network layers. None of these findings were observed in an untrained model, which as expected integrated uniformly across its input. These results suggest that LLMs learn to integrate information in natural language using a stereotyped pattern: integrating across position-yoked, exponential windows at early layers, followed by structure-yoked, power-law windows at later layers. The methods we describe in this paper provide a general-purpose toolkit for understanding temporal integration in language models, facilitating cross-disciplinary research at the intersection of biological and artificial intelligence.

What Can We Learn from Unlearnable Datasets?
Pedro Sandoval-Segura Vasu Singla Jonas Geiping Micah Goldblum Tom Goldstein



研究问题:在数据抓取普遍的时代,使用不可学习数据集方法有可能通过防止深度神经网络泛化来保护数据隐私。
动机:尽管不可学习数据集方法在实际应用中存在许多限制使其使用可能性不大,但我们发现这种方法对数据的保护能力存在问题。
方法:我们训练神经网络在不可学习的数据集上,发现网络实际上可以学习到有用的特征,这些特征可以通过重新加权以提高测试性能,这表明图像保护无法得到保证。我们还提出了一种正交投影攻击,允许从不可学习的数据集进行学习。
效果:我们的研究结果挑战了不可学习数据集方法能够保护数据隐私的观念,同时我们的正交投影攻击比最近提出的方法简单得多。

In an era of widespread web scraping, unlearnable dataset methods have the potential to protect data privacy by preventing deep neural networks from generalizing. But in addition to a number of practical limitations that make their use unlikely, we make a number of findings that call into question their ability to safeguard data. First, it is widely believed that neural networks trained on unlearnable datasets only learn shortcuts, simpler rules that are not useful for generalization. In contrast, we find that networks actually can learn useful features that can be reweighed for high test performance, suggesting that image protection is not assured. Unlearnable datasets are also believed to induce learning shortcuts through linear separability of added perturbations. We provide a counterexample, demonstrating that linear separability of perturbations is not a necessary condition. To emphasize why linearly separable perturbations should not be relied upon, we propose an orthogonal projection attack which allows learning from unlearnable datasets published in ICML 2021 and ICLR 2023. Our proposed attack is significantly less complex than recently proposed techniques.

CORNN: Convex optimization of recurrent neural networks for rapid inference of neural dynamics
Fatih Dinc Adam Shai Mark Schnitzer Hidenori Tanaka



研究问题:如何有效地训练大规模的循环神经网络(dRNNs)以解析和控制动物行为中的大型神经群体?
动机:光学和电生理记录技术的发展使得实时训练大规模神经网络成为可能,为研究和医学应用提供了新的可能。
方法:提出了一种名为“循环神经网络的凸优化”(CORNN)的训练方法,该方法在模拟记录中实现了比传统优化方法快100倍的训练速度,同时保持或提高了模型的准确性。
效果:通过在标准计算机上以亚分钟的处理时间训练具有数百万个参数的dRNNs,CORNN为实现对大规模神经记录的实时网络再现迈出了第一步,并为推进神经计算理解提供了强大的计算工具。

Advances in optical and electrophysiological recording technologies have made it possible to record the dynamics of thousands of neurons, opening up new possibilities for interpreting and controlling large neural populations in behaving animals. A promising way to extract computational principles from these large datasets is to train data-constrained recurrent neural networks (dRNNs). Performing this training in real-time could open doors for research techniques and medical applications to model and control interventions at single-cell resolution and drive desired forms of animal behavior. However, existing training algorithms for dRNNs are inefficient and have limited scalability, making it a challenge to analyze large neural recordings even in offline scenarios. To address these issues, we introduce a training method termed Convex Optimization of Recurrent Neural Networks (CORNN). In studies of simulated recordings, CORNN attained training speeds $\sim$100-fold faster than traditional optimization approaches while maintaining or enhancing modeling accuracy. We further validated CORNN on simulations with thousands of cells that performed simple computations such as those of a 3-bit flip-flop or the execution of a timed response. Finally, we showed that CORNN can robustly reproduce network dynamics and underlying attractor structures despite mismatches between generator and inference models, severe subsampling of observed neurons, or mismatches in neural time-scales. Overall, by training dRNNs with millions of parameters in subminute processing times on a standard computer, CORNN constitutes a first step towards real-time network reproduction constrained on large-scale neural recordings and a powerful computational tool for advancing the understanding of neural computation.

Feature-Learning Networks Are Consistent Across Widths At Realistic Scales
Nikhil Vyas Alexander Atanasov Blake Bordelon Depen Morwani Sabarish Sainathan Cengiz Pehlevan



研究问题:本文研究了宽度对各种架构和数据集的特征学习神经网络动力学的影响。
动机:早期的在线数据训练中,宽神经网络不仅具有相同的损失曲线,而且在训练过程中的点预测上也保持一致。对于像CIFAR-5m这样的简单任务,在现实宽度的网络中,这种现象在整个训练过程中都存在。
方法:通过使用不同的网络宽度进行训练和测试,观察其在不同任务和训练阶段的表现,以及模型的内部表示、预激活分布、稳定性边缘现象和大学习率效应等结构特性。
效果:实验结果表明,宽神经网络在早期训练阶段具有较好的一致性,但在较难的任务(如ImageNet和语言建模)和后期训练阶段,有限宽度的偏差会逐渐增大。这种偏差主要由网络输出的初始值依赖方差缩放(与宽度成反比)和窄宽度偏见(窄网络的集成表现不如单一宽网络)两个因素导致。最后,从频谱角度探讨了有限宽度偏见的起源。

We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. For simple tasks such as CIFAR-5m this holds throughout training for networks of realistic widths. We also show that structural properties of the models, including internal representations, preactivation distributions, edge of stability phenomena, and large learning rate effects are consistent across large widths. This motivates the hypothesis that phenomena seen in realistic models can be captured by infinite-width, feature-learning limits. For harder tasks (such as ImageNet and language modeling), and later training times, finite-width deviations grow systematically. Two distinct effects cause these deviations across widths. First, the network output has an initialization-dependent variance scaling inversely with width, which can be removed by ensembling networks. We observe, however, that ensembles of narrower networks perform worse than a single wide network. We call this the bias of narrower width. We conclude with a spectral perspective on the origin of this finite-width bias.

The Crucial Role of Normalization in Sharpness-Aware Minimization
Yan Dai Kwangjun Ahn Suvrit Sra



研究问题:本研究旨在理解Sharpness-Aware Minimization(SAM)优化器中归一化的作用。
动机:SAM是一种梯度基础的优化器,可以显著提高深度神经网络的预测性能,而其成功的原因引起了人们的关注。
方法:通过理论和实证研究,探讨了归一化在SAM中对凸和非凸函数的影响。
效果:研究发现,归一化有两个关键作用:一是帮助稳定算法;二是使算法能够在极小值的连续体上漂移,这是SAM取得更好性能的关键。这两个属性使得SAM对超参数的选择具有鲁棒性,支持了SAM的实用性。这一结论得到了各种实验的支持。

Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding ***the role played by normalization***, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima -- a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments.

On the impact of activation and normalization in obtaining isometric embeddings at initialization
Amir Joudaki Hadi Daneshmand Francis Bach



研究问题:探索深度神经网络中倒数第二Gram矩阵的结构,并解决其初始化时退化导致训练速度减慢的问题。
动机:在多个架构中观察到,该Gram矩阵在初始化时会退化,严重影响训练速度。尽管归一化层如批量或层归一化在防止等级崩溃问题上起到了关键作用,但现有的理论结果并未扩展到广泛使用的变换器中的层归一化,也不能量化非线性激活的作用。
方法:我们证明,层归一化与激活层相结合,会在初始化时以指数速率将多层感知器的Gram矩阵偏向单位矩阵。我们使用激活函数的埃尔米特展开来量化这一速率。
效果:实验结果表明,该方法能有效防止Gram矩阵的退化问题,提高训练速度,为深度学习模型的训练提供了新的视角和理论支持。

In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results do not extend to layer normalization, which is widely used in transformers, and can not quantitatively characterize the role of non-linear activations. To bridge this gap, we prove that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards the identity matrix at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function.

Zero-One Laws of Graph Neural Networks
Sam Adam-Day Theodor-Mihai Iliant Ismail Ilkan Ceylan



研究问题:本文旨在探讨图神经网络(GNN)在节点数量非常大时的行为,以及其表示和外推能力的理论限制。
动机:图神经网络是图上机器学习的标准深度学习架构,但其表示和外推能力的理论界限尚未明确。
方法:通过从Erdős–Rényi模型中抽取不同大小的图,分析这些图被图神经网络分类器映射到特定输出的概率。
效果:研究发现,当图的节点数量增大时,图神经网络将图映射到特定输出的概率趋向于零或一,这为图神经网络建立了“零一定律”,并揭示了其理论容量的限制。

Graph neural networks (GNNs) are the de facto standard deep learning architectures for machine learning on graphs. This has led to a large body of work analyzing the capabilities and limitations of these models, particularly pertaining to their representation and extrapolation capacity. We offer a novel theoretical perspective on the representation and extrapolation capacity of GNNs, by answering the question: how do GNNs behave as the number of graph nodes become very large? Under mild assumptions, we show that when we draw graphs of increasing size from the Erdős–Rényi model, the probability that such graphs are mapped to a particular output by a class of GNN classifiers tends to either zero or one. This class includes the popular graph convolutional network architecture. The result establishes `zero-one laws' for these GNNs, and analogously to other convergence laws, entails theoretical limitations on their capacity. We empirically verify our results, observing that the theoretical asymptotic limits are evident already on relatively small graphs.

What can a Single Attention Layer Learn? A Study Through the Random Features Lens
Hengyu Fu Tianyu Guo Yu Bai Song Mei



研究问题:本文旨在对Transformer架构中的核心构建模块——注意力层进行理论研究,探讨其学习与泛化能力。
动机:Transformer架构在现代人工智能领域取得了重大突破,而注意力层作为其核心组成部分,对于理解其工作机制和提升模型性能具有重要意义。
方法:本文针对具有多个头的注意力层进行理论分析,以一系列关键向量和一个独立的查询向量作为输入,考虑了随机特征设置,即注意力层具有大量头,查询和关键矩阵是随机采样的冻结矩阵,值矩阵是可训练的。
效果:研究发现,这种随机特征注意力层可以表达一大类目标函数,这些函数对于关键向量是置换不变的。此外,还提供了使用有限数量头的随机特征注意力从有限样本中学习这些目标函数的超额风险界限。实验结果证实了理论发现,并进一步揭示了样本大小和目标函数复杂性之间的相互作用。

Attention layers---which map a sequence of inputs to a sequence of outputs---are core building blocks of the Transformer architecture which has achieved significant breakthroughs in modern artificial intelligence. This paper presents a rigorous theoretical study on the learning and generalization of a single multi-head attention layer, with a sequence of key vectors and a separate query vector as input. We consider the random feature setting where the attention layer has a large number of heads, with randomly sampled frozen query and key matrices, and trainable value matrices. We show that such a random-feature attention layer can express a broad class of target functions that are permutation invariant to the key vectors. We further provide quantitative excess risk bounds for learning these target functions from finite samples, using random feature attention with finitely many heads. Our results feature several implications unique to the attention structure compared with existing random features theory for neural networks, such as (1) Advantages in the sample complexity over standard two-layer random-feature networks; (2) Concrete and natural classes of functions that can be learned efficiently by a random-feature attention layer; and (3) The effect of the sampling distribution of the query-key weight matrix (the product of the query and key matrix), where Gaussian random weights with a non-zero mean result in better sample complexities over the zero-mean counterpart for learning certain natural target functions. Experiments on simulated data corroborate our theoretical findings and further illustrate the interplay between the sample size and the complexity of the target function.

Formalizing locality for normative synaptic plasticity models
Colin Bredenberg Ezekiel Williams Cristina Savin Blake Aaron Richards Guillaume Lajoie



研究问题:如何定义和操作局部性,以明确哪些学习算法可以被认为是生物上合理的?
动机:当前对于大脑中突触可塑性的新模型的提出,大多基于机器学习原理,但"生物学上合理"的学习算法的定义模糊不清。
方法:提出了局部性的正式和操作性定义,明确了如果一个算法要符合特定的(生物)约束条件,那么在它的学习规则中不能包含哪些量。
效果:通过这个框架,可以从各种具有鲁棒性的、对神经网络架构的选择具有任意性的生物学上合理的突触可塑性模型中提炼出可测试的预测。因此,这个框架可以用来指导关于生物学合理性的声明,并找出可能的方法来实验性地证伪提出的大脑学习算法。

In recent years, many researchers have proposed new models for synaptic plasticity in the brain based on principles of machine learning. The central motivation has been the development of learning algorithms that are able to learn difficult tasks while qualifying as "biologically plausible". However, the concept of a biologically plausible learning algorithm is only heuristically defined as an algorithm that is potentially implementable by biological neural networks. Further, claims that neural circuits could implement any given algorithm typically rest on an amorphous concept of "locality" (both in space and time). As a result, it is unclear what many proposed local learning algorithms actually predict biologically, and which of these are consequently good candidates for experimental investigation. Here, we address this lack of clarity by proposing formal and operational definitions of locality. Specifically, we define different classes of locality, each of which makes clear what quantities cannot be included in a learning rule if an algorithm is to qualify as local with respect to a given (biological) constraint. We subsequently use this framework to distill testable predictions from various classes of biologically plausible synaptic plasticity models that are robust to arbitrary choices about neural network architecture. Therefore, our framework can be used to guide claims of biological plausibility and to identify potential means of experimentally falsifying a proposed learning algorithm for the brain.

From Trainable Negative Depth to Edge Heterophily in Graphs
Yuchen Yan Yuzhong Chen Huiyuan Chen Minghua Xu Mahashweta Das Hao Yang Hanghang Tong



研究问题:寻找能够提供强大表示能力的图卷积网络(GCN)的适当深度$d$,仍是图学习社区的一个重大挑战。
动机:尽管在图学习领域已经取得了显著的进步,但GCN的深度或层数是由一系列的图卷积操作实现的,这自然使得$d$是一个正整数($d \in \mathbb{N}+$)。因此,一个问题是,通过将$d$定义为一个可连续调整的实数($d \in mathbb{R}$),是否能够为图学习机制带来新的启示。
方法:本文重新定义了GCN的深度$d$为一个可在$(-\infty,+\infty)$内连续调整的训练参数,从而打开了一扇新的大门,可以通过控制其信号处理能力来模拟图的同质性/异质性(具有相似/不同标签/属性的节点倾向于相互连接)。提出了一种简单而强大的GCN模型TEDGCN,它既保留了GCN的简洁性,同时又能在无需预先了解输入图是否同质或异质的情况下自动搜索最优的$d$。负值的$d$通过增强拓扑结构实现了对图异质性的高通频率过滤功能。
效果:大量的实验表明,TEDGCN在各种同质和异质图的节点分类任务上具有优越的性能。

Finding the proper depth $d$ of a graph convolutional network (GCN) that provides strong representation ability has drawn significant attention, yet nonetheless largely remains an open problem for the graph learning community. Although noteworthy progress has been made, the depth or the number of layers of a corresponding GCN is realized by a series of graph convolution operations, which naturally makes $d$ a positive integer ($d \in \mathbb{N}+$). An interesting question is whether breaking the constraint of $\mathbb{N}+$ by making $d$ a real number ($d \in \mathbb{R}$) can bring new insights into graph learning mechanisms. In this work, by redefining GCN's depth $d$ as a trainable parameter continuously adjustable within $(-\infty,+\infty)$, we open a new door of controlling its signal processing capability to model graph homophily/heterophily (nodes with similar/dissimilar labels/attributes tend to be inter-connected). A simple and powerful GCN model TEDGCN, is proposed to retain the simplicity of GCN and meanwhile automatically search for the optimal $d$ without the prior knowledge regarding whether the input graph is homophilic or heterophilic. Negative-valued $d$ intrinsically enables high-pass frequency filtering functionality via augmented topology for graph heterophily. Extensive experiments demonstrate the superiority of TEDGCN on node classification tasks for a variety of homophilic and heterophilic graphs.

Neural Data Transformer 2: Multi-context Pretraining for Neural Spiking Activity
Joel Ye Jennifer L Collinger Leila Wehbe Robert Gaunt



研究问题:如何有效地利用大规模无监督预训练来学习神经棘波活动的表示。
动机:当前的神经棘波活动模型主要针对单个实验环境,限制了数据量和深度神经网络的有效性。
方法:开发了一种名为Neural Data Transformer 2(NDT2)的时空变压器模型,用于神经棘波活动,并证明预训练可以利用跨越会话、主题和实验任务的电机BCI数据集。
效果:NDT2能够快速适应下游解码任务中的新环境,为iBCI控制打开了预训练DNN部署的道路。

The neural population spiking activity recorded by intracortical brain-computer interfaces (iBCIs) contain rich structure. Current models of such spiking activity are largely prepared for individual experimental contexts, restricting data volume to that collectable within a single session and limiting the effectiveness of deep neural networks (DNNs). The purported challenge in aggregating neural spiking data is the pervasiveness of context-dependent shifts in the neural data distributions. However, large scale unsupervised pretraining by nature spans heterogeneous data, and has proven to be a fundamental recipe for successful representation learning across deep learning. We thus develop Neural Data Transformer 2 (NDT2), a spatiotemporal Transformer for neural spiking activity, and demonstrate that pretraining can leverage motor BCI datasets that span sessions, subjects, and experimental tasks. NDT2 enables rapid adaptation to novel contexts in downstream decoding tasks and opens the path to deployment of pretrained DNNs for iBCI control. Code: https://github.com/joel99/context_general_bci

Neural approximation of Wasserstein distance via a universal architecture for symmetric and factorwise group invariant functions
Samantha Chen Yusu Wang



研究问题:如何设计一种有效的神经网络来近似复杂对象(如点集和图)之间的连续和对称的积函数(如距离函数),并使其具有因式群不变性。
动机:在机器学习应用中,学习复杂对象之间的距离函数(如点集间的Wasserstein距离)是一个常见的目标。然而,这些函数需要对各种群作用(如排列或刚体变换)具有不变性。因此,我们需要开发一种新的神经网络架构来实现这一目标。
方法:本文首先提出了一种用于逼近SFGI函数的通用神经网络架构。然后,我们将这种通用神经网络与一种素描思想相结合,以开发一种特定且高效的神经网络,可以近似点集间的第p个Wasserstein距离。
效果:从理论上讲,我们的研究首次证明了存在一种具有有限模型复杂度的神经网络,可以近似Wasserstein距离。从实证上看,我们提出的新神经网络架构在性能上优于其他模型(包括最新的基于Siamese Autoencoder的方法)。特别是,我们的神经网络比最新的Siamese AE泛化能力更强,训练速度更快。

Learning distance functions between complex objects, such as the Wasserstein distance to compare point sets, is a common goal in machine learning applications. However, functions on such complex objects (e.g., point sets and graphs) are often required to be invariant to a wide variety of group actions e.g. permutation or rigid transformation. Therefore, continuous and symmetric *product* functions (such as distance functions) on such complex objects must also be invariant to the *product* of such group actions. We call these functions symmetric and factor-wise group invariant functions (or SGFI functions} in short). In this paper, we first present a general neural network architecture for approximating SFGI functions. The main contribution of this paper combines this general NN with a sketching idea in order to develop a specific and efficient neural network which can approximate the $p$-th Wasserstein distance between point sets. Very importantly, the required model complexity is *independent* of the sizes of input point sets. On the theoretical front, to the best of our knowledge, this is the first result showing that there exists a neural network with the capacity to approximate Wasserstein distance with bounded model complexity. Our work provides an interesting integration of sketching ideas for geometric problems with universal approximation of symmetric functions. On the empirical front, we present a range of results showing that our newly proposed neural network architecture performs comparatively or better than other models (including a SOTA Siamese Autoencoder based approach). In particular, our NN generalizes significantly better and trains much faster than the SOTA Siamese AE. Finally, this line of investigation could be useful in exploring effective neural network design for solving a broad range of geometric optimization problems (e.g., $k$-means in a metric space).

Learning better with Dale’s Law: A Spectral Perspective
Pingsheng Li Jonathan Cornford Arna Ghosh Blake Aaron Richards



研究问题:大多数循环神经网络(RNNs)并未包含真实神经电路的基本约束条件:戴尔定律,即神经元必须是兴奋性(E)或抑制性(I)。戴尔定律通常在RNNs中缺失,因为简单地将标准网络的单元分为E和I群体会损害学习。
动机:尽管戴尔定律在RNNs中的表现不佳,但作者扩展了最近的一种受生物启发的EI网络架构——戴尔人工神经网络(Dale's ANNs),并在循环网络中展示了良好的性能,同时尊重戴尔定律。
方法:作者通过比较不同网络的奇异值分布、谱性质以及性能,探讨了为何某些形式的EI网络学习效果差,而其他形式则学习效果好。此外,作者还提出了标准化SVD熵作为衡量谱病态与性能之间关系的指标。
效果:研究发现,简单的EI划分会导致奇异值分布呈现多模态且分散,而标准的RNNs和循环戴尔人工神经网络具有单模态、更集中的奇异值分布。此外,对于具有较少I单元的小网络,其谱性质和性能更差。总体而言,这项工作为神经科学启发的AI和计算神经科学领域的一个长期未解之谜提供了新的见解,为使神经网络与生物学更加一致铺平了道路。

Most recurrent neural networks (RNNs) do not include a fundamental constraint of real neural circuits: Dale's Law, which implies that neurons must be excitatory (E) or inhibitory (I). Dale's Law is generally absent from RNNs because simply partitioning a standard network's units into E and I populations impairs learning. However, here we extend a recent feedforward bio-inspired EI network architecture, named Dale's ANNs, to recurrent networks, and demonstrate that good performance is possible while respecting Dale's Law. This begs the question: What makes some forms of EI network learn poorly and others learn well? And, why does the simple approach of incorporating Dale's Law impair learning? Historically the answer was thought to be the sign constraints on EI network parameters, and this was a motivation behind Dale's ANNs. However, here we show the spectral properties of the recurrent weight matrix at initialisation are more impactful on network performance than sign constraints. We find that simple EI partitioning results in a singular value distribution that is multimodal and dispersed, whereas standard RNNs have an unimodal, more clustered singular value distribution, as do recurrent Dale's ANNs. We also show that the spectral properties and performance of partitioned EI networks are worse for small networks with fewer I units, and we present normalised SVD entropy as a measure of spectrum pathology that correlates with performance. Overall, this work sheds light on a long-standing mystery in neuroscience-inspired AI and computational neuroscience, paving the way for greater alignment between neural networks and biology.

Long Sequence Hopfield Memory
Hamza Tahir Chaudhry Jacob A Zavatone-Veth Dmitry Krotov Cengiz Pehlevan



研究问题:如何提高序列记忆模型的序列容量,并实现对高度相关模式序列的召回。
动机:现有的序列记忆模型由于记忆间的干扰,其序列容量有限。
方法:通过引入非线性交互项来增强模式之间的分离度,从而扩大序列容量。同时,提出一种广义伪逆规则来召回高度相关的模式序列。
效果:新的模型在序列容量上显著优于基于传统霍普菲尔德网络的模型,且能成功召回高度相关的模式序列。此外,该模型还可以存储状态转换时间可变的序列,并在生物学上具有可行性。

Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.

Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs
Rajat Vadiraj Dwaraknath Tolga Ergen Mert Pilanci



研究问题:本文旨在通过理论分析深度神经网络,探讨两种主要方向:1)通过神经切线核(NTK)理解无限隐藏层宽度和无限小学习率下神经网络的SGD训练;2)通过锥约束凸性重构ReLU网络来全局优化正则化训练目标。
动机:目前的理论研究主要集中在深入理解神经网络的训练过程和全局优化训练目标上。
方法:本文将锥约束凸性重构的ReLU网络的凸规划解释为带有加权数据掩蔽特征映射的多核学习(MKL)模型,并建立了与神经切线核(NTK)的联系。具体来说,我们展示了对于不依赖于学习目标的特殊掩蔽权重选择,该内核等于训练数据上gated ReLU网络的NTK。
效果:通过使用迭代重加权,我们改进了由NTK诱导的权重以获得最优的MKL内核,这等同于gated ReLU网络精确凸性重构的解决方案。我们还提供了几个数值模拟来证实我们的理论。此外,我们还通过对组套索的一致性结果进行分析,对所得最优内核的预测误差进行了分析。

Recently, theoretical analyses of deep neural networks have broadly focused on two directions: 1) Providing insight into neural network training by SGD in the limit of infinite hidden-layer width and infinitesimally small learning rate (also known as gradient flow) via the Neural Tangent Kernel (NTK), and 2) Globally optimizing the regularized training objective via cone-constrained convex reformulations of ReLU networks. The latter research direction also yielded an alternative formulation of the ReLU network, called a gated ReLU network, that is globally optimizable via efficient unconstrained convex programs. In this work, we interpret the convex program for this gated ReLU network as a Multiple Kernel Learning (MKL) model with a weighted data masking feature map and establish a connection to the NTK. Specifically, we show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data. A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set. By using iterative reweighting, we improve the weights induced by the NTK to obtain the optimal MKL kernel which is equivalent to the solution of the exact convex reformulation of the gated ReLU network. We also provide several numerical simulations corroborating our theory. Additionally, we provide an analysis of the prediction error of the resulting optimal kernel via consistency results for the group lasso.

PlanE: Representation Learning over Planar Graphs
Radoslav Dimitrov Zeyang Zhao Ralph Abboud Ismail Ilkan Ceylan



研究问题:设计一种有效学习平面图完全不变式的架构。
动机:尽管图神经网络在图表示学习中表现出色,但其学习的图不变式是不完全的,特别是在已知有高效图同构测试算法的特殊图类(如平面图)上。
方法:受Hopcroft和Tarjan的经典平面图同构算法启发,提出PlanE作为平面表示学习框架,包括可以学习平面图完全不变式且保持实际可扩展性的架构。
效果:实验结果证明,该模型架构在著名的平面图基准测试上表现出强大的性能,实现了多个最先进的结果。

Graph neural networks are prominent models for representation learning over graphs, where the idea is to iteratively compute representations of nodes of an input graph through a series of transformations in such a way that the learned graph function is isomorphism-invariant on graphs, which makes the learned representations graph invariants. On the other hand, it is well-known that graph invariants learned by these class of models are incomplete: there are pairs of non-isomorphic graphs which cannot be distinguished by standard graph neural networks. This is unsurprising given the computational difficulty of graph isomorphism testing on general graphs, but the situation begs to differ for special graph classes, for which efficient graph isomorphism testing algorithms are known, such as planar graphs. The goal of this work is to design architectures for efficiently learning complete invariants of planar graphs. Inspired by the classical planar graph isomorphism algorithm of Hopcroft and Tarjan, we propose PlanE as a framework for planar representation learning. PlanE includes architectures which can learn complete invariants over planar graphs while remaining practically scalable. We empirically validate the strong performance of the resulting model architectures on well-known planar graph benchmarks, achieving multiple state-of-the-art results.

Loss Dynamics of Temporal Difference Reinforcement Learning
Blake Bordelon Paul Masset Henry Kuo Cengiz Pehlevan



研究问题:尽管强化学习在许多应用中取得了成功,但其模型参数和状态表示特征如何相互作用以控制学习动态的理论理解仍然缺乏。
动机:为了填补这一理论空白,本文使用统计物理学的概念来研究强化学习的学习曲线。
方法:通过平均轨迹的随机性,将其替换为具有时间相关性的高斯特征平均值,并验证了我们的假设。
效果:我们发现,由于可能的情节空间的子采样产生的随机半梯度噪声,导致值误差显著的平台效应,这与传统的梯度下降动力学不同。我们研究了学习动态和平台效应如何依赖于特征结构、学习率、折扣因子和奖励函数。然后,我们分析了学习率退火和奖励塑造等策略如何有利于改变学习动态和平台效应。总的来说,我们的研究为开发强化学习的学习动态理论提供了新的工具。

Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.

When Do Graph Neural Networks Help with Node Classification? Investigating the Homophily Principle on Node Distinguishability
Sitao Luan Chenqing Hua Minkai Xu Qincheng Lu Jiaqi Zhu Xiao-Wen Chang Jie Fu Jure Leskovec Doina Precup



研究问题:本文旨在解决图神经网络(GNN)在节点分类任务上的性能优势是否仅研究问题:本文旨在解决图神经网络(GNN)在节点分类任务上的性能优势是否仅由同质性原则(即具有相同标签的节点更有可能被连接)引起,以及如何量化和理解这种优势。
动机:尽管现有的研究认为同质性原则是GNN在节点分类任务上优于神经网络的主要原因,但这种观点只考虑了同类节点之间的区分度,忽视了不同类节点之间的区分度,因此对同质性的理解并不完整。
方法:本文提出了一种基于上下文随机块模型的同质性(CSBM-H)方法,并定义了两种度量标准——概率贝叶斯误差(PBE)和负广义杰弗里斯散度,以量化区分度。通过这些度量标准,我们可视化和分析了图过滤器、节点度分布和类方差如何影响区分度,并研究了同类和不同类区分度的联合效应。此外,我们还发现了普遍存在于图形数据集中的中等同质性陷阱。
效果:实验表明,无论同质性水平如何,GNN在真实任务中的优势确实与同类和不同类区分度密切相关。基于这一观察结果,我们提出了一种新的基于假设检验的性能度量标准,该标准是非线性的、基于特征的,并能为GNN的优势提供统计阈值。实验表明,这种新的度量标准在揭示图形感知模式在合成和基准现实世界数据集上的优势和劣势方面,比现有的同质性度量标准更有效。

Homophily principle, i.e., nodes with the same labels are more likely to be connected, has been believed to be the main reason for the performance superiority of Graph Neural Networks (GNNs) over Neural Networks on node classification tasks. Recent research suggests that, even in the absence of homophily, the advantage of GNNs still exists as long as nodes from the same class share similar neighborhood patterns. However, this argument only considers intra-class Node Distinguishability (ND) but neglects inter-class ND, which provides incomplete understanding of homophily on GNNs. In this paper, we first demonstrate such deficiency with examples and argue that an ideal situation for ND is to have smaller intra-class ND than inter-class ND. To formulate this idea and study ND deeply, we propose Contextual Stochastic Block Model for Homophily (CSBM-H) and define two metrics, Probabilistic Bayes Error (PBE) and negative generalized Jeffreys divergence, to quantify ND. With the metrics, we visualize and analyze how graph filters, node degree distributions and class variances influence ND, and investigate the combined effect of intra- and inter-class ND. Besides, we discovered the mid-homophily pitfall, which occurs widely in graph datasets. Furthermore, we verified that, in real-work tasks, the superiority of GNNs is indeed closely related to both intra- and inter-class ND regardless of homophily levels. Grounded in this observation, we propose a new hypothesis-testing based performance metric beyond homophily, which is non-linear, feature-based and can provide statistical threshold value for GNNs' the superiority. Experiments indicate that it is significantly more effective than the existing homophily metrics on revealing the advantage and disadvantage of graph-aware modes on both synthetic and benchmark real-world datasets.

Sampling weights of deep neural networks
Erik Lien Bolager Iryna Burak Chinmay Datar Qing Sun Felix Dietrich



研究问题:本文旨在提出一种结合概率分布和高效采样算法的方法,用于训练全连接神经网络的权重和偏置。
动机:在监督学习中,无需迭代优化或计算内部网络参数的梯度,就可以得到一个训练好的网络。
方法:基于随机特征模型的思想进行采样,使用输入和输出训练数据来采样浅层和深层网络,并证明采样的网络是万能逼近器。
效果:实验结果表明,采样的网络可以达到与迭代训练相当的准确性,但构建速度要快几个数量级。测试案例包括OpenML的分类基准、采样神经操作符表示函数空间的映射以及使用知名架构的迁移学习。

We introduce a probability distribution, combined with an efficient sampling algorithm, for weights and biases of fully-connected neural networks. In a supervised learning context, no iterative optimization or gradient computations of internal network parameters are needed to obtain a trained network. The sampling is based on the idea of random feature models. However, instead of a data-agnostic distribution, e.g., a normal distribution, we use both the input and the output training data to sample shallow and deep networks. We prove that sampled networks are universal approximators. For Barron functions, we show that the $L^2$-approximation error of sampled shallow networks decreases with the square root of the number of neurons. Our sampling scheme is invariant to rigid body transformations and scaling of the input data, which implies many popular pre-processing techniques are not required. In numerical experiments, we demonstrate that sampled networks achieve accuracy comparable to iteratively trained ones, but can be constructed orders of magnitude faster. Our test cases involve a classification benchmark from OpenML, sampling of neural operators to represent maps in function spaces, and transfer learning using well-known architectures.

Towards Label Position Bias in Graph Neural Networks
Haoyu Han Xiaorui Liu Feng Shi MohamadAli Torkamani Charu C. Aggarwal Jiliang Tang



研究问题:本文旨在解决图神经网络(GNNs)在半监督节点分类任务中存在的各种偏见,特别是标签位置偏见。
动机:最近的研究发现,图神经网络存在多种源于节点特征和图拓扑的偏见,其中一种新的偏见——标签位置偏见,即靠近已标记节点的节点表现更好。
方法:我们提出了一种新的优化框架,用于学习无标签位置偏见的图结构,可以应用于现有的图神经网络。我们还引入了一个新的度量标准——标签邻近度得分,以量化这种偏见,并发现它与性能差异密切相关。
效果:实验结果表明,我们提出的方法不仅优于基线方法,而且显著减轻了图神经网络中的标签位置偏见问题。

Graph Neural Networks (GNNs) have emerged as a powerful tool for semi-supervised node classification tasks. However, recent studies have revealed various biases in GNNs stemming from both node features and graph topology. In this work, we uncover a new bias - label position bias, which indicates that the node closer to the labeled nodes tends to perform better. We introduce a new metric, the Label Proximity Score, to quantify this bias, and find that it is closely related to performance disparities. To address the label position bias, we propose a novel optimization framework for learning a label position unbiased graph structure, which can be applied to existing GNNs. Extensive experiments demonstrate that our proposed method not only outperforms backbone methods but also significantly mitigates the issue of label position bias in GNNs.

Norm-based Generalization Bounds for Sparse Neural Networks
Tomer Galanti Mengjia Xu Liane Galanti Tomaso Poggio



研究问题:本文旨在为稀疏ReLU神经网络(包括卷积神经网络)推导基于范数的泛化界限。
动机:现有的泛化界限通常只考虑与卷积层相关的Toeplitz矩阵的范数,而忽视了神经网络架构的稀疏结构和卷积滤波器的范数。
方法:通过考虑神经网络架构的稀疏结构和卷积滤波器的范数,为稀疏ReLU神经网络(包括卷积神经网络)推导出新的基于范数的泛化界限。
效果:理论证明,这些新界限比标准的基于范数的泛化界限更紧;在各种简单分类问题上,它们能提供相对紧密的泛化估计。这显示了目标函数和模型架构的稀疏性在深度学习的成功中起着关键作用。

In this paper, we derive norm-based generalization bounds for sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones because they consider the sparse structure of the neural network architecture and the norms of the convolutional filters, rather than the norms of the (Toeplitz) matrices associated with the convolutional layers. Theoretically, we demonstrate that these bounds are significantly tighter than standard norm-based generalization bounds. Empirically, they offer relatively tight estimations of generalization for various simple classification problems. Collectively, these findings suggest that the sparsity of the underlying target function and the model's architecture plays a crucial role in the success of deep learning.

A General Framework for Robust G-Invariance in G-Equivariant Networks
Sophia Sanborn Nina Miolane



研究问题:如何实现群等变卷积神经网络($G$-CNNs)中的稳健组不变性。
动机:目前常用的不变映射如最大值函数是不完整的,它们同时移除了群和信号结构。而完全的不变映射只移除了由于群操作引起的变化,保留了所有关于信号结构的信息。
方法:提出了一种称为$G$-三相关($G$-TC)层的通用方法,利用群论中的三元相关性理论,这是一种唯一的、最低阶的完全多项式不变映射。
效果:实验表明,该方法增强了$G$-CNN的鲁棒性,能有效抵抗基于不变性的对抗攻击,并在分类准确率上超过了标准的Max $G$-Pooling在$G$-CNN架构中的表现。

We introduce a general method for achieving robust group-invariance in group-equivariant convolutional neural networks ($G$-CNNs), which we call the $G$-triple-correlation ($G$-TC) layer. The approach leverages the theory of the triple-correlation on groups, which is the unique, lowest-degree polynomial invariant map that is also \textit{complete}. Many commonly used invariant maps\textemdash such as the \texttt{max}\textemdash are incomplete: they remove both group and signal structure. A complete invariant, by contrast, removes only the variation due to the actions of the group, while preserving all information about the structure of the signal. The completeness of the triple correlation endows the $G$-TC layer with strong robustness, which can be observed in its resistance to invariance-based adversarial attacks. In addition, we observe that it yields measurable improvements in classification accuracy over standard Max $G$-Pooling in $G$-CNN architectures. We provide a general and efficient implementation of the method for any discretized group, which requires only a table defining the group's product structure. We demonstrate the benefits of this method for $G$-CNNs defined on both commutative and non-commutative groups\textemdash $SO(2)$, $O(2)$, $SO(3)$, and $O(3)$ (discretized as the cyclic $C8$, dihedral $D16$, chiral octahedral $O$ and full octahedral $O_h$ groups)\textemdash acting on $\mathbb{R}^2$ and $\mathbb{R}^3$ on both $G$-MNIST and $G$-ModelNet10 datasets.

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars
Kaiyue Wen Yuchen Li Bingbin Liu Andrej Risteski



研究问题:本文旨在通过理论结果和对合成数据的仔细控制实验,对只关注模型个别部分的方法进行批判性分析。
动机:Transformer的可解释性旨在通过检查模型的各个部分(如权重矩阵或注意力模式)来理解学习到的算法。
方法:在理论上,我们证明了解决此任务的模型集满足源自形式语言思想的结构特性(泵引理)。我们使用这种特性来证明最优解集具有丰富的定性特征;具体来说,单层的注意力模式可以是“几乎随机化的”,同时保持网络的功能。我们还通过大量实验表明,这些构造不仅仅是理论上的人工产物:即使对模型结构施加严格的限制,也可以通过标准训练达到截然不同的解决方案。因此,基于检查Transformer中单个头部或权重矩阵的可解释性声明可能是误导性的。
效果:实验结果表明,这种方法可以有效地揭示Transformer模型的内在结构和工作原理,为理解和改进Transformer模型提供了新的视角和方法。

Transformer interpretability aims to understand the algorithm implemented by a learned Transformer by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be "nearly randomized", while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even with severe constraints to the architecture of the model, vastly different solutions can be reached via standard training. Thus, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading.

A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks
Vignesh Kothapalli Tom Tirer Joan Bruna



研究问题:本文旨在探索图神经网络(GNNs)中图拓扑和特征演变之间的关系,特别是在节点分类任务中。
动机:尽管图神经网络在图结构数据的分类任务上越来越受欢迎,但GNN中图拓扑和特征演变的相互作用尚未得到充分理解。
方法:通过实证研究和理论分析,研究了节点分类任务中的特征演变,以社团检测为例,探讨了“神经崩溃”(NC)现象。
效果:研究发现,即使在零训练误差点之后,深度分类器的训练也会出现NC现象,即深层特征的类内变异性降低,类均值对某些对称结构的对齐度增加。同时,理论研究发现,即使是“乐观”的数学模型也需要图满足严格的结构条件才能具有精确崩溃的最小化器。此外,通过对该模型的梯度动态的研究,为观察到的部分崩溃提供了理由。最后,通过对比层间和层内特征变异性的演化行为,进一步揭示了GNNs在节点分类任务中的特征演变特性。

Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data. Yet, the interplay between graph topology and feature evolution in GNNs is not well understood. In this paper, we focus on node-wise classification, illustrated with community detection on stochastic block model graphs, and explore the feature evolution through the lens of the "Neural Collapse" (NC) phenomenon. When training instance-wise deep classifiers (e.g. for image classification) beyond the zero training error point, NC demonstrates a reduction in the deepest features' within-class variability and an increased alignment of their class means to certain symmetric structures. We start with an empirical study that shows that a decrease in within-class variability is also prevalent in the node-wise classification setting, however, not to the extent observed in the instance-wise case. Then, we theoretically study this distinction. Specifically, we show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse. Furthermore, by studying the gradient dynamics of this model, we provide reasoning for the partial collapse observed empirically. Finally, we present a study on the evolution of within- and between-class feature variability across layers of a well-trained GNN and contrast the behavior with spectral methods.

A graphon-signal analysis of graph neural networks
Ron Levie



研究问题:如何对消息传递图神经网络(MPNNs)进行分析,特别是在输入空间为非欧几里得的情况下。
动机:由于MPNN的输入空间是任意大小和拓扑结构的图,因此其泛化等性质相对于欧几里得神经网络来说理解较少。作者认为过去研究中的一个重要缺失部分是缺乏有意义的图信号相似性度量,这使得MPNN的输入空间缺乏规则结构。
方法:作者提出了一种称为graphon-signal cut距离的相似性度量,使得所有图信号成为稠密子集--graphon-signal空间的一个紧致度量空间中的组成部分。如果两个确定性的图信号在cut距离上接近,那么它们看起来就像是从同一随机图信号模型中采样出来的。
效果:作者证明了MPNN在graphon-signal度量空间上是Lipschitz连续函数。然后给出了这一结果的两个应用:1) MPNN的泛化边界;2) MPNN对图信号子采样的稳定性。这些结果适用于任何足够规则的MPNN和任何图信号分布,使得这种分析具有相当的普遍性。

We present an approach for analyzing message passing graph neural networks (MPNNs) based on an extension of graphon analysis to a so called graphon-signal analysis. A MPNN is a function that takes a graph and a signal on the graph (a graph-signal) and returns some value. Since the input space of MPNNs is non-Euclidean, i.e., graphs can be of any size and topology, properties such as generalization are less well understood for MPNNs than for Euclidean neural networks. We claim that one important missing ingredient in past work is a meaningful notion of graph-signal similarity measure, that endows the space of inputs to MPNNs with a regular structure. We present such a similarity measure, called the graphon-signal cut distance, which makes the space of all graph-signals a dense subset of a compact metric space -- the graphon-signal space. Informally, two deterministic graph-signals are close in cut-distance if they ``look like'' they were sampled from the same random graph-signal model. Hence, our cut distance is a natural notion of graph-signal similarity, which allows comparing any pair of graph-signals of any size and topology. We prove that MPNNs are Lipschitz continuous functions over the graphon-signal metric space. We then give two applications of this result: 1) a generalization bound for MPNNs, and, 2) the stability of MPNNs to subsampling of graph-signals. Our results apply to any regular enough MPNN on any distribution of graph-signals, making the analysis rather universal.

GEQ: Gaussian Kernel Inspired Equilibrium Models
Mingjie Li Yisen Wang Zhouchen Lin



研究问题:尽管优化诱导的深度平衡模型(OptEqs)在输出和底层隐藏优化问题之间建立了联系,但其性能以及相关作品的性能仍然不够好,特别是与深度网络相比。
动机:导致这种性能限制的一个关键因素是这些模型使用线性内核来提取特征。
方法:我们提出了一种新的方法,通过用一种能直接捕获输入数据中非线性特征依赖性的新函数替换其线性内核来解决此问题。受经典机器学习算法的启发,我们引入高斯核作为替代函数,然后提出我们的新平衡模型,即GEQ。
效果:通过利用高斯核,GEQ可以有效地提取输入特征中嵌入的非线性信息,超越原始OptEqs的性能。此外,GEQ可以被视为具有无限宽度和深度的加权连接神经网络。GEQ还具有良好的理论性质和改进的整体性能。此外,我们的GEQ在面对各种样本时表现出更强的稳定性。我们通过一系列全面实验进一步证实了GEQ的有效性和稳定性。

Despite the connection established by optimization-induced deep equilibrium models (OptEqs) between their output and the underlying hidden optimization problems, the performance of it along with its related works is still not good enough especially when compared to deep networks. One key factor responsible for this performance limitation is the use of linear kernels to extract features in these models. To address this issue, we propose a novel approach by replacing its linear kernel with a new function that can readily capture nonlinear feature dependencies in the input data. Drawing inspiration from classical machine learning algorithms, we introduce Gaussian kernels as the alternative function and then propose our new equilibrium model, which we refer to as GEQ. By leveraging Gaussian kernels, GEQ can effectively extract the nonlinear information embedded within the input features, surpassing the performance of the original OptEqs. Moreover, GEQ can be perceived as a weight-tied neural network with infinite width and depth. GEQ also enjoys better theoretical properties and improved overall performance. Additionally, our GEQ exhibits enhanced stability when confronted with various samples. We further substantiate the effectiveness and stability of GEQ through a series of comprehensive experiments.

Implicit Convolutional Kernels for Steerable CNNs
Maksim Zhdanov Nico Hoffmann Gabriele Cesa



研究问题:如何构建一个对原点保留的群$G$(如反射和旋转)等变的神经网络,使其能够进行平移和变换?
动机:目前的等变卷积神经网络依赖于标准卷积和$G$-steerable核,但这种方法只适用于特定的群$G$,无法推广到其他对称变换。
方法:我们提出使用多层感知器(MLPs)来参数化$G$-steerable核,通过隐式神经表示来实现Steerable CNNs,并可以推广到任何可以构建$G$-等变MLP的群$G$。
效果:我们在多个任务上证明了该方法的有效性,包括N体模拟、点云分类和分子性质预测。

Steerable convolutional neural networks (CNNs) provide a general framework for building neural networks equivariant to translations and transformations of an origin-preserving group $G$, such as reflections and rotations. They rely on standard convolutions with $G$-steerable kernels obtained by analytically solving the group-specific equivariance constraint imposed onto the kernel space. As the solution is tailored to a particular group $G$, implementing a kernel basis does not generalize to other symmetry transformations, complicating the development of general group equivariant models. We propose using implicit neural representation via multi-layer perceptrons (MLPs) to parameterize $G$-steerable kernels. The resulting framework offers a simple and flexible way to implement Steerable CNNs and generalizes to any group $G$ for which a $G$-equivariant MLP can be built. We prove the effectiveness of our method on multiple tasks, including N-body simulations, point cloud classification and molecular property prediction.

Sequential Memory with Temporal Predictive Coding
Mufeng Tang Helen Barron Rafal Bogacz



研究问题:本文旨在解决大脑中序列记忆的计算机制问题。
动机:受到神经科学理论和预测编码在静态记忆任务中的应用成功的启发,提出了一种新的基于预测编码的序列记忆模型。
方法:提出了一种名为“时间预测编码”(tPC)的新模型,通过分析研究发现,tPC可以看作是具有隐式统计白化过程的经典不对称霍普菲尔德网络(AHN),从而在结构化输入的序列记忆任务中实现更稳定的表现。
效果:实验结果表明,tPC模型能够准确地记住和检索序列输入,其表现与神经科学的行为观察和理论相一致,强化了其生物学意义。

Forming accurate memory of sequential stimuli is a fundamental function of biological agents. However, the computational mechanism underlying sequential memory in the brain remains unclear. Inspired by neuroscience theories and recent successes in applying predictive coding (PC) to \emph{static} memory tasks, in this work we propose a novel PC-based model for \emph{sequential} memory, called \emph{temporal predictive coding} (tPC). We show that our tPC models can memorize and retrieve sequential inputs accurately with a biologically plausible neural implementation. Importantly, our analytical study reveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN) with an implicit statistical whitening process, which leads to more stable performance in sequential memory tasks of structured inputs. Moreover, we find that tPC exhibits properties consistent with behavioral observations and theories in neuroscience, thereby strengthening its biological relevance. Our work establishes a possible computational mechanism underlying sequential memory in the brain that can also be theoretically interpreted using existing memory model frameworks.

FourierGNN: Rethinking Multivariate Time Series Forecasting from a Pure Graph Perspective
Kun Yi Qi Zhang Wei Fan Hui He Liang Hu Pengyang Wang Ning An Longbing Cao Zhendong Niu



研究问题:多变量时间序列预测在许多行业中都显示出了重要性,但目前的最先进的基于图研究问题:多变量时间序列预测在许多行业中都显示出了重要性,但目前的最先进的基于图神经网络的预测方法需要同时使用图网络(如GCN)和时间网络(如LSTM)来捕获系列间(空间)动态和系列内(时间)依赖性,这对手工设计的模型提出了额外的负担。
动机:目前的方法将空间和时间建模分开,这自然违反了现实世界中统一的时空相互依赖性,从而大大阻碍了预测性能。为了克服这些问题,我们探索了一个直接应用图网络的新方向,并从纯图的角度重新思考了MTS预测。
方法:我们首先定义了一种新的数据结构——超变数图,它将每个序列值(无论变量或时间戳如何)视为一个图节点,并将滑动窗口表示为时空全连接图。这种观点将时空动态统一起来,并将经典的MTS预测重新定义为对超变数图的预测。然后,我们提出了一种新的架构——傅立叶图神经网络(FourierGNN),通过堆叠我们提出的傅立叶图运算符(FGO)在傅立叶空间中执行矩阵乘法。
效果:傅立叶GNN具有足够的表现力,并且实现的复杂度更低,可以有效地完成预测任务。此外,我们的理论研究揭示了FGO与时间域中的图卷积等价,进一步验证了傅立叶GNN的有效性。我们在七个数据集上的大量实验表明,与最先进的方法相比,我们的方法具有更高的效率和更少的参数。

Multivariate time series (MTS) forecasting has shown great importance in numerous industries. Current state-of-the-art graph neural network (GNN)-based forecasting methods usually require both graph networks (e.g., GCN) and temporal networks (e.g., LSTM) to capture inter-series (spatial) dynamics and intra-series (temporal) dependencies, respectively. However, the uncertain compatibility of the two networks puts an extra burden on handcrafted model designs. Moreover, the separate spatial and temporal modeling naturally violates the unified spatiotemporal inter-dependencies in real world, which largely hinders the forecasting performance. To overcome these problems, we explore an interesting direction of directly applying graph networks and rethink MTS forecasting from a pure graph perspective. We first define a novel data structure, hypervariate graph, which regards each series value (regardless of variates or timestamps) as a graph node, and represents sliding windows as space-time fully-connected graphs. This perspective considers spatiotemporal dynamics unitedly and reformulates classic MTS forecasting into the predictions on hypervariate graphs. Then, we propose a novel architecture Fourier Graph Neural Network (FourierGNN) by stacking our proposed Fourier Graph Operator (FGO) to perform matrix multiplications in Fourier space. FourierGNN accommodates adequate expressiveness and achieves much lower complexity, which can effectively and efficiently accomplish {the forecasting}. Besides, our theoretical analysis reveals FGO's equivalence to graph convolutions in the time domain, which further verifies the validity of FourierGNN. Extensive experiments on seven datasets have demonstrated our superior performance with higher efficiency and fewer parameters compared with state-of-the-art methods. Code is available at this repository: https://github.com/aikunyi/FourierGNN.

Low Tensor Rank Learning of Neural Dynamics
Arthur Pellegrino N Alex Cayco Gajic Angus Chadwick



研究问题:如何理解学习过程中的突触连接集体演变,特别是在循环神经网络中。
动机:最近的研究表明,任务训练的循环神经网络(RNN)的权重矩阵通常是低秩的,但这种低秩结构如何在学习过程中展开尚不清楚。
方法:通过在一项运动学习任务中,对不同等级的RNN进行大规模神经记录,我们调查了学习过程中形成的3-tensor的秩。
效果:我们发现,推断出的权重是低张量秩的,因此在整个学习过程中都在一个固定的低维子空间内演化。此外,我们还在被训练解决同一任务的RNN上验证了低张量秩学习的观察结果。最后,我们提出了一组数学结果,这些结果限制了梯度下降学习动态的矩阵和张量的秩,表明在被训练解决低维任务的RNN中,低张量秩的权重会自然出现。总的来说,我们的发现为理解生物和人工神经网络在学习过程中群体连接的演变提供了见解,并能够从大规模的神经记录中反向工程出学习引起的循环动力学变化。

Learning relies on coordinated synaptic changes in recurrently connected populations of neurons. Therefore, understanding the collective evolution of synaptic connectivity over learning is a key challenge in neuroscience and machine learning. In particular, recent work has shown that the weight matrices of task-trained RNNs are typically low rank, but how this low rank structure unfolds over learning is unknown. To address this, we investigate the rank of the 3-tensor formed by the weight matrices throughout learning. By fitting RNNs of varying rank to large-scale neural recordings during a motor learning task, we find that the inferred weights are low-tensor-rank and therefore evolve over a fixed low-dimensional subspace throughout the entire course of learning. We next validate the observation of low-tensor-rank learning on an RNN trained to solve the same task. Finally, we present a set of mathematical results bounding the matrix and tensor ranks of gradient descent learning dynamics which show that low-tensor-rank weights emerge naturally in RNNs trained to solve low-dimensional tasks. Taken together, our findings provide insight on the evolution of population connectivity over learning in both biological and artificial neural networks, and enable reverse engineering of learning-induced changes in recurrent dynamics from large-scale neural recordings.

Three Iterations of (d − 1)-WL Test Distinguish Non Isometric Clouds of d-dimensional Points
Valentino delle Rose Alexander Kozachinskiy Cristobal Rojas Mircea Petrache Pablo Barcelo



研究问题:Weisfeiler-Lehman (WL)测试对于检查图形同构性的基本迭代算法,本研究探讨了该测试在欧几里得空间中的点云数据上何时是完整的。
动机:由于最近机器学习在三维对象数据集中的应用发展,我们研究了当WL测试对于由完全距离图表示的欧几里得点云是完整时的情况。
方法:通过完全距离图表示的欧几里得点云进行WL测试,并进行了多次迭代,以确定其是否能够区分任何任意的点云。
效果:研究发现,对于$dge 2$的任何维度,$(d-1)$维的WL测试对于$d$维欧几里得空间中的点云是完整的,只需要进行三次迭代即可。

The Weisfeiler-Lehman (WL) test is a fundamental iterative algorithm for checking the isomorphism of graphs. It has also been observed that it underlies the design of several graph neural network architectures, whose capabilities and performance can be understood in terms of the expressive power of this test. Motivated by recent developments in machine learning applications to datasets involving three-dimensional objects, we study when the WL test is {\em complete} for clouds of Euclidean points represented by complete distance graphs, i.e., when it can distinguish, up to isometry, any arbitrary such cloud. Our main result states that the $(d-1)$-dimensional WL test is complete for point clouds in $d$-dimensional Euclidean space, for any $d\ge 2$, and only three iterations of the test suffice. Our result is tight for $d = 2, 3$. We also observe that the $d$-dimensional WL test only requires one iteration to achieve completeness.

Frequency-domain MLPs are More Effective Learners in Time Series Forecasting
Kun Yi Qi Zhang Wei Fan Shoujin Wang Pengyang Wang Hui He Ning An Defu Lian Longbing Cao Zhendong Niu



研究问题:本文旨在解决时间序列预测中存在的问题,如基于RNNs、GNNs或Transformers的复杂架构和基于MLPs的方法的信息瓶颈。
动机:尽管现有的文献设计了许多复杂的基于RNNs、GNNs或Transformers的架构,但另一种基于多层感知器(MLPs)的方法由于其简单的结构、低复杂度和优越的性能而被提出。然而,大多数基于MLP的时间序列预测方法受到点到点的映射和信息瓶颈的限制,这在很大程度上阻碍了预测性能。
方法:为了克服这个问题,我们探索了一种新的方向,即在频率域应用MLP进行时间序列预测。我们研究了频率域MLP的学习模式,并发现了两个有利于预测的内在特性:(i)全局视图:频率谱使MLP能够对信号拥有完整的视图,更容易学习全局依赖性;(ii)能量压缩:频率域MLP集中在具有紧凑信号能量的频率组件的关键部分。然后,我们提出了FreTS,这是一个简单而有效的架构,建立在频率域MLP的基础上进行时间序列预测。
效果:我们在13个真实世界基准上进行了广泛的实验(包括7个短期预测基准和6个长期预测基准),结果表明,我们的模型始终优于最先进的方法。代码可以在以下仓库中找到:https://github.com/aikunyi/FreTS。

Time series forecasting has played the key role in different industrial, including finance, traffic, energy, and healthcare domains. While existing literatures have designed many sophisticated architectures based on RNNs, GNNs, or Transformers, another kind of approaches based on multi-layer perceptrons (MLPs) are proposed with simple structure, low complexity, and superior performance. However, most MLP-based forecasting methods suffer from the point-wise mappings and information bottleneck, which largely hinders the forecasting performance. To overcome this problem, we explore a novel direction of applying MLPs in the frequency domain for time series forecasting. We investigate the learned patterns of frequency-domain MLPs and discover their two inherent characteristic benefiting forecasting, (i) global view: frequency spectrum makes MLPs own a complete view for signals and learn global dependencies more easily, and (ii) energy compaction: frequency-domain MLPs concentrate on smaller key part of frequency components with compact signal energy. Then, we propose FreTS, a simple yet effective architecture built upon Frequency-domain MLPs for Time Series forecasting. FreTS mainly involves two stages, (i) Domain Conversion, that transforms time-domain signals into complex numbers of frequency domain; (ii) Frequency Learning, that performs our redesigned MLPs for the learning of real and imaginary part of frequency components. The above stages operated on both inter-series and intra-series scales further contribute to channel-wise and time-wise dependency learning. Extensive experiments on 13 real-world benchmarks (including 7 benchmarks for short-term forecasting and 6 benchmarks for long-term forecasting) demonstrate our consistent superiority over state-of-the-art methods. Code is available at this repository: https://github.com/aikunyi/FreTS.

Stochastic Optimal Control for Collective Variable Free Sampling of Molecular Transition Paths
Lars Holdijk Yuanqi Du Ferry Hooft Priyank Jaini Bernd Ensing Max Welling



研究问题:本文旨在解决在分子系统中两个给定亚稳态之间的转换路径采样问题,例如折叠和展开的蛋白质或化学反应的产物和反应物。
动机:由于高能垒的存在,这些状态之间的转换路径不太可能通过标准的分子动力学模拟进行采样。传统的增加转换概率的方法依赖于基于集体变量(CVs)的偏置势,但选择合适的CVs需要化学直觉,因此传统方法并不总是适用于更大的系统。
方法:我们提出了一种名为PIPS的机器学习方法,该方法不需要依赖CVs。我们展示了这个问题、薛定谔桥问题和随机最优控制与神经网络策略之间的形式关系。
效果:与传统的非机器学习方法不同,我们的方法成功地为丙氨酸二肽以及更大的聚脯氨酸和Chignolin蛋白质生成了低能转换。

We consider the problem of sampling transition paths between two given metastable states of a molecular system, eg. a folded and unfolded protein or products and reactants of a chemical reaction. Due to the existence of high energy barriers separating the states, these transition paths are unlikely to be sampled with standard Molecular Dynamics (MD) simulation. Traditional methods to augment MD with a bias potential to increase the probability of the transition rely on a dimensionality reduction step based on Collective Variables (CVs). Unfortunately, selecting appropriate CVs requires chemical intuition and traditional methods are therefore not always applicable to larger systems. Additionally, when incorrect CVs are used, the bias potential might not be minimal and bias the system along dimensions irrelevant to the transition. Showing a formal relation between the problem of sampling molecular transition paths, the Schrodinger bridge problem and stochastic optimal control with neural network policies, we propose a machine learning method for sampling said transitions. Unlike previous non-machine learning approaches our method, named PIPS, does not depend on CVs. We show that our method successful generates low energy transitions for Alanine Dipeptide as well as the larger Polyproline and Chignolin proteins.

A Fractional Graph Laplacian Approach to Oversmoothing
Sohir Maskey Raffaele Paolino Aras Bacho Gitta Kutyniok



研究问题:图神经网络在捕获图中长程依赖关系时,由于过度平滑问题而表现不佳。
动机:解决现有图神经网络在处理有向图时过度平滑的问题。
方法:通过引入有向对称归一化拉普拉斯和分数图拉普拉斯神经网络ODEs,扩展了狄利克雷能量的概念,以描述非局部动态并传播信息。
效果:实验证明该方法能有效传播远距离节点的信息,同时降低长距离跳跃的概率,并在各种有向和无向真实世界图中表现出良好的灵活性。

Graph neural networks (GNNs) have shown state-of-the-art performances in various applications. However, GNNs often struggle to capture long-range dependencies in graphs due to oversmoothing. In this paper, we generalize the concept of oversmoothing from undirected to directed graphs. To this aim, we extend the notion of Dirichlet energy by considering a directed symmetrically normalized Laplacian. As vanilla graph convolutional networks are prone to oversmooth, we adopt a neural graph ODE framework. Specifically, we propose fractional graph Laplacian neural ODEs, which describe non-local dynamics. We prove that our approach allows propagating information between distant nodes while maintaining a low probability of long-distance jumps. Moreover, we show that our method is more flexible with respect to the convergence of the graph’s Dirichlet energy, thereby mitigating oversmoothing. We conduct extensive experiments on synthetic and real-world graphs, both directed and undirected, demonstrating our method’s versatility across diverse graph homophily levels. Our code is available at https://github.com/RPaolino/fLode

Structure of universal formulas
Dmitry Yarotsky



研究问题:本文旨在分析具有高表达能力的模型的基本结构元素,并探讨其全局近似性与无限VC维度之间的关系。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过引入一系列复杂性递增的功能族,建立了一个连接全局近似性属性到较弱的无限VC维度属性的表达能力等级。同时,证明了一些分类结果。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

By universal formulas we understand parameterized analytic expressions that have a fixed complexity, but nevertheless can approximate any continuous function on a compact set. There exist various examples of such formulas, including some in the form of neural networks. In this paper we analyze the essential structural elements of these highly expressive models. We introduce a hierarchy of expressiveness classes connecting the global approximability property to the weaker property of infinite VC dimension, and prove a series of classification results for several increasingly complex functional families. In particular, we introduce a general family of polynomially-exponentially-algebraic functions that, as we prove, is subject to polynomial constraints. As a consequence, we show that fixed-size neural networks with not more than one layer of neurons having transcendental activations (e.g., sine or standard sigmoid) cannot in general approximate functions on arbitrary finite sets. On the other hand, we give examples of functional families, including two-hidden-layer neural networks, that approximate functions on arbitrary finite sets, but fail to do that on the whole domain of definition.

Noether Embedding: Efficient Learning of Temporal Regularities
Chi Gao Zidong Zhou Luping Shi



研究问题:如何有效地检测和编码事件中的时序规律(TRs)?
动机:现有的事件嵌入方法无法有效地解码TR的有效性,且不能满足效率要求。
方法:开发了Noether Embedding(NE),这是第一个具有事件嵌入的有效TR学习器。NE具有时间-翻译对称性,可以降低每个TR有效性的计算,实现数据高效的TR形成和恒定时间复杂度的时间高效TR检索。
效果:在ICEWS14、ICEWS18和GDELT数据集上进行评估,NE比经典嵌入实现了大约两倍的F1分数,对于查询TR间隔提供了十倍以上的置信度分数。此外,NE在社会事件预测、个人决策和记忆受限的场景中展示了潜力。

Learning to detect and encode temporal regularities (TRs) in events is a prerequisite for human-like intelligence. These regularities should be formed from limited event samples and stored as easily retrievable representations. Existing event embeddings, however, cannot effectively decode TR validity with well-trained vectors, let alone satisfy the efficiency requirements. We develop Noether Embedding (NE) as the first efficient TR learner with event embeddings. Specifically, NE possesses the intrinsic time-translation symmetries of TRs indicated as conserved local energies in the embedding space. This structural bias reduces the calculation of each TR validity to embedding each event sample, enabling NE to achieve data-efficient TR formation insensitive to sample size and time-efficient TR retrieval in constant time complexity. To comprehensively evaluate the TR learning capability of embedding models, we define complementary tasks of TR detection and TR query, formulate their evaluation metrics, and assess embeddings on classic ICEWS14, ICEWS18, and GDELT datasets. Our experiments demonstrate that NE consistently achieves about double the F1 scores for detecting valid TRs compared to classic embeddings, and it provides over ten times higher confidence scores for querying TR intervals. Additionally, we showcase NE's potential applications in social event prediction, personal decision-making, and memory-constrained scenarios.

An Optimization-based Approach To Node Role Discovery in Networks: Approximating Equitable Partitions
Michael Scholkemper Michael T Schaub



研究问题:如何根据网络的结构角色对复杂网络的节点进行划分,以识别网络的基本构建模块。
动机:类似于社区检测,通过划分网络节点的角色可以简化网络连接的描述,为各种网络分析和图挖掘任务提供基础。
方法:基于图同构测试、Weisfeiler-Leman算法和公平分区等思想,提出了节点角色的定义和两个相关的优化问题(成本函数)。
效果:通过一个新的“角色注入分区基准”验证了该方法的有效性,该基准可以通过随机赋予网络节点不同角色来生成网络模型。

Similar to community detection, partitioning the nodes of a complex network according to their structural roles aims to identify fundamental building blocks of a network, which can be used, e.g., to find simplified descriptions of the network connectivity, to derive reduced order models for dynamical processes unfolding on processes, or as ingredients for various network analysis and graph mining tasks. In this work, we offer a fresh look on the problem of role extraction and its differences to community detection and present a definition of node roles and two associated optimization problems (cost functions) grounded in ideas related to graph-isomorphism tests, the Weisfeiler-Leman algorithm and equitable partitions. We present theoretical guarantees and validate our approach via a novel “role-infused partition benchmark”, a network model from which we can sample networks in which nodes are endowed with different roles in a stochastic way.

The Tunnel Effect: Building Data Representations in Deep Neural Networks
Wojciech Masarczyk Mateusz Ostaszewski Ehsan Imani Razvan Pascanu Piotr Miłoś Tomasz Trzcinski



研究问题:本文旨在探索深度神经网络在有监督图像分类任务中的数据表示特性。
动机:现有的深度神经网络虽然在各种任务上表现出色,但其内部数据表示的机制尚不清楚。
方法:通过实证研究,作者发现深度神经网络的训练过程可以划分为两个阶段,即初始层创建线性可分表示,后续层(被称为“隧道”)压缩这些表示并对总体性能影响最小。
效果:隧道的存在影响了模型的分布外泛化能力,并对其持续学习的影响进行了讨论。

Deep neural networks are widely known for their remarkable effectiveness across various tasks, with the consensus that deeper networks implicitly learn more complex data representations. This paper shows that sufficiently deep networks trained for supervised image classification split into two distinct parts that contribute to the resulting data representations differently. The initial layers create linearly-separable representations, while the subsequent layers, which we refer to as \textit{the tunnel}, compress these representations and have a minimal impact on the overall performance. We explore the tunnel's behavior through comprehensive empirical studies, highlighting that it emerges early in the training process. Its depth depends on the relation between the network's capacity and task complexity. Furthermore, we show that the tunnel degrades out-of-distribution generalization and discuss its implications for continual learning.

Transformer as a hippocampal memory consolidation model based on NMDAR-inspired nonlinearity
Dong-Kyum Kim Jea Kwon Meeyoung Cha C. Justin Lee



研究问题:本文旨在探讨如何通过模拟NMDAR动态来设计一种新的非线性激活函数,以增强Transformer模型的长期记忆功能。
动机:最近的研究发现深度学习模型与海马体有相似之处,特别是其对学习和记忆的处理方式。受此启发,我们提出一种模仿NMDAR动态的新型非线性激活函数。
方法:我们设计了一个导航任务来评估这两种记忆功能,并操纵激活函数(即模仿NMDAR的Mg2+门控)以干扰长期记忆过程。
效果:实验结果表明,变换器的前馈网络层中存在类似位置细胞的功能和参考记忆,并且非线性驱动这些过程。我们的发现揭示了NMDAR-like非线性性在建立变换器架构与海马体空间表示之间的显著相似性方面的作用。

The hippocampus plays a critical role in learning, memory, and spatial representation, processes that depend on the NMDA receptor (NMDAR). Inspired by recent findings that compare deep learning models to the hippocampus, we propose a new nonlinear activation function that mimics NMDAR dynamics. NMDAR-like nonlinearity has a beneficial role in shifting short-term working memory into long-term reference memory in transformers, thus enhancing a process that is similar to memory consolidation in the mammalian brain. We design a navigation task assessing these two memory functions and show that manipulating the activation function (i.e., mimicking the Mg$^{2+}$-gating of NMDAR) disrupts long-term memory processes. Our experiments suggest that place cell-like functions and reference memory reside in the feed-forward network layer of transformers and that nonlinearity drives these processes. We discuss the role of NMDAR-like nonlinearity in establishing this striking resemblance between transformer architecture and hippocampal spatial representation.

Taming Local Effects in Graph-based Spatiotemporal Forecasting
Andrea Cini Ivan Marisca Daniele Zambon Cesare Alippi



研究问题:本文旨在理解图基时空预测中全局性和局部性的相互作用,同时提出一种理论框架,以合理化在这种架构中包含可训练的节点嵌入的做法。
动机:尽管基于图的结构在计算和数据效率方面优于拟合一组局部模型,但依赖于单个全局模型可能是一个限制,当一些时间序列是由不同的时空随机过程生成时。
方法:我们赋予可训练的节点嵌入的角色是摊销学习专门的组件。此外,嵌入允许1)有效地结合共享的消息传递层与节点特定的参数的优点,2)有效地将学习到的模型转移到新的节点集。
效果:实证证据支持,我们为每个时间序列的动态专门化图基模型提供了见解和指导原则,并展示了这方面如何在整个预测过程中发挥关键作用。

Spatiotemporal graph neural networks have shown to be effective in time series forecasting applications, achieving better performance than standard univariate predictors in several settings. These architectures take advantage of a graph structure and relational inductive biases to learn a single (global) inductive model to predict any number of the input time series, each associated with a graph node. Despite the gain achieved in computational and data efficiency w.r.t. fitting a set of local models, relying on a single global model can be a limitation whenever some of the time series are generated by a different spatiotemporal stochastic process. The main objective of this paper is to understand the interplay between globality and locality in graph-based spatiotemporal forecasting, while contextually proposing a methodological framework to rationalize the practice of including trainable node embeddings in such architectures. We ascribe to trainable node embeddings the role of amortizing the learning of specialized components. Moreover, embeddings allow for 1) effectively combining the advantages of shared message-passing layers with node-specific parameters and 2) efficiently transferring the learned model to new node sets. Supported by strong empirical evidence, we provide insights and guidelines for specializing graph-based models to the dynamics of each time series and show how this aspect plays a crucial role in obtaining accurate predictions.

On permutation symmetries in Bayesian neural network posteriors: a variational perspective
Simone Rossi Ankit Singh Thomas Hannagan



研究问题:神经网络中梯度优化的难以捉摸的性质与其损失景观几何形状有关,这尚未得到充分理解。
动机:最近的研究表明,一旦考虑到保持网络计算不变的权重置换,梯度下降的局部解决方案之间实际上没有损失障碍。这对于贝叶斯神经网络(BNNs)中的近似推理提出了问题,我们对此感兴趣。
方法:我们首先将边缘化损失障碍和解决方案插值的正式化扩展到BNNs,然后提出一种匹配算法来搜索线性连接的解决方案。这是通过相对于置换矩阵对两个独立的近似贝叶斯解决方案的分布进行对齐来实现的。
效果:我们在各种架构和数据集上进行实验,发现线性连接的解决方案的边缘化损失障碍几乎为零。

The elusive nature of gradient-based optimization in neural networks is tied to their loss landscape geometry, which is poorly understood. However recent work has brought solid evidence that there is essentially no loss barrier between the local solutions of gradient descent, once accounting for weight-permutations that leave the network's computation unchanged. This raises questions for approximate inference in Bayesian neural networks (BNNs), where we are interested in marginalizing over multiple points in the loss landscape. In this work, we first extend the formalism of marginalized loss barrier and solution interpolation to BNNs, before proposing a matching algorithm to search for linearly connected solutions. This is achieved by aligning the distributions of two independent approximate Bayesian solutions with respect to permutation matrices. Building on the work of Ainsworth et al. (2023), we frame the problem as a combinatorial optimization one, using an approximation to the sum of bilinear assignment problem. We then experiment on a variety of architectures and datasets, finding nearly zero marginalized loss barriers for linearly connected solutions.

Theoretical Analysis of the Inductive Biases in Deep Convolutional Networks
Zihao Wang Lei Wu



研究问题:本文旨在理论分析卷积神经网络(CNNs)的归纳偏置。
动机:探讨CNNs的通用性,即逼近任何连续函数的能力,并理解深度、权重共享和局部性在网络学习过程中的作用。
方法:通过增加网络深度时结合多通道和下采样的方法,证明深度CNN只需要$\widetilde{\mathcal{O}}(\log^2d)$的样本就能实现通用性,其中$d$是输入维度。同时,通过比较CNNs、局部连接网络(LCNs)和全连接网络(FCNs)在简单回归任务上的表现,分析权重共享和局部性的关键作用。
效果:实验结果表明,深度CNN能有效地捕获长范围的稀疏相关性,且只需$\widetilde{\mathcal{O}}(\log^2d)$的样本。此外,权重共享和局部性在学习过程中打破不同的对称性,对网络性能有重要影响。

In this paper, we provide a theoretical analysis of the inductive biases in convolutional neural networks (CNNs). We start by examining the universality of CNNs, i.e., the ability to approximate any continuous functions. We prove that a depth of $\mathcal{O}(\log d)$ suffices for deep CNNs to achieve this universality, where $d$ in the input dimension. Additionally, we establish that learning sparse functions with CNNs requires only $\widetilde{\mathcal{O}}(\log^2d)$ samples, indicating that deep CNNs can efficiently capture {\em long-range} sparse correlations. These results are made possible through a novel combination of the multichanneling and downsampling when increasing the network depth. We also delve into the distinct roles of weight sharing and locality in CNNs. To this end, we compare the performance of CNNs, locally-connected networks (LCNs), and fully-connected networks (FCNs) on a simple regression task, where LCNs can be viewed as CNNs without weight sharing. On the one hand, we prove that LCNs require ${\Omega}(d)$ samples while CNNs need only $\widetilde{\mathcal{O}}(\log^2d)$ samples, highlighting the critical role of weight sharing. On the other hand, we prove that FCNs require $\Omega(d^2)$ samples, whereas LCNs need only $\widetilde{\mathcal{O}}(d)$ samples, underscoring the importance of locality. These provable separations quantify the difference between the two biases, and the major observation behind our proof is that weight sharing and locality break different symmetries in the learning process.

Polynomially Over-Parameterized Convolutional Neural Networks Contain Structured Strong Winning Lottery Tickets
Arthur da Cunha Francesco D'Amore Emanuele Natale



研究问题:本文旨在解决随机初始化的神经网络中可能存在未经训练就表现良好的子网络的问题,并研究问题:本文旨在解决随机初始化的神经网络中可能存在未经训练就表现良好的子网络的问题,并探索结构化剪枝在强彩票假设(SLTH)中的应用。
动机:尽管非结构化剪枝在这个问题上得到了广泛的研究,但其结构化的对应物——可以带来显著的计算和内存效率提升——却基本上未被探索。这主要是由于用于形式化分析SLTH的基础数学工具的限制。
方法:本文利用最近在多维随机子集和问题概括方面的进展,获得了一个能够处理结构化剪枝中产生的随机依赖关系的变体。我们将这一结果应用于证明,对于一类随机卷积神经网络,存在可以近似任何足够小的网络的结构化子网络。
效果:这一结果为结构化剪枝提供了第一个亚指数级围绕SLTH的界限,为进一步研究该假设开辟了新的途径,并对深度学习中的过参数化作用有了更深的理解。

The Strong Lottery Ticket Hypothesis (SLTH) states that randomly-initialised neural networks likely contain subnetworks that perform well without any training. Although unstructured pruning has been extensively studied in this context, its structured counterpart, which can deliver significant computational and memory efficiency gains, has been largely unexplored. One of the main reasons for this gap is the limitations of the underlying mathematical tools used in formal analyses of the SLTH. In this paper, we overcome these limitations: we leverage recent advances in the multidimensional generalisation of the Random Subset-Sum Problem and obtain a variant that admits the stochastic dependencies that arise when addressing structured pruning in the SLTH. We apply this result to prove, for a wide class of random Convolutional Neural Networks, the existence of structured subnetworks that can approximate any sufficiently smaller network. This result provides the first sub-exponential bound around the SLTH for structured pruning, opening up new avenues for further research on the hypothesis and contributing to the understanding of the role of over-parameterization in deep learning.

The geometry of hidden representations of large transformer models
Lucrezia Valeriani Diego Doimo Francesca Cuturello Alessandro Laio Alessio ansuini Alberto Cazzaniga



研究问题:大型转换器在不同数据类型(如蛋白质序列、图像和文本)的自监督数据分析中的强大架构。
动机:通过分析大型转换器的几何和统计特性,以及在各层之间的变化,寻找其语义结构的演变规律。
方法:通过对内在维度(ID)和邻居组成进行分析,发现训练蛋白质语言任务和图像重建任务的转换器在表示上的演变具有相似性。
效果:研究发现,数据集的语义信息在模型末端的第一个峰值处得到更好的表达,这一现象可以在许多不同数据集上观察到。因此,提出了一种无需监督即可确定最大化语义内容的层的明确策略:对应于ID轮廓相对最小值的中间层表示更适合下游学习任务。

Large transformers are powerful architectures used for self-supervised data analysis across various data types, including protein sequences, images, and text. In these models, the semantic structure of the dataset emerges from a sequence of transformations between one representation and the next. We characterize the geometric and statistical properties of these representations and how they change as we move through the layers. By analyzing the intrinsic dimension (ID) and neighbor composition, we find that the representations evolve similarly in transformers trained on protein language taskand image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic information of the dataset is better expressed at the end of the first peak, and this phenomenon can be observed across many models trained on diverse datasets. Based on our findings, we point out an explicit strategy to identify, without supervision, the layers that maximize semantic content: representations at intermediate layers corresponding to a relative minimum of the ID profile are more suitable for downstream learning tasks.

Diffusion Representation for Asymmetric Kernels via Magnetic Transform
Mingzhen He FAN He Ruikai Yang Xiaolin Huang



研究问题:如何有效地处理具有非对称邻近性的数据?
动机:现有的非线性降维技术,如扩散映射(DM),只能使用对称核,限制了其在有向图、营养网络等实际场景中的应用。
方法:提出了一种名为MagDM的扩散表示框架,利用磁性变换将非对称矩阵转换为厄米特矩阵,同时保留了扩散距离并避免了扩散过程中的发散问题。
效果:在三个合成数据集和两个营养网络中验证了MagDM处理具有非对称邻近性数据的有效性和鲁棒性。

As a nonlinear dimension reduction technique, the diffusion map (DM) has been widely used. In DM, kernels play an important role for capturing the nonlinear relationship of data. However, only symmetric kernels can be used now, which prevents the use of DM in directed graphs, trophic networks, and other real-world scenarios where the intrinsic and extrinsic geometries in data are asymmetric. A promising technique is the magnetic transform which converts an asymmetric matrix to a Hermitian one. However, we are facing essential problems, including how diffusion distance could be preserved and how divergence could be avoided during diffusion process. Via theoretical proof, we successfully establish a diffusion representation framework with the magnetic transform, named MagDM. The effectiveness and robustness for dealing data endowed with asymmetric proximity are demonstrated on three synthetic datasets and two trophic networks.

Mode Connectivity in Auction Design
Christoph Hertrich Yixin Tao László A. Végh



研究问题:本文旨在解决拍卖设计这一算法博弈论中的基本问题,并探讨神经网络在经济优化问题上的应用。
动机:尽管拍卖设计问题在简单设置下就已十分困难,但最近的可微经济学研究表明,神经网络可以有效地学习已知的最优拍卖机制并发现有趣的新机制。
方法:本文以RochetNet网络为例,证明了其满足模式连通性,即局部最优解之间存在一条简单的分段线性路径,使得路径上的每个解决方案几乎与两个局部最优解之一一样好。
效果:这是首次在可微经济学背景下进行此类分析,为神经网络直接用于解决非凸优化问题提供了理论支持。

Optimal auction design is a fundamental problem in algorithmic game theory. This problem is notoriously difficult already in very simple settings. Recent work in differentiable economics showed that neural networks can efficiently learn known optimal auction mechanisms and discover interesting new ones. In an attempt to theoretically justify their empirical success, we focus on one of the first such networks, RochetNet, and a generalized version for affine maximizer auctions. We prove that they satisfy mode connectivity, i.e., locally optimal solutions are connected by a simple, piecewise linear path such that every solution on the path is almost as good as one of the two local optima. Mode connectivity has been recently investigated as an intriguing empirical and theoretically justifiable property of neural networks used for prediction problems. Our results give the first such analysis in the context of differentiable economics, where neural networks are used directly for solving non-convex optimization problems.

A General Framework for Equivariant Neural Networks on Reductive Lie Groups
Ilyes Batatia Mario Geiger Jose M Munoz Tess Smidt Lior Silberman Christoph Ortner



研究问题:本文旨在提出一种通用的等变神经网络架构,能够尊重任何约化李群的有限维表示的对称性。
动机:约化李群在高能物理、量子力学、量子色动力学、分子动力学、计算机视觉和成像等多个科学领域中扮演着重要角色。然而,现有的神经网络架构往往无法充分利用这些群的对称性。
方法:本文提出了一种等变神经网络架构,该架构可以推广到任何与约化李群动作等变的数据集上。我们还介绍了lie-nn软件库,它提供了开发和实现这种通用G等变神经网络所需的所有工具。
效果:通过将该方法应用于顶夸克衰变标签(洛伦兹群)和形状识别(正交群)任务,证明了我们的方法的通用性和性能。

Reductive Lie Groups, such as the orthogonal groups, the Lorentz group, or the unitary groups, play essential roles across scientific fields as diverse as high energy physics, quantum mechanics, quantum chromodynamics, molecular dynamics, computer vision, and imaging. In this paper, we present a general Equivariant Neural Network architecture capable of respecting the symmetries of the finite-dimensional representations of any reductive Lie Group. Our approach generalizes the successful ACE and MACE architectures for atomistic point clouds to any data equivariant to a reductive Lie group action. We also introduce the lie-nn software library, which provides all the necessary tools to develop and implement such general G-equivariant neural networks. It implements routines for the reduction of generic tensor products of representations into irreducible representations, making it easy to apply our architecture to a wide range of problems and groups. The generality and performance of our approach are demonstrated by applying it to the tasks of top quark decay tagging (Lorentz group) and shape recognition (orthogonal group).

Curvature Filtrations for Graph Generative Model Evaluation
Joshua Southern Jeremy Wayland Michael M. Bronstein Bastian Rieck



研究问题:如何有效地利用图的结构特性进行图生成模型评估。
动机:现有的图生成模型评估方法无法充分理解图在分布层面的不同,需要寻找更有效的图特性进行评估。
方法:结合图曲率描述符和新兴的拓扑数据分析方法,获取用于评估图生成模型的稳健、表现力强的描述符。
效果:通过实验证明,该方法能有效提升图生成模型的评估效果。

Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property of graphs, and has recently started to prove useful in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with emerging methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.

How do Minimum-Norm Shallow Denoisers Look in Function Space?
Chen Zeno Greg Ongie Yaniv Blumenfeld Nir Weinberger Daniel Soudry



研究问题:本文旨在从理论上理解神经网络去噪器的成功。
动机:尽管神经网络去噪器在许多常见任务中起着关键作用,但其成功的原因尚不清楚。
方法:通过理论研究,对浅层ReLU神经网络去噪器的功能进行表征,特别是在插值(即零训练损失)和最小表示成本(即最小的L2范数权重)的常见理论设置下。
效果:对于单变量数据,我们得到了一个封闭形式的神经网络去噪器函数,并发现它对清洁数据点具有收缩性,并在低噪声水平下证明其比经验MMSE估计器具有更好的泛化能力。对于多变量数据,我们在各种几何假设下找到了封闭形式的神经网络去噪器函数,并通过实验验证了这种对齐现象。

Neural network (NN) denoisers are an essential building block in many common tasks, ranging from image reconstruction to image generation. However, the success of these models is not well understood from a theoretical perspective. In this paper, we aim to characterize the functions realized by shallow ReLU NN denoisers --- in the common theoretical setting of interpolation (i.e., zero training loss) with a minimal representation cost (i.e., minimal $\ell^2$ norm weights). First, for univariate data, we derive a closed form for the NN denoiser function, find it is contractive toward the clean data points, and prove it generalizes better than the empirical MMSE estimator at a low noise level. Next, for multivariate data, we find the NN denoiser functions in a closed form under various geometric assumptions on the training data: data contained in a low-dimensional subspace, data contained in a union of one-sided rays, or several types of simplexes. These functions decompose into a sum of simple rank-one piecewise linear interpolations aligned with edges and/or faces connecting training samples. We empirically verify this alignment phenomenon on synthetic data and real images.

GeoTMI: Predicting Quantum Chemical Property with Easy-to-Obtain Geometry via Positional Denoising
Hyeonsu Kim Jeheon Woo SEONGHWAN KIM Seokhyun Moon Jun Hyeong Kim Woo Youn Kim



研究问题:现有的图神经网络需要通过高级量子力学计算获取3D几何信息,这在实践中是不可行的,限制了其在实际问题中的应用。
动机:为了解决这个问题,我们提出了一种新的训练框架GeoTMI,它使用去噪过程来准确地预测使用易获得的几何结构(正确几何结构的损坏版本,如从低级计算中获得的)的属性。
方法:GeoTMI的目标是最大化三个变量之间的互信息:正确的和损坏的几何结构和属性。同时,GeoTMI还明确更新损坏的输入以接近正确的几何结构,这在GNN层中更有效地进行去噪。
效果:我们在三个预测任务中使用3D GNNs进行了实验,包括分子性质、化学反应性质和异质催化系统中的弛豫能。结果显示,GeoTMI在所有任务上的精度都有一致的提高,证明了其有效性和鲁棒性。

As quantum chemical properties have a dependence on their geometries, graph neural networks (GNNs) using 3D geometric information have achieved high prediction accuracy in many tasks. However, they often require 3D geometries obtained from high-level quantum mechanical calculations, which are practically infeasible, limiting their applicability to real-world problems. To tackle this, we propose a new training framework, GeoTMI, that employs denoising process to predict properties accurately using easy-to-obtain geometries (corrupted versions of correct geometries, such as those obtained from low-level calculations). Our starting point was the idea that the correct geometry is the best description of the target property. Hence, to incorporate information of the correct, GeoTMI aims to maximize mutual information between three variables: the correct and the corrupted geometries and the property. GeoTMI also explicitly updates the corrupted input to approach the correct geometry as it passes through the GNN layers, contributing to more effective denoising. We investigated the performance of the proposed method using 3D GNNs for three prediction tasks: molecular properties, a chemical reaction property, and relaxed energy in a heterogeneous catalytic system. Our results showed consistent improvements in accuracy across various tasks, demonstrating the effectiveness and robustness of GeoTMI.

GUST: Combinatorial Generalization by Unsupervised Grouping with Neuronal Coherence
Hao Zheng Hui Lin Rong Zhao



研究问题:如何将感知信息动态地分组为结构化实体,以理解组合性世界。
动机:成功的分组是由人脑中的神经一致性所指示的,但目前的人工神经网络在分组能力和组合泛化方面仍面临挑战。
方法:我们引入了GUST(通过尖峰定时网络进行无监督分组)模型,这是一种具有生物约束的迭代网络架构,可以使得网络偏向于反映其尖峰活动时间结构中分组信息的动态状态。
效果:我们在合成数据集上评估和分析了该模型。有趣的是,这种分离能力可以直接从重叠的刺激中学习得到,并使用简洁的无监督目标。模型有两个学习阶段,从粗略地感知全局特征到额外捕获局部特征。此外,学习到的符号状构建模块可以系统地组合,以生物合理的方式表示新的场景。

Dynamically grouping sensory information into structured entities is essential for understanding the world of combinatorial nature. However, the grouping ability and therefore combinatorial generalization are still challenging artificial neural networks. Inspired by the evidence that successful grouping is indicated by neuronal coherence in the human brain, we introduce GUST (Grouping Unsupervisely by Spike Timing network), an iterative network architecture with biological constraints to bias the network towards a dynamical state of neuronal coherence that softly reflects the grouping information in the temporal structure of its spiking activity. We evaluate and analyze the model on synthetic datasets. Interestingly, the segregation ability is directly learned from superimposed stimuli with a succinct unsupervised objective. Two learning stages are present, from coarsely perceiving global features to additionally capturing local features. Further, the learned symbol-like building blocks can be systematically composed to represent novel scenes in a bio-plausible manner.

Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies
Michael Beukman Devon Jarvis Richard Klein Steven James Benjamin Rosman



研究问题:强化学习在实际应用中受到限制,因为许多方法无法适应不熟悉的环境。
动机:当环境对代理的行动的反应改变时,需要让代理的行为依赖于外部状态信息和反映环境反应的相关上下文信息。
方法:提出了一种名为“决策适配器”的神经网络架构,该架构生成适配器模块的权重,并使代理的行为依赖于上下文信息。
效果:实验证明,决策适配器在几种环境中都能实现优越的泛化性能,并且比几种替代方法更能抵抗无关干扰变量。

While reinforcement learning has achieved remarkable successes in several domains, its real-world application is limited due to many methods failing to generalise to unfamiliar conditions. In this work, we consider the problem of generalising to new transition dynamics, corresponding to cases in which the environment's response to the agent's actions differs. For example, the gravitational force exerted on a robot depends on its mass and changes the robot's mobility. Consequently, in such cases, it is necessary to condition an agent's actions on extrinsic state information and pertinent contextual information reflecting how the environment responds. While the need for context-sensitive policies has been established, the manner in which context is incorporated architecturally has received less attention. Thus, in this work, we present an investigation into how context information should be incorporated into behaviour learning to improve generalisation. To this end, we introduce a neural network architecture, the Decision Adapter, which generates the weights of an adapter module and conditions the behaviour of an agent on the context information. We show that the Decision Adapter is a useful generalisation of a previously proposed architecture and empirically demonstrate that it results in superior generalisation performance compared to previous approaches in several environments. Beyond this, the Decision Adapter is more robust to irrelevant distractor variables than several alternative methods.

Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension
Moritz Haas David Holzmüller Ulrike von Luxburg Ingo Steinwart



研究问题:本文探讨了过参数化神经网络在训练误差接近零时出现的良性过拟合现象,即估计量在统计上一致,即使它们只是插值了噪声训练数据。
动机:虽然对于一些学习方法来说,固定维度下的良性过拟合已经得到证实,但目前的文献表明,对于典型的核方法和宽神经网络的回归,良性过拟合需要在高维设置中进行,其中维度随着样本大小而增长。
方法:本文证明,估计量的平滑性而非维度是关键:只有当估计量的导数足够大时,才可能发生良性过拟合。我们将现有的不一致结果推广到非插值模型和更多的内核,以证明在固定维度下,只有当估计量的导数适中时,才不可能发生良性过拟合。
效果:我们使用神经切线核将我们的结果转化为宽神经网络。我们的实验验证了这样的神经网络,尽管出现过拟合,但即使在低维数据集上也能很好地泛化。

The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting, where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.

Convolutional Neural Operators for robust and accurate learning of PDEs
Bogdan Raonic Roberto Molinaro Tim De Ryck Tobias Rohner Francesca Bartolucci Rima Alaifari Siddhartha Mishra Emmanuel de Bezenac



研究问题:本文旨在探讨卷积神经网络(CNN)在处理偏微分方程(PDE)的学习解算子方面的应用。
动机:尽管卷积神经网络在传统机器学习中非常成功,但它们被认为在函数空间上不一致,因此在学习PDE的解算子方面基本被忽视。
方法:本文提出了一种新的CNN适应方法,证明CNN确实能够处理作为输入和输出的函数。由此产生的架构被称为卷积神经网络操作符(CNO),其设计目的是即使在计算机上以离散形式实现时,也能保持其潜在的连续特性。
效果:通过一系列基准测试,包括具有多尺度解决方案的多样化PDE集,CNOs显著优于基线,为稳健准确的操作符学习开辟了新途径。

Although very successfully used in conventional machine learning, convolution based neural network architectures -- believed to be inconsistent in function space -- have been largely ignored in the context of learning solution operators of PDEs. Here, we present novel adaptations for convolutional neural networks to demonstrate that they are indeed able to process functions as inputs and outputs. The resulting architecture, termed as convolutional neural operators (CNOs), is designed specifically to preserve its underlying continuous nature, even when implemented in a discretized form on a computer. We prove a universality theorem to show that CNOs can approximate operators arising in PDEs to desired accuracy. CNOs are tested on a novel suite of benchmarks, encompassing a diverse set of PDEs with multi-scale solutions and are observed to significantly outperform baselines, paving the way for an alternative framework for robust and accurate operator learning.

Investigating how ReLU-networks encode symmetries
Georg Bökman Fredrik Kahl



研究问题:本研究旨在探讨网络的等变性质是否意味着所有层都具有等变性。
动机:在神经网络中,许多数据对称性可以通过组等变性质来描述,而编码组等变性质的最常见方法是构建具有组等变性的线性层。
方法:本研究通过理论分析和实验验证,探讨了网络的等变性是否意味着所有层都具有等变性。
效果:研究发现,在某些情况下,等变性确实意味着层状等变性,但这并不是普遍情况。然而,我们推测,经过训练具有等变性的CNN将表现出层状等变性,并解释了这一猜想是如何弱于Entezari等人最近的置换猜想的。通过对VGG-nets在CIFAR10上的定量实验和对ResNets在ImageNet上的定性实验,我们支持并说明了我们的理论研究结果。这些实验不仅有助于理解如何在ReLU网络中编码组等变性,而且为我们提供了一个新的视角来看待Entezari等人的置换猜想,因为我们发现,通常将一个网络与其自身经过组变换的版本合并比将两个不同的网络合并要容易得多。

Many data symmetries can be described in terms of group equivariance and the most common way of encoding group equivariances in neural networks is by building linear layers that are group equivariant. In this work we investigate whether equivariance of a network implies that all layers are equivariant. On the theoretical side we find cases where equivariance implies layerwise equivariance, but also demonstrate that this is not the case generally. Nevertheless, we conjecture that CNNs that are trained to be equivariant will exhibit layerwise equivariance and explain how this conjecture is a weaker version of the recent permutation conjecture by Entezari et al.\ [2022]. We perform quantitative experiments with VGG-nets on CIFAR10 and qualitative experiments with ResNets on ImageNet to illustrate and support our theoretical findings. These experiments are not only of interest for understanding how group equivariance is encoded in ReLU-networks, but they also give a new perspective on Entezari et al.'s permutation conjecture as we find that it is typically easier to merge a network with a group-transformed version of itself than merging two different networks.

WalkLM: A Uniform Language Model Fine-tuning Framework for Attributed Graph Embedding
Yanchao Tan Zihao Zhou Hang Lv Weiming Liu Carl Yang



研究问题:如何同时实现复杂属性和灵活结构的真实世界图的深度联合建模,并获得不限于特定下游预测的无监督通用图表示。
动机:现有的图神经网络(GNNs)需要针对特定的下游预测进行充分的训练才能获得强大的性能,而现实世界的图往往与多种类型的节点甚至链接的复杂属性相关联,难以统一建模。
方法:本研究采用一种与GNNs截然不同的方法,通过自然融合语言模型(LMs)和随机游走(RWs),直接从带属性的RWs中构造大致有意义的文本序列,然后使用RW-based文本序列微调LM并提取嵌入向量,该向量同时包含属性语义和图结构。
效果:在多个真实世界的带属性图数据集上,对不同的下游预测任务评估学习到的节点嵌入,观察到了显著优于一系列最先进的无监督节点嵌入方法的效果。

Graphs are widely used to model interconnected entities and improve downstream predictions in various real-world applications. However, real-world graphs nowadays are often associated with complex attributes on multiple types of nodes and even links that are hard to model uniformly, while the widely used graph neural networks (GNNs) often require sufficient training toward specific downstream predictions to achieve strong performance. In this work, we take a fundamentally different approach than GNNs, to simultaneously achieve deep joint modeling of complex attributes and flexible structures of real-world graphs and obtain unsupervised generic graph representations that are not limited to specific downstream predictions. Our framework, built on a natural integration of language models (LMs) and random walks (RWs), is straightforward, powerful and data-efficient. Specifically, we first perform attributed RWs on the graph and design an automated program to compose roughly meaningful textual sequences directly from the attributed RWs; then we fine-tune an LM using the RW-based textual sequences and extract embedding vectors from the LM, which encapsulates both attribute semantics and graph structures. In our experiments, we evaluate the learned node embeddings towards different downstream prediction tasks on multiple real-world attributed graph datasets and observe significant improvements over a comprehensive set of state-of-the-art unsupervised node embedding methods. We believe this work opens a door for more sophisticated technical designs and empirical evaluations toward the leverage of LMs for the modeling of real-world graphs.

ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
Zhongzhan Huang Pan Zhou Shuicheng YAN Liang Lin



研究问题:UNet在扩散模型中训练不稳定,LSC系数的缩放可以缓解这个问题,但对其理论理解和性能改进仍不清楚。
动机:解决UNet在扩散模型中训练不稳定的问题,提高其训练稳定性和性能。
方法:理论上证明了LSC系数对UNet前向和后向传播的稳定性和鲁棒性有重要影响,并提出了有效的LSC系数缩放框架ScaleLong。
效果:实验结果表明,该方法能有效稳定UNet的训练,并在不同使用UNet或UViT作为主干的网络中加速约1.5倍的训练。

In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improve the training stability of UNet. Experimental results on CIFAR10, CelebA, ImageNet and COCO show that our methods are superior to stabilize training, and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones.

Laplacian Canonization: A Minimalist Approach to Sign and Basis Invariant Spectral Embedding
George Ma Yifei Wang Yisen Wang



研究问题:如何提高图嵌入技术的表达能力,同时保持图的符号和基不变性。
动机:现有的图嵌入技术在提高表达能力的同时,会丧失图的符号和基不变性,限制了其在图数据上的效果。
方法:提出拉普拉斯规范化(LC)方法,通过直接寻找特征向量的规范方向来解决这一问题。
效果:实验证明,该方法可以成功规范化超过90%的特征向量,且在真实世界基准数据集上的表现优于现有方法,同时计算开销最小。

Spectral embedding is a powerful graph embedding technique that has received a lot of attention recently due to its effectiveness on Graph Transformers. However, from a theoretical perspective, the universal expressive power of spectral embedding comes at the price of losing two important invariance properties of graphs, sign and basis invariance, which also limits its effectiveness on graph data. To remedy this issue, many previous methods developed costly approaches to learn new invariants and suffer from high computation complexity. In this work, we explore a minimal approach that resolves the ambiguity issues by directly finding canonical directions for the eigenvectors, named Laplacian Canonization (LC). As a pure pre-processing method, LC is light-weighted and can be applied to any existing GNNs. We provide a thorough investigation, from theory to algorithm, on this approach, and discover an efficient algorithm named Maximal Axis Projection (MAP) that works for both sign and basis invariance and successfully canonizes more than 90\% of all eigenvectors. Experiments on real-world benchmark datasets like ZINC, MOLTOX21, and MOLPCBA show that MAP consistently outperforms existing methods while bringing minimal computation overhead. Code is available at \url{https://github.com/PKU-ML/LaplacianCanonization}.

Neural Oscillators are Universal
Samuel Lanthaler T. Konstantin Rusch Siddhartha Mishra



研究问题:本文旨在介绍一种抽象的神经网络振荡器类,并证明其具有通用性,即可以近似任何连续和因果的操作符映射。
动机:耦合振子正越来越多地用作机器学习(ML)架构的基础,例如在序列建模、图表示学习和模拟ML设备中使用的物理神经网络中。
方法:通过引入一个抽象的神经网络振荡器类,并证明了这种网络振荡器的通用性,即它们可以在所需的精度下近似任何连续和因果的操作符映射。
效果:该通用性结果为基于振荡器的ML系统提供了理论依据。

Coupled oscillators are being increasingly used as the basis of machine learning (ML) architectures, for instance in sequence modeling, graph representation learning and in physical neural networks that are used in analog ML devices. We introduce an abstract class of *neural oscillators* that encompasses these architectures and prove that neural oscillators are universal, i.e, they can approximate any continuous and casual operator mapping between time-varying functions, to desired accuracy. This universality result provides theoretical justification for the use of oscillator based ML systems. The proof builds on a fundamental result of independent interest, which shows that a combination of forced harmonic oscillators with a nonlinear read-out suffices to approximate the underlying operators.

Spiking PointNet: Spiking Neural Networks for Point Clouds
Dayong Ren Zhe Ma Yuanpei Chen Weihang Peng Xiaode Liu Yuhan Zhang Yufei Guo



研究问题:本文旨在解决深度学习在3D点云识别中的应用难题,探索脉冲神经网络(SNNs)是否能够被推广到3D识别。
动机:尽管脉冲神经网络(SNNs)在2D视觉识别中表现出了极高的能源效率并吸引了大量研究关注,但其在3D识别中的应用仍然是一个未充分开发的领域。
方法:本文提出了Spiking PointNet,这是第一个用于点云上的深度脉冲学习模型。我们发现了限制SNNs在点云中应用的两个主要障碍:一是SNNs的内在优化障碍,这阻碍了大时间步长的大脉冲模型的训练;二是PointNet的高昂的内存和计算成本,这使得训练大脉冲点模型变得不现实。为了同时解决这两个问题,我们提出了一种少训练但多学习的Spiking PointNet范式,并通过理论证明和深入的实验分析进行了验证。
效果:我们在ModelNet10和ModelNet40上进行了各种实验,证明了Spiking PointNet的有效性。值得注意的是,我们的Spiking PointNet甚至能超越其ANN对应模型的性能,这在SNN领域是罕见的,为后续的研究提供了可能的方向。此外,Spiking PointNet在训练阶段显示出了显著的速度提升和存储节省。

Recently, Spiking Neural Networks (SNNs), enjoying extreme energy efficiency, have drawn much research attention on 2D visual recognition and shown gradually increasing application potential. However, it still remains underexplored whether SNNs can be generalized to 3D recognition. To this end, we present Spiking PointNet in the paper, the first spiking neural model for efficient deep learning on point clouds. We discover that the two huge obstacles limiting the application of SNNs in point clouds are: the intrinsic optimization obstacle of SNNs that impedes the training of a big spiking model with large time steps, and the expensive memory and computation cost of PointNet that makes training a big spiking point model unrealistic. To solve the problems simultaneously, we present a trained-less but learning-more paradigm for Spiking PointNet with theoretical justifications and in-depth experimental analysis. In specific, our Spiking PointNet is trained with only a single time step but can obtain better performance with multiple time steps inference, compared to the one trained directly with multiple time steps. We conduct various experiments on ModelNet10, ModelNet40 to demonstrate the effectiveness of Sipiking PointNet. Notably, our Spiking PointNet even can outperform its ANN counterpart, which is rare in the SNN field thus providing a potential research direction for the following work. Moreover, Spiking PointNet shows impressive speedup and storage saving in the training phase. Our code is open-sourced at https://github.com/DayongRen/Spiking-PointNet.

Neural Graph Generation from Graph Statistics
Kiarash Zahirnia Yaochen Hu Mark Coates Oliver Schulte



研究问题:如何从聚合的图统计信息中学习深度图生成模型,同时保护局部隐私?
动机:传统的图生成模型通常从图邻接矩阵进行学习,而隐私研究者提出从图统计信息中学习以保护隐私。
方法:开发了一种用于训练深度图生成模型的架构,该模型在匹配统计信息的同时保持局部差分隐私保证。
效果:实验结果表明,当只从图统计信息中学习时,我们的深度图生成模型生成的图比传统图生成模型更真实,且在保护局部隐私方面具有竞争力。

We describe a new setting for learning a deep graph generative model (GGM) from aggregate graph statistics, rather than from the graph adjacency matrix. Matching the statistics of observed training graphs is the main approach for learning traditional GGMs (e.g, BTER, Chung-Lu, and Erdos-Renyi models). Privacy researchers have proposed learning from graph statistics as a way to protect privacy. We develop an architecture for training a deep GGM to match statistics while preserving local differential privacy guarantees. Empirical evaluation on 8 datasets indicates that our deep GGM model generates more realistic graphs than the traditional GGMs when both are learned from graph statistics only. We also benchmark our deep GGM trained on statistics only, against state-of-the-art deep GGM models that are trained on the entire adjacency matrix. The results show that graph statistics are often sufficient to build a competitive deep GGM that generates realistic graphs while protecting local privacy.

Re-Think and Re-Design Graph Neural Networks in Spaces of Continuous Graph Diffusion Functionals
Tingting Dan Jiaqi Ding Ziquan Wei Shahar Z Kovalsky Minjeong Kim Won Hwa Kim Guorong Wu



研究问题:如何设计新的归纳偏置以捕捉图中的长期依赖和全局模式,解决图神经网络(GNN)在局部性假设下的局限性。
动机:目前的GNN模型由于局部性假设的限制,无法有效捕捉到图中的长期依赖和全局模式。受经典Brachistochrone问题的启发,我们寻求设计一种新的归纳偏置,通过变分分析提供一个通用框架。
方法:我们提出了一个两阶段映射框架,将离散的GNN模型与连续的扩散泛函联系起来,允许我们在连续域中设计特定于应用的目标函数,并从数学上保证设计的深度离散模型。我们还引入了总变差(TV)来对齐图扩散模式和社区拓扑中的全局信息,并设计了一个新的选择性机制来解决模型深度和过平滑之间的权衡问题。
效果:实验结果表明,我们的新GNN模型在Cora、Citeseer和Pubmed等图学习基准测试中取得了最先进的性能。

Graphs are ubiquitous in various domains, such as social networks and biological systems. Despite the great successes of graph neural networks (GNNs) in modeling and analyzing complex graph data, the inductive bias of locality assumption, which involves exchanging information only within neighboring connected nodes, restricts GNNs in capturing long-range dependencies and global patterns in graphs. Inspired by the classic Brachistochrone problem, we seek how to devise a new inductive bias for cutting-edge graph application and present a general framework through the lens of variational analysis. The backbone of our framework is a two-way mapping between the discrete GNN model and continuous diffusion functional, which allows us to design application-specific objective function in the continuous domain and engineer discrete deep model with mathematical guarantees. First, we address over-smoothing in current GNNs. Specifically, our inference reveals that the existing layer-by-layer models of graph embedding learning are equivalent to a ${\ell _2}$-norm integral functional of graph gradients, which is the underlying cause of the over-smoothing problem. Similar to edge-preserving filters in image denoising, we introduce the total variation (TV) to promote alignment of the graph diffusion pattern with the global information present in community topologies. On top of this, we devise a new selective mechanism for inductive bias that can be easily integrated into existing GNNs and effectively address the trade-off between model depth and over-smoothing. Second, we devise a novel generative adversarial network (GAN) to predict the spreading flows in the graph through a neural transport equation. To avoid the potential issue of vanishing flows, we tailor the objective function to minimize the transportation within each community while maximizing the inter-community flows. Our new GNN models achieve state-of-the-art (SOTA) performance on graph learning benchmarks such as Cora, Citeseer, and Pubmed.

Wide Neural Networks as Gaussian Processes: Lessons from Deep Equilibrium Models
Tianxiang Gao Xiaokai Huo Hailiang Liu Hongyang Gao



研究问题:本文旨在对深度平衡模型(DEQ)进行深入研究,这是一种具有跨层共享权重矩阵的无限深度神经网络。
动机:现有的结果主要关注浅层或有限深度的网络,因此需要对无限深度的神经网络,如神经常微分方程(ODEs)和深度平衡模型(DEQs)进行全面分析。
方法:通过分析深度平衡模型(DEQ),我们发现当DEQ层的宽度趋近于无穷大时,它会收敛到一个高斯过程,建立了所谓的神经网络和高斯过程(NNGP)对应关系。
效果:我们的研究为研究DEQ的训练和泛化奠定了基础,为该领域的未来研究铺平了道路。

Neural networks with wide layers have attracted significant attention due to their equivalence to Gaussian processes, enabling perfect fitting of training data while maintaining generalization performance, known as benign overfitting. However, existing results mainly focus on shallow or finite-depth networks, necessitating a comprehensive analysis of wide neural networks with infinite-depth layers, such as neural ordinary differential equations (ODEs) and deep equilibrium models (DEQs). In this paper, we specifically investigate the deep equilibrium model (DEQ), an infinite-depth neural network with shared weight matrices across layers. Our analysis reveals that as the width of DEQ layers approaches infinity, it converges to a Gaussian process, establishing what is known as the Neural Network and Gaussian Process (NNGP) correspondence. Remarkably, this convergence holds even when the limits of depth and width are interchanged, which is not observed in typical infinite-depth Multilayer Perceptron (MLP) networks. Furthermore, we demonstrate that the associated Gaussian vector remains non-degenerate for any pairwise distinct input data, ensuring a strictly positive smallest eigenvalue of the corresponding kernel matrix using the NNGP kernel. These findings serve as fundamental elements for studying the training and generalization of DEQs, laying the groundwork for future research in this area.

Approximately Equivariant Graph Networks
Ningyuan Teresa Huang Ron Levie Soledad Villar



研究问题:本文探讨了图神经网络(GNNs)的对称性问题,以及如何通过图的粗化来形式化近似对称性。
动机:虽然GNNs和CNNs都存在对称性,但两者的本质不同。CNNs的平移等变性对应于固定域对图像信号的对称性(也称为活跃对称性),而GNNs的任何排列都会同时作用于图信号和图域(有时被称为被动对称性)。因此,作者关注GNNs的活跃对称性,并考虑在固定图上支持信号的学习设置。
方法:作者通过图的粗化来放松对称性的概念,形式化近似对称性。他们提出了一个偏差-方差公式,该公式根据所选的对称群量化了表现力损失和学习估计器规则性增益之间的权衡。
效果:作者在图像修复、交通流预测和人体姿态估计等任务上进行了广泛的实验,结果表明,选择适当大的群比图自同构体好,但比排列群小可以获得最佳的泛化性能。

Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to symmetries of the fixed domain acting on the image signals (sometimes known as active symmetries), whereas in GNNs any permutation acts on both the graph signals and the graph domain (sometimes described as passive symmetries). In this work, we focus on the active symmetries of GNNs, by considering a learning setting where signals are supported on a fixed graph. In this case, the natural symmetries of GNNs are the automorphisms of the graph. Since real-world graphs tend to be asymmetric, we relax the notion of symmetries by formalizing approximate symmetries via graph coarsening. We present a bias-variance formula that quantifies the tradeoff between the loss in expressivity and the gain in the regularity of the learned estimator, depending on the chosen symmetry group. To illustrate our approach, we conduct extensive experiments on image inpainting, traffic flow prediction, and human pose estimation with different choices of symmetries. We show theoretically and empirically that the best generalization performance can be achieved by choosing a suitably larger group than the graph automorphism, but smaller than the permutation group.

Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?
Haitao Mao Zhikai Chen Wei Jin Haoyu Han Yao Ma Tong Zhao Neil Shah Jiliang Tang



研究问题:现有的图神经网络(GNNs)在处理同质和某些异质图中的结构模式时,对于不同结构模式的节点,如异质图中的同质节点,其性能表现存在显著差异。
动机:大多数现实世界的同质和异质图都是由同质和异质结构模式的节点混合组成的,表现出结构性的差异。然而,关于GNN在处理具有不同结构模式的节点上的性能表现的研究仍然非常有限。
方法:本研究通过理论分析和实证研究,对GNN在处理具有不同结构模式的节点上的性能表现进行了深入探讨,并提出了一种新的非i.i.d PAC-Bayesian泛化边界,揭示了性能差异的原因。
效果:实验结果表明,GNN在同质图中的同质节点和异质图中的异质节点上的表现优秀,但在相反的节点集上表现不佳。此外,我们还发现深层GNN的效果以及图分布外问题中被忽视的分布偏移因素,并提出了相应的新场景。

Recent studies on Graph Neural Networks(GNNs) provide both empirical and theoretical evidence supporting their effectiveness in capturing structural patterns on both homophilic and certain heterophilic graphs. Notably, most real-world homophilic and heterophilic graphs are comprised of a mixture of nodes in both homophilic and heterophilic structural patterns, exhibiting a structural disparity. However, the analysis of GNN performance with respect to nodes exhibiting different structural patterns, e.g., homophilic nodes in heterophilic graphs, remains rather limited. In the present study, we provide evidence that Graph Neural Networks(GNNs) on node classification typically perform admirably on homophilic nodes within homophilic graphs and heterophilic nodes within heterophilic graphs while struggling on the opposite node set, exhibiting a performance disparity. We theoretically and empirically identify effects of GNNs on testing nodes exhibiting distinct structural patterns. We then propose a rigorous, non-i.i.d PAC-Bayesian generalization bound for GNNs, revealing reasons for the performance disparity, namely the aggregated feature distance and homophily ratio difference between training and testing nodes. Furthermore, we demonstrate the practical implications of our new findings via (1) elucidating the effectiveness of deeper GNNs; and (2) revealing an over-looked distribution shift factor on graph out-of-distribution problem and proposing a new scenario accordingly.

Improving neural network representations using human similarity judgments
Lukas Muttenthaler Lorenz Linhardt Jonas Dippel Robert A. Vandermeulen Katherine Hermann Andrew Kyle Lampinen Simon Kornblith



研究问题:本文旨在探索监督神经网络表示的全局结构对人类相似性判断的影响,并提出了一种新的方法来对齐表示的全局结构同时保留其局部结构。
动机:目前的深度神经网络在许多计算机视觉任务上已经达到了人类水平的性能,但是用于训练这些网络的目标只强制要求相似的图像在表示空间中的位置相近,并没有直接约束结果空间的全局结构。
方法:通过线性地将表示的全局结构与人类的相似性判断对齐,探索监督这种全局结构的影响。当发现简单的方法会导致局部表示结构的巨大变化从而损害下游性能时,提出了一种新颖的方法来对齐表示的全局结构同时保留其局部结构。
效果:实验结果表明,人类视觉表示是以一种便于从少量示例中学习的方式进行全局组织的。将这种全局结构纳入神经网络表示中可以显著提高各种少样本学习和异常检测任务的准确性。

Deep neural networks have reached human-level performance on many computer vision tasks. However, the objectives used to train these networks enforce only that similar images are embedded at similar locations in the representation space, and do not directly constrain the global structure of the resulting space. Here, we explore the impact of supervising this global structure by linearly aligning it with human similarity judgments. We find that a naive approach leads to large changes in local representational structure that harm downstream performance. Thus, we propose a novel method that aligns the global structure of representations while preserving their local structure. This global-local transform considerably improves accuracy across a variety of few-shot learning and anomaly detection tasks. Our results indicate that human visual representations are globally organized in a way that facilitates learning from few examples, and incorporating this global structure into neural network representations improves performance on downstream tasks.

CAT-Walk: Inductive Hypergraph Learning via Set Walks
Ali Behrouz Farnoosh Hashemi Sadaf Sadeghian Margo Seltzer



研究问题:如何有效地对超图进行表示学习,以提取在社会网络分析、神经科学、金融等真实世界问题中至关重要的高阶交互模式。
动机:现有的方法通常只针对特定任务或静态超图设计,缺乏对动态规律的学习和高阶因果关系的提取。
方法:提出CAT-Walk方法,通过引入一种基于集合的自适应和置换不变的池化策略SetMixer以及一种隐藏超边身份的集合匿名化过程,实现了对超图中的时间和结构过程的动态规律的学习。
效果:在10个超图基准数据集上的评估表明,CAT-Walk在归纳和演绎设置下都取得了优秀的时间超边预测性能,并在节点分类任务上与最先进的方法具有竞争力。

Temporal hypergraphs provide a powerful paradigm for modeling time-dependent, higher-order interactions in complex systems. Representation learning for hypergraphs is essential for extracting patterns of the higher-order interactions that are critically important in real-world problems in social network analysis, neuroscience, finance, etc. However, existing methods are typically designed only for specific tasks or static hypergraphs. We present CAT-Walk, an inductive method that learns the underlying dynamic laws that govern the temporal and structural processes underlying a temporal hypergraph. CAT-Walk introduces a temporal, higher-order walk on hypergraphs, SetWalk, that extracts higher-order causal patterns. CAT-Walk uses a novel adaptive and permutation invariant pooling strategy, SetMixer, along with a set-based anonymization process that hides the identity of hyperedges. Finally, we present a simple yet effective neural network model to encode hyperedges. Our evaluation on 10 hypergraph benchmark datasets shows that CAT-Walk attains outstanding performance on temporal hyperedge prediction benchmarks in both inductive and transductive settings. It also shows competitive performance with state-of-the-art methods for node classification. (https://github.com/ubc-systopia/CATWalk)

Equivariant Flow Matching with Hybrid Probability Transport for 3D Molecule Generation
Yuxuan Song Jingjing Gong Minkai Xu Ziyao Cao Yanyan Lan Stefano Ermon Hao Zhou Wei-Ying Ma



研究问题:如何同时决定3D分子的分类特征(原子类型)和连续特征(原子坐标)。
动机:现有的深度生成模型在生成具有丰富特征的几何形状方面表现出了有效性,但通常存在概率动态不稳定和采样速度低效的问题。
方法:引入几何流匹配,结合等变建模和稳定的概率动态优势。具体来说,我们提出了一种混合概率路径,其中坐标概率路径通过等变最优传输进行正则化,并在不同的模态之间对齐信息。
效果:实验结果表明,该方法在多个分子生成基准测试中始终能取得更好的性能,平均采样速度提高了4.75倍。

The generation of 3D molecules requires simultaneously deciding the categorical features (atom types) and continuous features (atom coordinates). Deep generative models, especially Diffusion Models (DMs), have demonstrated effectiveness in generating feature-rich geometries. However, existing DMs typically suffer from unstable probability dynamics with inefficient sampling speed. In this paper, we introduce geometric flow matching, which enjoys the advantages of both equivariant modeling and stabilized probability dynamics. More specifically, we propose a hybrid probability path where the coordinates probability path is regularized by an equivariant optimal transport, and the information between different modalities is aligned. Experimentally, the proposed method could consistently achieve better performance on multiple molecule generation benchmarks with 4.75$\times$ speed up of sampling on average.

The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks
Spencer Frei Gal Vardi Peter Bartlett Nathan Srebro



研究问题:本研究探讨了ReLU网络中梯度流的隐含偏见对泛化和对抗鲁棒性的影响。
动机:在数据由簇组成且簇均值之间的相关性较小的情况下,我们发现在两层ReLU网络中,梯度流偏向于泛化良好但易受对抗样本攻击的解决方案。即使网络高度过参数化,我们的结果仍然成立。
方法:通过研究数据由簇组成且簇均值之间的相关性较小的情况,我们分析了ReLU网络中梯度流的隐含偏见对泛化和对抗鲁棒性的影响。
效果:尽管这种设置可能导致有害的过拟合,但我们证明梯度流的隐含偏见可以防止它。然而,这种隐含偏见也会导致非鲁棒解决方案(容易受到小的对抗L2扰动的影响),即使存在适应数据的鲁棒网络。

In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are vulnerable to adversarial examples. Our results hold even in cases where the network is highly overparameterized. Despite the potential for harmful overfitting in such settings, we prove that the implicit bias of gradient flow prevents it. However, the implicit bias also leads to non-robust solutions (susceptible to small adversarial $\ell_2$-perturbations), even though robust networks that fit the data exist.

Does a sparse ReLU network training problem always admit an optimum ?
TUNG QUOC LE Rémi Gribonval Elisa Riccietti



研究问题:优化算法在寻找最优网络参数时,其存在性并非总是被保证,特别是在稀疏ReLU神经网络的上下文中。
动机:深度网络的某些稀疏模式的优化问题并不总是有最优参数,优化算法可能会因此发散。
方法:通过一种新的稀疏ReLU神经网络与其线性对应物之间的拓扑关系,利用实代数几何的现有工具,推导出一种算法来验证给定的稀疏模式是否存在此问题。然后,证明了每个涉及浅层输出维度为一的稀疏ReLU神经网络的具体优化问题都存在全局最优解。
效果:这项分析基于对可实施为稀疏ReLU神经网络的函数空间的两个拓扑性质的研究:最佳逼近性质和一致范数下的闭合性质。这既适用于对应于有限训练集的实际训练的有限域,也适用于更一般的域,如单位立方体。这使我们能够提供给定稀疏模式时存在最优解的条件。这些结果不仅适用于最近关于网络剪枝/稀疏化的工作提出的几种稀疏模式,也适用于经典的密集神经网络,包括现有结果未涵盖的架构。

Given a training set, a loss function, and a neural network architecture, it is often taken for granted that optimal network parameters exist, and a common practice is to apply available optimization algorithms to search for them. In this work, we show that the existence of an optimal solution is not always guaranteed, especially in the context of sparse ReLU neural networks. In particular, we first show that optimization problems involving deep networks with certain sparsity patterns do not always have optimal parameters, and that optimization algorithms may then diverge. Via a new topological relation between sparse ReLU neural networks and their linear counterparts, we derive --using existing tools from real algebraic geometry-- an algorithm to verify that a given sparsity pattern suffers from this issue. Then, the existence of a global optimum is proved for every concrete optimization problem involving a shallow sparse ReLU neural network of output dimension one. Overall, the analysis is based on the investigation of two topological properties of the space of functions implementable as sparse ReLU neural networks: a best approximation property, and a closedness property, both in the uniform norm. This is studied both for (finite) domains corresponding to practical training on finite training sets, and for more general domains such as the unit cube. This allows us to provide conditions for the guaranteed existence of an optimum given a sparsity pattern. The results apply not only to several sparsity patterns proposed in recent works on network pruning/sparsification, but also to classical dense neural networks, including architectures not covered by existing results.

Learning a Neuron by a Shallow ReLU Network: Dynamics and Implicit Bias for Correlated Inputs
Dmitry Chistikov Matthias Englert Ranko Lazic



研究问题:本文旨在证明训练一个单神经元的基本回归任务,通过梯度流从一个小型初始值研究问题:本文旨在证明训练一个单神经元的基本回归任务,通过梯度流从一个小型初始值训练任何宽度的单隐藏层ReLU网络会收敛到零损失,并隐含地倾向于最小化网络参数的等级。
动机:以前的工作主要考虑正交数据集,而我们假设训练点与教师神经元相关,从而补充了这一部分的研究。
方法:通过对每个隐藏神经元在整个训练过程中的动态进行详细的非渐近分析,得出我们的结果。
效果:我们展示了在最小等级插值器网络和最小欧几里得范数插值器网络之间存在一种令人惊讶的区别。最后,我们进行了一系列的数值实验,证实了我们的理论研究结果。

We prove that, for the fundamental regression task of learning a single neuron, training a one-hidden layer ReLU network of any width by gradient flow from a small initialisation converges to zero loss and is implicitly biased to minimise the rank of network parameters. By assuming that the training points are correlated with the teacher neuron, we complement previous work that considered orthogonal datasets. Our results are based on a detailed non-asymptotic analysis of the dynamics of each hidden neuron throughout the training. We also show and characterise a surprising distinction in this setting between interpolator networks of minimal rank and those of minimal Euclidean norm. Finally we perform a range of numerical experiments, which corroborate our theoretical findings.

Joint Feature and Differentiable $ k $-NN Graph Learning using Dirichlet Energy
Lei Xu Lei Chen Rong Wang Feiping Nie Xuelong Li



研究问题:本文旨在提出一种基于狄利克雷能量的深度特征选择方法,该方法可以同时进行特征选择和可微分k-NN图学习。
动机:特征选择在机器学习中起着重要作用,提取重要特征并加速学习过程。现有的特征选择方法往往忽视了特征之间的关联性,而基于狄利克雷能量的特征选择方法可以解决这个问题。
方法:我们的方法通过测量特征在图结构上的平滑度来识别重要特征,并促进新图的学习,以反映新特征子空间中的固有结构。我们还采用最优传输理论来解决神经网络中k-NN图学习的非可微分问题。
效果:我们在合成数据集和真实世界数据集上进行了广泛的实验,验证了我们模型的有效性。

Feature selection (FS) plays an important role in machine learning, which extracts important features and accelerates the learning process. In this paper, we propose a deep FS method that simultaneously conducts feature selection and differentiable $ k $-NN graph learning based on the Dirichlet Energy. The Dirichlet Energy identifies important features by measuring their smoothness on the graph structure, and facilitates the learning of a new graph that reflects the inherent structure in new feature subspace. We employ Optimal Transport theory to address the non-differentiability issue of learning $ k $-NN graphs in neural networks, which theoretically makes our method applicable to other graph neural networks for dynamic graph learning. Furthermore, the proposed framework is interpretable, since all modules are designed algorithmically. We validate the effectiveness of our model with extensive experiments on both synthetic and real-world datasets.

Optimizing over trained GNNs via symmetry breaking
Shiqiang Zhang Juan S Campos Christian Wolfgang Feldmann David Walz Frederik Sandfort Miriam Mathea Calvin Tsay Ruth Misener



研究问题:如何优化训练过的图神经网络(GNN)模型,并解决由此产生的约束问题。
动机:图神经网络在处理图形结构数据上具有优势,但其优化过程受到已训练的GNN的限制,且存在由于图同构导致的问题。
方法:提出两种类型的对称性破坏约束,并通过构建图索引算法来保证添加这些约束不会消除所有对称解。同时,针对输入图不是固定的情况,即每条边都是决策变量,开发了两种混合整数优化公式。
效果:通过在分子设计中的应用测试,证明了提出的对称性破坏策略和优化公式的有效性。

Optimization over trained machine learning models has applications including: verification, minimizing neural acquisition functions, and integrating a trained surrogate into a larger decision-making problem. This paper formulates and solves optimization problems constrained by trained graph neural networks (GNNs). To circumvent the symmetry issue caused by graph isomorphism, we propose two types of symmetry-breaking constraints: one indexing a node 0 and one indexing the remaining nodes by lexicographically ordering their neighbor sets. To guarantee that adding these constraints will not remove all symmetric solutions, we construct a graph indexing algorithm and prove that the resulting graph indexing satisfies the proposed symmetry-breaking constraints. For the classical GNN architectures considered in this paper, optimizing over a GNN with a fixed graph is equivalent to optimizing over a dense neural network. Thus, we study the case where the input graph is not fixed, implying that each edge is a decision variable, and develop two mixed-integer optimization formulations. To test our symmetry-breaking strategies and optimization formulations, we consider an application in molecular design.

Normalization-Equivariant Neural Networks with Application to Image Denoising
Sébastien Herbreteau Emmanuel Moebel Charles Kervrann



研究问题:在许多信息处理系统中,输入的变换(无论是平移还是缩放)都应导致系统响应的相应变化。然而,深度神经网络并不保证这种归一化等变(缩放+平移)属性,这在许多应用中可能是有害的。
动机:为了解决这个问题,我们提出了一种方法来调整现有的神经网络,使归一化等变性通过设计得以实现。
方法:我们的主要观点是,不仅普通的卷积层,而且所有激活函数,包括应用于预激活神经元的元素级的ReLU(修正线性单元),都应该从神经网络中完全移除,并用条件更好的替代方案来替换。为此,我们引入了仿射约束卷积和通道级排序池化层作为替代方案,并证明这两种架构修改在不损失性能的情况下确实保留了归一化等变性。
效果:实验结果表明,除了条件更好外,归一化等变神经网络还提供了更好的噪声水平泛化能力。

In many information processing systems, it may be desirable to ensure that any change of the input, whether by shifting or scaling, results in a corresponding change in the system response. While deep neural networks are gradually replacing all traditional automatic processing methods, they surprisingly do not guarantee such normalization-equivariance (scale + shift) property, which can be detrimental in many applications. To address this issue, we propose a methodology for adapting existing neural networks so that normalization-equivariance holds by design. Our main claim is that not only ordinary convolutional layers, but also all activation functions, including the ReLU (rectified linear unit), which are applied element-wise to the pre-activated neurons, should be completely removed from neural networks and replaced by better conditioned alternatives. To this end, we introduce affine-constrained convolutions and channel-wise sort pooling layers as surrogates and show that these two architectural modifications do preserve normalization-equivariance without loss of performance. Experimental results in image denoising show that normalization-equivariant neural networks, in addition to their better conditioning, also provide much better generalization across noise levels.

Globally injective and bijective neural operators
Takashi Furuya Michael Anthony Puthawala Matti Lassas Maarten V. de Hoop



研究问题:本文探讨了在无限维视角下,网络从函数空间中学习算子的问题,特别是当这些网络学习的算子是单射和满射的情况。
动机:近年来,算子学习引起了极大的关注,其中网络从本质上无穷维的角度学习函数空间之间的算子。本研究旨在探究当这些网络学习的算子是单射和满射时的结果。
方法:首先,通过给出严格的条件,证明了ReLU层与线性神经网络算子结合的情况下,该层是单射的。然后,考虑激活函数是逐点双射的情况,并获得了该层是单射的充分条件。此外,还证明了提供的单射神经网络算子是通用逼近器,并且其有限秩神经网络的实现仍然是单射的。
效果:最后,提高了抽象层次,考虑了当子网络可能是多层且是单射和满射的一般条件,并提供了从“线性化”的精确逆变换。这些结果适用于由本研究中考虑的层形成的子网络,在自然条件下。作者认为这项工作在贝叶斯不确定性量化中有应用,因为单射性可以实现似然估计,而在逆问题中,满射性和单射性分别对应解的存在性和唯一性。

Recently there has been great interest in operator learning, where networks learn operators between function spaces from an essentially infinite-dimensional perspective. In this work we present results for when the operators learned by these networks are injective and surjective. As a warmup, we combine prior work in both the finite-dimensional ReLU and operator learning setting by giving sharp conditions under which ReLU layers with linear neural operators are injective. We then consider the case when the activation function is pointwise bijective and obtain sufficient conditions for the layer to be injective. We remark that this question, while trivial in the finite-rank setting, is subtler in the infinite-rank setting and is proven using tools from Fredholm theory. Next, we prove that our supplied injective neural operators are universal approximators and that their implementation, with finite-rank neural networks, are still injective. This ensures that injectivity is not 'lost' in the transcription from analytical operators to their finite-rank implementation with networks. Finally, we conclude with an increase in abstraction and consider general conditions when subnetworks, which may have many layers, are injective and surjective and provide an exact inversion from a 'linearization.’ This section uses general arguments from Fredholm theory and Leray-Schauder degree theory for non-linear integral equations to analyze the mapping properties of neural operators in function spaces. These results apply to subnetworks formed from the layers considered in this work, under natural conditions. We believe that our work has applications in Bayesian uncertainty quantification where injectivity enables likelihood estimation and in inverse problems where surjectivity and injectivity corresponds to existence and uniqueness of the solutions, respectively.

Interpretable Graph Networks Formulate Universal Algebra Conjectures
Francesco Giannini Stefano Fioravanti Oguzhan Keskin Alisia Maria Lupidi Lucie Charlotte Magister Pietro Lio Pietro Barbiero



研究问题:本文旨在探索人工智能在普适代数(UA)中的应用,以解决传统方法难以处理的数学问题。
动机:尽管AI在许多领域都有广泛应用,但在建立现代数学基础之一的普适代数中,其使用仍然未被探索。
方法:本研究首次提出利用AI来研究普适代数的等价方程和拓扑特性。通过构建可解释的图神经网络,我们能够分析这些属性,并生成适用于AI的数据集。
效果:实验结果表明,这种可解释的图网络在预测普适代数属性时具有强大的泛化能力,并能生成简单的解释来验证现有的猜想,甚至能找出可能形成新猜想的子图。

The rise of Artificial Intelligence (AI) recently empowered researchers to investigate hard mathematical problems which eluded traditional approaches for decades. Yet, the use of AI in Universal Algebra (UA)---one of the fields laying the foundations of modern mathematics---is still completely unexplored. This work proposes the first use of AI to investigate UA's conjectures with an equivalent equational and topological characterization. While topological representations would enable the analysis of such properties using graph neural networks, the limited transparency and brittle explainability of these models hinder their straightforward use to empirically validate existing conjectures or to formulate new ones. To bridge these gaps, we propose a general algorithm generating AI-ready datasets based on UA's conjectures, and introduce a novel neural layer to build fully interpretable graph networks. The results of our experiments demonstrate that interpretable graph networks: (i) enhance interpretability without sacrificing task accuracy, (ii) strongly generalize when predicting universal algebra's properties, (iii) generate simple explanations that empirically validate existing conjectures, and (iv) identify subgraphs suggesting the formulation of novel conjectures.

Geometric Algebra Transformer
Johann Brehmer Pim De Haan Sönke Behrends Taco Cohen



研究问题:目前尚无一种适用于各种几何类型的通用架构,能够同时尊重其对称性。
动机:几何数据在物理、化学、机器人学、计算机视觉等多个领域都有涉及,但现有架构无法处理如此广泛的几何类型。
方法:本文介绍了一种名为“几何代数变换器”(GATr)的通用架构,用于处理几何数据。GATr使用射影几何(或克利福德)代数表示输入、输出和隐藏状态,为常见的几何对象及其操作提供了高效的16维向量空间表示。
效果:GATr在从n体建模到大型动脉网格的壁剪应力估计再到机器人运动规划等问题上表现出色,不仅优于非几何基线,还优于具有对称性的基线,无论在误差、数据效率还是可扩展性方面都表现出色。

Problems involving geometric data arise in physics, chemistry, robotics, computer vision, and many other fields. Such data can take numerous forms, for instance points, direction vectors, translations, or rotations, but to date there is no single architecture that can be applied to such a wide variety of geometric types while respecting their symmetries. In this paper we introduce the Geometric Algebra Transformer (GATr), a general-purpose architecture for geometric data. GATr represents inputs, outputs, and hidden states in the projective geometric (or Clifford) algebra, which offers an efficient 16-dimensional vector-space representation of common geometric objects as well as operators acting on them. GATr is equivariant with respect to E(3), the symmetry group of 3D Euclidean space. As a Transformer, GATr is versatile, efficient, and scalable. We demonstrate GATr in problems from n-body modeling to wall-shear-stress estimation on large arterial meshes to robotic motion planning. GATr consistently outperforms both non-geometric and equivariant baselines in terms of error, data efficiency, and scalability.

Is Distance Matrix Enough for Geometric Deep Learning?
Zian Li Xiyuan Wang Yinan Huang Muhan Zhang



研究问题:现有的基于消息传递神经网络的图神经网络(GNN)在处理3D几何图形任务时存在局限性,无法完全捕捉到图中的对称几何结构。
动机:为了解决这一问题,研究者提出了$k$-DisGNNs模型,该模型能够有效地利用距离矩阵中的丰富几何信息。
方法:首先,研究者构造了一系列新的、对称的几何图,这些图即使是考虑所有对的距离,Vanilla DisGNN也无法区分,从而大大扩展了现有的反例族。然后,研究者提出了$k$-DisGNNs模型,该模型可以从几何图中学习高阶几何信息,并统一了一些现有的精心设计的几何模型。
效果:实验结果表明,$k$-DisGNNs在MD17数据集上取得了许多新的最先进的结果。此外,研究者还建立了几何深度学习(GDL)和传统图表示学习(GRL)之间的联系,证明了原本为GRL设计的高表达能力的GNN模型也可以应用于GDL,并且表现令人印象深刻。

Graph Neural Networks (GNNs) are often used for tasks involving the 3D geometry of a given graph, such as molecular dynamics simulation. While incorporating Euclidean distance into Message Passing Neural Networks (referred to as Vanilla DisGNN) is a straightforward way to learn the geometry, it has been demonstrated that Vanilla DisGNN is geometrically incomplete. In this work, we first construct families of novel and symmetric geometric graphs that Vanilla DisGNN cannot distinguish even when considering all-pair distances, which greatly expands the existing counterexample families. Our counterexamples show the inherent limitation of Vanilla DisGNN to capture symmetric geometric structures. We then propose $k$-DisGNNs, which can effectively exploit the rich geometry contained in the distance matrix. We demonstrate the high expressive power of $k$-DisGNNs from three perspectives: 1. They can learn high-order geometric information that cannot be captured by Vanilla DisGNN. 2. They can unify some existing well-designed geometric models. 3. They are universal function approximators from geometric graphs to scalars (when $k\geq 2$) and vectors (when $k\geq 3$). Most importantly, we establish a connection between geometric deep learning (GDL) and traditional graph representation learning (GRL), showing that those highly expressive GNN models originally designed for GRL can also be applied to GDL with impressive performance, and that existing complicated, equivariant models are not the only solution. Experiments verify our theory. Our $k$-DisGNNs achieve many new state-of-the-art results on MD17.

Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules
Zhiyuan Liu Yaorui Shi An Zhang Enzhi Zhang Kenji Kawaguchi Xiang Wang Tat-Seng Chua



研究问题:本文旨在填补分子图自我监督学习中图掩码建模(Masked Graph Modeling,MGM)的三个关键组成部分——图标记器、图掩码和图自编码器的理解空白。
动机:现有的MGM研究主要关注图掩码和编码器,而对标记器和解码器的理解有限。为了弥补这一差距,作者首先总结了流行的分子标记器,然后探讨了它们作为MGM重建目标的作用。
方法:作者提出了一种新的MGM方法SimSGT,它包括一个基于简单GNN的标记器(SGT)和一个有效的解码策略。通过实验验证,该方法优于现有的分子自我监督学习方法。
效果:实验结果表明,子图级别的标记器和具有重掩码解码的足够表现力的解码器对编码器表示学习有重大影响。

Masked graph modeling excels in the self-supervised representation learning of molecular graphs. Scrutinizing previous studies, we can reveal a common scheme consisting of three key components: (1) graph tokenizer, which breaks a molecular graph into smaller fragments (\ie subgraphs) and converts them into tokens; (2) graph masking, which corrupts the graph with masks; (3) graph autoencoder, which first applies an encoder on the masked graph to generate the representations, and then employs a decoder on the representations to recover the tokens of the original graph. However, the previous MGM studies focus extensively on graph masking and encoder, while there is limited understanding of tokenizer and decoder. To bridge the gap, we first summarize popular molecule tokenizers at the granularity of node, edge, motif, and Graph Neural Networks (GNNs), and then examine their roles as the MGM's reconstruction targets. Further, we explore the potential of adopting an expressive decoder in MGM. Our results show that a subgraph-level tokenizer and a sufficiently expressive decoder with remask decoding have a \yuan{large impact on the encoder's representation learning}. Finally, we propose a novel MGM method SimSGT, featuring a Simple GNN-based Tokenizer (SGT) and an effective decoding strategy. We empirically validate that our method outperforms the existing molecule self-supervised learning methods. Our codes and checkpoints are available at https://github.com/syr-cn/SimSGT.

Lovász Principle for Unsupervised Graph Representation Learning
Ziheng Sun Chris Ding Jicong Fan



研究问题:本文旨在利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper focuses on graph-level representation learning that aims to represent graphs as vectors that can be directly utilized in downstream tasks such as graph classification. We propose a novel graph-level representation learning principle called Lovász principle, which is motivated by the Lovász number in graph theory. The Lovász number of a graph is a real number that is an upper bound for graph Shannon capacity and is strongly connected with various global characteristics of the graph. Specifically, we show that the handle vector for computing the Lovász number is potentially a suitable choice for graph representation, as it captures a graph's global properties, though a direct application of the handle vector is difficult and problematic. We propose to use neural networks to address the problems and hence provide the Lovász principle. Moreover, we propose an enhanced Lovász principle that is able to exploit the subgraph Lovász numbers directly and efficiently. The experiments demonstrate that our Lovász principles achieve competitive performance compared to the baselines in unsupervised and semi-supervised graph-level representation learning tasks. The code of our Lovász principles is publicly available on GitHub.

Understanding the Limitations of Deep Models for Molecular property prediction: Insights and Solutions
Jun Xia Lecheng Zhang Xiao Zhu Yue Liu Zhangyang Gao Bozhen Hu Cheng Tan Jiangbin Zheng Siyuan Li Stan Z. Li



研究问题:深度学习在分子性质预测(MPP)任务中的表现不如传统模型,尽管该任务在人工智能驱动的药物发现(AIDD)流程中至关重要。
动机:揭示深度学习在MPP任务中的不足,并找出原因以改进其性能。
方法:对12个代表性模型(包括3个非深度学习模型和9个深度学习模型)在15个分子数据集上进行基准测试,并进行深入的实证研究。
效果:通过特征映射方法改进深度学习模型,使其在大多数MoleculeNet数据集上超越非深度学习模型,并在与COVID-19相关的前沿数据集和活动悬崖数据集上进一步验证了其有效性。

Molecular Property Prediction (MPP) is a crucial task in the AI-driven Drug Discovery (AIDD) pipeline, which has recently gained considerable attention thanks to advancements in deep learning. However, recent research has revealed that deep models struggle to beat traditional non-deep ones on MPP. In this study, we benchmark 12 representative models (3 non-deep models and 9 deep models) on 15 molecule datasets. Through the most comprehensive study to date, we make the following key observations: \textbf{(\romannumeral 1)} Deep models are generally unable to outperform non-deep ones; \textbf{(\romannumeral 2)} The failure of deep models on MPP cannot be solely attributed to the small size of molecular datasets; \textbf{(\romannumeral 3)} In particular, some traditional models including XGB and RF that use molecular fingerprints as inputs tend to perform better than other competitors. Furthermore, we conduct extensive empirical investigations into the unique patterns of molecule data and inductive biases of various models underlying these phenomena. These findings stimulate us to develop a simple-yet-effective feature mapping method for molecule data prior to feeding them into deep models. Empirically, deep models equipped with this mapping method can beat non-deep ones in most MoleculeNet datasets. Notably, the effectiveness is further corroborated by extensive experiments on cutting-edge dataset related to COVID-19 and activity cliff datasets.

Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling
Haotao Wang Ziyu Jiang Yuning You Yan Han Gaowen Liu Jayanth Srinivasa Ramana Rao Kompella Zhangyang Wang



研究问题:如何提高图神经网络(GNNs)对各种训练图结构的泛化能力,同时避免计算成本的激增和训练难度的问题。
动机:现实世界的图结构多样,节点和边的类型各异。为了增强GNNs的泛化能力,通常需要通过图增强技术和在更广泛的图上进行大规模预训练来增强训练图结构。如何在保持多样性的同时避免计算成本激增和训练难度是关键。
方法:本研究将专家混合(MoE)的概念引入到GNNs中,提出了图混合专家(GMoE)模型。该模型使图中的节点能够动态地、自适应地选择更具一般性的信息聚合专家。这些专家被训练来捕获不同的图结构子组,并整合不同跳数的信息,跳数较大的专家专门收集较长距离的信息。
效果:通过在OGB基准的一系列实验,包括图、节点和链接预测任务,验证了GMoE的有效性。与非MoE基线相比,它在ogbg-molhiv和ogbg-molbbbp上的ROC-AUC分别提高了1.81%和1.40%。代码已公开发布在https://github.com/VITA-Group/Graph-Mixture-of-Experts。

Graph neural networks (GNNs) have found extensive applications in learning from graph data. However, real-world graphs often possess diverse structures and comprise nodes and edges of varying types. To bolster the generalization capacity of GNNs, it has become customary to augment training graph structures through techniques like graph augmentations and large-scale pre-training on a wider array of graphs. Balancing this diversity while avoiding increased computational costs and the notorious trainability issues of GNNs is crucial. This study introduces the concept of Mixture-of-Experts (MoE) to GNNs, with the aim of augmenting their capacity to adapt to a diverse range of training graph structures, without incurring explosive computational overhead. The proposed Graph Mixture of Experts (GMoE) model empowers individual nodes in the graph to dynamically and adaptively select more general information aggregation experts. These experts are trained to capture distinct subgroups of graph structures and to incorporate information with varying hop sizes, where those with larger hop sizes specialize in gathering information over longer distances. The effectiveness of GMoE is validated through a series of experiments on a diverse set of tasks, including graph, node, and link prediction, using the OGB benchmark. Notably, it enhances ROC-AUC by $1.81\%$ in ogbg-molhiv and by $1.40\%$ in ogbg-molbbbp, when compared to the non-MoE baselines. Our code is publicly available at https://github.com/VITA-Group/Graph-Mixture-of-Experts.

Analyzing Generalization of Neural Networks through Loss Path Kernels
Yilan Chen Wei Huang Hao Wang Charlotte Loh Akash Srivastava Lam M. Nguyen Tsui-Wei Weng



研究问题:本文旨在研究使用(随机)梯度流训练的神经网络的泛化能力。
动机:随着深度神经网络在现实应用中的广泛使用,确保其适应新、未见过的数据的能力变得至关重要。
方法:通过提出一种新的称为损失路径核的新核,将梯度流的损失动力学与一般内核机建立新的连接。这种核通过评估沿梯度流确定的路径上的损失梯度之间的一致性来测量两个数据点之间的相似性。基于这种联系,我们推导出一个新的适用于一般神经网络架构的泛化上限。
效果:我们的结果表明,新的泛化上限是紧的,并且与真实的泛化误差强烈相关。我们将我们的结果应用于指导神经架构搜索(NAS),并通过数值实验证明,与最先进的NAS算法相比,具有较好的性能。

Deep neural networks have been increasingly used in real-world applications, making it critical to ensure their ability to adapt to new, unseen data. In this paper, we study the generalization capability of neural networks trained with (stochastic) gradient flow. We establish a new connection between the loss dynamics of gradient flow and general kernel machines by proposing a new kernel, called loss path kernel. This kernel measures the similarity between two data points by evaluating the agreement between loss gradients along the path determined by the gradient flow. Based on this connection, we derive a new generalization upper bound that applies to general neural network architectures. This new bound is tight and strongly correlated with the true generalization error. We apply our results to guide the design of neural architecture search (NAS) and demonstrate favorable performance compared with state-of-the-art NAS algorithms through numerical experiments.

Transformed Low-Rank Parameterization Can Help Robust Generalization for Tensor Neural Networks
Andong Wang Chao Li Mingyuan Bai Zhong Jin Guoxu Zhou Qibin Zhao



研究问题:本文旨在解决多通道学习中t-NNs(具有t-product层的神经网络)泛化能力的理论分析问题。
动机:尽管t-NNs在实践中取得了成功,但其泛化能力的理论研究尚未得到充分探讨。
方法:通过推导标准和对抗性设置下t-NNs的泛化误差上界,来填补这一空白。
效果:研究发现,压缩为精确转换低秩参数化的t-NNs可以获得比非压缩模型更紧的对抗性泛化边界。此外,该分析表明,在特定条件下,经过对抗性训练和梯度流,高度过参数化的ReLU激活t-NNs可以隐式地被正则化为转换低秩参数化。

Multi-channel learning has gained significant attention in recent applications, where neural networks with t-product layers (t-NNs) have shown promising performance through novel feature mapping in the transformed domain. However, despite the practical success of t-NNs, the theoretical analysis of their generalization remains unexplored. We address this gap by deriving upper bounds on the generalization error of t-NNs in both standard and adversarial settings. Notably, it reveals that t-NNs compressed with exact transformed low-rank parameterization can achieve tighter adversarial generalization bounds compared to non-compressed models. While exact transformed low-rank weights are rare in practice, the analysis demonstrates that through adversarial training with gradient flow, highly over-parameterized t-NNs with the ReLU activation can be implicitly regularized towards a transformed low-rank parameterization under certain conditions. Moreover, this paper establishes sharp adversarial generalization bounds for t-NNs with approximately transformed low-rank weights. Our analysis highlights the potential of transformed low-rank parameterization in enhancing the robust generalization of t-NNs, offering valuable insights for further research and development.

Deep learning with kernels through RKHM and the Perron-Frobenius operator
Yuka Hashimoto Masahiro Ikeda Hachem Kadri



研究问题:本文旨在提出一种深度学习框架——深度RKHM,用于处理核方法。
动机:通过结合再生核希尔伯特空间(RKHS)和$C^*$-代数,以及与函数复合相关的Perron-Frobenius算子,提出了深度RKHM。
方法:利用$C^*$-代数,我们推导出一种新的Rademacher泛化界,并从Perron-Frobenius算子的角度对良性过拟合进行了理论解释。
效果:我们的理论研究为设计和分析深度核方法提供了新的视角,证明了$C^*$-代数是适合内核的深度学习的工具,能够利用操作数的乘积结构,并与卷积神经网络建立清晰的联系。

Reproducing kernel Hilbert $C^*$-module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of $C^*$-algebra, and the Perron-Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron-Frobenius operators. By virtue of $C^*$-algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that $C^*$-algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.

A General Theory of Correct, Incorrect, and Extrinsic Equivariance
Dian Wang Xupeng Zhu Jung Yeon Park Mingxi Jia Guanang Su Robert Platt Robin Walters



研究问题:尽管等变机器学习在许多任务上已被证明有效,但其成功研究问题:尽管等变机器学习在许多任务上已被证明有效,但其成功在很大程度上依赖于假设真实函数在整个域上与等变神经网络中的对称性相匹配。
动机:等变学习文献中缺失的部分是对仅在域部分存在对称性的等变网络的分析。
方法:我们提出了一种针对这种情况的一般理论,为正确、错误和外在等变性提出了点状定义,使我们能够连续量化函数显示的每种类型的等变性的程度。
效果:我们证明了在分类或回归设置中,具有部分错误对称性的不变或等变网络的错误下界。我们还分析了外在等变性可能带来的有害影响。实验在三种不同环境中验证了这些结果。

Although equivariant machine learning has proven effective at many tasks, success depends heavily on the assumption that the ground truth function is symmetric over the entire domain matching the symmetry in an equivariant neural network. A missing piece in the equivariant learning literature is the analysis of equivariant networks when symmetry exists only partially in the domain. In this work, we present a general theory for such a situation. We propose pointwise definitions of correct, incorrect, and extrinsic equivariance, which allow us to quantify continuously the degree of each type of equivariance a function displays. We then study the impact of various degrees of incorrect or extrinsic symmetry on model error. We prove error lower bounds for invariant or equivariant networks in classification or regression settings with partially incorrect symmetry. We also analyze the potentially harmful effects of extrinsic equivariance. Experiments validate these results in three different environments.

Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity Analysis
Mitchell Ostrow Adam Joseph Eisen Leo Kozachkov Ila R Fiete



研究问题:如何判断两个神经网络是否对特定计算使用了相同的内部过程?
动机:这个问题对于神经科学和机器学习的多个子领域都很重要,包括神经AI、机制可解释性和脑机接口。
方法:我们引入了一种名为动态相似性分析(DSA)的新相似度度量方法,该方法在系统的动态层面进行比较。该方法包含两个部分:首先,利用最新的数据驱动动力系统理论,学习一个高维线性系统,以准确捕获原始非线性动力学的核心特征;然后,通过一种新颖的扩展普罗克拉斯分析方法来比较经过这种嵌入的不同系统,该方法考虑了向量场在正交变换下的变换方式。
效果:在四个案例研究中,我们发现这种方法可以区分共轭和非共轭循环神经网络(RNNs),而几何方法则无法做到这一点。此外,我们还发现,该方法可以在无监督的方式下区分学习规则。

How can we tell whether two neural networks utilize the same internal processes for a particular computation? This question is pertinent for multiple subfields of neuroscience and machine learning, including neuroAI, mechanistic interpretability, and brain-machine interfaces. Standard approaches for comparing neural networks focus on the spatial geometry of latent states. Yet in recurrent networks, computations are implemented at the level of dynamics, and two networks performing the same computation with equivalent dynamics need not exhibit the same geometry. To bridge this gap, we introduce a novel similarity metric that compares two systems at the level of their dynamics, called Dynamical Similarity Analysis (DSA). Our method incorporates two components: Using recent advances in data-driven dynamical systems theory, we learn a high-dimensional linear system that accurately captures core features of the original nonlinear dynamics. Next, we compare different systems passed through this embedding using a novel extension of Procrustes Analysis that accounts for how vector fields change under orthogonal transformation. In four case studies, we demonstrate that our method disentangles conjugate and non-conjugate recurrent neural networks (RNNs), while geometric methods fall short. We additionally show that our method can distinguish learning rules in an unsupervised manner. Our method opens the door to comparative analyses of the essential temporal structure of computation in neural circuits.

Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation
Yingyi Chen Qinghua Tao Francesco Tonin Johan Suykens



研究问题:如何理解和改善Transformer中的自注意力机制。
动机:现有的工作将对称核方法应用于非对称的自注意力,导致理论分析和数值实现之间存在显著差距。
方法:通过非对称核奇异值分解(KSVD)来表示和优化自注意力,利用深度层中通常观察到的自注意力的低秩特性。
效果:实验结果表明,我们的方法在优化自注意力时具有最先进的性能和效率,同时验证了该方法的巨大潜力。

Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, i) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; ii) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; iii) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modelling and optimization.

On the spectral bias of two-layer linear networks
Aditya Vardhan Varre Maria-Luiza Vladarean Loucas Pillaud-Vivien Nicolas Flammarion



研究问题:本文研究了在平方损失下,使用梯度流训练具有线性激活的两层全连接网络的行为。
动机:我们发现优化过程对参数存在一种依赖于初始规模大小的隐含偏见。
方法:我们通过梯度流获取特定初始化形状的损失最小化器,并对其进行变分表征。我们还展示了一个跟踪权重矩阵奇异值动态并描述其时间演化的隐藏镜像流。
效果:我们的发现揭示了在线性神经网络的隐藏层中,小尺度初始化方案倾向于具有低秩结构。我们的数值实验支持了这些发现。

This paper studies the behaviour of two-layer fully connected networks with linear activations trained with gradient flow on the square loss. We show how the optimization process carries an implicit bias on the parameters that depends on the scale of its initialization. The main result of the paper is a variational characterization of the loss minimizers retrieved by the gradient flow for a specific initialization shape. This characterization reveals that, in the small scale initialization regime, the linear neural network's hidden layer is biased toward having a low-rank structure. To complement our results, we showcase a hidden mirror flow that tracks the dynamics of the singular values of the weights matrices and describe their time evolution. We support our findings with numerical experiments illustrating the phenomena.

Extending the Design Space of Graph Neural Networks by Rethinking Folklore Weisfeiler-Lehman
Jiarui Feng Lecheng Kong Hao Liu Dacheng Tao Fuhai Li Muhan Zhang Yixin Chen



研究问题:如何提高图神经网络的表达能力,同时解决现有方法中存在的空间复杂度高和设计空间刚性的问题。
动机:现有的图神经网络表达能力受限,且存在空间复杂度高和设计空间刚性的问题。
方法:提出一种扩展的Folklore WL(k-WL/FWL)方法,通过将任意等变集作为邻居,扩大了设计空间,并证明其具有实现现有模型的等效表达能力。进一步提出了一个实用的、理论可靠的实例Neighborhood^2-FWL(N^2-FWL),它只需要O(n^2)的空间复杂度,但能编码许多子结构。
效果:实验结果表明,N^2-GNN在ZINC-Subset和ZINC-Full任务上取得了破纪录的结果,比之前的最好结果提高了10.6%和40.9%。此外,N^2-GNN在BREC数据集上也取得了新的最优结果,超过了所有现有的高表达能力的GNN方法。

Message passing neural networks (MPNNs) have emerged as the most popular framework of graph neural networks (GNNs) in recent years. However, their expressive power is limited by the 1-dimensional Weisfeiler-Lehman (1-WL) test. Some works are inspired by $k$-WL/FWL (Folklore WL) and design the corresponding neural versions. Despite the high expressive power, there are serious limitations in this line of research. In particular, (1) $k$-WL/FWL requires at least $O(n^k)$ space complexity, which is impractical for large graphs even when $k=3$; (2) The design space of $k$-WL/FWL is rigid, with the only adjustable hyper-parameter being $k$. To tackle the first limitation, we propose an extension, $(k, t)$-FWL. We theoretically prove that even if we fix the space complexity to $O(n^k)$ (for any $k \geq 2$) in $(k, t)$-FWL, we can construct an expressiveness hierarchy up to solving the graph isomorphism problem. To tackle the second problem, we propose $k$-FWL+, which considers any equivariant set as neighbors instead of all nodes, thereby greatly expanding the design space of $k$-FWL. Combining these two modifications results in a flexible and powerful framework $(k, t)$-FWL+. We demonstrate $(k, t)$-FWL+ can implement most existing models with matching expressiveness. We then introduce an instance of $(k,t)$-FWL+ called Neighborhood$^2$-FWL (N$^2$-FWL), which is practically and theoretically sound. We prove that N$^2$-FWL is no less powerful than 3-WL, and can encode many substructures while only requiring $O(n^2)$ space. Finally, we design its neural version named **N$^2$-GNN** and evaluate its performance on various tasks. N$^2$-GNN achieves record-breaking results on ZINC-Subset (**0.059**) and ZINC-Full (**0.013**), outperforming previous SOTA results by 10.6\% and 40.9\%, respectively. Moreover, N$^2$-GNN achieves new SOTA results on the BREC dataset (**71.8\%**) among all existing high-expressive GNN methods.

MAG-GNN: Reinforcement Learning Boosted Graph Neural Network
Lecheng Kong Jiarui Feng Hao Liu Dacheng Tao Yixin Chen Muhan Zhang



研究问题:如何提高图神经网络(GNNs)的结构编码能力,同时保持其效率?
动机:尽管子图GNNs能够通过使用子图信息来提高GNNs的表达能力并取得了巨大的成功,但这种方法通过枚举所有可能的子图牺牲了GNNs的效率。
方法:我们分析了完全子图枚举的必要性,并表明一个模型可以通过考虑子图的一个小集达到相当的表达能力。然后我们将确定最优子集的问题形式化为一个组合优化问题,并提出磁性图神经网络(MAG-GNN),一种强化学习(RL)增强的GNN,来解决这个问题。
效果:我们在许多数据集上进行了大量的实验,结果表明MAG-GNN达到了与最先进的方法竞争的性能,甚至超过了许多子图GNNs。我们还证明MAG-GNN有效地减少了子图GNNs的运行时间。

While Graph Neural Networks (GNNs) recently became powerful tools in graph learning tasks, considerable efforts have been spent on improving GNNs' structural encoding ability. A particular line of work proposed subgraph GNNs that use subgraph information to improve GNNs' expressivity and achieved great success. However, such effectivity sacrifices the efficiency of GNNs by enumerating all possible subgraphs. In this paper, we analyze the necessity of complete subgraph enumeration and show that a model can achieve a comparable level of expressivity by considering a small subset of the subgraphs. We then formulate the identification of the optimal subset as a combinatorial optimization problem and propose Magnetic Graph Neural Network (MAG-GNN), a reinforcement learning (RL) boosted GNN, to solve the problem. Starting with a candidate subgraph set, MAG-GNN employs an RL agent to iteratively update the subgraphs to locate the most expressive set for prediction. This reduces the exponential complexity of subgraph enumeration to the constant complexity of a subgraph search algorithm while keeping good expressivity. We conduct extensive experiments on many datasets, showing that MAG-GNN achieves competitive performance to state-of-the-art methods and even outperforms many subgraph GNNs. We also demonstrate that MAG-GNN effectively reduces the running time of subgraph GNNs.

Simultaneous embedding of multiple attractor manifolds in a recurrent neural network using constrained gradient optimization
Haggai Agmon Yoram Burak



研究问题:本研究旨在探讨如何通过调整突触权重来减少连续变量在工作记忆中的存储受到的有害干扰。
动机:当前的研究表明,当多个连续吸引子嵌入单个循环神经网络时,会引发有害的干扰,导致记忆质量下降。
方法:本研究提出通过调整突触权重来改善状态稳定性,从而减轻这种有害干扰。突触权重的调整是通过一个损失函数来实现的,该函数量化了每个嵌入吸引子流形的能量景观的粗糙度。
效果:实验结果表明,通过最小化这个损失函数,可以显著提高状态的稳定性,而不会影响其容量,从而有效地减轻了连续变量在工作记忆中的存储受到的有害干扰。

The storage of continuous variables in working memory is hypothesized to be sustained in the brain by the dynamics of recurrent neural networks (RNNs) whose steady states form continuous manifolds. In some cases, it is thought that the synaptic connectivity supports multiple attractor manifolds, each mapped to a different context or task. For example, in hippocampal area CA3, positions in distinct environments are represented by distinct sets of population activity patterns, each forming a continuum. It has been argued that the embedding of multiple continuous attractors in a single RNN inevitably causes detrimental interference: quenched noise in the synaptic connectivity disrupts the continuity of each attractor, replacing it by a discrete set of steady states that can be conceptualized as lying on local minima of an abstract energy landscape. Consequently, population activity patterns exhibit systematic drifts towards one of these discrete minima, thereby degrading the stored memory over time. Here we show that it is possible to dramatically attenuate these detrimental interference effects by adjusting the synaptic weights. Synaptic weight adjustment are derived from a loss function that quantifies the roughness of the energy landscape along each of the embedded attractor manifolds. By minimizing this loss function, the stability of states can be dramatically improved, without compromising the capacity.

Fine-grained Expressivity of Graph Neural Networks
Jan Böker Ron Levie Ningyuan Teresa Huang Soledad Villar Christopher Morris



研究问题:本文旨在解决现有图神经网络(MPNNs)表达能力分析中的问题,即$1$-WL测试的二分性无法准确衡量两个给定图的相似度。
动机:现有的图神经网络表达能力分析主要依赖于组合技术如$1$-WL测试,但其二分性无法提供两个图的相似度信息。
方法:通过将$1$-WL和MPNNs扩展到图论上的连续形式——graphons,提出了一种连续的$1$-WL测试,可以准确刻画MPNNs在graphons上的表达能力。
效果:实验结果表明,随机初始化的MPNNs无需训练就具有与训练过的MPNNs相当的性能。此外,根据保持图距离的能力评估了不同的MPNN架构,证明了我们的连续$1$-WL测试在理解MPNNs的表达能力方面的重要性。

Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the $1$-dimensional Weisfeiler--Leman test ($1$-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resolves this issue by considering continuous extensions of both $1$-WL and MPNNs to graphons. Concretely, we show that the continuous variant of $1$-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the level of difficulty in separating them. We identify the finest topology where MPNNs separate points and prove a universal approximation theorem. Consequently, we provide a theoretical framework for graph and graphon similarity combining various topological variants of classical characterizations of the $1$-WL. In particular, we characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the $1$-WL and MPNNs on graphons. Empirically, we validate our theoretical findings by showing that randomly initialized MPNNs, without training, exhibit competitive performance compared to their trained counterparts. Moreover, we evaluate different MPNN architectures based on their ability to preserve graph distances, highlighting the significance of our continuous $1$-WL test in understanding MPNNs' expressivity.

Adversarial Training for Graph Neural Networks: Pitfalls, Solutions, and New Directions
Lukas Gosch Simon Geisler Daniel Sturm Bertrand Charpentier Daniel Zügner Stephan Günnemann



研究问题:尽管对抗性训练在图像领域取得了成功,但它尚未成为图神经网络(GNN)对抗图形结构扰动的有效防御。
动机:为了解决对抗性训练的问题,我们展示了并克服了先前工作中采用的图学习设置的基本理论和实践限制;我们发现基于可学习的图扩散的灵活GNN能够适应对抗性扰动,而学习到的消息传递方案是自然可解释的。
方法:我们引入了第一种针对结构扰动的攻击,该攻击同时针对多个节点,能够处理全局(图级别)和局部(节点级别)的约束。
效果:包括这些贡献在内,我们证明了对抗性训练是对抗结构扰动的最佳防御手段。

Despite its success in the image domain, adversarial training did not (yet) stand out as an effective defense for Graph Neural Networks (GNNs) against graph structure perturbations. In the pursuit of fixing adversarial training (1) we show and overcome fundamental theoretical as well as practical limitations of the adopted graph learning setting in prior work; (2) we reveal that flexible GNNs based on learnable graph diffusion are able to adjust to adversarial perturbations, while the learned message passing scheme is naturally interpretable; (3) we introduce the first attack for structure perturbations that, while targeting multiple nodes at once, is capable of handling global (graph-level) as well as local (node-level) constraints. Including these contributions, we demonstrate that adversarial training is a state-of-the-art defense against adversarial structure perturbations.

New Complexity-Theoretic Frontiers of Tractability for Neural Network Training
Cornelius Brand Robert Ganian Mathis Rocton



研究问题:尽管神经网络在现代机器学习研究中起着基础性的作用,但我们对优化训练神经网络的计算复杂性的理解仍然有限,即使处理的是最简单的激活函数。
动机:尽管最近有一些结果为线性和ReLU激活函数的问题建立了更紧的下界,但在识别新的多项式时间可处理的网络架构方面进展甚微。
方法:本文为训练线性和ReLU激活的神经网络到最优性获得了新的算法上界,这些上界将这些问题的可处理性推向了超越先前状态的边界。
效果:实验结果表明,这些新获得的上界推动了这些问题的可处理性超越了先前的状态。

In spite of the fundamental role of neural networks in contemporary machine learning research, our understanding of the computational complexity of optimally training neural networks remains limited even when dealing with the simplest kinds of activation functions. Indeed, while there has been a number of very recent results that establish ever-tighter lower bounds for the problem under linear and ReLU activation functions, little progress has been made towards the identification of novel polynomial-time tractable network architectures. In this article we obtain novel algorithmic upper bounds for training linear- and ReLU-activated neural networks to optimality which push the boundaries of tractability for these problems beyond the previous state of the art.

Metis: Understanding and Enhancing In-Network Regular Expressions
Zhengxin Zhang Yucheng Huang Guanglin Duan Qing Li Dan Zhao Yong Jiang Lianbo Ma Xi Xiao Hengyang Xu



研究问题:如何将正则表达式(REs)和神经网络(NNs)结合,以提升网络入侵检测的准确性和效率。
动机:虽然REs能提供一种一次性解决许多网络任务的方法,但其完全依赖于专家知识,无法利用标记数据提高准确性。而神经网络虽然可以从丰富的标记数据中学习,但在冷启动场景下表现不佳,且在网络设备上的部署过于复杂。
方法:本文提出了Metis框架,通过将REs转换为字节级的循环神经网络(BRNNs),无需训练即可实现高精度和高吞吐量。当有丰富的标记数据时,可以通过训练进一步提高BRNN的性能。此外,设计了一种半监督的知识蒸馏方法,将BRNNs转化为可以部署在网络设备上的池化软随机森林(PSRFs)。
效果:实验结果表明,Metis比原始的REs和其他基线更准确,当部署在网络设备上时,可以实现更高的吞吐量。

Regular expressions (REs) offer one-shot solutions for many networking tasks, e.g., network intrusion detection. However, REs purely rely on expert knowledge and cannot utilize labeled data for better accuracy. Today, neural networks (NNs) have shown superior accuracy and flexibility, thanks to their ability to learn from rich labeled data. Nevertheless, NNs are often incompetent in cold-start scenarios and too complex for deployment on network devices. In this paper, we propose Metis, a general framework that converts REs to network device affordable models for superior accuracy and throughput by taking advantage of REs' expert knowledge and NNs' learning ability. In Metis, we convert REs to byte-level recurrent neural networks (BRNNs) without training. The BRNNs preserve expert knowledge from REs and offer adequate accuracy in cold-start scenarios. When rich labeled data is available, the performance of BRNNs can be improved by training. Furthermore, we design a semi-supervised knowledge distillation to transform the BRNNs into pooling soft random forests (PSRFs) that can be deployed on network devices. To the best of our knowledge, this is the first method to employ model inference as an alternative to RE matching in network scenarios. We collect network traffic data on our campus for three weeks and evaluate Metis on them. Experimental results show that Metis is more accurate than original REs and other baselines, achieving superior throughput when deployed on network devices.

On skip connections and normalisation layers in deep optimisation
Lachlan Ewen MacDonald Jack Valmadre Hemanth Saratchandran Simon Lucey



研究问题:本文旨在为深度神经网络的梯度优化提供一个通用的理论框架,以研究其广泛应用的架构选择,包括批量归一化、权重归一化和跳跃连接。
动机:现有的理论框架无法完全解释归一化层和跳跃连接在深层神经网络训练中的作用。
方法:本文提出了一个新的理论框架,通过分析各层的性质来确定多层损失景观的曲率和规律性,从而阐明归一化层和跳跃连接在全局化这些性质中的作用。
效果:实验结果表明,该框架不仅可以证明一类深度神经网络可以使用梯度下降法训练到全局最优解,而且还发现了跳跃连接加速训练的新机制,并通过ResNets在MNIST、CIFAR10、CIFAR100和ImageNet上进行了验证。

We introduce a general theoretical framework, designed for the study of gradient optimisation of deep neural networks, that encompasses ubiquitous architecture choices including batch normalisation, weight normalisation and skip connections. Our framework determines the curvature and regularity properties of multilayer loss landscapes in terms of their constituent layers, thereby elucidating the roles played by normalisation layers and skip connections in globalising these properties. We then demonstrate the utility of this framework in two respects. First, we give the only proof of which we are aware that a class of deep neural networks can be trained using gradient descent to global optima even when such optima only exist at infinity, as is the case for the cross-entropy cost. Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with ResNets on MNIST, CIFAR10, CIFAR100 and ImageNet.

Self-supervised Graph Neural Networks via Low-Rank Decomposition
Liang Yang Runjie Shi Qiuliang Zhang Bingxin Niu Zhen Wang Xiaochun Cao Chuan Wang



研究问题:本文旨在解决预训练语言模型对结构化知识的利用不足,以及现有图神经网络在处理网络异构性和缺乏标签信息时的问题。
动机:目前的预训练语言模型和图神经网络在处理知识图谱等结构化信息时存在局限,且在无标签信息的情况下处理网络异构性困难。
方法:本文提出了一种基于低秩分解的图神经网络(LRD-GNN),通过将属性矩阵进行低秩分解,使得到的表示矩阵具有低秩特性,从而保留节点的局部特性并捕捉长距离关系。同时,还提出了基于低秩张量分解的图神经网络(LRD-GNN-Tensor),通过构建节点属性张量并执行低秩张量分解,以进一步捕捉原始网络和选定相似网络之间的长距离关系。
效果:实验结果表明,LRD-GNN在各种任务上表现出优越的性能和鲁棒性。

Self-supervised learning is introduced to train graph neural networks (GNNs) by employing propagation-based GNNs designed for semi-supervised learning tasks. Unfortunately, this common choice tends to cause two serious issues. Firstly, global parameters cause the model lack the ability to capture the local property. Secondly, it is difficult to handle networks beyond homophily without label information. This paper tends to break through the common choice of employing propagation-based GNNs, which aggregate representations of nodes belonging to different classes and tend to lose discriminative information. If the propagation in each ego-network is just between the nodes from the same class, the obtained representation matrix should follow the low-rank characteristic. To meet this requirement, this paper proposes the Low-Rank Decomposition-based GNNs (LRD-GNN-Matrix) by employing Low-Rank Decomposition to the attribute matrix. Furthermore, to incorporate long-distance information, Low-Rank Tensor Decomposition-based GNN (LRD-GNN-Tensor) is proposed by constructing the node attribute tensor from selected similar ego-networks and performing Low-Rank Tensor Decomposition. The employed tensor nuclear norm facilitates the capture of the long-distance relationship between original and selected similar ego-networks. Extensive experiments demonstrate the superior performance and the robustness of LRD-GNNs.

A Recurrent Neural Circuit Mechanism of Temporal-scaling Equivariant Representation
Junfeng Zuo Xiao Liu Ying Nian Wu Si Wu Wenhao Zhang



研究问题:本文旨在探讨大脑中循环电路的时间感知的数学原理。
动机:时间感知在我们的日常生活中至关重要,而时间感知的一个重要特征是时间尺度(TS),即在不同速度下生成时间序列的能力。然而,大脑中循环电路的时间尺度背后的数学原理还不清楚。为了揭示这一点,本研究从李群的角度对时间尺度进行了研究。
方法:我们提出了一种经典的非线性循环电路动力学模型,被建模为一个连续的吸引子网络,其神经元群体反应嵌入了一个时间尺度不变的时间序列。此外,我们发现时间尺度组操作符可以通过输入到循环电路的控制输入来明确表示,其中输入增益决定了时间尺度因子(组参数),而控制输入和网络状态之间的空间偏移产生了生成器。循环电路中的神经元反应也与实验结果一致。
效果:我们展示了循环电路可以驱动前馈电路生成具有不同时间尺度的复杂时间序列,即使在负时间尺度(“时间反转”)的情况下也是如此。我们的工作首次将抽象的时间尺度组和具体的神经电路动力学联系起来。

Time perception is critical in our daily life. An important feature of time perception is temporal scaling (TS): the ability to generate temporal sequences (e.g., motor actions) at different speeds. However, it is largely unknown about the math principle underlying temporal scaling in recurrent circuits in the brain. To shed insight, the present study investigates the temporal scaling from the Lie group point of view. We propose a canonical nonlinear recurrent circuit dynamics, modeled as a continuous attractor network, whose neuronal population responses embed a temporal sequence that is TS equivariant. Furthermore, we found the TS group operators can be explicitly represented by a control input fed into the recurrent circuit, where the input gain determines the temporal scaling factor (group parameter), and the spatial offset between the control input and network state emerges the generator. The neuronal responses in the recurrent circuit are also consistent with experimental findings. We illustrated that the recurrent circuit can drive a feedforward circuit to generate complex temporal sequences with different time scales, even in the case of negative time scaling (''time reversal''). Our work for the first time analytically links the abstract temporal scaling group and concrete neural circuit dynamics.

Efficient Uncertainty Quantification and Reduction for Over-Parameterized Neural Networks
Ziyi Huang Henry Lam Haofeng Zhang



研究问题:如何准确量化深度学习模型的不确定性,并降低其对数据和训练过程噪声的影响。
动机:深度学习模型的不确定性不仅来自数据,也来自训练过程,这给可靠性评估和模型增强带来了挑战。
方法:基于神经切线核理论,创建了具有统计保证的方案,通过一个辅助网络来主要“表征”和“消除”过参数化神经网络的不确定性。
效果:该方法能有效减少程序性不确定性,并且只需使用一个训练好的网络,无需多次重新训练网络。结合适当的轻量级计算重采样方法,可以构建具有渐近精确覆盖范围的置信区间。

Uncertainty quantification (UQ) is important for reliability assessment and enhancement of machine learning models. In deep learning, uncertainties arise not only from data, but also from the training procedure that often injects substantial noises and biases. These hinder the attainment of statistical guarantees and, moreover, impose computational challenges on UQ due to the need for repeated network retraining. Building upon the recent neural tangent kernel theory, we create statistically guaranteed schemes to principally \emph{characterize}, and \emph{remove}, the uncertainty of over-parameterized neural networks with very low computation effort. In particular, our approach, based on what we call a procedural-noise-correcting (PNC) predictor, removes the procedural uncertainty by using only \emph{one} auxiliary network that is trained on a suitably labeled dataset, instead of many retrained networks employed in deep ensembles. Moreover, by combining our PNC predictor with suitable light-computation resampling methods, we build several approaches to construct asymptotically exact-coverage confidence intervals using as low as four trained networks without additional overheads.

Spectral Evolution and Invariance in Linear-width Neural Networks
Zhichao Wang Andrew William Engel Anand Sarwate Ioana Dumitriu Tony Chiang



研究问题:本研究探讨了线性宽度前馈神经网络的频谱特性,样本大小与网络宽度呈渐进正比。
动机:我们发现在高维空间中,使用梯度下降法进行小常数学习率训练时,权重的频谱是不变的。我们对此观察提供了理论依据,并证明了对于共轭和神经切线核,其主体频谱都是不变的。
方法:我们通过实证研究和理论证明来展示这一特性,同时我们也展示了在使用小学习率的随机梯度下降法训练时,有类似的特性出现。
效果:当学习率较大时,会出现一个与训练数据结构对齐的异常值。我们还发现,在进行适应性梯度训练后(此时测试错误较低且特征学习出现),权重和核矩阵都表现出重尾行为。我们通过不同的训练策略从两层神经网络中展示了不同的频谱特性(如不变的主体、尖峰和重尾分布),并将它们与特征学习相关联。当我们用真实世界的数据训练传统的神经网络时,也会出现类似的现象。我们得出结论,监测训练过程中频谱的演变是理解训练动态和特征学习的关键步骤。

We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.

Tanimoto Random Features for Scalable Molecular Machine Learning
Austin Tripp Sergio Bacallado Sukriti Singh José Miguel Hernández-Lobato



研究问题:如何有效地利用随机特征近似法加速Tanimoto系数的计算,并扩展其到实值向量。
动机:目前缺乏适用于Tanimoto核的随机特征近似方法,且该核在大规模数据集上无法有效扩展。
方法:提出两种新型的随机特征来加速Tanimoto核的计算,并将其扩展到实值向量。
效果:实验证明这些随机特征能够有效地近似真实世界的数据集中的Tanimoto系数,对于分子性质预测和优化任务具有实用价值。

The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features to allow this kernel to scale to large datasets, and in the process discover a novel extension of the kernel to real-valued vectors. We theoretically characterize these random features, and provide error bounds on the spectral norm of the Gram matrix. Experimentally, we show that these random features are effective at approximating the Tanimoto coefficient of real-world datasets and are useful for molecular property prediction and optimization tasks. Future updates to this work will be available at http://arxiv.org/abs/2306.14809.

Complex-valued Neurons Can Learn More but Slower than Real-valued Neurons via Gradient Descent
Jin-Hui Wu Shao-Qun Zhang Yuan Jiang Zhi-Hua Zhou



研究问题:本文旨在理论探讨复值神经网络在处理复杂任务时,相比实值神经网络是否具有更好的表示和性能。
动机:尽管实证研究表明复值神经网络在一些复杂任务上可能优于实值神经网络,但理论上这种情况何时以及在何种程度上发生仍然未知。
方法:通过比较梯度下降下实值神经元和复值神经元的学习能力,我们迈出了这一步。结果显示,复值神经元可以有效地学习由任何实值或复值神经元表达的函数,收敛速度分别为$O(t^{-3})$和$O(t^{-1})$,而宽度有限的两层实值神经网络无法学习单个非退化的复值神经元。
效果:我们证明,复值神经元学习实值神经元的速度为$Omega(t^{-3})$,比使用具有常数$c$的实值神经元学习一个实值神经元的$O(\mathrm{e}^{-ct})$速度慢得多。我们在更一般的情况下通过模拟实验进一步验证并扩展了这些结果。

Complex-valued neural networks potentially possess better representations and performance than real-valued counterparts when dealing with some complicated tasks such as acoustic analysis, radar image classification, etc. Despite empirical successes, it remains unknown theoretically when and to what extent complex-valued neural networks outperform real-valued ones. We take one step in this direction by comparing the learnability of real-valued neurons and complex-valued neurons via gradient descent. We show that a complex-valued neuron can efficiently learn functions expressed by any one real-valued neuron and any one complex-valued neuron with convergence rate $O(t^{-3})$ and $O(t^{-1})$ where $t$ is the iteration index of gradient descent, respectively, whereas a two-layer real-valued neural network with finite width cannot learn a single non-degenerate complex-valued neuron. We prove that a complex-valued neuron learns a real-valued neuron with rate $\Omega (t^{-3})$, exponentially slower than the $O(\mathrm{e}^{- c t})$ rate of learning one real-valued neuron using a real-valued neuron with a constant $c$. We further verify and extend these results via simulation experiments in more general settings.

Deconstructing Data Reconstruction: Multiclass, Weight Decay and General Losses
Gon Buzaglo Niv Haim Gilad Yehudai Gal Vardi Yakir Oz Yaniv Nikankin michal Irani



研究问题:目前对神经网络内部工作机制的理解仍处于初级阶段,记忆训练数据是一个活跃的研究领域。
动机:Haim等人在2022年提出了一种从多层感知器二分类器重构训练样本的方案,有效地证明了大部分训练样本被编码在这种网络的参数中。
方法:我们扩展了他们的发现,包括从多类和卷积神经网络进行重构。我们推导出一种更通用的重构方案,适用于更广泛的损失函数,如回归损失。此外,我们还研究了导致网络易受此类重构方案影响的各种因素。
效果:有趣的是,我们发现在训练过程中使用权重衰减可以提高重构的数量和质量。另外,我们还考察了神经元数量相对于训练样本数量对可重构性的影响。

Memorization of training data is an active research area, yet our understanding of the inner workings of neural networks is still in its infancy. Recently, Haim et al. 2022 proposed a scheme to reconstruct training samples from multilayer perceptron binary classifiers, effectively demonstrating that a large portion of training samples are encoded in the parameters of such networks. In this work, we extend their findings in several directions, including reconstruction from multiclass and convolutional neural networks. We derive a more general reconstruction scheme which is applicable to a wider range of loss functions such as regression losses. Moreover, we study the various factors that contribute to networks' susceptibility to such reconstruction schemes. Intriguingly, we observe that using weight decay during training increases reconstructability both in terms of quantity and quality. Additionally, we examine the influence of the number of neurons relative to the number of training samples on the reconstructability. Code: https://github.com/gonbuzaglo/decoreco

Equivariant Spatio-Temporal Attentive Graph Networks to Simulate Physical Dynamics
Liming Wu Zhichao Hou Jirui Yuan Yu Rong Wenbing Huang



研究问题:如何更好地表示和模拟物理系统的动态行为。
动机:现有的等变图神经网络(GNN)方法虽然已经捕捉到了物理学的对称性,但由于忽视了环境中未被观察的动力学导致的非马尔可夫性质,其对物理系统动态行为的模拟效果仍有待提高。
方法:本文将动力学模拟任务重新定义为时空预测任务,通过利用过去一段时间的轨迹来恢复非马尔可夫交互作用。提出了等变时空注意力图网络(ESTAG),这是一种等变的时空GNN版本,用于实现这一目标。在ESTAG的核心,设计了一个新颖的等变离散傅立叶变换(EDFT)来从历史帧中提取周期性模式,然后构建一个等变空间模块(ESM)来完成空间消息传递,以及一个带有前向注意和等变池化的等变时间模块(ETM)来聚合时间信息。
效果:我们在三个真实数据集上评估了我们的模型,这些数据集分别对应于分子级、蛋白质级和宏观级别。实验结果验证了ESTAG相对于典型的时空GNNs和等变GNNs的有效性。

Learning to represent and simulate the dynamics of physical systems is a crucial yet challenging task. Existing equivariant Graph Neural Network (GNN) based methods have encapsulated the symmetry of physics, \emph{e.g.}, translations, rotations, etc, leading to better generalization ability. Nevertheless, their frame-to-frame formulation of the task overlooks the non-Markov property mainly incurred by unobserved dynamics in the environment. In this paper, we reformulate dynamics simulation as a spatio-temporal prediction task, by employing the trajectory in the past period to recover the Non-Markovian interactions. We propose Equivariant Spatio-Temporal Attentive Graph Networks (ESTAG), an equivariant version of spatio-temporal GNNs, to fulfil our purpose. At its core, we design a novel Equivariant Discrete Fourier Transform (EDFT) to extract periodic patterns from the history frames, and then construct an Equivariant Spatial Module (ESM) to accomplish spatial message passing, and an Equivariant Temporal Module (ETM) with the forward attention and equivariant pooling mechanisms to aggregate temporal message. We evaluate our model on three real datasets corresponding to the molecular-, protein- and macro-level. Experimental results verify the effectiveness of ESTAG compared to typical spatio-temporal GNNs and equivariant GNNs.

TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular Potentials
Guillem Simeon Gianni De Fabritiis



研究问题:如何开发高效的机器学习模型来表示分子系统,以支持科学研究。
动机:当前对分子系统的有效表示和处理在科学研究中变得越来越重要。
方法:介绍了一种创新的O(3)等变消息传递神经网络架构TensorNet,该架构利用了笛卡尔张量表示,通过矩阵乘法操作简化了特征混合,并通过将张量分解为旋转群不可约表示,实现了标量、向量和张量的分别处理。
效果:实验表明,与更高阶的球面张量模型相比,TensorNet具有最先进的性能,同时参数数量显著减少。对于小分子势能,甚至只需要一个交互层就可以实现。此外,该模型还可以准确预测势能和力之上的向量和张量分子量,从而大大降低了计算成本。总的来说,TensorNet的框架为设计最先进的等变模型开辟了新空间。

The development of efficient machine learning models for molecular systems representation is becoming crucial in scientific research. We introduce TensorNet, an innovative O(3)-equivariant message-passing neural network architecture that leverages Cartesian tensor representations. By using Cartesian tensor atomic embeddings, feature mixing is simplified through matrix product operations. Furthermore, the cost-effective decomposition of these tensors into rotation group irreducible representations allows for the separate processing of scalars, vectors, and tensors when necessary. Compared to higher-rank spherical tensor models, TensorNet demonstrates state-of-the-art performance with significantly fewer parameters. For small molecule potential energies, this can be achieved even with a single interaction layer. As a result of all these properties, the model's computational cost is substantially decreased. Moreover, the accurate prediction of vector and tensor molecular quantities on top of potential energies and forces is possible. In summary, TensorNet's framework opens up a new space for the design of state-of-the-art equivariant models.

Truly Scale-Equivariant Deep Nets with Fourier Layers
Md Ashiqur Rahman Raymond A. Yeh



研究问题:计算机视觉中,模型需要能够适应图像分辨率的变化,以有效地执行如图像分割等任务,这被称为尺度等变。
动机:尽管最近的一些工作在开发尺度等变的卷积神经网络方面取得了进展,例如通过权重共享和内核调整大小,但这些网络在实践中并不是真正的尺度等变的。具体来说,他们在连续域中制定降采样操作时没有考虑到抗锯齿。
方法:为了解决这个问题,我们直接在离散域中考虑抗锯齿来制定降采样操作。然后,我们提出了一种基于傅立叶层的新颖架构,以实现真正尺度等变深度网络,即绝对零等变误差。
效果:按照先前的工作,我们在MNIST-scale和STL-10数据集上测试了我们的模型。我们的模型在保持零等变误差的同时,实现了有竞争力的分类性能。

In computer vision, models must be able to adapt to changes in image resolution to effectively carry out tasks such as image segmentation; This is known as scale-equivariance. Recent works have made progress in developing scale-equivariant convolutional neural networks, e.g., through weight-sharing and kernel resizing. However, these networks are not truly scale-equivariant in practice. Specifically, they do not consider anti-aliasing as they formulate the down-scaling operation in the continuous domain. To address this shortcoming, we directly formulate down-scaling in the discrete domain with consideration of anti-aliasing. We then propose a novel architecture based on Fourier layers to achieve truly scale-equivariant deep nets, i.e., absolute zero equivariance-error. Following prior works, we test this model on MNIST-scale and STL-10 datasets. Our proposed model achieves competitive classification performance while maintaining zero equivariance-error.

Neural Functional Transformers
Allan Zhou Kaien Yang Yiding Jiang Kaylee Burns Winnie Xu Samuel Sokota J Zico Kolter Chelsea Finn



研究问题:如何构建能够处理高维权重空间对象的高效表达性神经网络功能架构。
动机:利用注意力机制定义一组新的排列等变权重空间层,并将其组合成深度等变模型,称为神经功能变压器(NFTs)。
方法:使用注意力机制定义一组新的排列等变权重空间层,并组成深度等变模型NFTs。
效果:在处理前馈MLP和CNN的权重时,NFTs的性能与先前的权重空间方法相当或超过。同时,利用NFTs开发了Inr2Array,这是一种从隐式神经表示(INR)的权重计算排列不变潜在表示的新方法。该方法将INR分类准确率提高了多达+17%。

The recent success of neural networks as implicit representation of data has driven growing interest in neural functionals: models that can process other neural networks as input by operating directly over their weight spaces. Nevertheless, constructing expressive and efficient neural functional architectures that can handle high-dimensional weight-space objects remains challenging. This paper uses the attention mechanism to define a novel set of permutation equivariant weight-space layers and composes them into deep equivariant models called neural functional Transformers (NFTs). NFTs respect weight-space permutation symmetries while incorporating the advantages of attention, which have exhibited remarkable success across multiple domains. In experiments processing the weights of feedforward MLPs and CNNs, we find that NFTs match or exceed the performance of prior weight-space methods. We also leverage NFTs to develop Inr2Array, a novel method for computing permutation invariant latent representations from the weights of implicit neural representations (INRs). Our proposed method improves INR classification accuracy by up to $+17\\%$ over existing methods. We provide an implementation of our layers at https://github.com/AllanYangZhou/nfn.

Permutation Equivariant Neural Functionals
Allan Zhou Kaien Yang Kaylee Burns Adriano Cardace Yiding Jiang Samuel Sokota J Zico Kolter Chelsea Finn



研究问题:设计能够处理其他神经网络的权重或梯度的神经网路,即神经功能网络(NFNs)。
动机:尽管NFNs在许多应用中都有潜力,包括学习优化、处理隐含的神经网络表示、网络编辑和策略评估等,但目前还没有统一的设计原则。
方法:通过对称性的视角来设计神经功能网络,特别是关注由于深层前馈网络中的隐藏层神经元没有内在顺序而产生的置换对称性。提出了一个构建置换等变神经功能网络的框架,其架构将这些对称性编码为归纳偏置。
效果:实验发现,置换等变神经功能网络在一系列需要处理MLPs和CNNs权重的任务上都表现出了良好的效果,如预测分类器的泛化能力、生成"获胜门票"稀疏掩码用于初始化,以及分类或编辑隐含的神经网络表示(INRs)。

This work studies the design of neural networks that can process the weights or gradients of other neural networks, which we refer to as *neural functional networks* (NFNs). Despite a wide range of potential applications, including learned optimization, processing implicit neural representations, network editing, and policy evaluation, there are few unifying principles for designing effective architectures that process the weights of other networks. We approach the design of neural functionals through the lens of symmetry, in particular by focusing on the permutation symmetries that arise in the weights of deep feedforward networks because hidden layer neurons have no inherent order. We introduce a framework for building *permutation equivariant* neural functionals, whose architectures encode these symmetries as an inductive bias. The key building blocks of this framework are *NF-Layers* (neural functional layers) that we constrain to be permutation equivariant through an appropriate parameter sharing scheme. In our experiments, we find that permutation equivariant neural functionals are effective on a diverse set of tasks that require processing the weights of MLPs and CNNs, such as predicting classifier generalization, producing "winning ticket" sparsity masks for initializations, and classifying or editing implicit neural representations (INRs). In addition, we provide code for our models and experiments at https://github.com/AllanYangZhou/nfn.

Coneheads: Hierarchy Aware Attention
Albert Tseng Tao Yu Toni J.B. Liu Christopher De Sa



研究问题:现有的注意力网络,如transformers,主要依赖点积注意力运算符来计算两点的相似性,但这种方法无法明确地模拟真实世界数据集的复杂结构属性,如数据点之间的层次关系。
动机:为了解决这个问题,我们提出了锥形注意力(cone attention),这是一种基于双曲蕴含锥的点积注意力的替代方法。
方法:锥形注意力通过在由双曲锥定义的层次结构中查找两个点的最低公共祖先的深度来关联两个点,这种方法直观地测量了两个点的发散程度,并给出了一个"层次感知"的相似度分数。
效果:我们在各种模型和任务上测试了锥形注意力,结果显示它在任务级别上的性能超过了点积注意力和其他基线,并且能够以显著更少的参数匹配点积注意力。这些结果说明,锥形注意力是一种有效的捕捉层次关系的注意力计算方法。

Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. To remedy this, we introduce cone attention, a drop-in replacement for dot product attention based on hyperbolic entailment cones. Cone attention associates two points by the depth of their lowest common ancestor in a hierarchy defined by hyperbolic cones, which intuitively measures the divergence of two points and gives a $\textit{hierarchy aware}$ similarity score. We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention.

Diffusion Probabilistic Models for Structured Node Classification
Hyosoon Jang Seonghyun Park Sangwoo Mo Sungsoo Ahn



研究问题:本文研究了在图中进行结构化节点分类的问题,特别是在部分标签的图上预测未知标签时如何考虑标签之间的依赖关系。
动机:现有的方法在处理部分标签图时,没有充分利用已知标签的信息来预测未知标签。为了解决这个问题,作者提出了一种新的框架,利用扩散概率模型进行结构化节点分类(DPM-SNC)。
方法:该框架的核心是DPM-SNC的卓越能力,包括学习具有表达能力的反向扩散过程的标签联合分布,以及利用流形约束采样在已知标签条件下进行预测。由于DPMs缺乏对部分标签数据的培训算法,作者设计了一种新的训练算法来应用DPMs,最大化一个新的变分下界。
效果:作者通过理论分析表明,DPMs可以通过增强基于图神经网络的表达能力来提高节点分类的性能。在各种场景中,包括部分标签图的转导设置、归纳设置和未标记图,作者广泛验证了DPM-SNC的优越性。

This paper studies structured node classification on graphs, where the predictions should consider dependencies between the node labels. In particular, we focus on solving the problem for partially labeled graphs where it is essential to incorporate the information in the known label for predicting the unknown labels. To address this issue, we propose a novel framework leveraging the diffusion probabilistic model for structured node classification (DPM-SNC). At the heart of our framework is the extraordinary capability of DPM-SNC to (a) learn a joint distribution over the labels with an expressive reverse diffusion process and (b) make predictions conditioned on the known labels utilizing manifold-constrained sampling. Since the DPMs lack training algorithms for partially labeled data, we design a novel training algorithm to apply DPMs, maximizing a new variational lower bound. We also theoretically analyze how DPMs benefit node classification by enhancing the expressive power of GNNs based on proposing AGG-WL, which is strictly more powerful than the classic 1-WL test. We extensively verify the superiority of our DPM-SNC in diverse scenarios, which include not only the transductive setting on partially labeled graphs but also the inductive setting and unlabeled graphs.

Riemannian Residual Neural Networks
Isay Katsman Eric Ming Chen Sidhanth Holalkere Anna Asch Aaron Lou Ser-Nam Lim Christopher De Sa



研究问题:如何将残差神经网络(ResNet)扩展到一般的黎曼流形上,以实现对图的层次结构或自然科学中遇到的流形值数据的更好学习。
动机:最近在几何深度学习中引入的各种神经网络需要在黎曼流形上操作数据,而扩展欧几里得网络是困难的,并且只对少数几个流形进行了扩展。
方法:我们检查了残差神经网络(ResNet),并展示了如何以几何原理的方式将其扩展到一般的黎曼流形上。
效果:与现有的设计用于学习超平面和对称正定矩阵流形的流形神经网络相比,我们的黎曼ResNet在相关测试指标和训练动态方面都优于这两种类型的网络。

Recent methods in geometric deep learning have introduced various neural networks to operate over data that lie on Riemannian manifolds. Such networks are often necessary to learn well over graphs with a hierarchical structure or to learn over manifold-valued data encountered in the natural sciences. These networks are often inspired by and directly generalize standard Euclidean neural networks. However, extending Euclidean networks is difficult and has only been done for a select few manifolds. In this work, we examine the residual neural network (ResNet) and show how to extend this construction to general Riemannian manifolds in a geometrically principled manner. Originally introduced to help solve the vanishing gradient problem, ResNets have become ubiquitous in machine learning due to their beneficial learning properties, excellent empirical results, and easy-to-incorporate nature when building varied neural networks. We find that our Riemannian ResNets mirror these desirable properties: when compared to existing manifold neural networks designed to learn over hyperbolic space and the manifold of symmetric positive definite matrices, we outperform both kinds of networks in terms of relevant testing metrics and training dynamics.

Latent Field Discovery in Interacting Dynamical Systems with Neural Fields
Miltiadis Kofinas Erik J Bekkers Naveen Shankar Nagaraja Efstratios Gavves



研究问题:本研究旨在解决现有模型在处理系统动态时,忽视了底层场效应的问题。
动机:目前的模型通常假设系统在真空中演化,而忽略了底层的场效应对系统动态的影响。
方法:本研究提出了一种新的模型,通过学习潜在的力场来发现和推断这些底层的场效应。我们使用等变图网络来建模局部对象交互,并将它们与神经场结合在一个集成了场力的新颖的图网络中。
效果:实验结果表明,我们可以准确地发现带电粒子设置、交通场景和重力n体问题中的底层场,并有效地利用它们来学习和预测系统的轨迹。

Systems of interacting objects often evolve under the influence of underlying field effects that govern their dynamics, yet previous works have abstracted away from such effects, and assume that systems evolve in a vacuum. In this work, we focus on discovering these fields, and infer them from the observed dynamics alone, without directly observing them. We theorize the presence of latent force fields, and propose neural fields to learn them. Since the observed dynamics constitute the net effect of local object interactions and global field effects, recently popularized equivariant networks are inapplicable, as they fail to capture global information. To address this, we propose to disentangle local object interactions --which are SE(3) equivariant and depend on relative states-- from external global field effects --which depend on absolute states. We model the interactions with equivariant graph networks, and combine them with neural fields in a novel graph network that integrates field forces. Our experiments show that we can accurately discover the underlying fields in charged particles settings, traffic scenes, and gravitational n-body problems, and effectively use them to learn the system and forecast future trajectories.

Two Sides of The Same Coin: Bridging Deep Equilibrium Models and Neural ODEs via Homotopy Continuation
Shutong Ding Tianyu Cui Jingya Wang Ye Shi



研究问题:本文旨在建立深度平衡模型(DEQs)和神经常微分方程(Neural ODEs)之间的联系,并基于此提出一种新的隐式模型HomoODE。
动机:尽管深度平衡模型和神经常微分方程都是优秀的隐式模型,但它们源自不同的数学公式。受同伦延续法的启发,作者们建立了这两种模型之间的联系,并发现它们实际上是同一事物的两面。
方法:基于这种联系,作者们提出了一种新的隐式模型HomoODE,它从深度平衡模型继承了高精度的特性,从神经常微分方程继承了稳定性的特性。与深度平衡模型通过牛顿方法在正向传播中显式求解均衡点问题不同,HomoODE使用修改后的神经常微分方程通过同伦延续法隐式地求解均衡点问题。此外,作者们还为HomoODE开发了一种带共享可学习初始点的加速方法。
效果:在几个图像分类任务上的全面实验表明,HomoODE在准确性和内存消耗方面都超过了现有的隐式模型。

Deep Equilibrium Models (DEQs) and Neural Ordinary Differential Equations (Neural ODEs) are two branches of implicit models that have achieved remarkable success owing to their superior performance and low memory consumption. While both are implicit models, DEQs and Neural ODEs are derived from different mathematical formulations. Inspired by homotopy continuation, we establish a connection between these two models and illustrate that they are actually two sides of the same coin. Homotopy continuation is a classical method of solving nonlinear equations based on a corresponding ODE. Given this connection, we proposed a new implicit model called HomoODE that inherits the property of high accuracy from DEQs and the property of stability from Neural ODEs. Unlike DEQs, which explicitly solve an equilibrium-point-finding problem via Newton's methods in the forward pass, HomoODE solves the equilibrium-point-finding problem implicitly using a modified Neural ODE via homotopy continuation. Further, we developed an acceleration method for HomoODE with a shared learnable initial point. It is worth noting that our model also provides a better understanding of why Augmented Neural ODEs work as long as the augmented part is regarded as the equilibrium point to find. Comprehensive experiments with several image classification tasks demonstrate that HomoODE surpasses existing implicit models in terms of both accuracy and memory consumption.

Enhancing Adaptive History Reserving by Spiking Convolutional Block Attention Module in Recurrent Neural Networks
Qi Xu Yuyuan Gao Jiangrong Shen Yaxin Li Xuming Ran Huajin Tang Gang Pan



研究问题:本文旨在开发一种结合空间和时间特征的脉冲神经网络模型,以处理动态视觉传感器收集的时空模式数据。
动机:尽管卷积脉冲神经网络在时空模式数据处理上取得了显著效果,但它们忽视了与连续时间点相关的时序特征。
方法:本文提出了一种嵌入先进脉冲卷积块注意力模块(SCBAM)的循环脉冲神经网络(RSNN)模型,以结合时空模式的空间和时间特征。
效果:实验结果表明,提出的RSNN-SCBAM模型能更有效地利用空间和时间维度的历史信息,且比其他模型具有更高的精度。

Spiking neural networks (SNNs) serve as one type of efficient model to process spatio-temporal patterns in time series, such as the Address-Event Representation data collected from Dynamic Vision Sensor (DVS). Although convolutional SNNs have achieved remarkable performance on these AER datasets, benefiting from the predominant spatial feature extraction ability of convolutional structure, they ignore temporal features related to sequential time points. In this paper, we develop a recurrent spiking neural network (RSNN) model embedded with an advanced spiking convolutional block attention module (SCBAM) component to combine both spatial and temporal features of spatio-temporal patterns. It invokes the history information in spatial and temporal channels adaptively through SCBAM, which brings the advantages of efficient memory calling and history redundancy elimination. The performance of our model was evaluated in DVS128-Gesture dataset and other time-series datasets. The experimental results show that the proposed SRNN-SCBAM model makes better use of the history information in spatial and temporal dimensions with less memory space, and achieves higher accuracy compared to other models.

On Class Distributions Induced by Nearest Neighbor Graphs for Node Classification of Tabular Data
Federico Errica



研究问题:本文旨在理解在缺失图结构的情况下,最近邻图对经典机器学习问题的转换以及图表示学习方法的效果。
动机:最近的研究发现,这些人工结构通常反映了同质性假设,被认为是深度图网络性能的关键因素。然而,这些信念被最新结果揭示出来,因此作者引入了一个理论框架来理解最近邻图的好处。
方法:作者形式化地分析了交叉类邻居相似性(CCNS),用于评估结构在实践中的有用性,特别是在最近邻图的背景下。此外,作者还研究了深度图网络在k-NN图上诱导的类别可分性。
效果:定量实验表明,在完全监督的情况下,使用k-NN图与无结构的基线相比没有带来任何好处。定性分析表明,作者的框架能够很好地估计CCNS,并暗示在完全监督的情况下,k-NN图从未对此类分类任务有用,因此主张研究其他图构建技术与深度图网络的结合。

Researchers have used nearest neighbor graphs to transform classical machine learning problems on tabular data into node classification tasks to solve with graph representation learning methods. Such artificial structures often reflect the homophily assumption, believed to be a key factor in the performances of deep graph networks. In light of recent results demystifying these beliefs, we introduce a theoretical framework to understand the benefits of Nearest Neighbor (NN) graphs when a graph structure is missing. We formally analyze the Cross-Class Neighborhood Similarity (CCNS), used to empirically evaluate the usefulness of structures, in the context of nearest neighbor graphs. Moreover, we study the class separability induced by deep graph networks on a k-NN graph. Motivated by the theory, our quantitative experiments demonstrate that, under full supervision, employing a k-NN graph offers no benefits compared to a structure-agnostic baseline. Qualitative analyses suggest that our framework is good at estimating the CCNS and hint at k-NN graphs never being useful for such classification tasks under full supervision, thus advocating for the study of alternative graph construction techniques in combination with deep graph networks.

Neural (Tangent Kernel) Collapse
Mariia Seleznova Dana Weitzner Raja Giryes Gitta Kutyniok Hung-Hsu Chou



研究问题:本文旨在通过利用神经网络切线核(NTK)和神经坍塌(NC)现象,理解深度神经网络(DNNs)在训练过程中的演变以及训练良好的分类DNNs最后一层特征中的对称性和结构的出现。
动机:为了解决现有预训练语言模型对丰富结构化知识的利用不足的问题,作者提出了一种增强的语言表示模型ERNIE,该模型能够同时充分利用词汇、句法和知识信息。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,使ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This work bridges two important concepts: the Neural Tangent Kernel (NTK), which captures the evolution of deep neural networks (DNNs) during training, and the Neural Collapse (NC) phenomenon, which refers to the emergence of symmetry and structure in the last-layer features of well-trained classification DNNs. We adopt the natural assumption that the empirical NTK develops a block structure aligned with the class labels, i.e., samples within the same class have stronger correlations than samples from different classes. Under this assumption, we derive the dynamics of DNNs trained with mean squared (MSE) loss and break them into interpretable phases. Moreover, we identify an invariant that captures the essence of the dynamics, and use it to prove the emergence of NC in DNNs with block-structured NTK. We provide large-scale numerical experiments on three common DNN architectures and three benchmark datasets to support our theory.

Spectral Invariant Learning for Dynamic Graphs under Distribution Shifts
Zeyang Zhang Xin Wang Ziwei Zhang Zhou Qin Weigao Wen Hui Xue' Haoyang Li Wenwu Zhu



研究问题:动态图神经网络(DyGNNs)在处理动态图中固有的分布偏移问题上存在困难。
动机:现有的DyGNNs在处理分布偏移问题上只关注时间域,无法处理频谱域中的分布偏移。本文首次提出在频谱域中研究动态图上的分布偏移。
方法:提出了Spectral Invariant Learning for Dynamic Graphs under Distribution Shifts (SILD)方法,通过捕获和利用不变和可变的频谱模式来处理动态图上的分布偏移。具体包括设计一个带有傅里叶变换的DyGNN以获取自我图轨迹频谱,将混合的动态图模式转化为分离的频率成分;开发一个解耦的频谱掩码来过滤来自各种频率成分的图动态并发现不变的和可变的频谱模式;最后,提出不变频谱过滤,鼓励模型在分布偏移下依赖不变的模式进行泛化。
效果:在合成和现实世界的动态图数据集上进行的实验结果表明,我们的方法在节点分类和链接预测任务上都具有优越的性能,特别是在处理分布偏移的情况下。

Dynamic graph neural networks (DyGNNs) currently struggle with handling distribution shifts that are inherent in dynamic graphs. Existing work on DyGNNs with out-of-distribution settings only focuses on the time domain, failing to handle cases involving distribution shifts in the spectral domain. In this paper, we discover that there exist cases with distribution shifts unobservable in the time domain while observable in the spectral domain, and propose to study distribution shifts on dynamic graphs in the spectral domain for the first time. However, this investigation poses two key challenges: i) it is non-trivial to capture different graph patterns that are driven by various frequency components entangled in the spectral domain; and ii) it remains unclear how to handle distribution shifts with the discovered spectral patterns. To address these challenges, we propose Spectral Invariant Learning for Dynamic Graphs under Distribution Shifts (SILD), which can handle distribution shifts on dynamic graphs by capturing and utilizing invariant and variant spectral patterns. Specifically, we first design a DyGNN with Fourier transform to obtain the ego-graph trajectory spectrums, allowing the mixed dynamic graph patterns to be transformed into separate frequency components. We then develop a disentangled spectrum mask to filter graph dynamics from various frequency components and discover the invariant and variant spectral patterns. Finally, we propose invariant spectral filtering, which encourages the model to rely on invariant patterns for generalization under distribution shifts. Experimental results on synthetic and real-world dynamic graph datasets demonstrate the superiority of our method for both node classification and link prediction tasks under distribution shifts.

Principled Weight Initialisation for Input-Convex Neural Networks
Pieter-Jan Hoedt Günter Klambauer



研究问题:本文旨在解决输入凸神经网络(ICNNs)的初始权重设定问题,并探索其对学习速度和模型泛化能力的影响。
动机:由于ICNNs的特性,传统的中心化权重初始化策略对其并不适用。因此,需要一种新的、符合其特性的权重初始化方法来提高学习效率和模型性能。
方法:通过研究非负权重下的信号传播理论,提出了一种适用于ICNNs的初始权重设定方法。同时,还发现正确初始化的ICNNs可以不依赖跳跃连接进行训练。
效果:实验证明,这种新的初始权重设定方法能有效加速ICNNs的学习速度,提高模型的泛化能力。此外,该方法在药物发现任务中也表现出良好的应用效果。

Input-Convex Neural Networks (ICNNs) are networks that guarantee convexity in their input-output mapping. These networks have been successfully applied for energy-based modelling, optimal transport problems and learning invariances. The convexity of ICNNs is achieved by using non-decreasing convex activation functions and non-negative weights. Because of these peculiarities, previous initialisation strategies, which implicitly assume centred weights, are not effective for ICNNs. By studying signal propagation through layers with non-negative weights, we are able to derive a principled weight initialisation for ICNNs. Concretely, we generalise signal propagation theory by removing the assumption that weights are sampled from a centred distribution. In a set of experiments, we demonstrate that our principled initialisation effectively accelerates learning in ICNNs and leads to better generalisation. Moreover, we find that, in contrast to common belief, ICNNs can be trained without skip-connections when initialised correctly. Finally, we apply ICNNs to a real-world drug discovery task and show that they allow for more effective molecular latent space exploration.

Spontaneous symmetry breaking in generative diffusion models
Gabriel Raya Luca Ambrogioni



研究问题:本文旨在探索生成扩散模型的动态特性,并理解其对高维数据生成的影响。
动机:生成扩散模型在高维数据生成中表现出优越性,但其内部动态机制尚未完全明了。
方法:通过理论和实证分析,研究发现生成扩散模型的动态过程存在一个对称性破缺,将其分为两个阶段:中心固定点附近的线性稳态动态和趋向数据流形的吸引子动态。这两个阶段由中心固定点稳定性的变化分隔,而这种不稳定性产生了丰富的生成样本。
效果:基于此发现,作者提出了高斯晚期初始化方案,显著提高了模型性能,并在快速采样器上实现了高达3倍FID的改进,同时增加了样本多样性。这项工作为理解扩散模型的生成动态提供了新的视角,有望带来更高性能和更少偏见的快速采样器。

Generative diffusion models have recently emerged as a leading approach for generating high-dimensional data. In this paper, we show that the dynamics of these models exhibit a spontaneous symmetry breaking that divides the generative dynamics into two distinct phases: 1) A linear steady-state dynamics around a central fixed-point and 2) an attractor dynamics directed towards the data manifold. These two "phases'' are separated by the change in stability of the central fixed-point, with the resulting window of instability being responsible for the diversity of the generated samples. Using both theoretical and empirical evidence, we show that an accurate simulation of the early dynamics does not significantly contribute to the final generation, since early fluctuations are reverted to the central fixed point. To leverage this insight, we propose a Gaussian late initialization scheme, which significantly improves model performance, achieving up to 3x FID improvements on fast samplers, while also increasing sample diversity (e.g., racial composition of generated CelebA images). Our work offers a new way to understand the generative dynamics of diffusion models that has the potential to bring about higher performance and less biased fast-samplers.

State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory
Shida Wang Beichen Xue



研究问题:本文旨在探讨如何通过在状态空间模型中引入逐层非线性激活来提高其学习复杂序列模式的能力。
动机:尽管状态空间模型由于其简单和高效的网络结构而在序列建模中受到欢迎,但由于其在时间方向上缺乏非线性激活,限制了模型的容量。
方法:本文证明了通过堆叠带有逐层非线性激活的状态空间模型,可以近似任何连续的序列到序列的关系。
效果:实验结果表明,加入逐层非线性激活增强了模型学习复杂序列模式的能力。同时,理论和实证结果均表明,状态空间模型并未从根本上解决指数衰减记忆的问题。

State-space models have gained popularity in sequence modelling due to their simple and efficient network structures. However, the absence of nonlinear activation along the temporal direction limits the model's capacity. In this paper, we prove that stacking state-space models with layer-wise nonlinear activation is sufficient to approximate any continuous sequence-to-sequence relationship. Our findings demonstrate that the addition of layer-wise nonlinear activation enhances the model's capacity to learn complex sequence patterns. Meanwhile, it can be seen both theoretically and empirically that the state-space models do not fundamentally resolve the issue of exponential decaying memory. Theoretical results are justified by numerical verifications.

On Sparse Modern Hopfield Model
Jerry Yao-Chieh Hu Donglin Yang Dennis Wu Chenwei Xu Bo-Yu Chen Han Liu



研究问题:本文旨在介绍稀疏现代Hopfield模型,作为现代Hopfield模型的稀疏扩展。
动机:与密集型Hopfield模型相比,稀疏现代Hopfield模型具有更好的记忆检索动态性,且其一步近似对应于稀疏注意力机制。
方法:通过使用稀疏熵正则化的凸共轭,从理论上推导出封闭形式的稀疏Hopfield能量。在此基础上,从稀疏能量函数中推导出稀疏记忆检索动态,并证明其一步近似等价于稀疏结构的注意力。
效果:实验结果表明,在许多情况下,稀疏Hopfield模型优于密集型Hopfield模型。此外,我们还证明了稀疏现代Hopfield模型保持了其密集型对应物的鲁棒理论性质,包括快速固定点收敛和指数内存容量。

We introduce the sparse modern Hopfield model as a sparse extension of the modern Hopfield model. Like its dense counterpart, the sparse modern Hopfield model equips a memory-retrieval dynamics whose one-step approximation corresponds to the sparse attention mechanism. Theoretically, our key contribution is a principled derivation of a closed-form sparse Hopfield energy using the convex conjugate of the sparse entropic regularizer. Building upon this, we derive the sparse memory retrieval dynamics from the sparse energy function and show its one-step approximation is equivalent to the sparse-structured attention. Importantly, we provide a sparsity-dependent memory retrieval error bound which is provably tighter than its dense analog. The conditions for the benefits of sparsity to arise are therefore identified and discussed. In addition, we show that the sparse modern Hopfield model maintains the robust theoretical properties of its dense counterpart, including rapid fixed point convergence and exponential memory capacity. Empirically, we use both synthetic and real-world datasets to demonstrate that the sparse Hopfield model outperforms its dense counterpart in many situations.

Training Your Image Restoration Network Better with Random Weight Network as Optimization Function
Man Zhou Naishan Zheng Yuan Xu Chun-Le Guo Chongyi Li



研究问题:本文旨在调查新的优化函数以提高图像恢复性能。
动机:尽管深度学习在图像恢复方面取得了显著进展,但优化函数如L_1和L_2仍然是实际使用的。
方法:我们提出使用“随机权重网络作为训练更好图像恢复网络的约束”的新思路,并从功能理论中获取灵感,证明替代随机权重网络应以严格的数学流形形式表示。
效果:我们探索了满足这一要求的随机权重网络原型:泰勒展开网络、可逆神经网络、中心差分卷积和零阶滤波器。通过四个方面进行研究:1)随机权重策略;2)网络架构;3)网络深度;4)随机权重网络的组合。此外,我们还设计了两种随机权重变体:在整个训练过程中仅初始化一次权重,以及在每个训练时期都初始化权重。我们的方法是直接集成到现有网络中,无需额外的训练和测试计算成本。我们在多个图像恢复任务上进行了广泛的实验,包括图像去噪、低光图像增强和引导图像超分辨率,以证明我们的方法取得的一致的性能提升。

The blooming progress made in deep learning-based image restoration has been largely attributed to the availability of high-quality, large-scale datasets and advanced network structures. However, optimization functions such as L_1 and L_2 are still de facto. In this study, we propose to investigate new optimization functions to improve image restoration performance. Our key insight is that ``random weight network can be acted as a constraint for training better image restoration networks''. However, not all random weight networks are suitable as constraints. We draw inspiration from Functional theory and show that alternative random weight networks should be represented in the form of a strict mathematical manifold. We explore the potential of our random weight network prototypes that satisfy this requirement: Taylor's unfolding network, invertible neural network, central difference convolution, and zero-order filtering. We investigate these prototypes from four aspects: 1) random weight strategies, 2) network architectures, 3) network depths, and 4) combinations of random weight networks. Furthermore, we devise the random weight in two variants: the weights are randomly initialized only once during the entire training procedure, and the weights are randomly initialized in each training epoch. Our approach can be directly integrated into existing networks without incurring additional training and testing computational costs. We perform extensive experiments across multiple image restoration tasks, including image denoising, low-light image enhancement, and guided image super-resolution to demonstrate the consistent performance gains achieved by our method. Upon acceptance of this paper, we will release the code.

Brain Dissection: fMRI-trained Networks Reveal Spatial Selectivity in the Processing of Natural Images
Gabriel Herbert Sarch Michael J. Tarr Katerina Fragkiadaki Leila Wehbe



研究问题:如何通过深度神经网络和大脑皮层反应的对齐,更准确地解释高级视觉区域的功能?
动机:目前的模型特征虽然能提供准确的定量解释,但被批评为不可解释的“黑箱”。本文旨在通过训练网络直接预测大脑对自然场景图像的反应,并使用可解释AI技术“网络解剖”来提高神经网络的可解释性。
方法:首先从大规模自然场景数据集(Allen et. al., 2021)中训练网络直接预测大脑反应;然后采用“网络解剖”技术识别和定位图像中最重要的特征,用于训练后的单个网络单元;最后将这种方法应用于创建假设中立的模型,探索特定视觉区域的微调特性,称为“大脑解剖”。
效果:研究发现大脑区域在解读视觉场景时具有明显偏好,如腹侧-外侧区域更喜欢近距离和弯曲的特征,内侧和顶枕区域更倾向于更多样化和平坦的3D元素,而顶枕区域则独特地偏好空间关系。场景选择性区域表现出不同的偏好,如后扣带回复杂体偏爱远距离和户外特征,而枕叶和海马旁回位置区域则倾向于近距离、垂直性和室内元素(对于OPA)。这些发现表明,使用可解释AI有可能揭示视觉皮层的空间特征选择性,有助于更深入、更精细地理解人类视觉皮层在观看自然场景时的功能特性。

The alignment between deep neural network (DNN) features and cortical responses currently provides the most accurate quantitative explanation for higher visual areas. At the same time, these model features have been critiqued as uninterpretable explanations, trading one black box (the human brain) for another (a neural network). In this paper, we train networks to directly predict, from scratch, brain responses to images from a large-scale dataset of natural scenes (Allen et. al., 2021). We then use "network dissection" (Bau et. al., 2017), an explainable AI technique used for enhancing neural network interpretability by identifying and localizing the most significant features in images for individual units of a trained network, and which has been used to study category selectivity in the human brain (Khosla & Wehbe, 2022). We adapt this approach to create a hypothesis-neutral model that is then used to explore the tuning properties of specific visual regions beyond category selectivity, which we call "brain dissection". We use brain dissection to examine a range of ecologically important, intermediate properties, including depth, surface normals, curvature, and object relations across sub-regions of the parietal, lateral, and ventral visual streams, and scene-selective regions. Our findings reveal distinct preferences in brain regions for interpreting visual scenes, with ventro-lateral areas favoring closer and curvier features, medial and parietal areas opting for more varied and flatter 3D elements, and the parietal region uniquely preferring spatial relations. Scene-selective regions exhibit varied preferences, as the retrosplenial complex prefers distant and outdoor features, while the occipital and parahippocampal place areas favor proximity, verticality, and in the case of the OPA, indoor elements. Such findings show the potential of using explainable AI to uncover spatial feature selectivity across the visual cortex, contributing to a deeper, more fine-grained understanding of the functional characteristics of human visual cortex when viewing natural scenes.

On the Implicit Bias of Linear Equivariant Steerable Networks
Ziyu Chen Wei Zhu



研究问题:本研究探讨了在群不变二分类中,线性等变可控网络的梯度流的隐含偏差。
动机:我们发现参数预测器会向唯一的群不变分类器收敛,该分类器由输入群动作定义的最大间隔确定。
方法:在输入表示的酉假设下,我们建立了可控网络和数据增强之间的等价性。
效果:我们证明了可控网络比非不变的网络具有更好的间隔和泛化界限。

We study the implicit bias of gradient flow on linear equivariant steerable networks in group-invariant binary classification. Our findings reveal that the parameterized predictor converges in direction to the unique group-invariant classifier with a maximum margin defined by the input group action. Under a unitary assumption on the input representation, we establish the equivalence between steerable networks and data augmentation. Furthermore, we demonstrate the improved margin and generalization bound of steerable networks over their non-invariant counterparts.

Adversarial Examples Exist in Two-Layer ReLU Networks for Low Dimensional Linear Subspaces
Odelia Melamed Gilad Yehudai Gal Vardi



研究问题:尽管有大量的研究,但为什么训练后的神经网络对对抗性示例高度敏感仍然不清楚。
动机:本研究关注使用低维线性子空间上的数据进行训练的两层神经网络。
方法:我们展示了标准的梯度方法会导致非鲁棒的神经网络,即在与数据子空间正交的方向上具有大梯度的神经网络,并且容易在这些方向上受到小的对抗性$L_2$-扰动。此外,我们还发现通过降低训练算法的初始化尺度或添加$L_2$正则化可以使训练后的网络对与数据子空间正交的对抗性扰动更具鲁棒性。
效果:实验结果表明,通过改变训练算法的初始化尺度或添加$L_2$正则化,可以显著提高神经网络对对抗性扰动的鲁棒性。

Despite a great deal of research, it is still not well-understood why trained neural networks are highly vulnerable to adversarial examples. In this work we focus on two-layer neural networks trained using data which lie on a low dimensional linear subspace. We show that standard gradient methods lead to non-robust neural networks, namely, networks which have large gradients in directions orthogonal to the data subspace, and are susceptible to small adversarial $L_2$-perturbations in these directions. Moreover, we show that decreasing the initialization scale of the training algorithm, or adding $L_2$ regularization, can make the trained network more robust to adversarial perturbations orthogonal to the data.

ESSEN: Improving Evolution State Estimation for Temporal Networks using Von Neumann Entropy
Qiyao Huang Yingyue Zhang Zhihong Zhang Edwin Hancock



研究问题:如何更好地理解和分析现实世界动态系统中的时间网络演变状态。
动机:现有的方法往往无法准确描述这些网络结构的时变特性,对复杂演变状态的网络应用效果不佳。
方法:提出一种名为ESSEN的新框架,利用冯·诺依曼熵和热力学温度测量时间网络的演变,采用冯·诺依曼熵感知的注意力机制和网络演变状态对比学习进行图编码,并使用独特的解码器——混合热力学专家(MoTE)进行解码。
效果:在转导和归纳两种设置下进行链接预测任务评估,结果表明ESSEN在各种最先进的基线上具有有效性。

Temporal networks are widely used as abstract graph representations for real-world dynamic systems. Indeed, recognizing the network evolution states is crucial in understanding and analyzing temporal networks. For instance, social networks will generate the clustering and formation of tightly-knit groups or communities over time, relying on the triadic closure theory. However, the existing methods often struggle to account for the time-varying nature of these network structures, hindering their performance when applied to networks with complex evolution states. To mitigate this problem, we propose a novel framework called ESSEN, an Evolution StateS awarE Network, to measure temporal network evolution using von Neumann entropy and thermodynamic temperature. The developed framework utilizes a von Neumann entropy aware attention mechanism and network evolution state contrastive learning in the graph encoding. In addition, it employs a unique decoder the so-called Mixture of Thermodynamic Experts (MoTE) for decoding. ESSEN extracts local and global network evolution information using thermodynamic features and adaptively recognizes the network evolution states. Moreover, the proposed method is evaluated on link prediction tasks under both transductive and inductive settings, with the corresponding results demonstrating its effectiveness compared to various state-of-the-art baselines.

Multiplication-Free Transformer Training via Piecewise Affine Operations
Atli Kosson Martin Jaggi



研究问题:如何降低神经网络训练和推理中的计算成本。
动机:乘法是神经网络训练和推理中计算成本最高的部分,寻找方法来减少与它们相关的成本。
方法:受Mogami 2020的启发,将乘法替换为廉价的分段仿射近似,通过将浮点数的位表示作为整数相加来实现。
效果:在视觉和语言任务上使用修改后的矩阵乘法训练transformers,性能影响很小或没有影响,无需更改训练超参数。进一步将网络中的所有非线性项替换为完全分段仿射的形式,包括输入和权重。最后,证明可以在整个训练过程中消除所有乘法操作,包括前向传播、反向传播和优化器更新中的操作,展示了首次成功以全无乘法的方式训练现代神经网络架构。

Multiplications are responsible for most of the computational cost involved in neural network training and inference. Recent research has thus looked for ways to reduce the cost associated with them. Inspired by Mogami 2020, we replace multiplication with a cheap piecewise affine approximation that is achieved by adding the bit representation of the floating point numbers together as integers. We show that transformers can be trained with the resulting modified matrix multiplications on both vision and language tasks with little to no performance impact, and without changes to the training hyperparameters. We further replace all non-linearities in the networks making them fully and jointly piecewise affine in both inputs and weights. Finally, we show that we can eliminate all multiplications in the entire training process, including operations in the forward pass, backward pass and optimizer update, demonstrating the first successful training of modern neural network architectures in a fully multiplication-free fashion.

Provable Guarantees for Neural Networks via Gradient Feature Learning
Zhenmei Shi Junyi Wei Yingyu Liang



研究问题:当前的理论分析无法充分理解神经网络的成功,如Neural Tangent Kernel方法未能捕捉到其关键的特征学习能力。
动机:提出了一个统一的分析框架,以解决目前理论分析无法充分理解神经网络成功的问题。
方法:该框架以从梯度中学习特征的原则为中心,通过在几个典型问题上的应用,如高斯混合和奇偶函数,来展示其有效性。
效果:该框架不仅有助于理解网络学习现象,如超越内核的特征学习和彩票假设,还对几个典型问题产生了积极影响。

Neural networks have achieved remarkable empirical performance, while the current theoretical analysis is not adequate for understanding their success, e.g., the Neural Tangent Kernel approach fails to capture their key feature learning ability, while recent analyses on feature learning are typically problem-specific. This work proposes a unified analysis framework for two-layer networks trained by gradient descent. The framework is centered around the principle of feature learning from gradients, and its effectiveness is demonstrated by applications in several prototypical problems, such as mixtures of Gaussians and parity functions. The framework also sheds light on interesting network learning phenomena such as feature learning beyond kernels and the lottery ticket hypothesis.

Adaptive Topological Feature via Persistent Homology: Filtration Learning for Point Clouds
Naoki Nishikawa Yuichi Ike Kenji Yamanishi



研究问题:如何提高点云机器学习方法的准确性。
动机:通过结合全局拓扑特征(由持久同调计算得出)可以有效提高点云机器学习方法的准确性,而持久同调的计算结果受滤波器选择影响大。
方法:提出一种利用神经网络自适应学习滤波器的框架,并开发具有同构不变性的神经网络架构,同时给出滤波器函数有限维近似的理论结果。
效果:实验结果表明,该框架在多个分类任务上表现出良好的效果。

Machine learning for point clouds has been attracting much attention, with many applications in various fields, such as shape recognition and material science. For enhancing the accuracy of such machine learning methods, it is often effective to incorporate global topological features, which are typically extracted by persistent homology. In the calculation of persistent homology for a point cloud, we choose a filtration for the point cloud, an increasing sequence of spaces. Since the performance of machine learning methods combined with persistent homology is highly affected by the choice of a filtration, we need to tune it depending on data and tasks. In this paper, we propose a framework that learns a filtration adaptively with the use of neural networks. In order to make the resulting persistent homology isometry-invariant, we develop a neural network architecture with such invariance. Additionally, we show a theoretical result on a finite-dimensional approximation of filtration functions, which justifies the proposed network architecture. Experimental results demonstrated the efficacy of our framework in several classification tasks.

Temporal Conditioning Spiking Latent Variable Models of the Neural Response to Natural Visual Scenes
Gehua Ma Runhao Jiang Rui Yan Huajin Tang



研究问题:开发计算神经响应的模型对于理解感官处理和神经计算至关重要。
动机:目前的最先进的神经网络方法使用时间过滤器来处理时间依赖性,导致一种不切实际且不灵活的处理模式。同时,这些方法针对的是试验平均放电率,无法捕捉到尖峰序列中的重要特征。
方法:本研究提出了时序条件尖峰潜在变量模型(TeCoS-LVM)来模拟对自然视觉刺激的神经反应。我们使用尖峰神经元产生与原始记录相匹配的尖峰输出。这种方法有助于避免丢失原始尖峰序列中嵌入的信息。我们从模型参数空间中排除了时间维度,并引入了一个时序条件操作,使模型能够以自然的方式自适应地探索和利用刺激序列中的时间依赖性。
效果:实验结果表明,TeCoS-LVM模型可以产生更真实的尖峰活动,并且比强大的替代方案更准确地适应尖峰统计。此外,学习到的TeCoS-LVM模型可以在更长的时间尺度上进行泛化。总的来说,尽管保持了计算上的可处理性,但我们的模型有效地捕捉到了神经编码系统的关键特征。因此,它为构建各种感觉知觉电路的准确预测计算提供了一个有用的工具。

Developing computational models of neural response is crucial for understanding sensory processing and neural computations. Current state-of-the-art neural network methods use temporal filters to handle temporal dependencies, resulting in an **unrealistic and inflexible processing paradigm**. Meanwhile, these methods target **trial-averaged firing rates** and fail to capture important features in spike trains. This work presents the temporal conditioning spiking latent variable models (***TeCoS-LVM***) to simulate the neural response to natural visual stimuli. We use spiking neurons to produce spike outputs that directly match the recorded trains. This approach helps to avoid losing information embedded in the original spike trains. We exclude the temporal dimension from the model parameter space and introduce a temporal conditioning operation to allow the model to adaptively explore and exploit temporal dependencies in stimuli sequences in a **natural paradigm**. We show that TeCoS-LVM models can produce more realistic spike activities and accurately fit spike statistics than powerful alternatives. Additionally, learned TeCoS-LVM models can generalize well to longer time scales. Overall, while remaining computationally tractable, our model effectively captures key features of neural coding systems. It thus provides a useful tool for building accurate predictive computational accounts for various sensory perception circuits.

DISCOVER: Making Vision Networks Interpretable via Competition and Dissection
Konstantinos P. Panousis Sotirios Chatzis



研究问题:如何提高深度网络的可解释性,特别是在安全关键或偏见感知的应用中。
动机:现代深度网络的复杂性和推理结果的难以理解是其透明部署的主要障碍。
方法:利用多模态视觉-文本模型和基于局部线性单元之间随机竞争的新概念的网络层,提出一种可以发现网络中每个神经元功能的框架。
效果:该方法不仅可以保留或提高分类性能,而且为生成的神经元表示提供了一种基于文本的描述和检查原则框架。

Modern deep networks are highly complex and their inferential outcome very hard to interpret. This is a serious obstacle to their transparent deployment in safety-critical or bias-aware applications. This work contributes to *post-hoc* interpretability, and specifically Network Dissection. Our goal is to present a framework that makes it easier to *discover* the individual functionality of each neuron in a network trained on a vision task; discovery is performed in terms of textual description generation. To achieve this objective, we leverage: (i) recent advances in multimodal vision-text models and (ii) network layers founded upon the novel concept of stochastic local competition between linear units. In this setting, only a *small subset* of layer neurons are activated *for a given input*, leading to extremely high activation sparsity (as low as only $\approx 4\%$). Crucially, our proposed method infers (sparse) neuron activation patterns that enables the neurons to activate/specialize to inputs with specific characteristics, diversifying their individual functionality. This capacity of our method supercharges the potential of dissection processes: human understandable descriptions are generated only for the very few active neurons, thus facilitating the direct investigation of the network's decision process. As we experimentally show, our approach: (i) yields Vision Networks that retain or improve classification performance, and (ii) realizes a principled framework for text-based description and examination of the generated neuronal representations.

SparseProp: Efficient Event-Based Simulation and Training of Sparse Recurrent Spiking Neural Networks
Rainer Engelken



研究问题:本文旨在解决模拟和训练脉冲神经网络(SNNs)的计算成本高的问题。
动机:由于需要解决耦合微分方程的大系统,模拟和训练SNNs的计算成本高。
方法:本文提出了一种名为SparseProp的新型事件驱动算法,用于模拟和训练稀疏SNNs。该算法将前向和后向操作的计算成本从O(N)降低到O(log(N))每网络脉冲,实现了大范围脉冲网络的数值精确模拟和高效训练。
效果:通过利用网络的稀疏性,SparseProp避免了每次脉冲都遍历所有神经元,并使用高效的状态更新。对于几种经典的积分-触发火神经元模型,包括模拟一个具有一百万LIF神经元的稀疏SNN,其速度比先前的实现快了四个数量级以上。这项工作为训练大规模脉冲神经网络提供了一种高效且精确的解决方案,并为构建更复杂的大脑启发式模型开辟了新的可能性。

Spiking Neural Networks (SNNs) are biologically-inspired models that are capable of processing information in streams of action potentials. However, simulating and training SNNs is computationally expensive due to the need to solve large systems of coupled differential equations. In this paper, we propose a novel event-based algorithm called SparseProp for simulating and training sparse SNNs. Our algorithm reduces the computational cost of both forward pass and backward pass operations from O(N) to O(log(N)) per network spike, enabling numerically exact simulations of large spiking networks and their efficient training using backpropagation through time. By exploiting the sparsity of the network, SparseProp avoids iterating through all neurons at every spike and uses efficient state updates. We demonstrate the effectiveness of SparseProp for several classical integrate-and-fire neuron models, including simulating a sparse SNN with one million LIF neurons, which is sped up by more than four orders of magnitude compared to previous implementations. Our work provides an efficient and exact solution for training large-scale spiking neural networks and opens up new possibilities for building more sophisticated brain-inspired models.

Finite-Time Analysis of Whittle Index based Q-Learning for Restless Multi-Armed Bandits with Neural Network Function Approximation
GUOJUN XIONG Jian Li



研究问题:如何有效地解决难以处理的多
动机:由于需要解决耦合微分方程的大系统,模拟和训练SNNs的计算成本高。
方法:本文提出了一种名为SparseProp的新型事件驱动算法,用于模拟和训练稀疏SNNs。该算法将前向和后向操作的计算成本从O(N)降低到O(log(N))每网络脉冲,实现了大范围脉冲网络的数值精确模拟和高效训练。
效果:通过利用网络的稀疏性,SparseProp避免了每次脉冲都遍历所有神经元,并使用高效的状态更新。对于几种经典的积分-触发火神经元模型,包括模拟一个具有一百万LIF神经元的稀疏SNN,其速度比先前的实现快了四个数量级以上。这项工作为训练大规模脉冲神经网络提供了一种高效且精确的解决方案,并为构建更复杂的大脑启发式模型开辟了新的可能性。

Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB) problem. Although it is provably asymptotically optimal, finding Whittle indices remains difficult. In this paper, we present Neural-Q-Whittle, a Whittle index based Q-learning algorithm for RMAB with neural network function approximation, which is an example of nonlinear two-timescale stochastic approximation with Q-function values updated on a faster timescale and Whittle indices on a slower timescale. Despite the empirical success of deep Q-learning, the non-asymptotic convergence rate of Neural-Q-Whittle, which couples neural networks with two-timescale Q-learning largely remains unclear. This paper provides a finite-time analysis of Neural-Q-Whittle, where data are generated from a Markov chain, and Q-function is approximated by a ReLU neural network. Our analysis leverages a Lyapunov drift approach to capture the evolution of two coupled parameters, and the nonlinearity in value function approximation further requires us to characterize the approximation error. Combing these provide Neural-Q-Whittle with $\mathcal{O}(1/k^{2/3})$ convergence rate, where $k$ is the number of iterations.

GraphPatcher: Mitigating Degree Bias for Graph Neural Networks via Test-time Augmentation
Mingxuan Ju Tong Zhao Wenhao Yu Neil Shah Yanfang Ye



研究问题:现有的图神经网络(GNNs)在处理低度节点时存在偏差,虽然已有方法可以改善其对低度节点的处理,但会降低对原本表现良好的高度节点的处理能力。
动机:为了解决GNNs在处理低度节点时的偏差问题,同时保持其对高度节点的优秀性能。
方法:提出一种名为GraphPatcher的测试时增强框架,通过迭代生成虚拟节点来修补人为创建的低度节点,旨在逐步重建目标GNN在一系列逐渐被破坏的节点上的预测。
效果:GraphPatcher不仅学习了如何增强低度节点(当邻居被严重破坏时),而且保留了GNNs对高度节点的原有优秀性能(当轻微破坏时)。实验表明,GraphPatcher可以显著提高GNNs的整体性能和低度性能,优于现有的最佳基线。

Recent studies have shown that graph neural networks (GNNs) exhibit strong biases towards the node degree: they usually perform satisfactorily on high-degree nodes with rich neighbor information but struggle with low-degree nodes. Existing works tackle this problem by deriving either designated GNN architectures or training strategies specifically for low-degree nodes. Though effective, these approaches unintentionally create an artificial out-of-distribution scenario, where models mainly or even only observe low-degree nodes during the training, leading to a downgraded performance for high-degree nodes that GNNs originally perform well at. In light of this, we propose a test-time augmentation framework, namely GraphPatcher, to enhance test-time generalization of any GNNs on low-degree nodes. Specifically, GraphPatcher iteratively generates virtual nodes to patch artificially created low-degree nodes via corruptions, aiming at progressively reconstructing target GNN's predictions over a sequence of increasingly corrupted nodes. Through this scheme, GraphPatcher not only learns how to enhance low-degree nodes (when the neighborhoods are heavily corrupted) but also preserves the original superior performance of GNNs on high-degree nodes (when lightly corrupted). Additionally, GraphPatcher is model-agnostic and can also mitigate the degree bias for either self-supervised or supervised GNNs. Comprehensive experiments are conducted over seven benchmark datasets and GraphPatcher consistently enhances common GNNs' overall performance by up to 3.6% and low-degree performance by up to 6.5%, significantly outperforming state-of-the-art baselines. The source code is publicly available at https://github.com/jumxglhf/GraphPatcher.

Transformers over Directed Acyclic Graphs
Yuankai Luo Veronika Thost Lei Shi



研究问题:如何将图的结构偏差注入到transformer架构中。
动机:最近,transformer模型在图表示学习中越来越受欢迎,因为它们有可能学习到超出常规图神经网络所能捕捉的复杂关系。
方法:本文研究了基于有向无环图(DAG)的transformer,并提出了一些专门针对DAG的架构调整:(1)一种比常规二次复杂度的transformer更为高效的注意机制,同时忠实地捕捉DAG结构;(2)对DAG的部分顺序进行位置编码,以补充前者。
效果:我们通过各种类型的任务严格评估了我们的方法,从分类源代码图到引用网络中的节点,并表明它在两个方面是有效的:一是使图transformer普遍优于定制的DAG图神经网络;二是提高SOTA图transformer在质量和效率方面的表现。

Transformer models have recently gained popularity in graph representation learning as they have the potential to learn complex relationships beyond the ones captured by regular graph neural networks. The main research question is how to inject the structural bias of graphs into the transformer architecture, and several proposals have been made for undirected molecular graphs and, recently, also for larger network graphs. In this paper, we study transformers over directed acyclic graphs (DAGs) and propose architecture adaptations tailored to DAGs: (1) An attention mechanism that is considerably more efficient than the regular quadratic complexity of transformers and at the same time faithfully captures the DAG structure, and (2) a positional encoding of the DAG's partial order, complementing the former. We rigorously evaluate our approach over various types of tasks, ranging from classifying source code graphs to nodes in citation networks, and show that it is effective in two important aspects: in making graph transformers generally outperform graph neural networks tailored to DAGs and in improving SOTA graph transformer performance in terms of both quality and efficiency.

Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance
Jonathan Crabbé Mihaela van der Schaar



研究问题:如何使神经网络的解释在特定对称群下保持不变,以增强其解释的鲁棒性。
动机:现有的解释方法需要与模型的对称性质保持一致,才能准确描述模型。
方法:通过几何深度学习的形式化方法,提出了解释不变性和等变性的概念,并推导出两种度量方法来测量任何解释方法对模型对称群的鲁棒性。
效果:通过实证测量不同模态和对称群下的模型解释,得出了5条指导原则,帮助用户和开发者生成鲁棒的解释。

Interpretability methods are valuable only if their explanations faithfully describe the explained model. In this work, we consider neural networks whose predictions are invariant under a specific symmetry group. This includes popular architectures, ranging from convolutional to graph neural networks. Any explanation that faithfully explains this type of model needs to be in agreement with this invariance property. We formalize this intuition through the notion of explanation invariance and equivariance by leveraging the formalism from geometric deep learning. Through this rigorous formalism, we derive (1) two metrics to measure the robustness of any interpretability method with respect to the model symmetry group; (2) theoretical robustness guarantees for some popular interpretability methods and (3) a systematic approach to increase the invariance of any interpretability method with respect to a symmetry group. By empirically measuring our metrics for explanations of models associated with various modalities and symmetry groups, we derive a set of 5 guidelines to allow users and developers of interpretability methods to produce robust explanations.

On the Convergence of Encoder-only Shallow Transformers
Yongtao Wu Fanghui Liu Grigorios Chrysos Volkan Cevher



研究问题:本文旨在构建一个现实环境下的编码器浅层Transformer的全局收敛理论,主要从架构、初始化和有限宽度下的缩放角度进行探讨。
动机:Transformer的核心是自注意力机制中的softmax,如何有效处理这一问题是关键。此外,我们希望通过分析不同的缩放方案和初始化对模型训练动态的影响,以深化对现代Transformer的理解。
方法:我们详细处理了softmax的输入/输出,并证明了二次过参数化对于常见的He/LeCun初始化在实践中的浅层Transformer的全局收敛性是足够的。同时,我们还进行了基于神经切线核(NTK)的分析。
效果:实验结果表明,不同的缩放方案和初始化的重要性各不相同。我们的研究为更好地理解现代Transformer,特别是其训练动态提供了新的视角。

In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.

Towards the Difficulty for a Deep Neural Network to Learn Concepts of Different Complexities
Dongrui Liu Huiqi Deng Xu Cheng Qihan Ren Kangrui Wang Quanshi Zhang



研究问题:本文旨在理论解释深度神经网络(DNN)更易学习简单概念而非复杂概念的直觉。
动机:最近的研究发现,DNN通常只编码少量交互概念,并使用它们的交互效应来计算推理分数。因此,本研究旨在理论上解释涉及更多输入变量(即更复杂的概念)的交互概念更难学习。
方法:通过观察和证明DNN中交互概念的出现,以及每个交互概念如何代表一组输入变量之间的协作,来理解这一现象。
效果:这项发现明确了提高学习难度的具体概念复杂性。

This paper theoretically explains the intuition that simple concepts are more likely to be learned by deep neural networks (DNNs) than complex concepts. In fact, recent studies have observed [24, 15] and proved [26] the emergence of interactive concepts in a DNN, i.e., it is proven that a DNN usually only encodes a small number of interactive concepts, and can be considered to use their interaction effects to compute inference scores. Each interactive concept is encoded by the DNN to represent the collaboration between a set of input variables. Therefore, in this study, we aim to theoretically explain that interactive concepts involving more input variables (i.e., more complex concepts) are more difficult to learn. Our finding clarifies the exact conceptual complexity that boosts the learning difficulty.

Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond
Oleg Platonov Denis Kuznedelev Artem Babenko Liudmila Prokhorenkova



研究问题:目前对于图的同质性(Homophily)测量方法存在缺陷,无法准确比较不同数据集的同质性水平。
动机:同质性和异质性是描述图结构的重要属性,但现有的同质性测量方法存在问题,限制了图神经网络在处理异质性图上的效果。
方法:提出了一种新的同质性测量方法——调整后的同质性,并进一步提出了标签信息量(LI)这一新的图结构特征,用于衡量一个节点的标签能从其邻居的标签中得到多少信息。
效果:实验证明,调整后的同质性比现有方法更能满足理想的性质,而标签信息量与图神经网络的性能有更好的一致性,证实了其作为图结构的有效特征的价值。

Homophily is a graph property describing the tendency of edges to connect similar nodes; the opposite is called heterophily. It is often believed that heterophilous graphs are challenging for standard message-passing graph neural networks (GNNs), and much effort has been put into developing efficient methods for this setting. However, there is no universally agreed-upon measure of homophily in the literature. In this work, we show that commonly used homophily measures have critical drawbacks preventing the comparison of homophily levels across different datasets. For this, we formalize desirable properties for a proper homophily measure and verify which measures satisfy which properties. In particular, we show that a measure that we call adjusted homophily satisfies more desirable properties than other popular homophily measures while being rarely used in graph machine learning literature. Then, we go beyond the homophily-heterophily dichotomy and propose a new characteristic that allows one to further distinguish different sorts of heterophily. The proposed label informativeness (LI) characterizes how much information a neighbor's label provides about a node's label. We prove that this measure satisfies important desirable properties. We also observe empirically that LI better agrees with GNN performance compared to homophily measures, which confirms that it is a useful characteristic of the graph structure.

Performance-optimized deep neural networks are evolving into worse models of inferotemporal visual cortex
Drew Linsley Ivan F Rodriguez Rodriguez Thomas FEL Michael Arcaro Saloni Sharma Margaret Livingstone Thomas Serre



研究问题:随着深度神经网络在物体识别任务上的准确性提高,它们是否仍然能准确预测下颞叶皮层神经元对图像的反应?
动机:深度神经网络在物体识别任务上的准确性与它们预测下颞叶皮层神经元反应的能力之间存在关联。然而,随着网络准确性的提高,这种关系是否仍然存在尚不清楚。
方法:通过对三个独立实验的分析,研究人员发现随着深度神经网络在ImageNet数据集上的准确性提高,它们预测下颞叶皮层神经元反应的能力反而下降。通过使用神经协调器,一种可以调整深度神经网络学习到的表示以匹配人类理解的插件式训练程序,研究人员成功地解决了这个问题。
效果:经过神经协调器的调整后,深度神经网络打破了在ImageNet准确性和神经元预测准确性之间的权衡,为更准确地模拟生物视觉提供了可能。这暗示了我们可能需要修改使用任务优化的深度神经网络来模拟下颞叶皮层的标准方法,并需要包括人类心理物理学数据在内的其他生物学约束条件,才能准确地逆向工程视觉皮层。

One of the most impactful findings in computational neuroscience over the past decade is that the object recognition accuracy of deep neural networks (DNNs) correlates with their ability to predict neural responses to natural images in the inferotemporal (IT) cortex. This discovery supported the long-held theory that object recognition is a core objective of the visual cortex, and suggested that more accurate DNNs would serve as better models of IT neuron responses to images. Since then, deep learning has undergone a revolution of scale: billion parameter-scale DNNs trained on billions of images are rivaling or outperforming humans at visual tasks including object recognition. Have today's DNNs become more accurate at predicting IT neuron responses to images as they have grown more accurate at object recognition? Surprisingly, across three independent experiments, we find that this is not the case. DNNs have become progressively worse models of IT as their accuracy has increased on ImageNet. To understand why DNNs experience this trade-off and evaluate if they are still an appropriate paradigm for modeling the visual system, we turn to recordings of IT that capture spatially resolved maps of neuronal activity elicited by natural images. These neuronal activity maps reveal that DNNs trained on ImageNet learn to rely on different visual features than those encoded by IT and that this problem worsens as their accuracy increases. We successfully resolved this issue with the neural harmonizer, a plug-and-play training routine for DNNs that aligns their learned representations with humans. Our results suggest that harmonized DNNs break the trade-off between ImageNet accuracy and neural prediction accuracy that assails current DNNs and offer a path to more accurate models of biological vision. Our work indicates that the standard approach for modeling IT with task-optimized DNNs needs revision, and other biological constraints, including human psychophysics data, are needed to accurately reverse-engineer the visual cortex.

SEENN: Towards Temporal Spiking Early Exit Neural Networks
Yuhang Li Tamar Geller Youngeun Kim Priyadarshini Panda



研究问题:本文旨在解决传统人工神经网络在处理信息时的效率和准确性问题。
动机:由于传统的人工神经网络在处理信息时存在效率和准确性的问题,作者们提出了一种新的方法——脉冲神经网络(SNNs),它以生物相似的方式处理输入,但发现其信息容量受时间步长的影响,导致准确性和效率之间的权衡。
方法:作者们提出了一种精细调整SNN中时间步长的方法,即脉冲早期退出神经网络(SEENN)。通过设定阈值过滤掉不确定的预测结果,或者通过强化学习确定时间步长。
效果:通过动态调整时间步长,SEENN在推理过程中显著减少了平均时间步数。例如,SEENN-II ResNet-19在CIFAR-10测试数据集上可以达到96.1%的准确率,平均时间步数为1.08。

Spiking Neural Networks (SNNs) have recently become more popular as a biologically plausible substitute for traditional Artificial Neural Networks (ANNs). SNNs are cost-efficient and deployment-friendly because they process input in both spatial and temporal manner using binary spikes. However, we observe that the information capacity in SNNs is affected by the number of timesteps, leading to an accuracy-efficiency tradeoff. In this work, we study a fine-grained adjustment of the number of timesteps in SNNs. Specifically, we treat the number of timesteps as a variable conditioned on different input samples to reduce redundant timesteps for certain data. We call our method Spiking Early-Exit Neural Networks (**SEENNs**). To determine the appropriate number of timesteps, we propose SEENN-I which uses a confidence score thresholding to filter out the uncertain predictions, and SEENN-II which determines the number of timesteps by reinforcement learning. Moreover, we demonstrate that SEENN is compatible with both the directly trained SNN and the ANN-SNN conversion. By dynamically adjusting the number of timesteps, our SEENN achieves a remarkable reduction in the average number of timesteps during inference. For example, our SEENN-II ResNet-19 can achieve **96.1**\% accuracy with an average of **1.08** timesteps on the CIFAR-10 test dataset. Code is shared at https://github.com/Intelligent-Computing-Lab-Yale/SEENN.

The expressive power of pooling in Graph Neural Networks
Filippo Maria Bianchi Veronica Lachi



研究问题:本文旨在研究图神经网络(GNN)中,图池化操作如何影响其表现力,以及如何比较不同的图池化操作。
动机:尽管图神经网络在处理图形数据上取得了显著的进展,但关于图池化操作对GNN表现力的影响的研究还很少,同时缺乏一个理论性的标准来比较不同的图池化操作。
方法:通过理论推导,提出了一个充分条件,用于判断一个图池化操作是否能完全保留其之前的MP层的表现力。基于这些条件,我们分析了几种现有的图池化操作,并找出了那些不能满足表现力条件的操作。
效果:通过实验验证了配备有池化层的GNN在执行图同构测试时的表现力。

In Graph Neural Networks (GNNs), hierarchical pooling operators generate local summaries of the data by coarsening the graph structure and the vertex features. Considerable attention has been devoted to analyzing the expressive power of message-passing (MP) layers in GNNs, while a study on how graph pooling affects the expressiveness of a GNN is still lacking. Additionally, despite the recent advances in the design of pooling operators, there is not a principled criterion to compare them. In this work, we derive sufficient conditions for a pooling operator to fully preserve the expressive power of the MP layers before it. These conditions serve as a universal and theoretically-grounded criterion for choosing among existing pooling operators or designing new ones. Based on our theoretical findings, we analyze several existing pooling operators and identify those that fail to satisfy the expressiveness conditions. Finally, we introduce an experimental setup to verify empirically the expressive power of a GNN equipped with pooling layers, in terms of its capability to perform a graph isomorphism test.

Spike-driven Transformer
Man Yao JiaKui Hu Zhaokun Zhou Li Yuan Yonghong Tian Bo XU Guoqi Li



研究问题:如何将脉冲神经网络(SNNs)的脉冲驱动范式应用于Transformer模型,以提高深度学习的效率。
动机:由于其独特的事件驱动(即脉冲驱动)模式,SNNs提供了一种能源高效的深度学习选择。
方法:通过提出的脉冲驱动Transformer,将脉冲驱动范式引入到Transformer中,具有四个独特的特性:(1)事件驱动,当Transformer的输入为零时,不触发任何计算;(2)二进制脉冲通信,所有与脉冲矩阵相关的矩阵乘法都可以转化为稀疏加法;(3)在令牌和通道维度上都有线性复杂度的自我注意力;(4)在脉冲形式查询、键和值之间的操作是掩码和加法。
效果:设计的脉冲驱动自注意力(SDSA)只利用了掩码和加法操作,没有任何乘法操作,因此比标准的自注意力有高达$87.2times$的计算能量节省。特别是在SDSA中,查询、键和值之间的矩阵乘法被设计为掩码操作。此外,我们还重新排列了标准Transformer中激活函数之前的所有残差连接,以确保所有神经元传输二进制脉冲信号。实验表明,脉冲驱动Transformer在ImageNet-1K上可以达到77.1%的top-1准确率,这是SNN领域中最先进的结果。

Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm. In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties: (1) Event-driven, no calculation is triggered when the input of Transformer is zero; (2) Binary spike communication, all matrix multiplications associated with the spike matrix can be transformed into sparse additions; (3) Self-attention with linear complexity at both token and channel dimensions; (4) The operations between spike-form Query, Key, and Value are mask and addition. Together, there are only sparse addition operations in the Spike-driven Transformer. To this end, we design a novel Spike-Driven Self-Attention (SDSA), which exploits only mask and addition operations without any multiplication, and thus having up to $87.2\times$ lower computation energy than vanilla self-attention. Especially in SDSA, the matrix multiplication between Query, Key, and Value is designed as the mask operation. In addition, we rearrange all residual connections in the vanilla Transformer before the activation functions to ensure that all neurons transmit binary spike signals. It is shown that the Spike-driven Transformer can achieve 77.1\% top-1 accuracy on ImageNet-1K, which is the state-of-the-art result in the SNN field.

On the Ability of Graph Neural Networks to Model Interactions Between Vertices
Noam Razin Tom Verbin Nadav Cohen



研究问题:本文旨在填补图神经网络(GNNs)在理论分析其交互建模能力方面的空白。
动机:尽管已有许多努力从理论上分析GNN的表达能力,但对其交互建模能力的正式描述仍然缺乏。
方法:通过一种被称为分离度的既定度量标准,对特定GNNs在给定子集顶点与其补集之间(即输入顶点分区的两侧)建模交互的能力进行量化。
效果:实验结果表明,交互建模能力主要由分区的行走索引决定,这是一种由源于分区边界的步数定义的图论特性。当输入边被移除时,我们设计了一种名为行走索引稀疏化(WIS)的边缘稀疏化算法,该算法保留了GNN的交互建模能力。WIS简单、计算效率高,并且在实验中在诱导预测精度方面显著优于其他方法。

Graph neural networks (GNNs) are widely used for modeling complex interactions between entities represented as vertices of a graph. Despite recent efforts to theoretically analyze the expressive power of GNNs, a formal characterization of their ability to model interactions is lacking. The current paper aims to address this gap. Formalizing strength of interactions through an established measure known as separation rank, we quantify the ability of certain GNNs to model interaction between a given subset of vertices and its complement, i.e. between the sides of a given partition of input vertices. Our results reveal that the ability to model interaction is primarily determined by the partition's walk index --- a graph-theoretical characteristic defined by the number of walks originating from the boundary of the partition. Experiments with common GNN architectures corroborate this finding. As a practical application of our theory, we design an edge sparsification algorithm named Walk Index Sparsification (WIS), which preserves the ability of a GNN to model interactions when input edges are removed. WIS is simple, computationally efficient, and in our experiments has markedly outperformed alternative methods in terms of induced prediction accuracy. More broadly, it showcases the potential of improving GNNs by theoretically analyzing the interactions they can model.

Sharpness-Aware Minimization Leads to Low-Rank Features
Maksym Andriushchenko Dara Bahri Hossein Mobahi Nicolas Flammarion



研究问题:本文旨在揭示一种新提出的SAM方法在训练神经网络时,除了众所周知的改善泛化能力外,还能降低网络各层特征的秩。
动机:尽管SAM方法的主要目标是提高泛化能力,但作者发现它还有一个额外的效果,即在不同网络架构和目标下,都能降低特征的秩。
方法:通过实验在不同的网络架构(如全连接网络、卷积网络、视觉变换器)和目标(如回归、分类、语言-图像对比训练)上应用SAM方法,观察其对特征秩的影响。并通过理论分析和深度网络实验来理解这一现象。
效果:实验结果显示,SAM方法能显著降低不同网络层的特征秩,且该效果在深度网络中也会出现,尽管对于具有预激活跳过连接和自我注意力层的深度网络,整体的秩降低机制可能更为复杂。

Sharpness-aware minimization (SAM) is a recently proposed method that minimizes the sharpness of the training loss of a neural network. While its generalization improvement is well-known and is the primary motivation, we uncover an additional intriguing effect of SAM: reduction of the feature rank which happens at different layers of a neural network. We show that this low-rank effect occurs very broadly: for different architectures such as fully-connected networks, convolutional networks, vision transformers and for different objectives such as regression, classification, language-image contrastive training. To better understand this phenomenon, we provide a mechanistic understanding of how low-rank features arise in a simple two-layer network. We observe that a significant number of activations gets entirely pruned by SAM which directly contributes to the rank reduction. We confirm this effect theoretically and check that it can also occur in deep networks, although the overall rank reduction mechanism can be more complex, especially for deep networks with pre-activation skip connections and self-attention layers.

Tanh Works Better with Asymmetry
Dongjin Kim Woojeong Kim Suhyun Kim



研究问题:本文探讨了批量归一化在激活函数前后的位置对模型性能的影响。
动机:原始论文建议将批量归一化放在激活函数前面,但有研究发现,当使用如Tanh这样的有界激活函数时,将其顺序调换可以获得更好的性能。
方法:通过观察单个激活函数的输出分布,发现许多激活函数是非线性饱和的。实验设计以诱导不同程度的非线性饱和,结果支持非线性饱和有助于提高性能的观点。此外,批量归一化在有界激活函数后可以将非线性饱和的输出重新定位到接近零,使模型具有高稀疏性,进一步提高性能。
效果:通过大量的实验,证实了在Tanh、LeCun Tanh和Softsign等激活函数中,调整顺序后的模型在高度非线性饱和的情况下获得了更好的性能。进一步测试了一种被操纵为具有一致非对称性的移位Tanh函数,其准确性甚至超过了原顺序使用的原始Tanh函数,从而确认了非对称性的重要性。

Batch Normalization is commonly located in front of activation functions, as proposed by the original paper. Swapping the order, i.e., using Batch Normalization after activation functions, has also been attempted, but its performance is generally not much different from the conventional order when ReLU or a similar activation function is used. However, in the case of bounded activation functions like Tanh, we discovered that the swapped order achieves considerably better performance than the conventional order on various benchmarks and architectures. This paper reports this remarkable phenomenon and closely examines what contributes to this performance improvement. By looking at the output distributions of individual activation functions, not the whole layers, we found that many of them are asymmetrically saturated. The experiments designed to induce a different degree of asymmetric saturation support the hypothesis that asymmetric saturation helps improve performance. In addition, Batch Normalization after bounded activation functions relocates the asymmetrically saturated output of activation functions near zero, enabling the swapped model to have high sparsity, further improving performance. Extensive experiments with Tanh, LeCun Tanh, and Softsign show that the swapped models achieve improved performance with a high degree of asymmetric saturation. Finally, based on this investigation, we test a Tanh function shifted to be asymmetric. This shifted Tanh function that is manipulated to have consistent asymmetry shows even higher accuracy than the original Tanh used in the swapped order, confirming the asymmetry's importance. The code is available at https://github.com/hipros/tanh_works_better_with_asymmetry.

Parallel Spiking Neurons with High Efficiency and Ability to Learn Long-term Dependencies
Wei Fang Zhaofei Yu Zhaokun Zhou Ding Chen Yanqi Chen Zhengyu Ma Timothée Masquelier Yonghong Tian



研究问题:现有的脉冲神经网络(SNNs)中的普通脉冲神经元只能串行模拟,且难以学习长期依赖关系。
动机:通过去除重置操作,神经元动态可以被改写为非迭代形式并实现并行化。
方法:提出了并行脉冲神经元(PSN),其生成的隐藏状态与前驱无关,从而实现了可并行化的神经元动态和极高的模拟速度。
效果:在模拟速度和时间/静态数据分类方面评估PSN家族,结果显示其在效率和准确性方面具有明显优势。这是首次研究关于如何并行化脉冲神经元的问题,对深度学习的研究具有重要意义。

Vanilla spiking neurons in Spiking Neural Networks (SNNs) use charge-fire-reset neuronal dynamics, which can only be simulated serially and can hardly learn long-time dependencies. We find that when removing reset, the neuronal dynamics can be reformulated in a non-iterative form and parallelized. By rewriting neuronal dynamics without reset to a general formulation, we propose the Parallel Spiking Neuron (PSN), which generates hidden states that are independent of their predecessors, resulting in parallelizable neuronal dynamics and extremely high simulation speed. The weights of inputs in the PSN are fully connected, which maximizes the utilization of temporal information. To avoid the use of future inputs for step-by-step inference, the weights of the PSN can be masked, resulting in the masked PSN. By sharing weights across time-steps based on the masked PSN, the sliding PSN is proposed to handle sequences of varying lengths. We evaluate the PSN family on simulation speed and temporal/static data classification, and the results show the overwhelming advantage of the PSN family in efficiency and accuracy. To the best of our knowledge, this is the first study about parallelizing spiking neurons and can be a cornerstone for the spiking deep learning research. Our codes are available at https://github.com/fangwei123456/Parallel-Spiking-Neuron.

Learning Curves for Deep Structured Gaussian Feature Models
Jacob A Zavatone-Veth Cengiz Pehlevan



研究问题:本文旨在探讨深度学习模型在插值训练数据时是否仍能良好泛化到未见过的示例。
动机:尽管已有大量研究关注模型插值训练数据时的泛化能力,但大多数研究都假设随机特征是由独立同分布的高斯权重生成的,且只允许输入数据中存在结构。
方法:本文使用统计物理学中的复制技术,为具有多层结构化高斯特征的模型推导学习曲线。结果显示,允许特征层的第一行之间的相关性有助于泛化,而后续层的结构通常对泛化不利。
效果:本文的研究结果为我们理解权重结构如何影响简单可解模型的泛化能力提供了新的见解。

In recent years, significant attention in deep learning theory has been devoted to analyzing when models that interpolate their training data can still generalize well to unseen examples. Many insights have been gained from studying models with multiple layers of Gaussian random features, for which one can compute precise generalization asymptotics. However, few works have considered the effect of weight anisotropy; most assume that the random features are generated using independent and identically distributed Gaussian weights, and allow only for structure in the input data. Here, we use the replica trick from statistical physics to derive learning curves for models with many layers of structured Gaussian features. We show that allowing correlations between the rows of the first layer of features can aid generalization, while structure in later layers is generally detrimental. Our results shed light on how weight structure affects generalization in a simple class of solvable models.

Towards Anytime Classification in Early-Exit Architectures by Enforcing Conditional Monotonicity
Metod Jazbec James Urquhart Allingham Dan Zhang Eric Nalisnick



研究问题:如何使早期退出神经网络适应动态计算预算的实时预测环境。
动机:现有的早期退出网络在实时计算环境中无法保证预测质量随计算时间的增长而提高,因此需要改进。
方法:提出一种基于专家乘积的后处理方法,鼓励早期退出网络逐渐增强信心,从而使深度模型具有条件单调性。
效果:在标准图像分类任务上进行实证研究,结果表明该方法可以在保持平均竞争力的同时实现这种行为。

Modern predictive models are often deployed to environments in which computational budgets are dynamic. Anytime algorithms are well-suited to such environments as, at any point during computation, they can output a prediction whose quality is a function of computation time. Early-exit neural networks have garnered attention in the context of anytime computation due to their capability to provide intermediate predictions at various stages throughout the network. However, we demonstrate that current early-exit networks are not directly applicable to anytime settings, as the quality of predictions for individual data points is not guaranteed to improve with longer computation. To address this shortcoming, we propose an elegant post-hoc modification, based on the Product-of-Experts, that encourages an early-exit network to become gradually confident. This gives our deep models the property of *conditional monotonicity* in the prediction quality---an essential building block towards truly anytime predictive modeling using early-exit architectures. Our empirical results on standard image-classification tasks demonstrate that such behaviors can be achieved while preserving competitive accuracy on average.

Learning Time-Invariant Representations for Individual Neurons from Population Dynamics
Lu Mi Trung Le Tianxing He Eli Shlizerman Uygar Sümbül



研究问题:如何为单个神经元分配时间不变的表示,以反映其从电路其余部分接收的输入。
动机:虽然神经元的活动表现出高度的可变性,但其基因表达在成年大脑中相对稳定。这表明神经元活动是其时间不变的特性和从电路其余部分接收的输入的组合。
方法:提出了一种基于自监督学习的方法,根据置换和群体大小不变的群体记录总结,为单个神经元分配时间不变的表示。通过考虑个体和邻近群体的活动来拟合动态模型以学习表示。
效果:在公开的小鼠皮层神经元活动和转录组标签多模态数据集上演示了该方法。报告了相对于最先进的方法,预测转录子亚类身份提高了35%,预测类别身份提高了20%。

Neurons can display highly variable dynamics. While such variability presumably supports the wide range of behaviors generated by the organism, their gene expressions are relatively stable in the adult brain. This suggests that neuronal activity is a combination of its time-invariant identity and the inputs the neuron receives from the rest of the circuit. Here, we propose a self-supervised learning based method to assign time-invariant representations to individual neurons based on permutation-, and population size-invariant summary of population recordings. We fit dynamical models to neuronal activity to learn a representation by considering the activity of both the individual and the neighboring population. Our self-supervised approach and use of implicit representations enable robust inference against imperfections such as partial overlap of neurons across sessions, trial-to-trial variability, and limited availability of molecular (transcriptomic) labels for downstream supervised tasks. We demonstrate our method on a public multimodal dataset of mouse cortical neuronal activity and transcriptomic labels. We report >35\% improvement in predicting the transcriptomic subclass identity and >20\% improvement in predicting class identity with respect to the state-of-the-art.

topic-5

Topic words :  adversarial,  model,  privacy,  models,  robustness,  data,  robust,  attacks

Conformal Meta-learners for Predictive Inference of Individual Treatment Effects
Ahmed Alaa Zaid Ahmad Mark van der Laan



研究问题:本文旨在解决基于机器学习的个体治疗效应(ITE)预测推理问题。
动机:现有的工作主要集中在开发基于机器学习的“元学习器”,用于提供条件平均治疗效应(CATE)的点估计,这些是结合中间混杂估计以产生CATE估计的模型无关方法。
方法:本文开发了一致元学习器,这是一个通过在CATE元学习器上应用标准的一致预测(CP)程序来为ITEs发布预测区间的通用框架。我们关注基于两阶段伪结果回归的一类广泛的元学习器,并开发了一个随机排序框架来研究它们的有效性。
效果:实验结果表明,如果元学习器的(伪结果)一致性得分随机优于在未观察到的ITEs上评估的“理想”一致性得分,那么使用一致元学习器进行推理是边际有效的。此外,我们还证明,常用的CATE元学习器,如双稳健学习器,满足模型和分布自由的随机(或凸)占优条件,使其一致推理在实践中相关的目标覆盖水平下有效。与现有的通过加权CP对潜在结果进行推理的过程不同,一致元学习器能够直接对目标参数(ITE)进行推理。数值实验表明,一致元学习器提供了具有竞争力的效率的有效区间,同时保留了CATE元学习器的有利的点估计特性。

We investigate the problem of machine learning-based (ML) predictive inference on individual treatment effects (ITEs). Previous work has focused primarily on developing ML-based “meta-learners” that can provide point estimates of the conditional average treatment effect (CATE)—these are model-agnostic approaches for combining intermediate nuisance estimates to produce estimates of CATE. In this paper, we develop conformal meta-learners, a general framework for issuing predictive intervals for ITEs by applying the standard conformal prediction (CP) procedure on top of CATE meta-learners. We focus on a broad class of meta-learners based on two-stage pseudo-outcome regression and develop a stochastic ordering framework to study their validity. We show that inference with conformal meta-learners is marginally valid if their (pseudo-outcome) conformity scores stochastically dominate “oracle” conformity scores evaluated on the unobserved ITEs. Additionally, we prove that commonly used CATE meta-learners, such as the doubly-robust learner, satisfy a model- and distribution-free stochastic (or convex) dominance condition, making their conformal inferences valid for practically-relevant levels of target coverage. Whereas existing procedures conduct inference on nuisance parameters (i.e., potential outcomes) via weighted CP, conformal meta-learners enable direct inference on the target parameter (ITE). Numerical experiments show that conformal meta-learners provide valid intervals with competitive efficiency while retaining the favorable point estimation properties of CATE meta-learners.

Evaluating Post-hoc Explanations for Graph Neural Networks via Robustness Analysis
Junfeng Fang Wei Liu Yuan Gao Zemin Liu An Zhang Xiang Wang Xiangnan He



研究问题:本文旨在评估图神经网络的解释性,这是实际应用中事后解释可信度的关键。
动机:传统的评估指标和解释方法主要遵循提供解释子图并测量输出差异的模式,但总是受到著名的分布外(OOD)问题的困扰。
方法:我们引入了一种新的评估指标,称为OOD-resistant Adversarial Robustness(OAR)。具体来说,我们从对抗鲁棒性的概念中汲取灵感,通过计算其在攻击下的鲁棒性来评估事后解释子图。此外,我们还在流程中插入了一个详细的OOD重权重块,以将评估过程限制在原始数据分布内。对于涉及大型数据集的应用,我们还设计了一种简化版的OAR(SimOAR),在牺牲少量性能的情况下,显著提高了计算效率。
效果:大量的实证研究表明,我们的OAR和SimOAR非常有效。

This work studies the evaluation of explaining graph neural networks (GNNs), which is crucial to the credibility of post-hoc explainability in practical usage. Conventional evaluation metrics, and even explanation methods -- which mainly follow the paradigm of feeding the explanatory subgraph and measuring output difference -- always suffer from the notorious out-of-distribution (OOD) issue. In this work, we endeavor to confront the issue by introducing a novel evaluation metric, termed **O**OD-resistant **A**dversarial **R**obustness (OAR). Specifically, we draw inspiration from the notion of adversarial robustness and evaluate post-hoc explanation subgraphs by calculating their robustness under attack. On top of that, an elaborate OOD reweighting block is inserted into the pipeline to confine the evaluation process to the original data distribution. For applications involving large datasets, we further devise a **Sim**plified version of **OAR** (SimOAR), which achieves a significant improvement in computational efficiency at the cost of a small amount of performance. Extensive empirical studies validate the effectiveness of our OAR and SimOAR.

Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei Nika Haghtalab Jacob Steinhardt



研究问题:大型语言模型在安全和无害性方面仍容易受到对抗性滥用,如ChatGPT早期版本中普遍存在的“越狱”攻击。
动机:我们不仅识别了这个问题,还调查了这种攻击为何成功以及如何创建。
方法:我们假设了两种安全训练的失败模式:竞争目标和不匹配的泛化。当模型的能力与安全目标冲突时,就会出现竞争目标;当安全训练未能泛化到存在能力的领域时,就会出现不匹配的泛化。我们使用这些失败模式来指导越狱设计,并评估最先进的模型,包括OpenAI的GPT-4和Anthropic的Claude v1.3,以对抗现有的和新设计的袭击。
效果:我们发现,尽管这些模型背后的红队评估集进行了广泛的红队评估和安全训练,但仍然存在漏洞。值得注意的是,利用我们的失败模式的新攻击在所有提示中都成功了,并在模型的红队评估集中的一系列不安全的请求中超越了现有的临时越狱。我们的分析强调了安全能力对等的重要性——即安全机制应该与底层模型一样复杂——并反对仅靠扩大规模就能解决这些安全失败模式的观点。

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of “jailbreak” attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model’s capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI’s GPT-4 and Anthropic’s Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models’ red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity—that safety mechanisms should be as sophisticated as the underlying model—and argues against the idea that scaling alone can resolve these safety failure modes.

Privacy Auditing with One (1) Training Run
Thomas Steinke Milad Nasr Matthew Jagielski



研究问题:提出一种单次训练的差分隐私机器学习系统审计方案。
动机:利用并行添加或删除多个独立训练样本的能力,避免分组隐私的成本。
方法:通过差分隐私与统计泛化的关联进行分析,对算法的假设需求最小,适用于黑盒或白盒设置。
效果:应用于DP-SGD,仅通过训练一个模型就可以达到有意义的经验隐私下界,而标准方法则需要训练数百个模型。

We propose a scheme for auditing differentially private machine learning systems with a single training run. This exploits the parallelism of being able to add or remove multiple training examples independently. We analyze this using the connection between differential privacy and statistical generalization, which avoids the cost of group privacy. Our auditing scheme requires minimal assumptions about the algorithm and can be applied in the black-box or white-box setting. We demonstrate the effectiveness of our framework by applying it to DP-SGD, where we can achieve meaningful empirical privacy lower bounds by training only one model. In contrast, standard methods would require training hundreds of models.

On the Role of Randomization in Adversarially Robust Classification
Lucas Gnecco Heredia Muni Sreenivas Pydi Laurent Meunier benjamin negrevergne Yann Chevaleyre



研究问题:深度神经网络易受测试数据中的微小对抗性扰动影响,为了防御对抗性攻击,概率分类器被提出作为确定性分类器的替代。然而,关于概率分类器与确定性分类器在防御对抗性攻击方面的有效性,现有文献存在冲突的发现。
动机:本论文旨在阐明随机化在构建对抗性鲁棒分类器中的作用。
方法:给定一个确定性分类器的基本假设集,我们展示了在何种条件下随机化集成在对抗风险上优于假设集,扩展了之前的结果。此外,我们还表明,对于任何概率二元分类器(包括随机化集成),都存在一种确定性分类器能优于它。最后,我们为许多常见的概率分类器(即随机化集成和参数/输入噪声注入)给出了包含这样一种确定性分类器的明确描述。
效果:实验结果表明,在某些条件下,随机化集成在对抗风险上优于基本假设集;并且对于任何概率二元分类器,都存在一种确定性分类器能优于它。

Deep neural networks are known to be vulnerable to small adversarial perturbations in test data. To defend against adversarial attacks, probabilistic classifiers have been proposed as an alternative to deterministic ones. However, literature has conflicting findings on the effectiveness of probabilistic classifiers in comparison to deterministic ones. In this paper, we clarify the role of randomization in building adversarially robust classifiers. Given a base hypothesis set of deterministic classifiers, we show the conditions under which a randomized ensemble outperforms the hypothesis set in adversarial risk, extending previous results. Additionally, we show that for any probabilistic binary classifier (including randomized ensembles), there exists a deterministic classifier that outperforms it. Finally, we give an explicit description of the deterministic hypothesis set that contains such a deterministic classifier for many types of commonly used probabilistic classifiers, *i.e.* randomized ensembles and parametric/input noise injection.

GLIME: General, Stable and Local LIME Explanation
Zeren Tan Yang Tian Jian Li



研究问题:随着黑箱机器学习模型变得越来越复杂,并在高风险环境中应用,对其预测提供解释的需求变得至关重要。
动机:尽管局部可解释的模型无关的解释(LIME)是一种广泛采用的方法来理解模型行为,但它在随机种子方面存在不稳定性,并且表现出低局部保真度。
方法:我们的研究提出了一种增强的框架Glime,它扩展了LIME并统一了几种先前的方法。在Glime框架中,我们得到了一个等效的LIME公式,该公式实现了显著更快的收敛和改进的稳定性。通过使用局部且无偏的采样分布,Glime生成的解释与LIME相比具有更高的局部保真度,同时独立于参考选择。
效果:实验结果表明,Glime在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

As black-box machine learning models become more complex and are applied in high-stakes settings, the need for providing explanations for their predictions becomes crucial. Although Local Interpretable Model-agnostic Explanations (LIME) \cite{ribeiro2016should} is a widely adopted method for understanding model behavior, it suffers from instability with respect to random seeds \cite{zafar2019dlime, shankaranarayana2019alime, bansal2020sam} and exhibits low local fidelity (i.e., how the explanation explains model's local behaviors) \cite{rahnama2019study, laugel2018defining}. Our study demonstrates that this instability is caused by small sample weights, resulting in the dominance of regularization and slow convergence. Additionally, LIME's sampling approach is non-local and biased towards the reference, leading to diminished local fidelity and instability to references. To address these challenges, we propose \textsc{Glime}, an enhanced framework that extends LIME and unifies several previous methods. Within the \textsc{Glime} framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, \textsc{Glime} generates explanations with higher local fidelity compared to LIME, while being independent of the reference choice. Moreover, \textsc{Glime} offers users the flexibility to choose sampling distribution based on their specific scenarios.

A Privacy-Friendly Approach to Data Valuation
Jiachen T. Wang Yuqing Zhu Yu-Xiang Wang Ruoxi Jia Prateek Mittal



研究问题:本文旨在解决数据估值中面临的隐私挑战,特别是针对KNN-Shapley这一实用的数据估值方法。
动机:随着数据估值领域的发展,如何量化单个数据源对训练机器学习模型的有用性成为了重要课题。然而,这个过程中往往忽视了隐私保护的问题。
方法:本文首先强调了KNN-Shapley方法的内在隐私风险,并展示了在该方法中引入差分隐私(DP)所面临的重大技术挑战。然后,提出了TKNN-Shapley,这是KNN-Shapley的一个改进版本,更加关注隐私保护,可以方便地修改以提供DP保证。
效果:实验结果表明,与直接进行私有化处理的KNN-Shapley相比,DP-TKNN-Shapley具有更多的优势,并且在隐私和效用之间提供了更好的权衡。此外,即使没有进行私有化处理的TKNN-Shapley也能匹配KNN-Shapley的数据质量识别性能。总的来说,这些发现表明TKNN-Shapley是KNN-Shapley的一个有前景的替代方案,特别是在涉及敏感数据的实际应用中。

Data valuation, a growing field that aims at quantifying the usefulness of individual data sources for training machine learning (ML) models, faces notable yet often overlooked privacy challenges. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical challenges in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley. Moreover, even non-private TKNN-Shapley matches KNN-Shapley's performance in discerning data quality. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.

Group Fairness in Peer Review
Haris Aziz Evi Micha Nisarg Shah



研究问题:大型会议如NeurIPS和AAAI的投稿分配方式可能导致某些社区的评审体验不佳,因为他们的投稿可能会被分配给不熟悉该领域的评审员。
动机:为了解决这个问题,我们提出了一个称为“核心”的群体公平性概念,要求在处理每个可能的社区(研究人员子集)时,防止他们单方面从大型会议中获益。
方法:我们研究了一个简单的同行评审模型,证明了它总是可以在核心中找到一种审稿分配方式,并设计了一个有效的算法来找到这样的分配方式。
效果:我们使用来自CVPR和ICLR会议的真实数据,通过多个指标将我们的算法与现有的审稿分配算法进行了比较。

Large conferences such as NeurIPS and AAAI serve as crossroads of various AI fields, since they attract submissions from a vast number of communities. However, in some cases, this has resulted in a poor reviewing experience for some communities, whose submissions get assigned to less qualified reviewers outside of their communities. An often-advocated solution is to break up any such large conference into smaller conferences, but this can lead to isolation of communities and harm interdisciplinary research. We tackle this challenge by introducing a notion of group fairness, called the core, which requires that every possible community (subset of researchers) to be treated in a way that prevents them from unilaterally benefiting by withdrawing from a large conference. We study a simple peer review model, prove that it always admits a reviewing assignment in the core, and design an efficient algorithm to find one such assignment. We use real data from CVPR and ICLR conferences to compare our algorithm to existing reviewing assignment algorithms on a number of metrics.

Delegated Classification
Eden Saig Inbal Talgam-Cohen Nir Rosenfeld



研究问题:当机器学习任务外包给理性代理时,可能出现利益冲突,严重影响预测性能。
动机:提出一个理论框架,用于有意识的激励性机器学习任务委派。
方法:将委派建模为主代理人博弈,通过基于绩效的合同来激励准确的学习。
效果:实验证明,可以使用小规模数据构建预算最优合同,利用学习曲线和规模定律的最新研究成果。在合成和现实世界分类任务中评估性能和经济结果。

When machine learning is outsourced to a rational agent, conflicts of interest might arise and severely impact predictive performance. In this work, we propose a theoretical framework for incentive-aware delegation of machine learning tasks. We model delegation as a principal-agent game, in which accurate learning can be incentivized by the principal using performance-based contracts. Adapting the economic theory of contract design to this setting, we define budget-optimal contracts and prove they take a simple threshold form under reasonable assumptions. In the binary-action case, the optimality of such contracts is shown to be equivalent to the classic Neyman-Pearson lemma, establishing a formal connection between contract design and statistical hypothesis testing. Empirically, we demonstrate that budget-optimal contracts can be constructed using small-scale data, leveraging recent advances in the study of learning curves and scaling laws. Performance and economic outcomes are evaluated using synthetic and real-world classification tasks.

Counterfactual Evaluation of Peer-Review Assignment Policies
Martin Saveski Steven Jecmen Nihar B Shah Johan Ugander



研究问题:如何评估同行评审分配算法的变化对评审质量的影响?
动机:设计有效的分配策略的一个主要挑战是评估分配算法的变化如何映射到评审质量的变化。
方法:利用最近提出的在同行评审分配中引入随机性(以减轻欺诈)的政策,作为评估反事实分配政策的机会。具体来说,我们利用这种随机分配如何提供观察到许多感兴趣的分配政策的积极概率。为了解决应用标准的离策略评估方法的挑战,如违反正则性,我们引入了基于审稿人-论文协变量和结果之间的映射的单调性和Lipschitz平滑性的新颖的部分识别方法。
效果:我们发现,更重视文本相似度会导致更高的评审质量,而在审稿人-论文分配中引入随机性只会略微降低评审质量。我们的部分识别方法可能具有独立的兴趣,而我们的离策略方法可以用于评估一类广泛的算法匹配系统。

Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment—in order to mitigate fraud—as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP'21 workshop (95 papers and 35 reviewers) and the AAAI'22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers' bids vs. textual similarity (between the review's past papers and the submission), and (ii) the "cost of randomization", capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use in evaluating a broad class of algorithmic matching systems.

Participatory Personalization in Classification
Hailey James Chirag Nagpal Katherine A Heller Berk Ustun



研究问题:目前的机器学习模型在个性化预测时,缺乏对用户同意的考虑和信息提供。
动机:为了解决这一问题,我们提出了一种名为参与系统的预测模型,让用户在预测时选择是否接受个性化。
方法:我们设计了一种模型无关的算法,用于学习带有类别群体属性的监督学习任务的参与系统。
效果:通过在临床预测任务中的全面实证研究,我们发现参与系统能够提高性能和隐私保护,同时改善所有报告个人信息的群体的同意和知情权。

Machine learning models are often personalized based on information that is protected, sensitive, self-reported, or costly to acquire. These models use information about people, but do not facilitate nor inform their *consent*. Individuals cannot opt out of reporting information that a model needs to personalize their predictions nor tell if they benefit from personalization in the first place. We introduce a new family of prediction models, called participatory systems, that let individuals opt into personalization at prediction time. We present a model-agnostic algorithm to learn participatory systems for supervised learning tasks where models are personalized with categorical group attributes. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks, comparing them to common approaches for personalization and imputation. Our results show that participatory systems can facilitate and inform consent in a way that improves performance and privacy across all groups who report personal data.

Learning to Receive Help: Intervention-Aware Concept Embedding Models
Mateo Espinosa Zarlenga Katherine M. Collins Krishnamurthy Dj Dvijotham Adrian Weller Zohreh Shams Mateja Jamnik



研究问题:概念瓶颈模型(CBMs)通过使用一组高级概念来构建和解释其预测,以解决神经网络结构的不透明性。然而,最近的研究表明,干预效果可能高度依赖于概念干预的顺序以及模型的架构和训练超参数。
动机:本文认为这源于CBM缺乏对概念干预的适当反应的训练激励。为了解决这个问题,我们提出了干预感知的概念嵌入模型(IntCEMs),这是一种基于CBM的新型架构和训练范式,可以提高模型对测试时间干预的接受能力。
方法:我们的模型从端到端学习了一个概念干预策略,从而在训练时可以采样有意义的干预轨迹。这使得IntCEMs在部署到测试时能够有效地选择和接收概念干预。
效果:实验表明,当提供测试时间的概念干预时,IntCEMs显著优于最先进的概念可解释模型,证明了我们的方法的有效性。

Concept Bottleneck Models (CBMs) tackle the opacity of neural architectures by constructing and explaining their predictions using a set of high-level concepts. A special property of these models is that they permit concept interventions, wherein users can correct mispredicted concepts and thus improve the model's performance. Recent work, however, has shown that intervention efficacy can be highly dependent on the order in which concepts are intervened on and on the model's architecture and training hyperparameters. We argue that this is rooted in a CBM's lack of train-time incentives for the model to be appropriately receptive to concept interventions. To address this, we propose Intervention-aware Concept Embedding models (IntCEMs), a novel CBM-based architecture and training paradigm that improves a model's receptiveness to test-time interventions. Our model learns a concept intervention policy in an end-to-end fashion from where it can sample meaningful intervention trajectories at train-time. This conditions IntCEMs to effectively select and receive concept interventions when deployed at test-time. Our experiments show that IntCEMs significantly outperform state-of-the-art concept-interpretable models when provided with test-time concept interventions, demonstrating the effectiveness of our approach.

Evaluating the Moral Beliefs Encoded in LLMs
Nino Scherrer Claudia Shi Amir Feder David Blei



研究问题:本文旨在通过设计、管理、后处理和评估大型语言模型(LLMs)的调查,探讨其内部编码的道德信念。
动机:了解不同LLMs在模糊情况下的道德选择,特别是在正确选择不明显的情况下。
方法:设计了一个包含680个高模糊度道德场景和687个低模糊度道德场景的大型调查问卷,并对28个开源和闭源LLMs进行调查。
效果:发现大部分模型在明确的场景中会选择符合常识的行动,而在模糊的情况下,大部分模型会表示不确定。部分模型对选择常识行动感到不确定是因为其响应对问题措辞敏感。部分模型在模糊场景中反映出明确的偏好,特别是闭源模型往往彼此之间达成一致。

This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a pedestrian on the road?"). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). We administer the survey to 28 open- and closed-source LLMs. We find that (a) in unambiguous scenarios, most models ``choose" actions that align with commonsense. In ambiguous cases, most models express uncertainty. (b) Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.

Individual Arbitrariness and Group Fairness
Carol Xuan Long Hsiang Hsu Wael Alghamdi Flavio Calmon



研究问题:本文旨在解决机器学习任务中预测多重性的问题,即多个模型在性能上相似,但对个体样本的输出产生冲突。
动机:目前的公平性干预措施在优化群体公平性和准确性时,可能会加剧预测多重性的问题。因此,我们需要在部署模型以帮助决策制定时,考虑“任意性”这一第三轴。
方法:我们提出了一种适用于任何公平性干预的集成算法,该算法可以确保更一致的预测。
效果:实验结果表明,我们的算法能够有效地解决预测多重性的问题,提高模型的预测一致性。

Machine learning tasks may admit multiple competing models that achieve similar performance yet produce conflicting outputs for individual samples---a phenomenon known as predictive multiplicity. We demonstrate that fairness interventions in machine learning optimized solely for group fairness and accuracy can exacerbate predictive multiplicity. Consequently, state-of-the-art fairness interventions can mask high predictive multiplicity behind favorable group fairness and accuracy metrics. We argue that a third axis of ``arbitrariness'' should be considered when deploying models to aid decision-making in applications of individual-level impact. To address this challenge, we propose an ensemble algorithm applicable to any fairness intervention that provably ensures more consistent predictions.

Anonymous and Copy-Robust Delegations for Liquid Democracy
Markus Utke Ulrike Schmidt-Kraepelin



研究问题:本文旨在解决液体民主中委托代理投票制度中的匿名性和复制鲁棒性之间的权衡问题。
动机:现有的委托代理投票制度存在匿名性和复制鲁棒性之间的矛盾,需要寻找一种能够同时满足这两个属性的投票规则。
方法:研究了两种分数委托代理投票规则:混合Borda分支和随机游走规则,并使用马尔科夫链树定理证明了这两种规则在匿名性和复制鲁棒性的一般化版本上是等价的。结合Fulkerson的算法,开发了一种计算所研究委托代理投票结果的多项式时间算法。
效果:该算法具有独立应用价值,可应用于半监督学习和图论。

Liquid democracy with ranked delegations is a novel voting scheme that unites the practicability of representative democracy with the idealistic appeal of direct democracy: Every voter decides between casting their vote on a question at hand or delegating their voting weight to some other, trusted agent. Delegations are transitive, and since voters may end up in a delegation cycle, they are encouraged to indicate not only a single delegate, but a set of potential delegates and a ranking among them. Based on the delegation preferences of all voters, a delegation rule selects one representative per voter. Previous work has revealed a trade-off between two properties of delegation rules called anonymity and copy-robustness. To overcome this issue we study two fractional delegation rules: Mixed Borda branching, which generalizes a rule satisfying copy-robustness, and the random walk rule, which satisfies anonymity. Using the Markov chain tree theorem, we show that the two rules are in fact equivalent, and simultaneously satisfy generalized versions of the two properties. Combining the same theorem with Fulkerson's algorithm, we develop a polynomial-time algorithm for computing the outcome of the studied delegation rule. This algorithm is of independent interest, having applications in semi-supervised learning and graph theory.

Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness
Suraj Srinivas Sebastian Bordt Himabindu Lakkaraju



研究问题:本文旨在解释计算机视觉模型中输入梯度与人类感知对齐的现象,即所谓的感知对齐梯度(PAGs)。
动机:尽管只经过分类训练,但PAGs使健壮的模型具有基本的生成能力,包括图像生成、去噪和修复。然而,这些现象背后的机制尚不清楚。
方法:本文通过“离群点稳健性”首次解释了PAGs,该理论指出模型在数据流形外的稳健性必须高于其在流形上的稳健性。我们首先从理论上证明离群点稳健性导致输入梯度近似位于数据流形上,从而解释了它们的感知对齐。然后,我们证明贝叶斯最优模型满足离群点稳健性,并通过梯度范数正则化、随机平滑和投影梯度下降的对抗训练等方式,实证证实了健壮模型满足这一特性。
效果:通过量化模型梯度与生成模型梯度的相似性,我们发现离群点稳健性与感知对齐密切相关。最后,基于在数据流形内外的稳健性的水平,我们确定了影响感知对齐和模型准确性的三种不同稳健性区间:弱稳健性、贝叶斯对齐稳健性和过度稳健性。

One of the remarkable properties of robust computer vision models is that their input-gradients are often aligned with human perception, referred to in the literature as perceptually-aligned gradients (PAGs). Despite only being trained for classification, PAGs cause robust models to have rudimentary generative capabilities, including image generation, denoising, and in-painting. However, the underlying mechanisms behind these phenomena remain unknown. In this work, we provide a first explanation of PAGs via \emph{off-manifold robustness}, which states that models must be more robust off- the data manifold than they are on-manifold. We first demonstrate theoretically that off-manifold robustness leads input gradients to lie approximately on the data manifold, explaining their perceptual alignment. We then show that Bayes optimal models satisfy off-manifold robustness, and confirm the same empirically for robust models trained via gradient norm regularization, randomized smoothing, and adversarial training with projected gradient descent. Quantifying the perceptual alignment of model gradients via their similarity with the gradients of generative models, we show that off-manifold robustness correlates well with perceptual alignment. Finally, based on the levels of on- and off-manifold robustness, we identify three different regimes of robustness that affect both perceptual alignment and model accuracy: weak robustness, bayes-aligned robustness, and excessive robustness. Code is available at https://github.com/tml-tuebingen/pags.

Auditing for Human Expertise
Rohan Alur Loren Laine Darrick K Li Manish Raghavan Devavrat Shah Dennis Shung



研究问题:在高风险预测任务中,专家的判断往往优于算法,这引发了一个问题,即人类专家是否提供了无法被算法捕获的价值。
动机:为了解决这个问题,研究者开发了一个统计框架,通过这个框架,他们可以将其视为一个自然假设检验。
方法:研究者提出了一个简单的程序,该程序测试了专家预测结果是否在给定输入(“特征”)后与感兴趣的结果独立。如果拒绝这个测试,那么就说明人类专家可能为任何基于可用数据训练的算法增加了价值。
效果:通过对一家大型学术医院系统的急诊部门收集的入院数据进行分析,研究者发现,尽管标准算法筛查工具可能比医生的自主决策更准确,但医生对急性胃肠出血患者的入院/出院决定似乎包含了一些标准算法筛查工具无法获得的信息。这表明,即使不考虑解释性或可解释性的规范性问题,仅仅准确性并不足以证明算法自动化的合理性。

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor. We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs (‘features’). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI ‘complementarity’ is achievable in a given prediction task. We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians’ admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information that is not available to a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians’ discretionary decisions, highlighting that – even absent normative concerns about accountability or interpretability – accuracy is insufficient to justify algorithmic automation.

Topological Parallax: A Geometric Specification for Deep Perception Models
Abraham David Smith Michael J. Catanzaro Gabrielle Angeloro Nirav Patel Paul Bendich



研究问题:如何通过比较训练模型和参考数据集的多尺度几何结构,确定其是否具有相似性,以提高AI系统的安全性和鲁棒性。
动机:当前的深度学习应用中,模型的几何描述不明确,而数据集和模型之间的几何相似性对于可信赖的插值和扰动至关重要。
方法:引入拓扑视差作为理论和计算工具,通过检查参考数据集对Rips复形的测地扭曲效应,估计模型中的拓扑特征(组件、环、空穴等)。
效果:实验证明,数据集和模型之间的几何相似性对于可信赖的插值和扰动至关重要,这一新概念将为当前关于深度学习应用中“过拟合”与“泛化”关系的争论增添价值。

For safety and robustness of AI systems, we introduce _topological parallax_ as a theoretical and computational tool that compares a trained model to a reference dataset to determine whether they have similar multiscale geometric structure. Our proofs and examples show that this geometric similarity between dataset and model is essential to trustworthy interpolation and perturbation, and we conjecture that this new concept will add value to the current debate regarding the unclear relationship between "overfitting"' and "generalization'' in applications of deep-learning. In typical deep-learning applications, an explicit geometric description of the model is impossible, but parallax can estimate topological features (components, cycles, voids, etc.) in the model by examining the effect on the Rips complex of geodesic distortions using the reference dataset. Thus, parallax indicates whether the model shares similar multiscale geometric features with the dataset. Parallax presents theoretically via topological data analysis [TDA] as a bi-filtered persistence module, and the key properties of this module are stable under perturbation of the reference dataset.

On the Gini-impurity Preservation For Privacy Random Forests
XinRan Xie Man-Jie Yuan Xuetong Bai Wei Gao Zhi-Hua Zhou



研究问题:本文旨在提出一种新的加密方法,以保护随机森林算法中数据的信息熵。
动机:尽管现有的随机森林算法的隐私保护技术多种多样,但很少有研究关注到学习算法的关键成分。
方法:我们提出了一种新的加密方案,该方案通过修改二叉搜索树的结构,在每个节点中存储多个示例,并结合标签和顺序信息对数据特征进行加密。
效果:实验结果表明,我们的方案在不解密的情况下,能在密文中保留最小的信息熵,同时提供了加密的安全性保证。

Random forests have been one successful ensemble algorithms in machine learning. Various techniques have been utilized to preserve the privacy of random forests from anonymization, differential privacy, homomorphic encryption, etc., whereas it rarely takes into account some crucial ingredients of learning algorithm. This work presents a new encryption to preserve data's Gini impurity, which plays a crucial role during the construction of random forests. Our basic idea is to modify the structure of binary search tree to store several examples in each node, and encrypt data features by incorporating label and order information. Theoretically, we prove that our scheme preserves the minimum Gini impurity in ciphertexts without decrypting, and present the security guarantee for encryption. For random forests, we encrypt data features based on our Gini-impurity-preserving scheme, and take the homomorphic encryption scheme CKKS to encrypt data labels due to their importance and privacy. We conduct extensive experiments to show the effectiveness, efficiency and security of our proposed method.

Uncertainty Quantification over Graph with Conformalized Graph Neural Networks
Kexin Huang Ying Jin Emmanuel Candes Jure Leskovec



研究问题:如何为图神经网络提供严格的不确定性估计,以减少在错误成本高的场景中的不可靠部署。
动机:现有的图神经网络缺乏严谨的不确定性估计,限制了其在错误成本高的环境中的可靠部署。
方法:提出一种规范化的图神经网络(CF-GNN),将一致性预测(CP)扩展到基于图的模型中,以获得保证的不确定性估计。
效果:实验结果表明,CF-GNN在达到任何预定义的目标边际覆盖的同时,通过比基线最多减少74%的预测集/区间大小,显著减少了预测集/区间长度。同时,它在各种原始和网络特征上实现了满意的条件覆盖。

Graph Neural Networks (GNNs) are powerful machine learning prediction models on graph-structured data. However, GNNs lack rigorous uncertainty estimates, limiting their reliable deployment in settings where the cost of errors is significant. We propose conformalized GNN (CF-GNN), extending conformal prediction (CP) to graph-based models for guaranteed uncertainty estimates. Given an entity in the graph, CF-GNN produces a prediction set/interval that provably contains the true label with pre-defined coverage probability (e.g. 90%). We establish a permutation invariance condition that enables the validity of CP on graph data and provide an exact characterization of the test-time coverage. Moreover, besides valid coverage, it is crucial to reduce the prediction set size/interval length for practical use. We observe a key connection between non-conformity scores and network structures, which motivates us to develop a topology-aware output correction model that learns to update the prediction and produces more efficient prediction sets/intervals. Extensive experiments show that CF-GNN achieves any pre-defined target marginal coverage while significantly reducing the prediction set/interval size by up to 74% over the baselines. It also empirically achieves satisfactory conditional coverage over various raw and network features.

Differentially Private Image Classification by Learning Priors from Random Processes
Xinyu Tang Ashwinee Panda Vikash Sehwag Prateek Mittal



研究问题:在隐私保护机器学习中,由于每个样本的梯度裁剪和噪声添加,差分隐私随机梯度下降(DP-SGD)的性能比SGD更差。
动机:目前,私人学习研究的一个重点是通过在真实世界的公共数据上学习先验知识来提高DP-SGD在私有数据上的性能。
方法:我们探索了如何通过从随机过程生成的图像中学习先验知识并将这些先验知识转移到私有数据上来改善DP-SGD的隐私效用权衡。我们提出了DP-RandP,一种三阶段的方法。
效果:我们在一系列隐私预算ε∈[1,8]下,从零开始在CIFAR10、CIFAR100、MedMNIST和ImageNet上进行训练,取得了新的最先进的精度。特别是在ε=1时,我们将CIFAR10上的先前最佳报告精度从60.6%提高到72.3%。

In privacy-preserving machine learning, differentially private stochastic gradient descent (DP-SGD) performs worse than SGD due to per-sample gradient clipping and noise addition. A recent focus in private learning research is improving the performance of DP-SGD on private data by incorporating priors that are learned on real-world public data. In this work, we explore how we can improve the privacy-utility tradeoff of DP-SGD by learning priors from images generated by random processes and transferring these priors to private data. We propose DP-RandP, a three-phase approach. We attain new state-of-the-art accuracy when training from scratch on CIFAR10, CIFAR100, MedMNIST and ImageNet for a range of privacy budgets $\\varepsilon \\in [1, 8]$. In particular, we improve the previous best reported accuracy on CIFAR10 from $60.6 \\%$ to $72.3 \\%$ for $\\varepsilon=1$.

A Scalable Neural Network for DSIC Affine Maximizer Auction Design
Zhijian Duan Haoran Sun Yurong Chen Xiaotie Deng



研究问题:如何通过机器学习设计出具有高收入的经验主义机制的自动拍卖。
动机:现有的多物品拍卖场景的研究方法存在无法严格保证占优策略激励兼容性和面临由于分配候选者数量大而导致的可扩展性问题。
方法:提出AMenuNet,一种从出价者和项目表示中构建AMAs参数(包括分配菜单)的可扩展神经网络。
效果:实验结果表明,AMenuNet在上下文和非上下文多物品拍卖中均优于强大的基线,能很好地扩展到更大的拍卖,在不同的环境中具有良好的泛化能力,并能识别有用的确定性分配,为自动化的DSIC拍卖设计提供了有效的解决方案。

Automated auction design aims to find empirically high-revenue mechanisms through machine learning. Existing works on multi item auction scenarios can be roughly divided into RegretNet-like and affine maximizer auctions (AMAs) approaches. However, the former cannot strictly ensure dominant strategy incentive compatibility (DSIC), while the latter faces scalability issue due to the large number of allocation candidates. To address these limitations, we propose AMenuNet, a scalable neural network that constructs the AMA parameters (even including the allocation menu) from bidder and item representations. AMenuNet is always DSIC and individually rational (IR) due to the properties of AMAs, and it enhances scalability by generating candidate allocations through a neural network. Additionally, AMenuNet is permutation equivariant, and its number of parameters is independent of auction scale. We conduct extensive experiments to demonstrate that AMenuNet outperforms strong baselines in both contextual and non-contextual multi-item auctions, scales well to larger auctions, generalizes well to different settings, and identifies useful deterministic allocations. Overall, our proposed approach offers an effective solution to automated DSIC auction design, with improved scalability and strong revenue performance in various settings.

Vulnerabilities in Video Quality Assessment Models: The Challenge of Adversarial Attacks
Aoxiang Zhang Yu Ran Weixuan Tang Yuan-Gen Wang



研究问题:本文旨在评估无参考视频质量评估(NR-VQA)模型对抗性攻击的鲁棒性,并提出一种针对黑盒攻击的基于补丁的随机搜索方法。
动机:为了建立一个可靠且实用的评估系统,评估NR-VQA模型的鲁棒性是至关重要的,但这个问题在学术界尚未引起足够的关注。
方法:本文提出了一种新的损失函数Score-Reversed Boundary Loss,通过将估计的质量分数推向一个特定的边界,同时满足仅可察觉差异(JND)约束,以实现有效且难以察觉的白盒和黑盒攻击。
效果:实验结果表明,该方法能够有效地对NR-VQA模型进行攻击,提高其鲁棒性。

No-Reference Video Quality Assessment (NR-VQA) plays an essential role in improving the viewing experience of end-users. Driven by deep learning, recent NR-VQA models based on Convolutional Neural Networks (CNNs) and Transformers have achieved outstanding performance. To build a reliable and practical assessment system, it is of great necessity to evaluate their robustness. However, such issue has received little attention in the academic community. In this paper, we make the first attempt to evaluate the robustness of NR-VQA models against adversarial attacks, and propose a patch-based random search method for black-box attack. Specifically, considering both the attack effect on quality score and the visual quality of adversarial video, the attack problem is formulated as misleading the estimated quality score under the constraint of just-noticeable difference (JND). Built upon such formulation, a novel loss function called Score-Reversed Boundary Loss is designed to push the adversarial video’s estimated quality score far away from its ground-truth score towards a specific boundary, and the JND constraint is modeled as a strict $L_2$ and $L_\infty$ norm restriction. By this means, both white-box and black-box attacks can be launched in an effective and imperceptible manner. The source code is available at https://github.com/GZHU-DVL/AttackVQA.

A One-Size-Fits-All Approach to Improving Randomness in Paper Assignment
Yixuan Even Xu Steven Jecmen Zimeng Song Fei Fang



研究问题:如何有效地为大型出版场所的同行评审过程分配论文审查员,以实现专家匹配、抵御恶意行为、评估替代论文分配方案、保持审稿人多样性和审稿人匿名性等多重目标。
动机:目前,自动论文分配算法在满足这些多重目标方面存在挑战,需要一种能够同时满足所有考虑因素的随机化论文分配方法。
方法:本文提出了一种实用的、通用的随机化论文分配方法,该方法在不同的随机性动机下都能表现良好。
效果:理论和实验证明,该方法在几个直观的随机性度量上优于当前部署的随机化论文分配方法,表明该方法生成的随机化分配是通用的。

The assignment of papers to reviewers is a crucial part of the peer review processes of large publication venues, where organizers (e.g., conference program chairs) rely on algorithms to perform automated paper assignment. As such, a major challenge for the organizers of these processes is to specify paper assignment algorithms that find appropriate assignments with respect to various desiderata. Although the main objective when choosing a good paper assignment is to maximize the expertise of each reviewer for their assigned papers, several other considerations make introducing randomization into the paper assignment desirable: robustness to malicious behavior, the ability to evaluate alternative paper assignments, reviewer diversity, and reviewer anonymity. However, it is unclear in what way one should randomize the paper assignment in order to best satisfy all of these considerations simultaneously. In this work, we present a practical, one-size-fits-all method for randomized paper assignment intended to perform well across different motivations for randomness. We show theoretically and experimentally that our method outperforms currently-deployed methods for randomized paper assignment on several intuitive randomness metrics, demonstrating that the randomized assignments produced by our method are general-purpose.

Aleatoric and Epistemic Discrimination: Fundamental Limits of Fairness Interventions
Hao Wang Luxi He Rui Gao Flavio Calmon



研究问题:机器学习模型在开发过程中的选择和数据固有的偏见可能导致某些人群的性能不佳。
动机:本研究将机器学习管道中的歧视来源分为两类:固有于数据分布的随机歧视和由于模型开发过程中的决定导致的先验歧视。
方法:我们通过确定模型在公平性约束下的性能极限,假设完全了解数据分布,来量化随机歧视。然后,我们将模型的精度与应用公平性约束时的随机歧视限制之间的差距,作为先验歧视的度量。
效果:我们的研究结果表明,现有的公平性干预措施在标准的(过度使用的)表格数据集上有效地消除了先验歧视。然而,当数据存在缺失值时,处理随机歧视仍有很大的改进空间。

Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions made during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy when fairness constraints are applied and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing fairness interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination on standard (overused) tabular datasets. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.

Evaluating and Inducing Personality in Pre-trained Language Models
Guangyuan Jiang Manjie Xu Song-Chun Zhu Wenjuan Han Chi Zhang Yixin Zhu



研究问题:如何以原则性和量化的方式评估机器学习模型的行为?能否在机器学习模型中引入特定的个性?
动机:借鉴人类性格理论,通过心理测量工具对机器行为进行系统化的研究,为构建类似人类的社交机器提供参考。
方法:提出“机器个性量表”(MPI)工具,基于大五人格因素理论和人格评估量表,对机器学习模型进行标准化评估。同时设计了“个性提示”(P$^2$)方法,以可控的方式引导机器学习模型展现特定个性。
效果:首次证明了MPI在研究机器学习模型行为上的有效性,成功实现了以个性化为导向的多样化、可验证的机器行为。

Standardized and quantified evaluation of machine behaviors is a crux of understanding LLMs. In this study, we draw inspiration from psychometric studies by leveraging human personality theory as a tool for studying machine behaviors. Originating as a philosophical quest for human behaviors, the study of personality delves into how individuals differ in thinking, feeling, and behaving. Toward building and understanding human-like social machines, we are motivated to ask: Can we assess machine behaviors by leveraging human psychometric tests in a **principled** and **quantitative** manner? If so, can we induce a specific personality in LLMs? To answer these questions, we introduce the Machine Personality Inventory (MPI) tool for studying machine behaviors; MPI follows standardized personality tests, built upon the Big Five Personality Factors (Big Five) theory and personality assessment inventories. By systematically evaluating LLMs with MPI, we provide the first piece of evidence demonstrating the efficacy of MPI in studying LLMs behaviors. We further devise a Personality Prompting (P$^2$) method to induce LLMs with specific personalities in a **controllable** way, capable of producing diverse and verifiable behaviors. We hope this work sheds light on future studies by adopting personality as the essential indicator for various downstream tasks, and could further motivate research into equally intriguing human-like machine behaviors.

Adversarial Examples Might be Avoidable: The Role of Data Concentration in Adversarial Robustness
Ambar Pal Jeremias Sulam Rene Vidal



研究问题:现代机器学习分类器对对抗性示例的敏感性是否意味着这些对抗性示例是不可避免的?
动机:尽管理论结果认为对抗性示例可能是不可避免的,但这些结果可能过于普遍,无法适用于自然数据分布。人类在视觉任务中表现出了相当的鲁棒性,这与理论结果产生了明显的冲突。
方法:我们通过理论研究发现,数据分布的一个关键属性——输入空间小体积子集的集中程度——决定了是否存在鲁棒的分类器。我们还发现,对于集中在低维线性子空间上的数据集,利用数据结构自然能够产生具有良好鲁棒性保证的分类器,在某些情况下优于可证明认证的方法。
效果:我们的研究结果揭示了对抗性示例是否不可避免取决于数据分布的特性,为理解人类视觉任务中的鲁棒性提供了新的视角,并为设计鲁棒的分类器提供了指导。

The susceptibility of modern machine learning classifiers to adversarial examples has motivated theoretical results suggesting that these might be unavoidable. However, these results can be too general to be applicable to natural data distributions. Indeed, humans are quite robust for tasks involving vision. This apparent conflict motivates a deeper dive into the question: Are adversarial examples truly unavoidable? In this work, we theoretically demonstrate that a key property of the data distribution -- concentration on small-volume subsets of the input space -- determines whether a robust classifier exists. We further demonstrate that, for a data distribution concentrated on a union of low-dimensional linear subspaces, exploiting data structure naturally leads to classifiers that enjoy good robustness guarantees, improving upon methods for provable certification in certain regimes.

Smooth Flipping Probability for Differential Private Sign Random Projection Methods
Ping Li Xiaoyun Li



研究问题:如何通过随机投影和符号随机投影方法,开发一系列差分隐私(DP)算法。
动机:改善现有的DP-RP方法,利用最优高斯机制,并利用随机投影的“符号翻转概率”的鲁棒性,提出一系列DP-SignRP算法。
方法:首先改进了之前的DP-RP方法,然后提出了一系列的DP-SignRP算法。这些算法利用了随机投影的“符号翻转概率”的鲁棒性,即在数据u的小修改下,sign(x)只有很小的概率被翻转。这种鲁棒性导致了我们设计出“平滑翻转概率”,使得SignRP类型的算法比使用标准的随机响应机制有更好的效用。
效果:检索和分类实验表明,在所有提出的DP-RP算法中,DP-SignOPORP(其中OPORP是对著名的计数- sketch算法的改进)表现最好。由于我们的新提出的DP算法显著提高了性能,预计这将推动DP在实践中的广泛应用。最后,我们强调,由于我们的方法应用于原始数据(即特征向量),因此下游任务的隐私自然受到保护。

We develop a series of differential privacy (DP) algorithms from a family of random projection (RP) and sign random projection (SignRP) methods. We first show how to improve the previous DP-RP approach using the ``optimal Gaussian mechanism''. Then, we propose a series of DP-SignRP algorithms that leverage the robustness of the ``sign flipping probability'' of random projections. That is, given $x = \sum_{i=1}^p u_i w_{i}$ where $u$ is a $p$-dimensional data vector and $w$ is a symmetric random vector, $sign(x)$ only has a fairly small probability to be flipped if there is a small modification on data $u$, depending on the specific distribution of $w$. This robustness leads to our novel design of ``smooth flipping probability'' for SignRP-type algorithms with better utility than using the standard randomized response mechanism. Retrieval and classification experiments demonstrate that, among the presented DP-RP algorithms, \textbf{DP-SignOPORP} (where OPORP is an improvement over the celebrated count-sketch algorithms), performs the best in general. In the industrial practice, DP methods were not very popular for machine learning or search, largely because the performance typically would drop substantially if DP is applied. Since our proposed new DP algorithms have significantly improved the performance, it is anticipated that our work will motivate a wide adoption of DP in practice. Finally, we stress that, since our methods are applied to the original data (i.e., feature vectors), the privacy of downstream tasks is naturally protected.

Data Market Design through Deep Learning
Sai Srivatsa Ravindranath Yanchen Jiang David C. Parkes



研究问题:如何设计一种数据市场,以最大化信息卖家的预期收入?
动机:在经济理论中,数据市场设计问题是一个寻找一组信号方案(统计实验)的问题,以使信息卖家的预期收入最大化。每个实验都会揭示出卖家知道的一些信息,并有相应的价格。
方法:我们引入了深度学习在设计收益最优的数据市场中的应用,以扩大我们可以理解和实现的前沿。相对于早期关于深度学习用于拍卖设计的工作,我们必须学习信号方案,而不仅仅是分配规则,并且要处理服从性约束——这些来自于对买家下游行为的建模——以及除了出价的激励约束之外。
效果:我们的实验表明,这种新的深度学习框架可以几乎精确地复制所有已知的理论解决方案,扩展到更复杂的设置,并用于建立数据市场的最优设计,并对最优设计的结构进行猜想。

The _data market design_ problem is a problem in economic theory to find a set of signaling schemes (statistical experiments) to maximize expected revenue to the information seller, where each experiment reveals some of the information known to a seller and has a corresponding price. Each buyer has their own decision to make in a world environment, and their subjective expected value for the information associated with a particular experiment comes from the improvement in this decision and depends on their prior and value for different outcomes. In a setting with multiple buyers, a buyer's expected value for an experiment may also depend on the information sold to others. We introduce the application of deep learning for the design of revenue-optimal data markets, looking to expand the frontiers of what can be understood and achieved. Relative to earlier work on deep learning for auction design, we must learn signaling schemes rather than allocation rules and handle _obedience constraints_ — these arising from modeling the downstream actions of buyers — in addition to incentive constraints on bids. Our experiments demonstrate that this new deep learning framework can almost precisely replicate all known solutions from theory, expand to more complex settings, and be used to establish the optimality of new designs for data markets and make conjectures in regard to the structure of optimal designs.

ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP
Lu Yan ZHUO ZHANG Guanhong Tao Kaiyuan Zhang Xuan Chen Guangyu Shen Xiangyu Zhang



研究问题:本文旨在解决自然语言处理(NLP)模型中后门攻击的问题,特别是针对更隐蔽的基于风格的攻击。
动机:当前的检测机制无法有效应对更隐蔽的后门攻击策略,如基于风格的攻击。因此,本文提出了一种创新的测试时有毒样本检测框架。
方法:我们利用先进的大型语言模型ChatGPT作为我们的改写器,并将触发词移除任务视为提示工程问题。我们采用模糊测试技术来发现能够有效消除触发词同时保持输入语义的最佳改写提示。
效果:在4种类型的后门攻击和4个不同的数据集上进行的实验表明,我们的方法在精度和召回率上都优于STRIP、RAP和ONION等基线方法。

Backdoor attacks have emerged as a prominent threat to natural language processing (NLP) models, where the presence of specific triggers in the input can lead poisoned models to misclassify these inputs to predetermined target classes. Current detection mechanisms are limited by their inability to address more covert backdoor strategies, such as style-based attacks. In this work, we propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions, grounded in the semantic meaning of inputs. We contend that triggers (e.g., infrequent words) are not supposed to fundamentally alter the underlying semantic meanings of poisoned samples as they want to stay stealthy. Based on this observation, we hypothesize that while the model's predictions for paraphrased clean samples should remain stable, predictions for poisoned samples should revert to their true labels upon the mutations applied to triggers during the paraphrasing process. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem. We adopt fuzzing, a technique commonly used for unearthing software vulnerabilities, to discover optimal paraphrase prompts that can effectively eliminate triggers while concurrently maintaining input semantics. Experiments on 4 types of backdoor attacks, including the subtle style backdoors, and 4 distinct datasets demonstrate that our approach surpasses baseline methods, including STRIP, RAP, and ONION, in precision and recall.

Equal Opportunity of Coverage in Fair Regression
Fangxin Wang Lu Cheng Ruocheng Guo Kay Liu Philip S. Yu



研究问题:本研究旨在解决预测不确定性下的公平机器学习问题,以实现可靠和可信赖的决策制定。
动机:虽然"等量覆盖"的工作提出了一种考虑不确定性的公平性概念,但它不能保证在更细粒度的群体(如低收入女性)中,基于真实标签的覆盖率相等,并且在评估不确定性时存在偏见。
方法:我们提出了一种新的考虑不确定性的公平性——等机会覆盖(EOC),旨在实现两个属性:(1)具有相似结果的不同群体的覆盖率接近,(2)整个群体的覆盖率保持在预定水平。此外,预测区间应保持较窄以提供有用信息。我们提出了Binned Fair Quantile Regression(BFQR),这是一种无分布的后处理方法,可以改善任何已训练的ML模型的EOC,并保持合理的预测区间宽度。
效果:实验结果表明,我们的方法在改善EOC方面是有效的。

We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making. The seminal work of 'equalized coverage' proposed an uncertainty-aware fairness notion. However, it does not guarantee equal coverage rates across more fine-grained groups (e.g., low-income females) conditioning on the true label and is biased in the assessment of uncertainty. To tackle these limitations, we propose a new uncertainty-aware fairness -- Equal Opportunity of Coverage (EOC) -- that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level. Further, the prediction intervals should be narrow to be informative. We propose Binned Fair Quantile Regression (BFQR), a distribution-free post-processing method to improve EOC with reasonable width for any trained ML models. It first calibrates a hold-out set to bound deviation from EOC, then leverages conformal prediction to maintain EOC on a test set, meanwhile optimizing prediction interval width. Experimental results demonstrate the effectiveness of our method in improving EOC.

On the Robustness of Removal-Based Feature Attributions
Chris Lin Ian Connick Covert Su-In Lee



研究问题:现有的特征归属方法对输入和模型扰动敏感,缺乏稳健性。
动机:为了解决这一问题,本研究旨在理论分析移除式特征归属方法的稳健性。
方法:通过统一分析和推导上界,研究在输入和模型扰动情况下,完整和受损的特征归属之间的差异。
效果:实验结果验证了理论成果,并展示了其实用价值,包括通过提高模型的Lipschitz正则化来增强特征归属的稳健性。

To explain predictions made by complex machine learning models, many feature attribution methods have been developed that assign importance scores to input features. Some recent work challenges the robustness of these methods by showing that they are sensitive to input and model perturbations, while other work addresses this issue by proposing robust attribution methods. However, previous work on attribution robustness has focused primarily on gradient-based feature attributions, whereas the robustness of removal-based attribution methods is not currently well understood. To bridge this gap, we theoretically characterize the robustness properties of removal-based feature attributions. Specifically, we provide a unified analysis of such methods and derive upper bounds for the difference between intact and perturbed attributions, under settings of both input and model perturbations. Our empirical results on synthetic and real-world data validate our theoretical results and demonstrate their practical implications, including the ability to increase attribution robustness by improving the model’s Lipschitz regularity.

Django: Detecting Trojans in Object Detection Models via Gaussian Focus Calibration
Guangyu Shen Siyuan Cheng Guanhong Tao Kaiyuan Zhang Yingqi Liu Shengwei An Shiqing Ma Xiangyu Zhang



研究问题:现有的触发器反转方法在对象检测模型中存在优化目标不匹配的问题,因为注入的恶意触发器对不同边界框的影响程度可能不同。
动机:为了解决这一问题,我们提出了一种新的对象检测后门检测框架Django。
方法:Django采用动态高斯加权方案,优先处理更容易受到攻击的目标边界框,并在触发器反转过程中分配适当的系数以校准优化目标。此外,我们还结合了一种新的标签建议预处理技术来提高其效率。
效果:我们在3个对象检测图像数据集、3种模型架构和2种攻击类型上评估了Django,共涉及168个模型。实验结果表明,Django优于6种最先进的基线方法,准确率提高了高达38%,并且开销减少了10倍。

Object detection models are vulnerable to backdoor or trojan attacks, where an attacker can inject malicious triggers into the model, leading to altered behavior during inference. As a defense mechanism, trigger inversion leverages optimization to reverse-engineer triggers and identify compromised models. While existing trigger inversion methods assume that each instance from the support set is equally affected by the injected trigger, we observe that the poison effect can vary significantly across bounding boxes in object detection models due to its dense prediction nature, leading to an undesired optimization objective misalignment issue for existing trigger reverse-engineering methods. To address this challenge, we propose the first object detection backdoor detection framework Django (Detecting Trojans in Object Detection Models via Gaussian Focus Calibration). It leverages a dynamic Gaussian weighting scheme that prioritizes more vulnerable victim boxes and assigns appropriate coefficients to calibrate the optimization objective during trigger inversion. In addition, we combine Django with a novel label proposal pre-processing technique to enhance its efficiency. We evaluate Django on 3 object detection image datasets, 3 model architectures, and 2 types of attacks, with a total of 168 models. Our experimental results show that Django outperforms 6 state-of-the-art baselines, with up to 38% accuracy improvement and 10x reduced overhead. The code is available at https://github.com/PurduePAML/DJGO.

Label Poisoning is All You Need
Rishi Dev Jha Jonathan Hayase Sewoong Oh



研究问题:本文旨在探讨是否仅通过篡改标签就可以成功发起后门攻击。
动机:在许多常见的机器学习场景中,训练标签由可能具有恶意的第三方提供,包括众包注释和知识蒸馏。因此,作者提出一个基本问题:我们能否仅通过篡改标签来发起成功的后门攻击?
方法:作者引入了一种名为FLIP的新型方法来设计仅基于标签的后门攻击,并在三个数据集(CIFAR-10、CIFAR-100和Tiny-ImageNet)和四种架构(ResNet-32、ResNet-18、VGG-19和视觉变换器)上展示了其优势。
效果:在CIFAR-10上,仅篡改了2%的标签,FLIP就实现了99.4%的攻击成功率,同时干净测试准确率仅下降了1.8%。这种方法建立在最近在数据集蒸馏中引入的轨迹匹配的进展之上。

In a backdoor attack, an adversary injects corrupted data into a model's training dataset in order to gain control over its predictions on images with a specific attacker-defined trigger. A typical corrupted training example requires altering both the image, by applying the trigger, and the label. Models trained on clean images, therefore, were considered safe from backdoor attacks. However, in some common machine learning scenarios, the training labels are provided by potentially malicious third-parties. This includes crowd-sourced annotation and knowledge distillation. We, hence, investigate a fundamental question: can we launch a successful backdoor attack by only corrupting labels? We introduce a novel approach to design label-only backdoor attacks, which we call FLIP, and demonstrate its strengths on three datasets (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32, ResNet-18, VGG-19, and Vision Transformer). With only 2% of CIFAR-10 labels corrupted, FLIP achieves a near-perfect attack success rate of 99.4% while suffering only a 1.8% drop in the clean test accuracy. Our approach builds upon the recent advances in trajectory matching, originally introduced for dataset distillation.

Understanding Deep Gradient Leakage via Inversion Influence Functions
Haobo Zhang Junyuan Hong Yuyang Deng Mehrdad Mahdavi Jiayu Zhou



研究问题:现有的深度学习模型在分布式学习中存在严重的隐私泄露问题,如何有效地防止这种攻击并保护用户隐私。
动机:深度梯度泄露(DGL)攻击可以从梯度向量中恢复出训练图像的隐私信息,这对拥有敏感数据的客户端进行分布式学习构成了重大挑战。
方法:本文提出了一种新的逆影响函数(I$^2$F),通过隐式解决DGL问题,在恢复的图像和私有梯度之间建立了封闭形式的联系。
效果:实验证明,I$^2$F在不同的模型架构、数据集、攻击实现和基于噪声的防御上,都能有效地近似DGL。此外,I$^2$F还为有效梯度扰动方向、隐私保护的不公平性以及隐私优先的模型初始化提供了深入的见解。

Deep Gradient Leakage (DGL) is a highly effective attack that recovers private training images from gradient vectors. This attack casts significant privacy challenges on distributed learning from clients with sensitive data, where clients are required to share gradients. Defending against such attacks requires but lacks an understanding of when and how privacy leakage happens, mostly because of the black-box nature of deep networks. In this paper, we propose a novel Inversion Influence Function (I$^2$F) that establishes a closed-form connection between the recovered images and the private gradients by implicitly solving the DGL problem. Compared to directly solving DGL, I$^2$F is scalable for analyzing deep networks, requiring only oracle access to gradients and Jacobian-vector products. We empirically demonstrate that I$^2$F effectively approximated the DGL generally on different model architectures, datasets, attack implementations, and noise-based defenses. With this novel tool, we provide insights into effective gradient perturbation directions, the unfairness of privacy protection, and privacy-preferred model initialization. Our codes are provided in https://github.com/illidanlab/inversion-influence-function.

Adversarially Robust Learning with Uncertain Perturbation Sets
Tosca Lechner Vinayak Pathak Ruth Urner



研究问题:现有的深度学习模型在分布式学习中存在严重的隐私泄露问题,如何有效地防止这种攻击并保护用户隐私。
动机:深度梯度泄露(DGL)攻击可以从梯度向量中恢复出训练图像的隐私信息,这对拥有敏感数据的客户端进行分布式学习构成了重大挑战。
方法:本文提出了一种新的逆影响函数(I$^2$F),通过隐式解决DGL问题,在恢复的图像和私有梯度之间建立了封闭形式的联系。
效果:实验证明,I$^2$F在不同的模型架构、数据集、攻击实现和基于噪声的防御上,都能有效地近似DGL。此外,I$^2$F还为有效梯度扰动方向、隐私保护的不公平性以及隐私优先的模型初始化提供了深入的见解。

In many real-world settings exact perturbation sets to be used by an adversary are not plausibly available to a learner. While prior literature has studied both scenarios with completely known and completely unknown perturbation sets, we propose an in-between setting of learning with respect to a class of perturbation sets. We show that in this setting we can improve on previous results with completely unknown perturbation sets, while still addressing the concerns of not having perfect knowledge of these sets in real life. In particular, we give the first positive results for the learnability of infinite Littlestone classes when having access to a perfect-attack oracle. We also consider a setting of learning with abstention, where predictions are considered robustness violations, only when the wrong prediction is made within the perturbation set. We show there are classes for which perturbation-set unaware learning without query access is possible, but abstention is required.

DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification
Mintong Kang Dawn Song Bo Li



研究问题:如何有效地对抗基于扩散的净化防御机制。
动机:尽管先进的攻击无法有效破坏这种防御,但净化过程可能会引发深度计算图的问题,如梯度模糊、高内存成本和无界随机性。
方法:提出了一个统一的框架DiffAttack,包括DDPM和基于分数的方法,通过在中间扩散步骤中引入偏离重建损失来诱导不准确的密度梯度估计,以解决梯度消失/爆炸问题。同时提供了一种分段前向-后向算法,实现了内存高效的梯度反向传播。
效果:在CIFAR-10和ImageNet上,与现有的自适应攻击相比,DiffAttack的攻击效果显著,可以使模型的鲁棒性准确率下降超过20%(CIFAR-10,$\ell_infty$攻击,$\epsilon=8/255$),并在ImageNet上下降超过10%($\ell_\infty$攻击,$epsilon=4/255$)。

Diffusion-based purification defenses leverage diffusion models to remove crafted perturbations of adversarial examples and achieve state-of-the-art robustness. Recent studies show that even advanced attacks cannot break such defenses effectively, since the purification process induces an extremely deep computational graph which poses the potential problem of gradient obfuscation, high memory cost, and unbounded randomness. In this paper, we propose a unified framework DiffAttack to perform effective and efficient attacks against diffusion-based purification defenses, including both DDPM and score-based approaches. In particular, we propose a deviated-reconstruction loss at intermediate diffusion steps to induce inaccurate density gradient estimation to tackle the problem of vanishing/exploding gradients. We also provide a segment-wise forwarding-backwarding algorithm, which leads to memory-efficient gradient backpropagation. We validate the attack effectiveness of DiffAttack compared with existing adaptive attacks on CIFAR-10 and ImageNet. We show that DiffAttack decreases the robust accuracy of models compared with SOTA attacks by over 20\% on CIFAR-10 under $\ell_\infty$ attack $(\epsilon=8/255)$, and over 10\% on ImageNet under $\ell_\infty$ attack $(\epsilon=4/255)$. We conduct a series of ablations studies, and we find 1) DiffAttack with the deviated-reconstruction loss added over uniformly sampled time steps is more effective than that added over only initial/final steps, and 2) diffusion-based purification with a moderate diffusion length is more robust under DiffAttack.

Training on Foveated Images Improves Robustness to Adversarial Attacks
Muhammad A Shah Aqsa Kashaf Bhiksha Raj



研究问题:深度神经网络易受对抗性攻击,如何提高其鲁棒性?
动机:人类视觉系统对低质量视觉刺激的持续暴露有助于增强其鲁棒性。
方法:开发RBlur图像转换技术,模拟人眼周边视觉的图像失真和色彩饱和度降低。
效果:使用RBlur处理过的图像训练的深度神经网络在对抗性攻击和其他非对抗性破坏下表现出更高的准确性,提高了25%。

Deep neural networks (DNNs) have been shown to be vulnerable to adversarial attacks -- subtle, perceptually indistinguishable perturbations of inputs that change the response of the model. In the context of vision, we hypothesize that an important contributor to the robustness of human visual perception is constant exposure to low-fidelity visual stimuli in our peripheral vision. To investigate this hypothesis, we develop RBlur, an image transform that simulates the loss in fidelity of peripheral vision by blurring the image and reducing its color saturation based on the distance from a given fixation point. We show that compared to DNNs trained on the original images, DNNs trained on images transformed by RBlur are substantially more robust to adversarial attacks, as well as other, non-adversarial, corruptions, achieving up to 25% higher accuracy on perturbed data.

Creating a Public Repository for Joining Private Data
James Cook Milind Shyani Nina Mishra



研究问题:如何在保护隐私的同时,发布包含敏感属性的数据集,并能够与其他具有相同敏感属性的数据集进行连接?
动机:在许多情况下,例如医院和航空公司可能希望联合确定乘坐长途飞行的人是否更容易感染呼吸道感染。如果通过共同的键控用户标识符(如电子邮件地址)连接他们的数据,他们可以确定答案,但这会破坏隐私。
方法:本文展示了医院如何生成私有草图,以及航空公司如何通过电子邮件地址与医院的草图进行私密连接。所提出的方法满足纯差分隐私,并对这些连接上的线性查询和优化问题给出近似答案。
效果:该方法是非交互式的,因此可以将草图发布到任何组织的存储库中进行连接,从而促进数据发现。该方法的准确性通过理论分析和大量实证证据得以证明。

How can one publish a dataset with sensitive attributes in a way that both preserves privacy and enables joins with other datasets on those same sensitive attributes? This problem arises in many contexts, e.g., a hospital and an airline may want to jointly determine whether people who take long-haul flights are more likely to catch respiratory infections. If they join their data by a common keyed user identifier such as email address, they can determine the answer, though it breaks privacy. This paper shows how the hospital can generate a private sketch and how the airline can privately join with the hospital's sketch by email address. The proposed solution satisfies pure differential privacy and gives approximate answers to linear queries and optimization problems over those joins. Whereas prior work such as secure function evaluation requires sender/receiver interaction, a distinguishing characteristic of the proposed approach is that it is non-interactive. Consequently, the sketch can be published to a repository for any organization to join with, facilitating data discovery. The accuracy of the method is demonstrated through both theoretical analysis and extensive empirical evidence.

Robust Bayesian Satisficing
Artun Saday Y. Cahit Yıldırım Cem Tekin



研究问题:分布偏移对现代机器学习的鲁棒性构成了重大挑战。
动机:为了克服这个挑战,提出了鲁棒满意策略(RS),在未指定的分布偏移下寻找一个鲁棒的解决方案,同时达到期望的效用阈值。
方法:本文专注于当真实分布和参考分布存在差异时,上下文贝叶斯优化中的RS问题。我们提出了一种新的鲁棒贝叶斯满意算法,称为RoBOS,用于噪声黑箱优化。
效果:我们的算法在一定分布偏移假设下,保证了次线性宽松遗憾。此外,我们定义了一个较弱的遗憾概念,称为鲁棒满意遗憾,我们的算法实现了与分布偏移量无关的次线性上界。通过将此方法应用于各种学习问题并与其它方法进行比较,例如分布健壮优化,证明了我们的方法的有效性。

Distributional shifts pose a significant challenge to achieving robustness in contemporary machine learning. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called RoBOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.

Incentives in Federated Learning: Equilibria, Dynamics, and Mechanisms for Welfare Maximization
Aniket Murhekar Zhuowen Yuan Bhaskar Ray Chaudhury Bo Li Ruta Mehta



研究问题:如何在保护数据隐私和通信成本的前提下,实现模型的协同学习。
动机:联邦学习(FL)是一种强大的合作学习模型,但参与方在共享数据的同时可能会产生隐私和通信成本。
方法:我们构建了一个协作的联邦学习框架,每个参与者都在学习收益和数据共享成本之间寻求最优平衡。通过引入预算平衡机制和最佳响应动态,我们设计了一种新的协议FedBR-BG。
效果:实验证明,FedBR-BG在MNIST和CIFAR-10数据集上的表现优于没有额外激励的基本最佳响应协议、标准的联邦学习协议FedAvg以及最近的基线MWFed,实现了更高的$p$-mean福利。

Federated learning (FL) has emerged as a powerful scheme to facilitate the collaborative learning of models amongst a set of agents holding their own private data. Although the agents benefit from the global model trained on shared data, by participating in federated learning, they may also incur costs (related to privacy and communication) due to data sharing. In this paper, we model a collaborative FL framework, where every agent attempts to achieve an optimal trade-off between her learning payoff and data sharing cost. We show the existence of Nash equilibrium (NE) under mild assumptions on agents' payoff and costs. Furthermore, we show that agents can discover the NE via best response dynamics. However, some of the NE may be bad in terms of overall welfare for the agents, implying little incentive for some fraction of the agents to participate in the learning. To remedy this, we design a budget-balanced mechanism involving payments to the agents, that ensures that any $p$-mean welfare function of the agents' utilities is maximized at NE. In addition, we introduce a FL protocol FedBR-BG that incorporates our budget-balanced mechanism, utilizing best response dynamics. Our empirical validation on MNIST and CIFAR-10 substantiates our theoretical analysis. We show that FedBR-BG outperforms the basic best-response-based protocol without additional incentivization, the standard federated learning protocol FedAvg, as well as a recent baseline MWFed in terms of achieving superior $p$-mean welfare.

Robust and Actively Secure Serverless Collaborative Learning
Nicholas Franzese Adam Dziedzic Christopher A. Choquette-Choo Mark R. Thomas Muhammad Ahmad Kaleem Stephan Rabanser Congyu Fang Somesh Jha Nicolas Papernot Xiao Wang



研究问题:如何实现一个安全的对等学习方案,防止恶意服务器和对抗恶意客户端。
动机:当前的协作机器学习方法在分布式数据上学习更好的模型时,存在被服务器或客户端滥用其权力的风险。
方法:提出一种点对点(P2P)学习方案,通过将任何兼容的模型更新聚合算法转化为可以对抗恶意服务器和恶意客户端的环境,来保证安全性。
效果:即使在有100万个参数的模型和标准的数据集上进行训练,该方法也显示出了很高的计算效率。

Collaborative machine learning (ML) is widely used to enable institutions to learn better models from distributed data. While collaborative approaches to learning intuitively protect user data, they remain vulnerable to either the server, the clients, or both, deviating from the protocol. Indeed, because the protocol is asymmetric, a malicious server can abuse its power to reconstruct client data points. Conversely, malicious clients can corrupt learning with malicious updates. Thus, both clients and servers require a guarantee when the other cannot be trusted to fully cooperate. In this work, we propose a peer-to-peer (P2P) learning scheme that is secure against malicious servers and robust to malicious clients. Our core contribution is a generic framework that transforms any (compatible) algorithm for robust aggregation of model updates to the setting where servers and clients can act maliciously. Finally, we demonstrate the computational efficiency of our approach even with 1-million parameter models trained by 100s of peers on standard datasets.

Recommender Systems with Generative Retrieval
Shashank Rajput Nikhil Mehta Anima Singh Raghunandan Hulikal Keshavan Trung Vu Lukasz Heldt Lichan Hong Yi Tay Vinh Q. Tran Jonah Samost Maciej Kula Ed H. Chi Maheswaran Sathiamoorthy



研究问题:如何通过联合训练大规模文本语料库和知识图谱来训练一种增强的语言表示模型(ERNIE)。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文提出利用知识图谱中的有信息量的实体来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,该模型能同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Modern recommender systems perform large-scale retrieval by embedding queries and item candidates in the same unified space, followed by approximate nearest neighbor search to select top candidates given a query embedding. In this paper, we propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates. To that end, we create semantically meaningful tuple of codewords to serve as a Semantic ID for each item. Given Semantic IDs for items in a user session, a Transformer-based sequence-to-sequence model is trained to predict the Semantic ID of the next item that the user will interact with. We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets. In addition, we show that incorporating Semantic IDs into the sequence-to-sequence model enhances its ability to generalize, as evidenced by the improved retrieval performance observed for items with no prior interaction history.

Scalable Fair Influence Maximization
Xiaobin Rui Zhixiao Wang Jiayu Zhao Lichao Sun Wei Chen



研究问题:在给定图G、社区结构C和预算k的情况下,公平影响最大化问题旨在选择种子集S(|S|\leq k),以最大化影响传播,同时缩小不同社区之间的影响差距。
动机:尽管存在各种公平性概念,但福利公平性概念(平衡公平水平和影响传播)已显示出良好的效果。然而,优化福利公平目标函数的高效算法的缺乏限制了其在只有几百个节点的小尺度网络中的应用。
方法:本文采用福利公平目标函数来最大化所有社区的影响分数的指数加权总和。我们首先引入了一个无偏估计器来计算算术平均值的分数幂。然后,通过适应反向影响采样(RIS)方法,我们将优化问题转化为加权最大覆盖问题。我们还分析了需要多少个反向可达集来高概率地近似公平影响。此外,我们提出了一个保证1-1/e - ε近似的高效算法。
效果:实验结果表明,该方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Given a graph $G$, a community structure $\mathcal{C}$, and a budget $k$, the fair influence maximization problem aims to select a seed set $S$ ($|S|\leq k$) that maximizes the influence spread while narrowing the influence gap between different communities. While various fairness notions exist, the welfare fairness notion, which balances fairness level and influence spread, has shown promising effectiveness. However, the lack of efficient algorithms for optimizing the welfare fairness objective function restricts its application to small-scale networks with only a few hundred nodes. In this paper, we adopt the objective function of welfare fairness to maximize the exponentially weighted summation over the influenced fraction of all communities. We first introduce an unbiased estimator for the fractional power of the arithmetic mean. Then, by adapting the reverse influence sampling (RIS) approach, we convert the optimization problem to a weighted maximum coverage problem. We also analyze the number of reverse reachable sets needed to approximate the fair influence at a high probability. Further, we present an efficient algorithm that guarantees $1-1/e - \varepsilon$ approximation.

Explainable and Efficient Randomized Voting Rules
Soroush Ebadian Aris Filos-Ratsikas Mohamad Latifian Nisarg Shah



研究问题:如何在保持解释性的同时,通过添加简单的随机化步骤来提高决策效率。
动机:随着AI工具在关键决策中的应用增长,对如何向利益相关者解释这些工具的决策方式的需求也在增加。因此,由于其内在的可解释性,投票经常被用于做出这样的决定。
方法:研究了两种简单随机化的投票规则:随机位置评分规则和随机委员会成员规则,并从理论和实证两方面证明它们在一定程度上确实同时实现了解释性和效率。
效果:实验结果表明,这两种投票规则在保持解释性的同时,也能提高决策效率。

With a rapid growth in the deployment of AI tools for making critical decisions (or aiding humans in doing so), there is a growing demand to be able to explain to the stakeholders how these tools arrive at a decision. Consequently, voting is frequently used to make such decisions due to its inherent explainability. Recent work suggests that using randomized (as opposed to deterministic) voting rules can lead to significant efficiency gains measured via the distortion framework. However, rules that use intricate randomization can often become too complex to explain to the stakeholders; losing explainability can eliminate the key advantage of voting over black-box AI tools, which may outweigh the efficiency gains. We study the efficiency gains which can be unlocked by using voting rules that add a simple randomization step to a deterministic rule, thereby retaining explainability. We focus on two such families of rules, randomized positional scoring rules and random committee member rules, and show, theoretically and empirically, that they indeed achieve explainability and efficiency simultaneously to some extent.

Robust Data Valuation with Weighted Banzhaf Values
Weida Li Yaoliang Yu



研究问题:现有的基于价值的方法在数据估值中的稳定性和可靠性存在问题。
动机:为了解决现有方法在处理数据估值中的随机性时的稳定性和可靠性问题,提出了权重Banzhaf值的概念。
方法:通过引入Kronecker噪声参数化随机性,证明了独特的稳健半值位于权重Banzhaf值的家族中,同时最小化最坏情况的熵。并采用最大样本复用原则设计了一个高效的估值器来近似权重Banzhaf值。
效果:理论验证在合成和真实噪声下均有效。对于后者,拟合了固有的随机性Kronecker噪声,然后插入生成预测的最稳健半值。研究表明,面对数据估值中的过度噪声,权重Banzhaf值具有潜力。

Data valuation, a principled way to rank the importance of each training datum, has become increasingly important. However, existing value-based approaches (e.g., Shapley) are known to suffer from the stochasticity inherent in utility functions that render consistent and reliable ranking difficult. Recently, Wang and Jia (2023) proposed the noise-structure-agnostic framework to advocate the Banzhaf value for its robustness against such stochasticity as it achieves the largest safe margin among many alternatives. Surprisingly, our empirical study shows that the Banzhaf value is not always the most robust when compared with a broader family: weighted Banzhaf values. To analyze this scenario, we introduce the concept of Kronecker noise to parameterize stochasticity, through which we prove that the uniquely robust semi-value, which can be analytically derived from the underlying Kronecker noise, lies in the family of weighted Banzhaf values while minimizing the worst-case entropy. In addition, we adopt the maximum sample reuse principle to design an estimator to efficiently approximate weighted Banzhaf values, and show that it enjoys the best time complexity in terms of achieving an $(\epsilon, \delta)$-approximation. Our theory is verified under both synthetic and authentic noises. For the latter, we fit a Kronecker noise to the inherent stochasticity, which is then plugged in to generate the predicted most robust semi-value. Our study suggests that weighted Banzhaf values are promising when facing undue noises in data valuation.

A Path to Simpler Models Starts With Noise
Lesia Semenova Harry Chen Ronald Parr Cynthia Rudin



研究问题:Rashomon集是一组在给定数据集上表现相近的模型,而Rasho研究问题:Rashomon集是一组在给定数据集上表现相近的模型,而Rashomon比率是给定假设空间中所有模型在Rashomon集中的比例。对于刑事司法、医疗、借贷、教育等领域的表格数据,Rashomon比率通常较大,这引发了一个开放性问题,即为什么Rashomon比率往往较大。
动机:本研究旨在探讨数据生成过程以及分析师在学习过程中通常做出的选择如何决定Rashomon比率的大小。
方法:我们提出了一种机制,通过分析人员训练模型的方式,证明噪声较大的数据集会导致较大的Rashomon比率。此外,我们还引入了一个名为模式多样性的度量标准,用于捕捉Rashomon集中不同分类模式之间的预测平均差异,并解释了为什么它往往会随着标签噪声的增加而增加。
效果:我们的研究结果解释了为什么在复杂且噪声较大的数据集上,简单模型往往能与黑箱模型表现相媲美的一个重要原因。

The Rashomon set is the set of models that perform approximately equally well on a given dataset, and the Rashomon ratio is the fraction of all models in a given hypothesis space that are in the Rashomon set. Rashomon ratios are often large for tabular datasets in criminal justice, healthcare, lending, education, and in other areas, which has practical implications about whether simpler models can attain the same level of accuracy as more complex models. An open question is why Rashomon ratios often tend to be large. In this work, we propose and study a mechanism of the data generation process, coupled with choices usually made by the analyst during the learning process, that determines the size of the Rashomon ratio. Specifically, we demonstrate that noisier datasets lead to larger Rashomon ratios through the way that practitioners train models. Additionally, we introduce a measure called pattern diversity, which captures the average difference in predictions between distinct classification patterns in the Rashomon set, and motivate why it tends to increase with label noise. Our results explain a key aspect of why simpler models often tend to perform as well as black box models on complex, noisier datasets.

Batchnorm Allows Unsupervised Radial Attacks
Amur Ghose Apurv Gupta Yaoliang Yu Pascal Poupart



研究问题:本文旨在探讨在无需标签的情况下,利用批量归一化深度图像识别架构中的中间潜在变量生成对抗性示例。
动机:现有的对抗性示例生成方法通常需要针对每个实例的软或硬标签,而本文提出的方法不需要依赖任何标签。
方法:通过利用批量归一化表示的几何特性和它们在超球面上的范数集中以及与高斯分布的接近程度,仅使用中间损失(仅利用角度偏差)来生成对抗性示例。
效果:实验结果表明,该方法可以成功生成对抗性示例,即使模型被转移到下游使用,泄漏的中间表示仍然可能对已部署的模型造成安全漏洞。去除批量归一化会减弱攻击效果,表明批量归一化是导致这种脆弱性的原因之一。此外,该方法在实证上也成功地针对LayerNorm进行了攻击,因此对于变换器架构(尤其是视觉变换器)具有相关性。

The construction of adversarial examples usually requires the existence of soft or hard labels for each instance, with respect to which a loss gradient provides the signal for construction of the example. We show that for batch normalized deep image recognition architectures, intermediate latents that are produced after a batch normalization step by themselves suffice to produce adversarial examples using an intermediate loss solely utilizing angular deviations, without relying on any label. We motivate our loss through the geometry of batch normed representations and their concentration of norm on a hypersphere and distributional proximity to Gaussians. Our losses expand intermediate latent based attacks that usually require labels. The success of our method implies that leakage of intermediate representations may create a security breach for deployed models, which persists even when the model is transferred to downstream usage. Removal of batch norm weakens our attack, indicating it contributes to this vulnerability. Our attacks also succeed against LayerNorm empirically, thus being relevant for transformer architectures, most notably vision transformers which we analyze.

Optimal and Fair Encouragement Policy Evaluation and Learning
Angela Zhou



研究问题:在必须遵守人类对治疗建议的非遵从性的连续领域中,如何制定最优的政策规则?
动机:在这些领域,人们可能不会遵守治疗建议,同时,谁接受治疗和治疗效果也存在差异。例如,在社会服务中,最需要的人却未能充分利用有益服务的空白一直是一个难题。当决策者对访问和平均结果都有分配偏好时,最优决策规则会发生变化。
方法:我们研究了潜在违反正则性的情况下的识别、双重稳健估计和稳健估计。通过约束优化考虑了公平性约束,如治疗采用的人口平等和其他约束。我们的框架可以扩展到处理算法推荐,通常合理的协变量条件排除限制,使用我们的鲁棒性检查来检测推荐中缺乏积极性。我们开发了一个两阶段的在线学习算法,用于解决一般约束下的参数化政策类问题,以获得方差敏感的遗憾界限。
效果:我们在一个典型的例子中评估了改进的推荐规则,即在减少监控差异的同时,优化PSA-DMF预审风险评估工具中的监督释放推荐。

In consequential domains, it is often impossible to compel individuals to take treatment, so that optimal policy rules are merely suggestions in the presence of human non-adherence to treatment recommendations. In these same domains, there may be heterogeneity both in who responds in taking-up treatment, and heterogeneity in treatment efficacy. For example, in social services, a persistent puzzle is the gap in take-up of beneficial services among those who may benefit from them the most. When in addition the decision-maker has distributional preferences over both access and average outcomes, the optimal decision rule changes. We study identification, doubly-robust estimation, and robust estimation under potential violations of positivity. We consider fairness constraints such as demographic parity in treatment take-up, and other constraints, via constrained optimization. Our framework can be extended to handle algorithmic recommendations under an often-reasonable covariate-conditional exclusion restriction, using our robustness checks for lack of positivity in the recommendation. We develop a two-stage, online learning-based algorithm for solving over parametrized policy classes under general constraints to obtain variance-sensitive regret bounds. We assess improved recommendation rules in a stylized case study of optimizing recommendation of supervised release in the PSA-DMF pretrial risk-assessment tool while reducing surveillance disparities.

Bicriteria Multidimensional Mechanism Design with Side Information
Siddharth Prasad Nina Balcan Tuomas Sandholm



研究问题:如何设计一种能同时产生高社会福利和高收益的多维机制,并利用关于代理类型的旁信息?
动机:在实际操作中,旁信息的主要来源包括从历史代理数据上训练的机器学习模型的预测、领域专家的建议,甚至机制设计者自己的直觉。本文采用无先验假设的观点,不对旁信息的正确性、准确性或来源做出任何假设。
方法:我们设计了一个元机制,将输入的旁信息与经典的VCG机制相结合。通过基于最弱竞争者的概念(即对福利影响最小的代理)引入的新构造,我们描述了我们的元机制的福利、收益和激励特性。
效果:当我们仔细实例化时,我们的元机制同时实现了强大的福利和收益保证,参数由旁信息的错误决定。当旁信息具有高度的信息性和准确性时,我们的机制实现的收益和福利与总社会剩余相竞争,并且其性能会随着旁信息质量的降低而逐渐衰减。最后,我们将我们的元机制应用于每个代理的类型由常数个参数确定的情况。

We develop a versatile new methodology for multidimensional mechanism design that incorporates side information about agent types to generate high social welfare and high revenue simultaneously. Prominent sources of side information in practice include predictions from a machine-learning model trained on historical agent data, advice from domain experts, and even the mechanism designer's own gut instinct. In this paper we adopt a prior-free perspective that makes no assumptions on the correctness, accuracy, or source of the side information. First, we design a meta-mechanism that integrates input side information with an improvement of the classical VCG mechanism. The welfare, revenue, and incentive properties of our meta-mechanism are characterized by novel constructions we introduce based on the notion of a weakest competitor, which is an agent that has the smallest impact on welfare. We show that our meta-mechanism, when carefully instantiated, simultaneously achieves strong welfare and revenue guarantees parameterized by errors in the side information. When the side information is highly informative and accurate, our mechanism achieves welfare and revenue competitive with the total social surplus, and its performance decays continuously and gradually as the quality of the side information decreases. Finally, we apply our meta-mechanism to a setting where each agent's type is determined by a constant number of parameters. Specifically, agent types lie on constant-dimensional subspaces (of the potentially high-dimensional ambient type space) that are known to the mechanism designer. We use our meta-mechanism to obtain the first known welfare and revenue guarantees in this setting.

Randomized and Deterministic Maximin-share Approximations for Fractionally Subadditive Valuations
Hannaneh Akrami Kurt Mehlhorn Masoud Seddighin Golnoosh Shahkarami



研究问题:如何为具有分数次可加性($\XOS$)估值的代理分配不可分割的项目,以保证最大最小份额($\MMS$)。
动机:对于$\XOS$估值,一些实例表明,无法保证所有代理获得比一半更好的最大最小份额。同时,已存在一种确定性分配方法,可以保证每个代理获得$0.219225$的最大最小份额。
方法:我们的研究涉及确定性和随机分配。在确定性方面,我们将分数次可加性估值的最佳近似保证提高到$3/13=0.230769$。我们在分配算法中开发了新的想法,用于分配大项目,这可能具有独立的兴趣。此外,我们还研究了随机算法和最佳两者公平保证。我们提出了一种随机分配方法,对于$\XOS$估值,其预期望最大最小份额为$1/4$,后期望最大最小份额为$1/8$。此外,我们证明了这类估值的预期望保证上限为$3/4$。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider the problem of guaranteeing maximin-share ($\MMS$) when allocating a set of indivisible items to a set of agents with fractionally subadditive ($\XOS$) valuations. For $\XOS$ valuations, it has been previously shown that for some instances no allocation can guarantee a fraction better than $1/2$ of maximin-share to all the agents. Also, a deterministic allocation exists that guarantees $0.219225$ of the maximin-share of each agent. Our results involve both deterministic and randomized allocations. On the deterministic side, we improve the best approximation guarantee for fractionally subadditive valuations to $3/13 = 0.230769$. We develop new ideas on allocating large items in our allocation algorithm which might be of independent interest. Furthermore, we investigate randomized algorithms and the Best-of-both-worlds fairness guarantees. We propose a randomized allocation that is $1/4$-$\MMS$ ex-ante and $1/8$-$\MMS$ ex-post for $\XOS$ valuations. Moreover, we prove an upper bound of $3/4$ on the ex-ante guarantee for this class of valuations.

GAIA: Delving into Gradient-based Attribution Abnormality for Out-of-distribution Detection
Jinggang Chen Junjie Li Xiaoyang Qu Jianzong Wang Jiguang Wan Jing Xiao



研究问题:如何检测神经网络中的分布外(OOD)示例,以确保其在现实世界环境中的可靠性和安全性。
动机:梯度基础的解释方法在为OOD数据分配特征重要性时遇到挑战,导致解释模式出现分歧。因此,我们研究了解释梯度如何导致不确定的解释结果,并引入了两种形式的OOD检测异常:零收缩异常和通道平均异常。
方法:我们提出了GAIA,一种简单而有效的方法,将梯度异常检查和聚合相结合。
效果:GAIA在常用的CIFAR基准测试以及大规模的ImageNet-1k基准测试上的效果得到了验证。具体来说,与先进的后处理方法相比,GAIA在CIFAR10和CIFAR100上分别将平均FPR95降低了23.10%和45.41%。

Detecting out-of-distribution (OOD) examples is crucial to guarantee the reliability and safety of deep neural networks in real-world settings. In this paper, we offer an innovative perspective on quantifying the disparities between in-distribution (ID) and OOD data---analyzing the uncertainty that arises when models attempt to explain their predictive decisions. This perspective is motivated by our observation that gradient-based attribution methods encounter challenges in assigning feature importance to OOD data, thereby yielding divergent explanation patterns. Consequently, we investigate how attribution gradients lead to uncertain explanation outcomes and introduce two forms of abnormalities for OOD detection: the zero-deflation abnormality and the channel-wise average abnormality. We then propose GAIA, a simple and effective approach that incorporates Gradient Abnormality Inspection and Aggregation. The effectiveness of GAIA is validated on both commonly utilized (CIFAR) and large-scale (ImageNet-1k) benchmarks. Specifically, GAIA reduces the average FPR95 by 23.10% on CIFAR10 and by 45.41% on CIFAR100 compared to advanced post-hoc methods.

Perturbation Towards Easy Samples Improves Targeted Adversarial Transferability
Junqi Gao Biqing Qi Yao Li Zhichang Guo Dong Li Yuming Xing Dazhi Zhang



研究问题:如何提高对抗性攻击的转移性,特别是在目标设置中。
动机:对抗性攻击在黑箱模型中的应用需要更有效的转移方法。
方法:通过实验和理论证明,神经网络在同一数据集上训练时,每个类别的高样本密度区域(HSDR)的性能更一致。因此,在目标设置中,向目标类别的HSDR添加扰动可以提高转移性。同时,提出了一种名为“易样本匹配攻击”(ESMA)的生成性目标攻击策略。
效果:ESMA不仅成功率高,且比当前最先进的生成性方法表现更好。此外,与当前最先进的方法相比,ESMA需要的存储空间和计算时间更少。

The transferability of adversarial perturbations provides an effective shortcut for black-box attacks. Targeted perturbations have greater practicality but are more difficult to transfer between models. In this paper, we experimentally and theoretically demonstrated that neural networks trained on the same dataset have more consistent performance in High-Sample-Density-Regions (HSDR) of each class instead of low sample density regions. Therefore, in the target setting, adding perturbations towards HSDR of the target class is more effective in improving transferability. However, density estimation is challenging in high-dimensional scenarios. Further theoretical and experimental verification demonstrates that easy samples with low loss are more likely to be located in HSDR. Perturbations towards such easy samples in the target class can avoid density estimation for HSDR location. Based on the above facts, we verified that adding perturbations to easy samples in the target class improves targeted adversarial transferability of existing attack methods. A generative targeted attack strategy named Easy Sample Matching Attack (ESMA) is proposed, which has a higher success rate for targeted attacks and outperforms the SOTA generative method. Moreover, ESMA requires only $5\%$ of the storage space and much less computation time comparing to the current SOTA, as ESMA attacks all classes with only one model instead of seperate models for each class. Our code is available at https://github.com/gjq100/ESMA

Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning
Tyler Kastner Murat A Erdogdu Amir-massoud Farahmand



研究问题:学习风险敏感强化学习模型的问题。
动机:现有的值等价方法在风险中性环境中可以用于最优规划,但在风险敏感环境中则不适用。
方法:利用分布强化学习方法引入两种新的模型等价概念,一种通用但难以处理,另一种实用,可以选择最优规划的风险度量。
效果:通过实验证明,这些模型可以增强任何无模型的风险敏感算法,并在表格和大规模实验中展示了该方法的能力。

We consider the problem of learning models for risk-sensitive reinforcement learning. We theoretically demonstrate that proper value equivalence, a method of learning models which can be used to plan optimally in the risk-neutral setting, is not sufficient to plan optimally in the risk-sensitive setting. We leverage distributional reinforcement learning to introduce two new notions of model equivalence, one which is general and can be used to plan for any risk measure, but is intractable; and a practical variation which allows one to choose which risk measures they may plan optimally for. We demonstrate how our models can be used to augment any model-free risk-sensitive algorithm, and provide both tabular and large-scale experiments to demonstrate our method’s ability.

Posthoc privacy guarantees for collaborative inference with modified Propose-Test-Release
Abhishek Singh Praneeth Vepakomma Vivek Sharma Ramesh Raskar



研究问题:如何通过链接神经网络的局部Lipschitz常数和局部敏感性,为任意训练的神经网络提供形式化的隐私保证。
动机:随着对数据隐私问题的日益关注,现有的工作提出了协作推理(CI)来学习在与不受信任的服务提供者共享敏感用户数据之前保护隐私的编码。
方法:我们开发了一个新的框架,通过将神经网络的局部Lipschitz常数与其局部敏感性联系起来,为其提供形式化的隐私保证。为了使用局部敏感性保证隐私,我们将Propose-Test-Release(PTR)框架进行扩展,使其适用于神经网络查询。
效果:我们在真实世界数据集上验证了我们框架的有效性,并阐明了对抗性表示学习(ARL)在改善隐私-效用权衡中的作用。

Cloud-based machine learning inference is an emerging paradigm where users query by sending their data through a service provider who runs an ML model on that data and returns back the answer. Due to increased concerns over data privacy, recent works have proposed Collaborative Inference (CI) to learn a privacy-preserving encoding of sensitive user data before it is shared with an untrusted service provider. Existing works so far evaluate the privacy of these encodings through empirical reconstruction attacks. In this work, we develop a new framework that provides formal privacy guarantees for an arbitrarily trained neural network by linking its local Lipschitz constant with its local sensitivity. To guarantee privacy using local sensitivity, we extend the Propose-Test-Release (PTR) framework to make it tractable for neural network queries. We verify the efficacy of our framework experimentally on real-world datasets and elucidate the role of Adversarial Representation Learning (ARL) in improving the privacy-utility trade-off.

Human-Aligned Calibration for AI-Assisted Decision Making
Nina L. Corvelo Benz Manuel Gomez Rodriguez



研究问题:本文旨在解决二元分类器在提供决策支持时,其置信值往往无法帮助决策者准确判断预测结果的问题。
动机:虽然二元分类器可以提供预测标签和置信值,但现有的证据表明,决策者往往难以仅凭置信值来判断预测结果的准确性。
方法:本文首先提出在某些数据分布下,即使最优的决策者也可能无法仅通过常规的置信值来找出最优决策策略。然后,我们证明了如果置信值满足与决策者对自己预测的信心的对齐属性,那么总是存在一个最优决策策略,使得决策者对预测的信任度是置信值的单调函数,从而有助于发现这个策略。
效果:实验证明,当分类器的置信值满足与决策者对自己预测的信心的对齐属性时,可以帮助决策者做出更好的决策。

Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exists data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values—an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker’s confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker’s confidence on her own prediction is a sufficient condition for alignment. Experiments on a real AI-assisted decision making scenario where a classifier provides decision support to human decision makers validate our theoretical results and suggest that alignment may lead to better decisions.

IBA: Towards Irreversible Backdoor Attacks in Federated Learning
Dung Thuy Nguyen Tuan Minh Nguyen Anh Tuan Tran Khoa D Doan KOK SENG WONG



研究问题:如何在联邦学习中进行后门攻击,以在不损害终端设备个人敏感数据的情况下训练机器学习模型。
动机:现有的联邦学习中的后门攻击方法存在局限性,如需要控制大量客户端或了解其他诚实客户端的数据分布,触发器往往明显可见,且效应会随攻击者退出训练过程而迅速稀释。
方法:提出一种新的联邦学习后门攻击框架——不可逆后门攻击(IBA),该框架联合学习最优和视觉上难以察觉的触发器,并逐渐将后门植入全局模型中,以提高攻击的效率和持久性。
效果:在MNIST、CIFAR-10和Tiny ImageNet等基准数据集上评估了所提出的攻击框架,取得了高成功率,同时绕过了现有的后门防御措施,与其他后门攻击相比,实现了更有效、更隐蔽和更持久的后门效果。

Federated learning (FL) is a distributed learning approach that enables machine learning models to be trained on decentralized data without compromising end devices' personal, potentially sensitive data. However, the distributed nature and uninvestigated data intuitively introduce new security vulnerabilities, including backdoor attacks. In this scenario, an adversary implants backdoor functionality into the global model during training, which can be activated to cause the desired misbehaviors for any input with a specific adversarial pattern. Despite having remarkable success in triggering and distorting model behavior, prior backdoor attacks in FL often hold impractical assumptions, limited imperceptibility, and durability. Specifically, the adversary needs to control a sufficiently large fraction of clients or know the data distribution of other honest clients. In many cases, the trigger inserted is often visually apparent, and the backdoor effect is quickly diluted if the adversary is removed from the training process. To address these limitations, we propose a novel backdoor attack framework in FL, \dung{the \textbf{Irreversible Backdoor Attack} (\texttt{IBA})}, that jointly learns the optimal and visually stealthy trigger and then gradually implants the backdoor into a global model. This approach allows the adversary to execute a backdoor attack that can evade both human and machine inspections. Additionally, we enhance the efficiency and durability of the proposed attack by selectively poisoning the model's parameters that are least likely updated by the main task's learning process and constraining the poisoned model update to the vicinity of the global model. Finally, we evaluate the proposed attack framework on several benchmark datasets, including MNIST, CIFAR-10, and Tiny ImageNet, and achieved high success rates while simultaneously bypassing existing backdoor defenses and achieving a more durable backdoor effect compared to other backdoor attacks. Overall, \texttt{IBA}\footnote{Code for this paper is published at \url{https://github.com/sail-research/iba}.} offers a more effective, stealthy, and durable approach to backdoor attacks in FL.

Certification of Distributional Individual Fairness
Matthew Robert Wicker Vihari Piratla Adrian Weller



研究问题:如何为神经网络的算法公平性提供形式保证。
动机:对机器学习算法进行社会负责的部署,为其提供算法公平性的正式保证至关重要。
方法:提出了一种新的关于个体公平性的凸近似方法,该方法可以显著降低提供局部个体公平性正式保证的计算成本。同时,还提出了分布个体公平性的认证方法,确保在给定的经验分布和所有在γ-Wasserstein球内的分布中,神经网络都有保证的个体公平预测。
效果:利用拟凸优化的发展,我们为分布个体公平性提供了新颖且高效的认证边界。实验结果表明,我们的方法能够对比先前工作大几个数量级的神经网络进行认证和规范化。此外,我们还研究了真实世界的分布变化,发现我们的边界是一个可扩展、实用且可靠的IF保证来源。

Providing formal guarantees of algorithmic fairness is of paramount importance to socially responsible deployment of machine learning algorithms. In this work, we study formal guarantees, i.e., certificates, for individual fairness (IF) of neural networks. We start by introducing a novel convex approximation of IF constraints that exponentially decreases the computational cost of providing formal guarantees of local individual fairness. We highlight that prior methods are constrained by their focus on global IF certification and can therefore only scale to models with a few dozen hidden neurons, thus limiting their practical impact. We propose to certify \textit{distributional} individual fairness which ensures that for a given empirical distribution and all distributions within a $\gamma$-Wasserstein ball, the neural network has guaranteed individually fair predictions. Leveraging developments in quasi-convex optimization, we provide novel and efficient certified bounds on distributional individual fairness and show that our method allows us to certify and regularize neural networks that are several orders of magnitude larger than those considered by prior works. Moreover, we study real-world distribution shifts and find our bounds to be a scalable, practical, and sound source of IF guarantees.

Attacks on Online Learners: a Teacher-Student Analysis
Riccardo Giuseppe Margiotta Sebastian Goldt Guido Sanguinetti



研究问题:现有的预训练语言模型缺乏对丰富的结构化知识的利用,本文旨在通过结合大规模文本语料库和知识图谱来训练一种增强的语言表示模型(ERNIE)。
动机:知识图谱中的有信息量的实体可以通过外部知识来增强语言表示,以提升模型的语义理解能力。
方法:采用大规模文本语料库和知识图谱进行联合训练,ERNIE模型能够同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Machine learning models are famously vulnerable to adversarial attacks: small ad-hoc perturbations of the data that can catastrophically alter the model predictions. While a large literature has studied the case of test-time attacks on pre-trained models, the important case of attacks in an online learning setting has received little attention so far. In this work, we use a control-theoretical perspective to study the scenario where an attacker may perturb data labels to manipulate the learning dynamics of an online learner. We perform a theoretical analysis of the problem in a teacher-student setup, considering different attack strategies, and obtaining analytical results for the steady state of simple linear learners. These results enable us to prove that a discontinuous transition in the learner's accuracy occurs when the attack strength exceeds a critical threshold. We then study empirically attacks on learners with complex architectures using real data, confirming the insights of our theoretical analysis. Our findings show that greedy attacks can be extremely efficient, especially when data stream in small batches.

Use perturbations when learning from explanations
Juyeon Heo Vihari Piratla Matthew Robert Wicker Adrian Weller



研究问题:本文旨在解决机器学习中模型解释的问题,即如何通过人类提供的解释来确保模型的预测是正确的。
动机:现有的机器学习解释方法依赖于局部模型解释和强模型平滑,这导致性能不佳。
方法:我们将机器学习解释重新定义为鲁棒性问题,其中人类解释指定了一个低维流形,可以从中提取扰动。我们展示了这种方法在理论上和实验上如何减轻对强模型平滑的需求。
效果:我们的方法在各种实现鲁棒性的方法上都取得了改进,并在合成和现实世界的基准测试上达到了最先进的结果。

Machine learning from explanations (MLX) is an approach to learning that uses human-provided explanations of relevant or irrelevant features for each input to ensure that model predictions are right for the right reasons. Existing MLX approaches rely on local model interpretation methods and require strong model smoothing to align model and human explanations, leading to sub-optimal performance. We recast MLX as a robustness problem, where human explanations specify a lower dimensional manifold from which perturbations can be drawn, and show both theoretically and empirically how this approach alleviates the need for strong model smoothing. We consider various approaches to achieving robustness, leading to improved performance over prior MLX methods. Finally, we show how to combine robustness with an earlier MLX method, yielding state-of-the-art results on both synthetic and real-world benchmarks.

Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial Defense
Zunzhi You Daochang Liu Bohyung Han Chang Xu



研究问题:本文旨在解决预训练的深度学习模型在对抗性攻击下易受攻击的问题,以及探索如何通过自监督学习提供对抗性鲁棒性。
动机:虽然最新的掩蔽图像建模(MIM)已经在自我监督视觉表示学习中取得显著进展,但其预训练模型与大多数深度神经网络方法一样,对对抗性攻击仍然脆弱,限制了其实际应用。
方法:作者发现,作为MIM的一个简单变体,采用去噪作为预任务的噪声图像建模(NIM)能很好地重建被严重破坏的噪声图像。因此,作者提出一种名为De^3的对抗防御方法,利用预训练的去噪解码器来增强对抗鲁棒性。
效果:实验结果表明,由于其有效的去噪能力,NIM在对抗鲁棒性方面优于MIM。此外,NIM提供的防御性能与对抗性训练相当,同时具有额外的可调优优势。

Recent advancements in masked image modeling (MIM) have made it a prevailing framework for self-supervised visual representation learning. The MIM pretrained models, like most deep neural network methods, remain vulnerable to adversarial attacks, limiting their practical application, and this issue has received little research attention. In this paper, we investigate how this powerful self-supervised learning paradigm can provide adversarial robustness to downstream classifiers. During the exploration, we find that noisy image modeling (NIM), a simple variant of MIM that adopts denoising as the pre-text task, reconstructs noisy images surprisingly well despite severe corruption. Motivated by this observation, we propose an adversarial defense method, referred to as De^3, by exploiting the pretrained decoder for denoising. Through De^3, NIM is able to enhance adversarial robustness beyond providing pretrained features. Furthermore, we incorporate a simple modification, sampling the noise scale hyperparameter from random distributions, and enable the defense to achieve a better and tunable trade-off between accuracy and robustness. Experimental results demonstrate that, in terms of adversarial robustness, NIM is superior to MIM thanks to its effective denoising capability. Moreover, the defense provided by NIM achieves performance on par with adversarial training while offering the extra tunability advantage. Source code and models are available at https://github.com/youzunzhi/NIM-AdvDef.

Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning
Amit Dhurandhar Karthikeyan Natesan Ramamurthy Kartik Ahuja Vijay Arya



研究问题:如何提供一种简单、稳定且直观的高保真局部解释方法,用于解释黑箱模型。
动机:尽管存在许多变体,但现有的局部可解释模型往往无法产生高保真、稳定和直观的解释。
方法:提出一种受不变风险最小化原则启发的模型无关局部解释方法,该方法基于博弈论公式,通过消除在需要解释的例子附近函数梯度突然改变符号的特征,来提供高保真的局部解释。
效果:实验表明,该方法在表格、图像和文本数据上的效果优于LIME,在某些情况下甚至与从数据流形中采样的真实邻居相当。此外,该方法训练简单高效,无需访问如部分因果图等边信息即可确定黑箱模型的局部决策的稳定输入特征。

Locally interpretable model agnostic explanations (LIME) method is one of the most popular methods used to explain black-box models at a per example level. Although many variants have been proposed, few provide a simple way to produce high fidelity explanations that are also stable and intuitive. In this work, we provide a novel perspective by proposing a model agnostic local explanation method inspired by the invariant risk minimization (IRM) principle -- originally proposed for (global) out-of-distribution generalization -- to provide such high fidelity explanations that are also stable and unidirectional across nearby examples. Our method is based on a game theoretic formulation where we theoretically show that our approach has a strong tendency to eliminate features where the gradient of the black-box function abruptly changes sign in the locality of the example we want to explain, while in other cases it is more careful and will choose a more conservative (feature) attribution, a behavior which can be highly desirable for recourse. Empirically, we show on tabular, image and text data that the quality of our explanations with neighborhoods formed using random perturbations are much better than LIME and in some cases even comparable to other methods that use realistic neighbors sampled from the data manifold. This is desirable given that learning a manifold to either create realistic neighbors or to project explanations is typically expensive or may even be impossible. Moreover, our algorithm is simple and efficient to train, and can ascertain stable input features for local decisions of a black-box without access to side information such as a (partial) causal graph as has been seen in some recent works.

Connecting Certified and Adversarial Training
Yuhao Mao Mark Niklas Mueller Marc Fischer Martin Vechev



研究问题:如何训练出具有可靠鲁棒性的神经网络。
动机:现有的对抗性训练方法优化了最坏情况损失的下界近似,导致认证不足,而可靠的认证训练方法优化了宽松的上界近似,导致过度正则化和较差的标准准确性。
方法:我们提出了TAPS,一种(非可靠的)认证训练方法,结合IBP和PGD训练来优化更精确的(尽管不一定是可靠的)最坏情况损失近似值,减少过度正则化并提高认证和标准准确性。
效果:实证研究表明,TAPS在许多情况下达到了新的最先进的水平,例如,在TinyImageNet上对$\ell_infty$扰动的半径$\epsilon=1/255$达到$22\%$的认证准确性。我们的实现和网络在https://github.com/eth-sri/taps上公开。

Training certifiably robust neural networks remains a notoriously hard problem. While adversarial training optimizes under-approximations of the worst-case loss, which leads to insufficient regularization for certification, sound certified training methods, optimize loose over-approximations, leading to over-regularization and poor (standard) accuracy. In this work, we propose TAPS, an (unsound) certified training method that combines IBP and PGD training to optimize more precise, although not necessarily sound, worst-case loss approximations, reducing over-regularization and increasing certified and standard accuracies. Empirically, TAPS achieves a new state-of-the-art in many settings, e.g., reaching a certified accuracy of $22$% on TinyImageNet for $\ell_\infty$-perturbations with radius $\epsilon=1/255$. We make our implementation and networks public at https://github.com/eth-sri/taps.

Decision Tree for Locally Private Estimation with Public Data
Yuheng Ma Han Zhang Yuchao Cai Hanfang Yang



研究问题:如何利用少量公开数据提升私有估计的性能。
动机:目前的私有估计方法在性能上有待提高,我们希望通过引入少量公开数据来改善这个问题。
方法:我们提出了一种名为局部差分隐私决策树(LPDT)的高效算法进行差分隐私回归。首先,我们使用公开数据生成一个决策树分区,然后根据这个分区进行私有的估计器拟合。
效果:理论分析和实验证明,LPDT具有优越的性能,其收敛速度比不使用公开数据的差分隐私估计方法快,且在公开数据与私有数据差异较大的情况下仍能保持有效。

We propose conducting locally differentially private (LDP) estimation with the aid of a small amount of public data to enhance the performance of private estimation. Specifically, we introduce an efficient algorithm called Locally differentially Private Decision Tree (LPDT) for LDP regression. We first use the public data to grow a decision tree partition and then fit an estimator according to the partition privately. From a theoretical perspective, we show that LPDT is $\varepsilon$-LDP and has a mini-max optimal convergence rate under a mild assumption of similarity between public and private data, whereas the lower bound of the convergence rate of LPDT without public data is strictly slower, which implies that the public data helps to improve the convergence rates of LDP estimation. We conduct experiments on both synthetic and real-world data to demonstrate the superior performance of LPDT compared with other state-of-the-art LDP regression methods. Moreover, we show that LPDT remains effective despite considerable disparities between public and private data.

Hierarchical Randomized Smoothing
Yan Scholten Jan Schuchardt Aleksandar Bojchevski Stephan Günnemann



研究问题:如何使模型在面对复杂真实世界数据时,不仅能保证高准确度,还能对输入的小变动具有鲁棒性?
动机:现有的随机平滑方法虽然能保证模型的鲁棒性,但在面对只针对部分实体的恶意攻击时,其效果并不理想。
方法:提出分层随机平滑方法,通过在对象的部分实体上添加随机噪声,以更有针对性的方式进行平滑处理。
效果:实验证明,分层随机平滑方法在图像和节点分类等任务中,既能保证高准确度,又能显著提高模型的鲁棒性。

Real-world data is complex and often consists of objects that can be decomposed into multiple entities (e.g. images into pixels, graphs into interconnected nodes). Randomized smoothing is a powerful framework for making models provably robust against small changes to their inputs - by guaranteeing robustness of the majority vote when randomly adding noise before classification. Yet, certifying robustness on such complex data via randomized smoothing is challenging when adversaries do not arbitrarily perturb entire objects (e.g. images) but only a subset of their entities (e.g. pixels). As a solution, we introduce hierarchical randomized smoothing: We partially smooth objects by adding random noise only on a randomly selected subset of their entities. By adding noise in a more targeted manner than existing methods we obtain stronger robustness guarantees while maintaining high accuracy. We initialize hierarchical smoothing using different noising distributions, yielding novel robustness certificates for discrete and continuous domains. We experimentally demonstrate the importance of hierarchical smoothing in image and node classification, where it yields superior robustness-accuracy trade-offs. Overall, hierarchical smoothing is an important contribution towards models that are both - certifiably robust to perturbations and accurate.

Understanding and Improving Ensemble Adversarial Defense
Yian Deng Tingting Mu



研究问题:尽管对抗防御中的集成策略在实践中取得了成功,但为何对抗训练的分类器集成比单一分类器更强大,其理论解释尚不清楚。
动机:为了填补这一空白,我们开发了一种新的错误理论,专门用于理解集成对抗防御。
方法:我们提出了一种有效的方法来改进集成对抗防御,名为交互式全局对抗训练(iGAT)。该方法包括(1)一个概率分配规则,选择性地将全局挑战性的对抗性样本分配给不同的基本分类器,以及(2)一个正则化项,以解决基本分类器的最严重的弱点。
效果:在各种现有的集成对抗防御技术中进行测试,iGAT能够通过高达17%的性能提升来提高它们的表现,使用CIFAR10和CIFAR100数据集在白盒和黑盒攻击下进行评估。

The strategy of ensemble has become popular in adversarial defense, which trains multiple base classifiers to defend against adversarial attacks in a cooperative manner. Despite the empirical success, theoretical explanations on why an ensemble of adversarially trained classifiers is more robust than single ones remain unclear. To fill in this gap, we develop a new error theory dedicated to understanding ensemble adversarial defense, demonstrating a provable 0-1 loss reduction on challenging sample sets in adversarial defense scenarios. Guided by this theory, we propose an effective approach to improve ensemble adversarial defense, named interactive global adversarial training (iGAT). The proposal includes (1) a probabilistic distributing rule that selectively allocates to different base classifiers adversarial examples that are globally challenging to the ensemble, and (2) a regularization term to rescue the severest weaknesses of the base classifiers. Being tested over various existing ensemble adversarial defense techniques, iGAT is capable of boosting their performance by up to 17\% evaluated using CIFAR10 and CIFAR100 datasets under both white-box and black-box attacks.

H-nobs: Achieving Certified Fairness and Robustness in Distributed Learning on Heterogeneous Datasets
Guanqiang Zhou Ping Xu Yue Wang Zhi Tian



研究问题:本文旨在解决现代分布式学习系统中公平性和鲁棒性设计的两个重要目标,包括(i)公平性和鲁棒性的结合为何困难?(ii)能否为公平性和鲁棒性的双属性建立理论保证?(iii)在系统中融入鲁棒性时,公平性需要牺牲多少?
动机:尽管已有一些尝试同时实现公平性和鲁棒性的工作,但这个方向的一些关键方面仍然未被充分探索。
方法:作者首先将数据异质性确定为公平性和鲁棒性结合的主要难点,然后提出了一个名为H-nobs的公平和鲁棒框架,该框架通过采用两个关键组件——公平促进的目标函数和简单鲁棒聚合方案(称为基于规范的筛选,NBS)来实现公平性和鲁棒性的认证。
效果:作者推导了H-nobs在非凸、凸和强凸学习模型情况下的三个收敛定理,为公平性和鲁棒性提供了理论保证。此外,首次从实证上研究了鲁棒机制(NBS)对H-nobs公平性性能的影响。

Fairness and robustness are two important goals in the design of modern distributed learning systems. Despite a few prior works attempting to achieve both fairness and robustness, some key aspects of this direction remain underexplored. In this paper, we try to answer three largely unnoticed and unaddressed questions that are of paramount significance to this topic: (i) What makes jointly satisfying fairness and robustness difficult? (ii) Is it possible to establish theoretical guarantee for the dual property of fairness and robustness? (iii) How much does fairness have to sacrifice at the expense of robustness being incorporated into the system? To address these questions, we first identify data heterogeneity as the key difficulty of combining fairness and robustness. Accordingly, we propose a fair and robust framework called H-nobs which can offer certified fairness and robustness through the adoption of two key components, a fairness-promoting objective function and a simple robust aggregation scheme called norm-based screening (NBS). We explain in detail why NBS is the suitable scheme in our algorithm in contrast to other robust aggregation measures. In addition, we derive three convergence theorems for H-nobs in cases of the learning model being nonconvex, convex, and strongly convex respectively, which provide theoretical guarantees for both fairness and robustness. Further, we empirically investigate the influence of the robust mechanism (NBS) on the fairness performance of H-nobs, the very first attempt of such exploration.

Towards Unbounded Machine Unlearning
Meghdad Kurmanji Peter Triantafillou Jamie Hayes Eleni Triantafillou



研究问题:如何从训练过的神经网络中删除一部分训练集,即深度机器取消学习的问题。
动机:这个问题具有时效性,并且在许多应用中都有用,包括消除偏见(RB)、解决混淆(RC)(由训练模型中的误标记数据引起)以及允许用户行使他们的“被遗忘权”以保护用户隐私(UP)。
方法:本文首次针对不同的应用(RB、RC、UP)研究取消学习,认为每个应用都有自己的期望、“遗忘”的定义和相关的遗忘质量度量标准。对于UP,我们提出了一种新颖的针对强成员推理攻击的适应方法。我们还提出了SCRUB,这是一种新的取消学习算法,它是唯一一个在不同应用程序依赖的度量标准下始终在遗忘质量上表现优异的方法。同时,SCRUB在测量模型效用(即保留数据的准确率和泛化能力)的指标上也始终处于领先地位,并且比之前的工作更有效。
效果:通过与先前最先进的技术进行全面的实证评估,证明了上述观点。

Deep machine unlearning is the problem of 'removing' from a trained neural network a subset of its training set. This problem is very timely and has many applications, including the key tasks of removing biases (RB), resolving confusion (RC) (caused by mislabelled data in trained models), as well as allowing users to exercise their 'right to be forgotten' to protect User Privacy (UP). This paper is the first, to our knowledge, to study unlearning for different applications (RB, RC, UP), with the view that each has its own desiderata, definitions for 'forgetting' and associated metrics for forget quality. For UP, we propose a novel adaptation of a strong Membership Inference Attack for unlearning. We also propose SCRUB, a novel unlearning algorithm, which is the only method that is consistently a top performer for forget quality across the different application-dependent metrics for RB, RC, and UP. At the same time, SCRUB is also consistently a top performer on metrics that measure model utility (i.e. accuracy on retained data and generalization), and is more efficient than previous work. The above are substantiated through a comprehensive empirical evaluation against previous state-of-the-art.

BadTrack: A Poison-Only Backdoor Attack on Visual Object Tracking
Bin Huang Jiaqian Yu Yiwei Chen Siyang Pan Qiang Wang Zhi Wang



研究问题:本文旨在揭示知识图谱中的有信息量的实体可以通过外部知识来增强语言表示,并研究问题:本文旨在揭示知识图谱中的有信息量的实体可以通过外部知识来增强语言表示,并利用这一特性提出一种新的针对视觉目标跟踪(VOT)的毒丸式后门攻击。
动机:目前的预训练语言模型和视觉目标跟踪算法都存在对外部知识的利用不足的问题。
方法:通过在训练数据中添加预设触发模式,使得触发模式几乎只出现在提取的负面例子中,从而实施毒丸式后门攻击。
效果:实验结果表明,这种攻击可以显著降低两种流式的Siamese和一种流式的Transformer跟踪器在被污染数据上的性能,同时在清洁数据上与良性跟踪器获得相当的性能。

Visual object tracking (VOT) is one of the most fundamental tasks in computer vision community. State-of-the-art VOT trackers extract positive and negative examples that are used to guide the tracker to distinguish the object from the background. In this paper, we show that this characteristic can be exploited to introduce new threats and hence propose a simple yet effective poison-only backdoor attack. To be specific, we poison a small part of the training data by attaching a predefined trigger pattern to the background region of each video frame, so that the trigger appears almost exclusively in the extracted negative examples. To the best of our knowledge, this is the first work that reveals the threat of poison-only backdoor attack on VOT trackers. We experimentally show that our backdoor attack can significantly degrade the performance of both two-stream Siamese and one-stream Transformer trackers on the poisoned data while gaining comparable performance with the benign trackers on the clean data.

(Provable) Adversarial Robustness for Group Equivariant Tasks: Graphs, Point Clouds, Molecules, and More
Jan Schuchardt Yan Scholten Stephan Günnemann



研究问题:传统的机器学习模型在输入扰动下保持稳定,但现实世界的任务如分子性质预测或研究问题:传统的机器学习模型在输入扰动下保持稳定,但现实世界的任务如分子性质预测或点云分割具有固有的等变特性,如旋转或置换等变。
动机:对于这些任务,即使输入的扰动有大的范数,也不一定会改变输入的语义内容。同时,有些情况下模型的预测需要明确地改变。因此,提出了一个关于对抗性鲁棒性的概念,以考虑任务的等变性。
方法:通过选择与任务等变性相匹配的模型和认证传统的对抗性鲁棒性来实现可证明的鲁棒性。对于许多模型,如连续等变的模型,认证方法是不可用的。因此,开发了等变保持随机平滑框架,实现了架构无关的认证。
效果:首次为同构等变任务(如节点分类)推导出第一个特定于架构的图编辑距离证书,即针对同构等变任务的声音鲁棒性保证。总的来说,声音的鲁棒性概念是未来在鲁棒性和几何机器学习交叉领域的工作的重要前提。

A machine learning model is traditionally considered robust if its prediction remains (almost) constant under input perturbations with small norm. However, real-world tasks like molecular property prediction or point cloud segmentation have inherent equivariances, such as rotation or permutation equivariance. In such tasks, even perturbations with large norm do not necessarily change an input's semantic content. Furthermore, there are perturbations for which a model's prediction explicitly needs to change. For the first time, we propose a sound notion of adversarial robustness that accounts for task equivariance. We then demonstrate that provable robustness can be achieved by (1) choosing a model that matches the task's equivariances (2) certifying traditional adversarial robustness. Certification methods are, however, unavailable for many models, such as those with continuous equivariances. We close this gap by developing the framework of equivariance-preserving randomized smoothing, which enables architecture-agnostic certification. We additionally derive the first architecture-specific graph edit distance certificates, i.e. sound robustness guarantees for isomorphism equivariant tasks like node classification. Overall, a sound notion of robustness is an important prerequisite for future work at the intersection of robust and geometric machine learning.

Minimax Risks and Optimal Procedures for Estimation under Functional Local Differential Privacy
Bonwoo Lee Jeongyoun Ahn Cheolwoo Park



研究问题:如何在保证数据隐私的同时,最大化统计数据的效用。
动机:随着对数据隐私的关注日益增长,差分隐私(DP)作为一种基本概念出现,旨在通过确保个体在数据分析中的不可区分性来保证隐私。局部差分隐私(LDP)是一种严格的DP类型,需要在将数据发送给收集器之前对个体数据进行私有化,从而消除了需要信任第三方收集数据的需求。
方法:本研究通过分析单变量均值估计和非参数密度估计的最小最大风险,探讨了功能LDP如何保护统计效用。我们利用功能LDP机制的收缩性质和经典的信息理论边界,推导出私有的最小最大下界。
效果:理论研究发现,可以在统计效用和隐私级别之间建立一种可解释的、连续的平衡,这是在ε-LDP框架下无法实现的。此外,我们建议基于高斯LDP(一种功能LDP)的最小最大最优机制,并通过数值研究证明它们优于在ε-LDP下得出的对应机制。这项工作的理论和实证发现表明,高斯LDP应被视为LDP的一种可靠标准。

As concerns about data privacy continue to grow, differential privacy (DP) has emerged as a fundamental concept that aims to guarantee privacy by ensuring individuals' indistinguishability in data analysis. Local differential privacy (LDP) is a rigorous type of DP that requires individual data to be privatized before being sent to the collector, thus removing the need for a trusted third party to collect data. Among the numerous (L)DP-based approaches, functional DP has gained considerable attention in the DP community because it connects DP to statistical decision-making by formulating it as a hypothesis-testing problem and also exhibits Gaussian-related properties. However, the utility of privatized data is generally lower than that of non-private data, prompting research into optimal mechanisms that maximize the statistical utility for given privacy constraints. In this study, we investigate how functional LDP preserves the statistical utility by analyzing minimax risks of univariate mean estimation as well as nonparametric density estimation. We leverage the contraction property of functional LDP mechanisms and classical information-theoretical bounds to derive private minimax lower bounds. Our theoretical study reveals that it is possible to establish an interpretable, continuous balance between the statistical utility and privacy level, which has not been achieved under the $\epsilon$-LDP framework. Furthermore, we suggest minimax optimal mechanisms based on Gaussian LDP (a type of functional LDP) that achieve the minimax upper bounds and show via a numerical study that they are superior to the counterparts derived under $\epsilon$-LDP. The theoretical and empirical findings of this work suggest that Gaussian LDP should be considered a reliable standard for LDP.

LEACE: Perfect linear concept erasure in closed form
Nora Belrose David Schneider-Joseph Shauli Ravfogel Ryan Cotterell Edward Raff Stella Biderman



研究问题:如何从语言模型中移除特定特征,以提高公平性和可解释性。
动机:防止分类器使用性别或种族等特征,提高模型的公平性和可解释性。
方法:提出一种名为LEAst-squares Concept Erasure(LEACE)的闭型方法,可以最小化改变表示的同时,防止所有线性分类器检测到某一概念。
效果:通过在大型语言模型上应用概念擦洗技术,从网络的每一层中删除目标概念信息,证明了该方法在测量语言模型对部分词性信息的依赖性以及减少BERT嵌入中的性别偏见方面的有效性。

Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called concept scrubbing, which erases target concept information from _every_ layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Our code is available at https://github.com/EleutherAI/concept-erasure.

Strategic Behavior in Two-sided Matching Markets with Prediction-enhanced Preference-formation
Stefania Ionescu Yuhao Du Kenneth Joseph Aniko Hannak



研究问题:在缺乏监管的交易所中,如何匹配代理双方?
动机:在没有规范交易所的情况下,已经存在了用于匹配代理双方的双向匹配市场。在这种情况下,形成偏好既困难又关键。
方法:本文提出了一种名为对抗性互动攻击的新型策略行为,并构建了一个正式的经济模型,该模型捕捉了旨在协助代理的预测机制和用于配对它们的匹配机制之间的反馈循环。
效果:实验结果表明,返回市场的代理可以通过使用对抗性互动攻击获得好处,并且随着对预测的信任和准确性的增加,他们可以获得越来越多的收益。此外,这种攻击还增加了学生群体中的不平等现象。

Two-sided matching markets have long existed to pair agents in the absence of regulated exchanges. A common example is school choice, where a matching mechanism uses student and school preferences to assign students to schools. In such settings, forming preferences is both difficult and critical. Prior work has suggested various prediction mechanisms that help agents make decisions about their preferences. Although often deployed together, these matching and prediction mechanisms are almost always analyzed separately. The present work shows that at the intersection of the two lies a previously unexplored type of strategic behavior: agents returning to the market (e.g., schools) can attack future predictions by interacting short-term non-optimally with their matches. Here, we first introduce this type of strategic behavior, which we call an adversarial interaction attack. Next, we construct a formal economic model that captures the feedback loop between prediction mechanisms designed to assist agents and the matching mechanism used to pair them. Finally, in a simplified setting, we prove that returning agents can benefit from using adversarial interaction attacks and gain progressively more as the trust in and accuracy of predictions increases. We also show that this attack increases inequality in the student population.

Uncertainty Estimation for Safety-critical Scene Segmentation via Fine-grained Reward Maximization
Hongzheng Yang Cheng Chen Yueyao Chen Markus Scheppach Hon Chi Yip Qi Dou



研究问题:如何提高深度分割模型在安全关键场景(如医疗应用)中的可靠性部署。
动机:现有的不确定性估计方法由于缺乏对预测风险和模型置信度的明确指导,其效果有限。
方法:提出一种新的细粒度奖励最大化(FGRM)框架,通过直接利用与奖励函数相关的不确定性度量和基于强化学习的模型调整算法进行不确定性估计。设计了一个新的不确定性估计奖励函数,并使用该函数来微调一个经过证据学习预训练的分割模型,以校准预测风险。
效果:在两个大型安全关键手术场景分割数据集上进行了实验,结果表明,该方法在所有不确定性估计的校准指标上都优于最先进的方法,同时保持了高的任务分割准确性。

Uncertainty estimation plays an important role for future reliable deployment of deep segmentation models in safety-critical scenarios such as medical applications. However, existing methods for uncertainty estimation have been limited by the lack of explicit guidance for calibrating the prediction risk and model confidence. In this work, we propose a novel fine-grained reward maximization (FGRM) framework, to address uncertainty estimation by directly utilizing an uncertainty metric related reward function with a reinforcement learning based model tuning algorithm. This would benefit the model uncertainty estimation with direct optimization guidance for model calibration. Specifically, our method designs a new uncertainty estimation reward function using the calibration metric, which is maximized to fine-tune an evidential learning pre-trained segmentation model for calibrating prediction risk. Importantly, we innovate an effective fine-grained parameter update scheme, which imposes fine-grained reward-weighting of each network parameter according to the parameter importance quantified by the fisher information matrix. To the best of our knowledge, this is the first work exploring reward optimization for model uncertainty estimation in safety-critical vision tasks. The effectiveness of our method is demonstrated on two large safety-critical surgical scene segmentation datasets under two different uncertainty estimation settings. With real-time one forward pass at inference, our method outperforms state-of-the-art methods by a clear margin on all the calibration metrics of uncertainty estimation, while maintaining a high task accuracy for the segmentation results. Code is available at https://github.com/med-air/FGRM.

Counterfactually Fair Representation
Zhiqun Zuo Mohammad Mahdi Khalili Xueru Zhang



研究问题:在高风险应用中,机器学习模型可能对受保护的社会群体产生偏见,如何公平地处理这一问题?
动机:为了解决机器学习模型在高风险应用中的偏见问题,本文关注了反事实公平性(CF)这一公平性概念。
方法:本文提出了一种新的算法,该算法使用所有可用的特征来训练模型,并从理论上和实证上证明了这种方法可以满足CF。
效果:实验结果表明,使用这种方法训练的模型能够满足CF的要求。

The use of machine learning models in high-stake applications (e.g., healthcare, lending, college admission) has raised growing concerns due to potential biases against protected social groups. Various fairness notions and methods have been proposed to mitigate such biases. In this work, we focus on Counterfactual Fairness (CF), a fairness notion that is dependent on an underlying causal graph and first proposed by Kusner $\textit{et al.}$; it requires that the outcome an individual perceives is the same in the real world as it would be in a "counterfactual" world, in which the individual belongs to another social group. Learning fair models satisfying CF can be challenging. It was shown in (Kusner $\textit{et al.}$) that a sufficient condition for satisfying CF is to $\textbf{not}$ use features that are descendants of sensitive attributes in the causal graph. This implies a simple method that learns CF models only using non-descendants of sensitive attributes while eliminating all descendants. Although several subsequent works proposed methods that use all features for training CF models, there is no theoretical guarantee that they can satisfy CF. In contrast, this work proposes a new algorithm that trains models using all the available features. We theoretically and empirically show that models trained with this method can satisfy CF.

Concept Distillation: Leveraging Human-Centered Explanations for Model Improvement
Avani Gupta Saurabh Saini P J Narayanan



研究问题:如何通过利用人类中心的概念解释来理解和减少神经网络的偏差?
动机:目前的可解释性研究主要关注于人类中心的概念解释,而我们的目标是通过训练前的概念损失来减少模型的偏差。
方法:我们将概念激活向量(CAVs)从后验分析扩展到先验训练,通过使用额外的“概念损失”进行微调来减少模型偏差。我们还引入了“概念蒸馏”,这是一种使用预训练的知识模型作为教师的方法,用于定义丰富和有效的概念。
效果:我们的方法可以提高模型的可解释性,减少偏差,并引入先验知识。我们在几个分类问题上应用了概念敏感的训练,结果显示这种方法可以有效地减少模型的偏差。

Humans use abstract *concepts* for understanding instead of hard features. Recent interpretability research has focused on human-centered concept explanations of neural networks. Concept Activation Vectors (CAVs) estimate a model's sensitivity and possible biases to a given concept. We extend CAVs from post-hoc analysis to ante-hoc training to reduce model bias through fine-tuning using an additional *Concept Loss*. Concepts are defined on the final layer of the network in the past. We generalize it to intermediate layers, including the last convolution layer. We also introduce *Concept Distillation*, a method to define rich and effective concepts using a pre-trained knowledgeable model as the teacher. Our method can sensitize or desensitize a model towards concepts. We show applications of concept-sensitive training to debias several classification problems. We also show a way to induce prior knowledge into a reconstruction problem. We show that concept-sensitive training can improve model interpretability, reduce biases, and induce prior knowledge.

RETVec: Resilient and Efficient Text Vectorizer
Elie Bursztein Marina Zhang Owen Skipper Vallis Xinyu Jia Alexey Kurakin



研究问题:如何设计一种高效、有弹性且多语言的文本向量化方法,用于基于神经网络的文本处理。
动机:现有的文本向量化方法在面对拼写错误和字符级对抗攻击时表现不佳。
方法:RETVec结合了一种新的字符编码和可选的小嵌入模型,将单词嵌入到256维的向量空间中。其嵌入模型通过成对度量学习进行预训练,以提高对拼写错误和字符级对抗攻击的抵抗力。
效果:实验结果表明,RETVec在流行的模型架构和数据集上表现出色,能产生具有竞争力的多语言模型,并显著提高了对拼写错误和对抗性文本攻击的抵抗力。

This paper describes RETVec, an efficient, resilient, and multilingual text vectorizer designed for neural-based text processing. RETVec combines a novel character encoding with an optional small embedding model to embed words into a 256-dimensional vector space. The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks. In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.

Stability Guarantees for Feature Attributions with Multiplicative Smoothing
Anton Xue Rajeev Alur Eric Wong



研究问题:解释机器学习模型的方法往往无法提供任何形式上的保证,可能无法反映底层的决策过程。
动机:本研究将稳定性分析为可靠的特征归因方法的一种属性。
方法:我们证明了如果模型在特征遮蔽方面具有足够的Lipschitz连续性,那么放宽的稳定性是有保障的。我们开发了一种名为乘法平滑(MuS)的平滑方法来实现这样的模型。
效果:我们在视觉和语言模型上评估了MuS,并与各种特征归因方法(如LIME和SHAP)进行了集成,证明MuS赋予了特征归因以非平凡的稳定性保证。

Explanation methods for machine learning models tend not to provide any formal guarantees and may not reflect the underlying decision-making process. In this work, we analyze stability as a property for reliable feature attribution methods. We prove that relaxed variants of stability are guaranteed if the model is sufficiently Lipschitz with respect to the masking of features. We develop a smoothing method called Multiplicative Smoothing (MuS) to achieve such a model. We show that MuS overcomes the theoretical limitations of standard smoothing techniques and can be integrated with any classifier and feature attribution method. We evaluate MuS on vision and language models with various feature attribution methods, such as LIME and SHAP, and demonstrate that MuS endows feature attributions with non-trivial stability guarantees.

Class-Conditional Conformal Prediction with Many Classes
Tiffany Ding Anastasios Nikolas Angelopoulos Stephen Bates Michael Jordan Ryan Tibshirani



研究问题:如何在特定类别的测试点上,获得更强的预测集包含真实标签的概率保证?
动机:在许多分类问题中,我们希望对特定类别的测试点,预测集包含真实标签的概率与用户选择的概率相同。
方法:提出一种称为聚类式适应性预测的方法,将具有“相似”适应性得分的类别进行聚类,并在集群级别执行适应性预测。
效果:通过在四个图像数据集上进行实证评估(最多有1000个类别),我们发现聚类式适应性预测通常在类别条件覆盖和集合大小指标上优于现有方法。

Standard conformal prediction methods provide a marginal coverage guarantee, which means that for a random test point, the conformal prediction set contains the true label with a user-specified probability. In many classification problems, we would like to obtain a stronger guarantee--that for test points of a specific class, the prediction set contains the true label with the same user-chosen probability. For the latter goal, existing conformal prediction methods do not work well when there is a limited amount of labeled data per class, as is often the case in real applications where the number of classes is large. We propose a method called clustered conformal prediction that clusters together classes having "similar" conformal scores and performs conformal prediction at the cluster level. Based on empirical evaluation across four image data sets with many (up to 1000) classes, we find that clustered conformal typically outperforms existing methods in terms of class-conditional coverage and set size metrics.

Optimal Unbiased Randomizers for Regression with Label Differential Privacy
Ashwinkumar Badanidiyuru Badih Ghazi Pritish Kamath Ravi Kumar Ethan Jacob Leeman Pasin Manurangsi Avinash V Varadarajan Chiyuan Zhang



研究问题:如何在标签差分隐私(DP)约束下训练回归模型。
动机:在保护用户隐私的同时,提高模型的性能和效果。
方法:提出一种新的标签随机化器,利用标签偏差和方差之间的权衡来构建更好的标签随机化器,根据私有估计的标签先验分布进行操作。
效果:这些随机化器在多个数据集上实现了最先进的隐私-效用权衡,强调了在用标签DP训练神经网络时减少偏差的重要性。同时,还提供了有关最优无偏随机化器结构性质的理论结果。

We propose a new family of label randomizers for training _regression_ models under the constraint of label differential privacy (DP). In particular, we leverage the trade-offs between bias and variance to construct better label randomizers depending on a privately estimated prior distribution over the labels. We demonstrate that these randomizers achieve state-of-the-art privacy-utility trade-offs on several datasets, highlighting the importance of reducing bias when training neural networks with label DP. We also provide theoretical results shedding light on the structural properties of the optimal unbiased randomizers.

What Distributions are Robust to Indiscriminate Poisoning Attacks for Linear Learners?
Fnu Suya Xiao Zhang Yuan Tian David Evans



研究问题:本研究旨在探讨线性学习器在面对无差别投毒攻击时的抵抗能力,即研究问题:本研究旨在探讨线性学习器在面对无差别投毒攻击时的抵抗能力,即通过向训练数据中注入少量精心设计的示例,使模型在测试时产生更高的错误率。
动机:观察到某些数据集上的线性学习器即使没有任何防御措施也能抵抗已知的最佳攻击,我们进一步研究数据集是否对线性学习器的无差别投毒攻击具有固有的抵抗力。
方法:对于理论上的高斯分布,我们严格地描述了最优投毒攻击的行为,定义为在给定投毒预算下达到最大模型风险的投毒策略。
效果:结果证明,如果类别间的数据分布分离良好且方差小,以及包含所有允许投毒点的约束集的大小也小,那么线性学习器确实可以抵抗无差别投毒。这些发现在很大程度上解释了最先进的投毒攻击在基准数据集上对线性学习器的攻击性能的巨大差异,为理解一些学习任务容易受到数据投毒攻击的根本原因迈出了重要的一步。

We study indiscriminate poisoning for linear learners where an adversary injects a few crafted examples into the training data with the goal of forcing the induced model to incur higher test error. Inspired by the observation that linear learners on some datasets are able to resist the best known attacks even without any defenses, we further investigate whether datasets can be inherently robust to indiscriminate poisoning attacks for linear learners. For theoretical Gaussian distributions, we rigorously characterize the behavior of an optimal poisoning attack, defined as the poisoning strategy that attains the maximum risk of the induced model at a given poisoning budget. Our results prove that linear learners can indeed be robust to indiscriminate poisoning if the class-wise data distributions are well-separated with low variance and the size of the constraint set containing all permissible poisoning points is also small. These findings largely explain the drastic variation in empirical attack performance of the state-of-the-art poisoning attacks on linear learners across benchmark datasets, making an important initial step towards understanding the underlying reasons some learning tasks are vulnerable to data poisoning attacks.

Rethinking Incentives in Recommender Systems: Are Monotone Rewards Always Beneficial?
Fan Yao Chuanhao Li Karthik Abinav Sankararaman Yiming Liao Yan Zhu Qifan Wang Hongning Wang Haifeng Xu



研究问题:如何设计在线内容推荐平台的奖励机制,以引导创作者的竞争朝着长期可取的福利结果发展。
动机:当前平台广泛采用的奖励机制会引发创作者之间的竞争,影响他们的创作选择和内容分布,进而影响系统福利。
方法:本文首先揭示了一类被广泛采用的“基于优点的单调机制”存在的基本限制,即它们会导致最优福利的损失。然后提出了“逆向奖励机制”,并证明由此产生的竞争博弈具有潜在的博弈结构,可以自然地引导创作者采取策略性行为,优化潜在函数,以达到任何给定的福利指标。
效果:实验结果表明,逆向奖励机制在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

The past decade has witnessed the flourishing of a new profession as media content creators, who rely on revenue streams from online content recommendation platforms. The reward mechanism employed by these platforms creates a competitive environment among creators which affects their production choices and, consequently, content distribution and system welfare. It is thus crucial to design the platform's reward mechanism in order to steer the creators' competition towards a desirable welfare outcome in the long run. This work makes two major contributions in this regard: first, we uncover a fundamental limit about a class of widely adopted mechanisms, coined \emph{Merit-based Monotone Mechanisms}, by showing that they inevitably lead to a constant fraction loss of the optimal welfare. To circumvent this limitation, we introduce \emph{Backward Rewarding Mechanisms} (BRMs) and show that the competition game resultant from BRMs possesses a potential game structure. BRMs thus naturally induce strategic creators' collective behaviors towards optimizing the potential function, which can be designed to match any given welfare metric. In addition, the class of BRM can be parameterized so that it allows the platform to directly optimize welfare within the feasible mechanism space even when the welfare metric is not explicitly defined.

Counterfactually Comparing Abstaining Classifiers
Yo Joong Choe Aditya Gangrade Aaditya Ramdas



研究问题:如何评估和比较弃权分类器,特别是在其弃权预测上。
动机:在高风险决策问题上,弃权分类器越来越受欢迎,因为它们可以保留不确定的预测以提高其可靠性和安全性。然而,我们缺乏一种原则性的方法来评估黑箱弃权分类器在其弃权预测上会做出什么预测。这些缺失的预测在它们最终被利用时很重要,无论是直接还是作为失败模式的备份选项。
方法:我们将弃权预测视为缺失数据,引入了一种新颖的方法和视角来评估和比较弃权分类器。我们的评估方法围绕着定义弃权分类器的反事实分数,即如果不允许分类器弃权,那么分类器的预期性能。我们指定了反事实分数可识别的条件:如果弃权是随机的,并且评估数据与训练数据独立(确保预测是随机丢失的),则该分数是可识别的。
效果:通过观察性因果推理工具,我们在模拟和真实数据实验中开发了非参数和双重鲁棒方法来有效估计这一数量。

Abstaining classifiers have the option to abstain from making predictions on inputs that they are unsure about. These classifiers are becoming increasingly popular in high-stake decision-making problems, as they can withhold uncertain predictions to improve their reliability and safety. When evaluating black-box abstaining classifier(s), however, we lack a principled approach that accounts for what the classifier would have predicted on its abstentions. These missing predictions matter when they can eventually be utilized, either directly or as a backup option in a failure mode. In this paper, we introduce a novel approach and perspective to the problem of evaluating and comparing abstaining classifiers by treating abstentions as missing data. Our evaluation approach is centered around defining the counterfactual score of an abstaining classifier, defined as the expected performance of the classifier had it not been allowed to abstain. We specify the conditions under which the counterfactual score is identifiable: if the abstentions are stochastic, and if the evaluation data is independent of the training data (ensuring that the predictions are missing at random), then the score is identifiable. Note that, if abstentions are deterministic, then the score is unidentifiable because the classifier can perform arbitrarily poorly on its abstentions. Leveraging tools from observational causal inference, we then develop nonparametric and doubly robust methods to efficiently estimate this quantity under identification. Our approach is examined in both simulated and real data experiments.

Explain Any Concept: Segment Anything Meets Concept-Based Explanation
Ao Sun Pingchuan Ma Yuanyuan Yuan Shuai Wang



研究问题:如何提高深度神经网络的可解释性,以增强人类对其黑箱内部的理解。
动机:主流的基于像素的XAI方法通过识别重要的像素来解释DNN决策,而新兴的概念基XAI则通过形成概念(如图像中的头部)的解释来探索。然而,像素通常难以解释,且对XAI方法的不精确性敏感,而先前工作中的“概念”则需要人工注释或仅限于预定义的概念集。
方法:本文首次探索使用SAM来增强概念基XAI。我们提供了一个有效且灵活的概念基解释方法,即解释任何概念(EAC),用任何概念来解释DNN决策。
效果:我们在两个流行的数据集(ImageNet和COCO)上进行评估,结果显示EAC在常用的XAI方法上表现出了高度的鼓励性能。

EXplainable AI (XAI) is an essential topic to improve human understanding of deep neural networks (DNNs) given their black-box internals. For computer vision tasks, mainstream pixel-based XAI methods explain DNN decisions by identifying important pixels, and emerging concept-based XAI explore forming explanations with concepts (e.g., a head in an image). However, pixels are generally hard to interpret and sensitive to the imprecision of XAI methods, whereas “concepts” in prior works require human annotation or are limited to pre-defined concept sets. On the other hand, driven by large-scale pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promotable framework for performing precise and comprehensive instance segmentation, enabling automatic preparation of concept sets from a given image. This paper for the first time explores using SAM to augment concept-based XAI. We offer an effective and flexible concept-based explanation method, namely Explain Any Concept (EAC), which explains DNN decisions with any concept. While SAM is highly effective and offers an “out-of-the-box” instance segmentation, it is costly when being integrated into defacto XAI pipelines. We thus propose a lightweight per-input equivalent (PIE) scheme, enabling efficient explanation with a surrogate model. Our evaluation over two popular datasets (ImageNet and COCO) illustrate the highly encouraging performance of EAC over commonly-used XAI methods.

HQA-Attack: Toward High Quality Black-Box Hard-Label Adversarial Attack on Text
Han Liu Zhi Xu Xiaotong Zhang Feng Zhang Fenglong Ma Hongyang Chen Hong Yu Xianchao Zhang



研究问题:针对文本的黑箱硬标签对抗攻击是一个实际且具有挑战性的任务,因为文本研究问题:针对文本的黑箱硬标签对抗攻击是一个实际且具有挑战性的任务,因为文本数据空间本质上是离散的和非可微分的,只能访问预测的标签。
动机:现有的方法依赖于复杂的启发式算法或不可靠的梯度估计策略,容易陷入局部最优,并且不可避免地消耗大量的查询,因此在有限的查询预算下难以生成具有高语义相似度和低扰动率的满意的对抗样本。
方法:我们提出了一个简单而有效的框架来在黑箱硬标签攻击场景下生成高质量的文本对抗样本,名为HQA-Attack。具体来说,HQA-Attack首先随机初始化一个对抗样本,然后尽可能多地将原始单词替换回来,从而缩小扰动率。接着,它利用剩余更改单词的同义词集进一步优化对抗样本,同时满足提高语义相似度和满足对抗条件的方向。此外,在优化过程中,它会为每个更改的单词搜索一个过渡同义词,从而避免遍历整个同义词集并在一定程度上减少查询数量。
效果:我们在五个文本分类数据集、三个自然语言推理数据集和两个真实世界API上进行了广泛的实验,结果表明提出的HQA-Attack方法显著优于其他强大的基线方法。

Black-box hard-label adversarial attack on text is a practical and challenging task, as the text data space is inherently discrete and non-differentiable, and only the predicted label is accessible. Research on this problem is still in the embryonic stage and only a few methods are available. Nevertheless, existing methods rely on the complex heuristic algorithm or unreliable gradient estimation strategy, which probably fall into the local optimum and inevitably consume numerous queries, thus are difficult to craft satisfactory adversarial examples with high semantic similarity and low perturbation rate in a limited query budget. To alleviate above issues, we propose a simple yet effective framework to generate high quality textual adversarial examples under the black-box hard-label attack scenarios, named HQA-Attack. Specifically, after initializing an adversarial example randomly, HQA-attack first constantly substitutes original words back as many as possible, thus shrinking the perturbation rate. Then it leverages the synonym set of the remaining changed words to further optimize the adversarial example with the direction which can improve the semantic similarity and satisfy the adversarial condition simultaneously. In addition, during the optimizing procedure, it searches a transition synonym word for each changed word, thus avoiding traversing the whole synonym set and reducing the query number to some extent. Extensive experimental results on five text classification datasets, three natural language inference datasets and two real-world APIs have shown that the proposed HQA-Attack method outperforms other strong baselines significantly.

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
Usha Bhalla Suraj Srinivas Himabindu Lakkaraju



研究问题:如何提高机器学习模型的解释性,使其能更好地解释模型行为。
动机:当前存在的两种策略——后验解释方法和固有可解释模型,都存在一些问题,如后验解释可能不准确,而固有可解释模型的预测性能较差。
方法:提出Distractor Erasure Tuning(DiET)方法,通过调整黑箱模型以增强对干扰特征擦除的鲁棒性,从而提供有区分性和忠实的特征归因。
效果:在半合成和真实世界数据集上的大量实验表明,DiET产生的模型(1)接近它们要解释的原黑箱模型,(2)产生与构造的近似真实值相匹配的解释。

With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by identifying features critical to model predictions; however, prior work has shown that these explanations may not be faithful, in that they incorrectly attribute high importance to features that are unimportant or non-discriminative for the underlying task. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful, but they often exhibit poor predictive performance due to their limited expressive power. In this work, we identify a key reason for the lack of faithfulness of feature attributions: the lack of robustness of the underlying black-box models, especially the erasure of unimportant distractor features in the input. To address this issue, we propose Distractor Erasure Tuning (DiET), a method that adapts black-box models to be robust to distractor erasure, thus providing discriminative and faithful attributions. This strategy naturally combines the ease-of-use of post hoc explanations with the faithfulness of inherently interpretable models. We perform extensive experiments on semi-synthetic and real-world datasets, and show that DiET produces models that (1) closely approximate the original black-box models they are intended to explain, and (2) yield explanations that match approximate ground truths available by construction.

Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly
Qizhang Li Yiwen Guo Wangmeng Zuo Hao Chen



研究问题:深度神经网络的对抗性漏洞由于在实际应用中存在安全风险,引起了广泛关注。
动机:由于对抗性示例的可转移性,越来越多的基于转移的方法被开发出来,以欺骗黑箱DNN模型,这些模型的架构和参数无法访问。然而,目前缺乏一个标准化的基准来系统、公平和实际地比较这些方法。
方法:我们建立了一个基于转移的攻击基准(TA-Bench),实现了30+种方法。我们在ImageNet上的10个流行的替代/受害者模型上对这些方法进行了全面评估和比较。
效果:我们对这些方法的有效性有了新的认识,并为未来的评估提供了指导方针。

The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 10 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided.

A Randomized Approach to Tight Privacy Accounting
Jiachen T. Wang Saeed Mahloujifar Tong Wu Ruoxi Jia Prateek Mittal



研究问题:如何在差分隐私(DP)中对隐私泄露进行边界限制,即隐私核算,是一个关键挑战。
动机:在DP中,虽然隐私参数(ε或δ)容易估计,但难以进行边界限制。
方法:本文提出了一种新的差分隐私范式——估计-验证-发布(EVR),通过将隐私参数的估计转化为形式化的保证,解决了在DP组合中为隐私参数提供严格上限的挑战。
效果:实验结果表明,EVR范式提高了隐私保护机器学习的效用-隐私权衡。

Bounding privacy leakage over compositions, i.e., privacy accounting, is a key challenge in differential privacy (DP). However, the privacy parameter ($\varepsilon$ or $\delta$) is often easy to estimate but hard to bound. In this paper, we propose a new differential privacy paradigm called estimate-verify-release (EVR), which tackles the challenges of providing a strict upper bound for the privacy parameter in DP compositions by converting an *estimate* of privacy parameter into a formal guarantee. The EVR paradigm first verifies whether the mechanism meets the *estimated* privacy guarantee, and then releases the query output based on the verification result. The core component of the EVR is privacy verification. We develop a randomized privacy verifier using Monte Carlo (MC) technique. Furthermore, we propose an MC-based DP accountant that outperforms existing DP accounting techniques in terms of accuracy and efficiency. MC-based DP verifier and accountant is applicable to an important and commonly used class of DP algorithms, including the famous DP-SGD. An empirical evaluation shows the proposed EVR paradigm improves the utility-privacy tradeoff for privacy-preserving machine learning.

Rethinking the Backward Propagation for Adversarial Transferability
Xiaosen Wang Kangheng Tong Kun He



研究问题:如何提高对抗性样本的转移性,以误导无法访问的黑箱模型,从而攻击现实世界的应用。
动机:现有的转移式攻击方法可以提高对抗性样本的转移性,但通常忽视了代理模型的角色。
方法:通过识别非线性层(如ReLU、max-pooling等)在反向传播过程中会截断梯度,导致输入图像相对于损失函数的梯度不精确,我们提出了一种新的方法——后向传播攻击(BPA)。该方法采用非单调函数作为ReLU的导数,并结合了带有温度的softmax来平滑max-pooling的导数,从而减少梯度反向传播过程中的信息损失。
效果:在ImageNet数据集上的实验结果表明,我们的方法不仅显著提高了对抗性样本的转移性,而且对现有的转移式攻击具有通用性。

Transfer-based attacks generate adversarial examples on the surrogate model, which can mislead other black-box models without access, making it promising to attack real-world applications. Recently, several works have been proposed to boost adversarial transferability, in which the surrogate model is usually overlooked. In this work, we identify that non-linear layers (e.g., ReLU, max-pooling, etc.) truncate the gradient during backward propagation, making the gradient w.r.t. input image imprecise to the loss function. We hypothesize and empirically validate that such truncation undermines the transferability of adversarial examples. Based on these findings, we propose a novel method called Backward Propagation Attack (BPA) to increase the relevance between the gradient w.r.t. input image and loss function so as to generate adversarial examples with higher transferability. Specifically, BPA adopts a non-monotonic function as the derivative of ReLU and incorporates softmax with temperature to smooth the derivative of max-pooling, thereby mitigating the information loss during the backward propagation of gradients. Empirical results on the ImageNet dataset demonstrate that not only does our method substantially boost the adversarial transferability, but it is also general to existing transfer-based attacks. Code is available at https://github.com/Trustworthy-AI-Group/RPA.

Functional Renyi Differential Privacy for Generative Modeling
Dihong Jiang Sun Sun Yaoliang Yu



研究问题:如何量化数据隐私,并开发实用的保护机制?
动机:为了提供更严格的数据隐私量化和更灵活的隐私保护机制。
方法:将Renyi差分隐私(RDP)扩展到无穷维函数输出空间,开发必要的工具如子采样高斯机制、组合和后处理规则等。
效果:通过在再生核希尔伯特空间(RKHS)中应用f-RDP,实现了一种差分隐私生成模型(DPGM),在隐私-效用权衡方面取得了显著改进。

Differential privacy (DP) has emerged as a rigorous notion to quantify data privacy. Subsequently, Renyi differential privacy (RDP) becomes an alternative to the ordinary DP notion in both theoretical and empirical studies, for its convenient compositional rules and flexibility. However, most mechanisms with DP (RDP) guarantees are essentially based on randomizing a fixed, finite-dimensional vector output. In this work, following Hall et al. (2013) we further extend RDP to functional outputs, where the output space can be infinite-dimensional, and develop all necessary tools, *e.g.*, (subsampled) Gaussian mechanism, composition, and post-processing rules, to facilitate its practical adoption. As an illustration, we apply functional RDP (f-RDP) to functions in the reproducing kernel Hilbert space (RKHS) to develop a differentially private generative model (DPGM), where training can be interpreted as iteratively releasing loss functions (in an RKHS) with DP (RDP) guarantees. Empirically, the new training paradigm achieves a significant improvement in privacy-utility trade-off compared to existing alternatives, especially when $\epsilon=0.2$. Our code is available at https://github.com/dihjiang/DP-kernel.

On the Relationship Between Relevance and Conflict in Online Social Link Recommendations
Yanbang Wang Jon Kleinberg



研究问题:在线社交网络中,链接推荐是用户发现他们可能认识的人的相关链接的一种方式,从而可能增加他们在平台上的参与度。然而,添加链接到社交网络也可能影响网络中的冲突水平(以极化和分歧表示)。迄今为止,我们对这两个链接形成的影响之间的关系了解甚少:高相关性和减少冲突的目标是一致的,还是用户最有可能接受的链接与具有最大潜在冲突减少能力的链接有根本的不同?
动机:我们首次使用最近流行的Friedkin-Johnsen观点动态模型来分析这个问题。
方法:我们首先展示了添加链接如何改变意见冲突水平的结果,然后解释了这种变化与添加链接的结构特征的关系。然后,我们在真实数据上对实现最大减少和最高相关性的链接集之间的冲突减少差距进行了描述。
效果:我们发现,一些更准确的算法实际上并没有导致更好的冲突减少。我们的工作表明,为增加用户参与而推荐的社交链接可能并不像人们想象的那样引发冲突。

In an online social network, link recommendations are a way for users to discover relevant links to people they may know, thereby potentially increasing their engagement on the platform. However, the addition of links to a social network can also have an effect on the level of conflict in the network --- expressed in terms of polarization and disagreement. To date, however, we have very little understanding of how these two implications of link formation relate to each other: are the goals of high relevance and conflict reduction aligned, or are the links that users are most likely to accept fundamentally different from the ones with the greatest potential for reducing conflict? Here we provide the first analysis of this question, using the recently popular Friedkin-Johnsen model of opinion dynamics. We first present a surprising result on how link additions shift the level of opinion conflict, followed by explanation work that relates the amount of shift to structural features of the added links. We then characterize the gap in conflict reduction between the set of links achieving the largest reduction and the set of links achieving the highest relevance. The gap is measured on real-world data, based on instantiations of relevance defined by 13 link recommendation algorithms. We find that some, but not all, of the more accurate algorithms actually lead to better reduction of conflict. Our work suggests that social links recommended for increasing user engagement may not be as conflict-provoking as people might have thought.

UltraRE: Enhancing RecEraser for Recommendation Unlearning via Error Decomposition
Yuyuan Li Chaochao Chen Yizhao Zhang Weiming Liu Lingjuan Lyu Xiaolin Zheng Dan Meng Jun Wang



研究问题:随着对机器学习模型隐私问题的日益关注,如何在保证模型效用和学习效率的同时实现推荐系统的完全遗忘。
动机:在法规要求公司开发非歧视性机器学习系统并赋予个人被遗忘权的背景下,研究如何提高推荐系统的遗忘能力。
方法:从集成的角度重新思考现有的RecEraser框架,针对其冗余、相关性和组合三个潜在损失进行优化,提出了名为UltraRE的新框架。
效果:通过在三个真实数据集上的大量实验,证明了UltraRE的有效性。

With growing concerns regarding privacy in machine learning models, regulations have committed to granting individuals the right to be forgotten while mandating companies to develop non-discriminatory machine learning systems, thereby fueling the study of the machine unlearning problem. Our attention is directed toward a practical unlearning scenario, i.e., recommendation unlearning. As the state-of-the-art framework, i.e., RecEraser, naturally achieves full unlearning completeness, our objective is to enhance it in terms of model utility and unlearning efficiency. In this paper, we rethink RecEraser from an ensemble-based perspective and focus on its three potential losses, i.e., redundancy, relevance, and combination. Under the theoretical guidance of the above three losses, we propose a new framework named UltraRE, which simplifies and powers RecEraser for recommendation tasks. Specifically, for redundancy loss, we incorporate transport weights in the clustering algorithm to optimize the equilibrium between collaboration and balance while enhancing efficiency; for relevance loss, we ensure that sub-models reach convergence on their respective group data; for combination loss, we simplify the combination estimator without compromising its efficacy. Extensive experiments on three real-world datasets demonstrate the effectiveness of UltraRE.

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
Jiaqi Xue Mengxin Zheng Ting Hua Yilin Shen Yepeng Liu Ladislau Bölöni Qian Lou



研究问题:大型语言模型(LLMs)在各种应用中被用作机器学习服务和接口工具,但其安全性,特别是对抗性和Trojan攻击方面的问题,尚未得到充分研究。
动机:本文提出了TrojLLM,一个自动的、黑箱框架,用于有效地生成通用且隐蔽的触发器。当这些触发器被纳入输入数据时,可以恶意操纵LLMs的输出。
方法:该框架支持在离散提示中嵌入Trojans,提高了触发器攻击的整体效果和精度。具体来说,我们提出了一种触发器发现算法,通过使用少量样本数据查询受害者基于LLM的API来生成各种输入的通用触发器。此外,我们还引入了一种新颖的渐进式Trojan污染算法,设计出具有有效性和可转移性的被污染提示。
效果:实验和结果显示,TrojLLM能够在包括GPT-3.5和GPT-4在内的真实世界黑箱LLM API中有效地将Trojans插入文本提示中,同时在干净的测试集上保持出色的性能。我们的工作揭示了当前模型的潜在安全风险,并提供了潜在的防御方法。

Large Language Models (LLMs) are progressively being utilized as machine learning services and interface tools for various applications. However, the security implications of LLMs, particularly in relation to adversarial and Trojan attacks, remain insufficiently examined. In this paper, we propose TrojLLM, an automatic and black-box framework to effectively generate universal and stealthy triggers. When these triggers are incorporated into the input data, the LLMs' outputs can be maliciously manipulated. Moreover, the framework also supports embedding Trojans within discrete prompts, enhancing the overall effectiveness and precision of the triggers' attacks. Specifically, we propose a trigger discovery algorithm for generating universal triggers for various inputs by querying victim LLM-based APIs using few-shot data samples. Furthermore, we introduce a novel progressive Trojan poisoning algorithm designed to generate poisoned prompts that retain efficacy and transferability across a diverse range of models. Our experiments and results demonstrate TrojLLM's capacity to effectively insert Trojans into text prompts in real-world black-box LLM APIs including GPT-3.5 and GPT-4, while maintaining exceptional performance on clean test sets. Our work sheds light on the potential security risks in current models and offers a potential defensive approach. The source code of TrojLLM is available at https://github.com/UCF-ML-Research/TrojLLM.

State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Devleena Das Sonia Chernova Been Kim



研究问题:如何为非AI专家提供可理解的AI决策解释,特别是在序列决策中。
动机:随着更多的非AI专家在日常任务中使用复杂的AI系统,开发可被他们理解的AI决策解释方法的需求日益增加。
方法:我们提出了一个统一的框架State2Explanation (S2E),该框架通过学习状态-动作对和基于概念的解释之间的联合嵌入模型,来同时进行奖励塑造和向最终用户提供解释。
效果:在Connect 4和Lunar Lander的实验验证中,S2E成功地提供了双重效益,不仅成功地指导了奖励塑造和提高了代理的学习速率,而且在部署时显著提高了最终用户的任务性能。

As more non-AI experts use complex AI systems for daily tasks, there has been an increasing effort to develop methods that produce explanations of AI decision making that are understandable by non-AI experts. Towards this effort, leveraging higher-level concepts and producing concept-based explanations have become a popular method. Most concept-based explanations have been developed for classification techniques, and we posit that the few existing methods for sequential decision making are limited in scope. In this work, we first contribute a desiderata for defining ``concepts'' in sequential decision making settings. Additionally, inspired by the Protege Effect which states explaining knowledge often reinforces one's self-learning, we explore how concept-based explanations of an RL agent's decision making can in turn improve the agent's learning rate, as well as improve end-user understanding of the agent's decision making. To this end, we contribute a unified framework, State2Explanation (S2E), that involves learning a joint embedding model between state-action pairs and concept-based explanations, and leveraging such learned model to both (1) inform reward shaping during an agent's training, and (2) provide explanations to end-users at deployment for improved task performance. Our experimental validations, in Connect 4 and Lunar Lander, demonstrate the success of S2E in providing a dual-benefit, successfully informing reward shaping and improving agent learning rate, as well as significantly improving end user task performance at deployment time.

Online Ad Procurement in Non-stationary Autobidding Worlds
Jason Cheuk Nam Liang Haihao Lu Baoyu Zhou



研究问题:在线广告商如何通过自动竞价平台有效优化广告杠杆决策。
动机:由于非稳定因素如季节性模式、系统偶尔的损坏和市场趋势,广告商在实际操作中难以有效地优化广告杠杆决策。
方法:提出了一个在线学习框架,引入了一个多维决策变量、强化反馈和长期不确定约束的原-对偶算法进行在线决策。
效果:实验结果表明,该算法在许多情况下都能实现低遗憾,即使在不知道哪个过程是真实情况的情况下,也能在生成采购结果的过程中实现随机、对抗性、对抗性破坏、周期性和各态历经的效果。最后,强调了所提出的算法和理论结果不仅适用于在线广告应用。

Today's online advertisers procure digital ad impressions through interacting with autobidding platforms: advertisers convey high level procurement goals via setting levers such as budget, target return-on-investment, max cost per click, etc.. Then ads platforms subsequently procure impressions on advertisers' behalf, and report final procurement conversions (e.g. click) to advertisers. In practice, advertisers may receive minimal information on platforms' procurement details, and procurement outcomes are subject to non-stationary factors like seasonal patterns, occasional system corruptions, and market trends which make it difficult for advertisers to optimize lever decisions effectively. Motivated by this, we present an online learning framework that helps advertisers dynamically optimize ad platform lever decisions while subject to general long-term constraints in a realistic bandit feedback environment with non-stationary procurement outcomes. In particular, we introduce a primal-dual algorithm for online decision making with multi-dimension decision variables, bandit feedback and long-term uncertain constraints. We show that our algorithm achieves low regret in many worlds when procurement outcomes are generated through procedures that are stochastic, adversarial, adversarially corrupted, periodic, and ergodic, respectively, without having to know which procedure is the ground truth. Finally, we emphasize that our proposed algorithm and theoretical results extend beyond the applications of online advertising.

VeriX: Towards Verified Explainability of Deep Neural Networks
Min Wu Haoze Wu Clark Barrett



研究问题:本文旨在提出一种名为VeriX的系统,用于生成机器学习模型决策边界的最佳鲁棒解释和反事实。
动机:为了提高机器学习模型的解释性和可信度,需要能够产生最优鲁棒解释和反事实的方法。
方法:通过使用约束求解技术和基于特征级敏感性排序的启发式方法,迭代地构建解释和反事实。
效果:在图像识别基准测试和自动驾驶飞机滑行的实际场景中评估了该方法。

We present **VeriX** (**Veri**fied e**X**plainability), a system for producing *optimal robust explanations* and generating *counterfactuals* along decision boundaries of machine learning models. We build such explanations and counterfactuals iteratively using constraint solving techniques and a heuristic based on feature-level sensitivity ranking. We evaluate our method on image recognition benchmarks and a real-world scenario of autonomous aircraft taxiing.

Robust Concept Erasure via Kernelized Rate-Distortion Maximization
Somnath Basu Roy Chowdhury Nicholas Monath Kumar Avinava Dubey Amr Ahmed Snigdha Chaturvedi



研究问题:如何从分布式表示中删除一个属性,同时尽可能保留原始表示空间中的其他信息。
动机:现有的分布式表示会混淆数据实例的多个属性或概念(例如,文本的主题或情感,作者的特征等)。
方法:提出一种新的基于距离度量学习的优化目标——Kernelized Rate-Distortion Maximizer (KRaM),用于执行概念擦除。KRaM通过修改率失真函数来匹配指定的距离度量(由要擦除的标记概念定义)的表示变换。
效果:实验结果表明,KRaM能有效删除各种类型的分布式表示中的概念,包括分类、连续和向量值变量,并在不同领域表现出良好的效果。

Distributed representations provide a vector space that captures meaningful relationships between data instances. The distributed nature of these representations, however, entangles together multiple attributes or concepts of data instances (e.g., the topic or sentiment of a text, characteristics of the author (age, gender, etc), etc). Recent work has proposed the task of concept erasure, in which rather than making a concept predictable, the goal is to remove an attribute from distributed representations while retaining other information from the original representation space as much as possible. In this paper, we propose a new distance metric learning-based objective, the Kernelized Rate-Distortion Maximizer (KRaM), for performing concept erasure. KRaM fits a transformation of representations to match a specified distance measure (defined by a labeled concept to erase) using a modified rate-distortion function. Specifically, KRaM's objective function aims to make instances with similar concept labels dissimilar in the learned representation space while retaining other information. We find that optimizing KRaM effectively erases various types of concepts—categorical, continuous, and vector-valued variables—from data representations across diverse domains. We also provide a theoretical analysis of several properties of KRaM's objective. To assess the quality of the learned representations, we propose an alignment score to evaluate their similarity with the original representation space. Additionally, we conduct experiments to showcase KRaM's efficacy in various settings, from erasing binary gender variables in word embeddings to vector-valued variables in GPT-3 representations.

Double Auctions with Two-sided Bandit Feedback
Soumya Basu Abishek Sankararaman



研究问题:本文旨在研究双向拍卖市场中,买卖双方通过重复交互学习各自估值的问题。
动机:双向拍卖市场是许多在线市场的基础,买卖双方通过竞价进行交易,但他们通常不知道自身的先验估值。参与者的盈利性,因此市场的可持续性,关键取决于通过重复交互学习各自的估值。
方法:我们提出了一种基于置信区间的竞价策略和“平均定价”策略,以实现有效的价格发现。我们还证明了在T轮中,买方和卖方的综合估值的社会遗憾(即总遗憾)为$O(\log(T)/\Delta)$,其中$Delta$是最小价格差距。
效果:我们的实验结果表明,买卖双方交换商品时的总遗憾为$O(\sqrt{T})$,而没有从交换中受益的买卖双方分别只经历$O(\log{T}/ \Delta)$ 的遗憾。此外,我们还证明了在某些双向拍卖市场中,无法达到$\omega(sqrt{T})$的个人遗憾和$omega(\log{T})$的社会遗憾。

Double Auction enables decentralized transfer of goods between multiple buyers and sellers, thus underpinning functioning of many online marketplaces. Buyers and sellers compete in these markets through bidding, but do not often know their own valuation a-priori. As the allocation and pricing happens through bids, the profitability of participants, hence sustainability of such markets, depends crucially on learning respective valuations through repeated interactions. We initiate the study of Double Auction markets under bandit feedback on both buyers' and sellers' side. We show with confidence bound based bidding, and `Average Pricing' there is an efficient price discovery among the participants. In particular, the regret on combined valuation of the buyers and the sellers -- a.k.a. the social regret -- is $O(\log(T)/\Delta)$ in $T$ rounds, where $\Delta$ is the minimum price gap. Moreover, the buyers and sellers exchanging goods attain $O(\sqrt{T})$ regret, individually. The buyers and sellers who do not benefit from exchange in turn only experience $O(\log{T}/ \Delta)$ regret individually in $T$ rounds. We augment our upper bound by showing that $\omega(\sqrt{T})$ individual regret, and $\omega(\log{T})$ social regret is unattainable in certain Double Auction markets. Our paper is the first to provide decentralized learning algorithms in a two-sided market where \emph{both sides have uncertain preference} that need to be learned.

Adaptive Privacy Composition for Accuracy-first Mechanisms
Ryan Rogers Gennady Samorodnitsky Steven Wu Aaditya Ramdas



研究问题:如何将事后隐私机制与差分隐私机制结合,以实现更高的隐私保护效果?
动机:尽管已有的事后隐私机制可以提供一定的准确性保证,但无法与差分隐私机制结合使用。此外,目前尚无关于这些事后隐私机制如何组合的理论,以便我们能够跟踪多个机制的累积隐私。
方法:我们开发了隐私过滤器,使分析师能够在总体隐私损失保证下自适应地在差分隐私机制和事后隐私机制之间切换。
效果:实验表明,使用特定的事后隐私机制——噪声降低机制——可以显著优于使用现有隐私损失组合边界的基线方法。我们以返回尽可能多的计数为目标,同时满足相对误差保证和总体隐私预算作为示例。

Although there has been work to develop ex-post private mechanisms from Ligett et al. '17 and Whitehouse et al '22 that seeks to provide privacy guarantees subject to a target level of accuracy, there was not a way to use them in conjunction with differentially private mechanisms. Furthermore, there has yet to be work in developing a theory for how these ex-post privacy mechanisms compose, so that we can track the accumulated privacy over several mechanisms. We develop privacy filters that allow an analyst to adaptively switch between differentially private mechanisms and ex-post private mechanisms subject to an overall privacy loss guarantee. We show that using a particular ex-post private mechanism --- noise reduction mechanisms --- can substantially outperform baseline approaches that use existing privacy loss composition bounds. We use the common task of returning as many counts as possible subject to a relative error guarantee and an overall privacy budget as a motivating example.

The Adversarial Consistency of Surrogate Risks for Binary Classification
Natalie Frank Jonathan Niles-Weed



研究问题:本研究关注二分类鲁棒性学习中替代风险的一致性。
动机:对抗训练是学习鲁棒分类器的一种常见方法,其目标是在每个示例都可能在一个小球内被恶意破坏的情况下最小化预期的0-1损失。
方法:我们给出了一种简单而完整的描述,描述了一组替代损失函数,这些函数是“一致的”,即可以替换0-1损失,而不会影响原始对抗风险的最小化序列,适用于任何数据分布。
效果:我们的研究结果表明,与标准设置相比,对抗一致性替代类别要小得多,而在标准设置中,许多常见的替代方案都是已知的一致的。

We study the consistency of surrogate risks for robust binary classification. It is common to learn robust classifiers by adversarial training, which seeks to minimize the expected $0$-$1$ loss when each example can be maliciously corrupted within a small ball. We give a simple and complete characterization of the set of surrogate loss functions that are \emph{consistent}, i.e., that can replace the $0$-$1$ loss without affecting the minimizing sequences of the original adversarial risk, for any data distribution. We also prove a quantitative version of adversarial consistency for the $\rho$-margin loss. Our results reveal that the class of adversarially consistent surrogates is substantially smaller than in the standard setting, where many common surrogates are known to be consistent.

Improving the Privacy and Practicality of Objective Perturbation for Differentially Private Linear Learners
Rachel Emily Redberg Antti Koskela Yu-Xiang Wang



研究问题:如何在保护隐私的机器学习领域中,提升目标扰动机制的性能。
动机:虽然差分隐私随机梯度下降(DP-SGD)在通用性上无人能敌,但其需要显著的隐私开销(用于私有地调整模型的超参数)和可能对简单模型如线性和逻辑回归来说过于昂贵的计算复杂性。
方法:本文通过更严格的隐私分析和新计算工具来改进目标扰动机制,使其在无约束凸广义线性问题上与DP-SGD竞争。
效果:实验结果表明,经过改进的目标扰动机制在性能上可以与DP-SGD相媲美。

In the arena of privacy-preserving machine learning, differentially private stochastic gradient descent (DP-SGD) has outstripped the objective perturbation mechanism in popularity and interest. Though unrivaled in versatility, DP-SGD requires a non-trivial privacy overhead (for privately tuning the model’s hyperparameters) and a computational complexity which might be extravagant for simple models such as linear and logistic regression. This paper revamps the objective perturbation mechanism with tighter privacy analyses and new computational tools that boost it to perform competitively with DP-SGD on unconstrained convex generalized linear problems.

Marich: A Query-efficient Distributionally Equivalent Model Extraction Attack
Pratik Karmakar Debabrota Basu



研究问题:设计黑盒模型提取攻击,通过预测API从公开可用的数据集向目标ML模型发送最少数量的查询,以创建信息丰富且分布等价的目标模型副本。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:首先定义分布等价和最大信息模型提取攻击,并将其转化为变分优化问题。攻击者依次解决这个优化问题,选择同时最大化熵并减少目标和被盗模型之间不匹配的最有价值的查询。这导致了基于主动采样的查询选择算法Marich,它是模型无关的。然后,我们在不同的文本和图像数据集以及不同的模型(包括CNNs和BERT)上评估Marich。Marich提取的模型实现了60-95%的真实模型精度,并使用1000-8500个来自公开可用数据集的查询,这些数据集与私有训练数据集不同。Marich提取的模型生成的预测分布比现有的基于主动采样的攻击更接近目标分布2-4倍。提取的模型还导致84-96%的会员推理攻击下的准确率。实验结果验证了Marich是查询高效的,并且能够执行任务准确、高保真度和信息丰富的模型提取。
效果:实验结果表明,经过改进的目标扰动机制在性能上可以与DP-SGD相媲美。

We study design of black-box model extraction attacks that can *send minimal number of queries from* a *publicly available dataset* to a target ML model through a predictive API with an aim *to create an informative and distributionally equivalent replica* of the target. First, we define *distributionally equivalent* and *Max-Information model extraction* attacks, and reduce them into a variational optimisation problem. The attacker sequentially solves this optimisation problem to select the most informative queries that simultaneously maximise the entropy and reduce the mismatch between the target and the stolen models. This leads to *an active sampling-based query selection algorithm*, Marich, which is *model-oblivious*. Then, we evaluate Marich on different text and image data sets, and different models, including CNNs and BERT. Marich extracts models that achieve $\sim 60-95\%$ of true model's accuracy and uses $\sim 1,000 - 8,500$ queries from the publicly available datasets, which are different from the private training datasets. Models extracted by Marich yield prediction distributions, which are $\sim2-4\times$ closer to the target's distribution in comparison to the existing active sampling-based attacks. The extracted models also lead to 84-96$\%$ accuracy under membership inference attacks. Experimental results validate that Marich is *query-efficient*, and capable of performing task-accurate, high-fidelity, and informative model extraction.

Online Ad Allocation with Predictions
Fabian Christian Spaeh Alina Ene



研究问题:本文旨在解决在线广告分配中的两个重要问题,即显示广告和广义指派问题。
动机:尽管已有的最坏情况算法可以解决这两个问题,但由于现实世界输入的可预测性和通常温和的性质,这些算法可能过于保守。因此,作者希望开发一种结合机器学习预测的算法,以提高性能。
方法:作者基于Feldman等人(2009)的工作,开发了一种基于学习增强的算法,该算法能够利用良好的预测,同时对不良预测具有鲁棒性。
效果:通过在合成数据和真实世界数据上的广泛预测进行实验评估,作者发现他们的算法始终优于没有预测的最坏情况算法。

Display Ads and the generalized assignment problem are two well-studied online packing problems with important applications in ad allocation and other areas. In both problems, ad impressions arrive online and have to be allocated immediately to budget-constrained advertisers. Worst-case algorithms that achieve the ideal competitive ratio are known for both problems, but might act overly conservative given the predictable and usually tame nature of real-world input. Given this discrepancy, we develop an algorithm for both problems that incorporate machine-learned predictions and can thus improve the performance beyond the worst-case. Our algorithm is based on the work of Feldman et al. (2009) and similar in nature to Mahdian et al. (2007) who were the first to develop a learning-augmented algorithm for the related, but more structured Ad Words problem. We use a novel analysis to show that our algorithm is able to capitalize on a good prediction, while being robust against poor predictions. We experimentally evaluate our algorithm on synthetic and real-world data on a wide range of predictions. Our algorithm is consistently outperforming the worst-case algorithm without predictions.

Strategic Classification under Unknown Personalized Manipulation
Han Shao Avrim Blum Omar Montasser



研究问题:本研究关注在战略分类中的基本错误界限和样本复杂度,其中代理可以策略性地操纵其特征向量以被预测为积极。
动机:在许多情况下,例如大学录取决定,学生可能会尝试采取更简单的课程以提高他们的GPA,重考SAT或转学以欺骗分类器。这种“球操作”是文献中广泛研究的一种操作类别,其中代理可以在一个有界半径的球内修改其特征向量。
方法:我们首先部署分类器,然后代理在其操作集中操纵特征向量以游戏已部署的分类器,从而形式化学习问题。我们探讨了交互过程中可用信息的各种情况,例如在部署之前或之后观察原始特征向量,观察操纵后的特征向量,或者既不看原始特征向量也不看操纵后的特征向量。
效果:我们首先为这些场景提供了在线错误界限和PAC样本复杂度。我们还探索了非球操作,并发现即使在最简单的情况下,即原始特征向量和操纵后的特征向量都被揭示出来,当目标函数属于已知类H时,错误界限和样本复杂度也由$\Omega(|\mathcal H|)$下界。

We study the fundamental mistake bound and sample complexity in the strategic classification, where agents can strategically manipulate their feature vector up to an extent in order to be predicted as positive. For example, given a classifier determining college admission, student candidates may try to take easier classes to improve their GPA, retake SAT and change schools in an effort to fool the classifier. *Ball manipulations* are a widely studied class of manipulations in the literature, where agents can modify their feature vector within a bounded radius ball. Unlike most prior work, our work consider manipulations to be *personalized*, meaning that agents can have different levels of manipulation abilities (e.g., varying radii for ball manipulations), and *unknown* to the learner. We formalize the learning problem in an interaction model where the learner first deploys a classifier and the agent manipulates the feature vector within their manipulation set to game the deployed classifier. We investigate various scenarios in terms of the information available to the learner during the interaction, such as observing the original feature vector before or after deployment, observing the manipulated feature vector, or not seeing either the original or the manipulated feature vector. We begin by providing online mistake bounds and PAC sample complexity in these scenarios for ball manipulations. We also explore non-ball manipulations and show that, even in the simplest scenario where both the original and the manipulated feature vectors are revealed, the mistake bounds and sample complexity are lower bounded by $\Omega(|\mathcal H|)$ when the target function belongs to a known class $\mathcal H$.

Scalable Membership Inference Attacks via Quantile Regression
Martin Andres Bertran Shuai Tang Aaron Roth Michael Kearns Jamie Heather Morgenstern Steven Wu



研究问题:本文旨在解决利用黑盒访问训练好的模型,确定特定示例是否在训练中使用过的问题。
动机:现有的成员推断攻击方法需要通过训练许多影子模型来估计测试统计量的分布,这种方法计算成本高且需要对被攻击模型的架构有知识。
方法:本文提出了一种新的基于执行分位数回归的攻击方法,该方法对未用于训练的点的置信度分数分布进行操作。
效果:实验结果表明,该方法与最先进的影子模型攻击方法竞争,但计算成本大大降低,因为只需要训练一个模型。此外,与影子模型攻击不同,所提出的方法不需要任何被攻击模型架构的知识,因此是真正的“黑盒”攻击。

Membership inference attacks are designed to determine, using black box access to trained models, whether a particular example was used in training or not. Membership inference can be formalized as a hypothesis testing problem. The most effective existing attacks estimate the distribution of some test statistic (usually the model's confidence on the true label) on points that were (and were not) used in training by training many \emph{shadow models}---i.e. models of the same architecture as the model being attacked, trained on a random subsample of data. While effective, these attacks are extremely computationally expensive, especially when the model under attack is large. \footnotetext[0]{ Martin and Shuai are the lead authors, and other authors are ordered alphabetically. \{maberlop,shuat\}@amazon.com} We introduce a new class of attacks based on performing quantile regression on the distribution of confidence scores induced by the model under attack on points that are not used in training. We show that our method is competitive with state-of-the-art shadow model attacks, while requiring substantially less compute because our attack requires training only a single model. Moreover, unlike shadow model attacks, our proposed attack does not require any knowledge of the architecture of the model under attack and is therefore truly ``black-box". We show the efficacy of this approach in an extensive series of experiments on various datasets and model architectures. Our code is available at \href{https://github.com/amazon-science/quantile-mia}{github.com/amazon-science/quantile-mia.}

FedGame: A Game-Theoretic Defense against Backdoor Attacks in Federated Learning
Jinyuan Jia Zhuowen Yuan Dinuka Sahabandu Luyao Niu Arezoo Rajabi Bhaskar Ramasubramanian Bo Li Radha Poovendran



研究问题:如何在联邦学习中防止动态攻击者利用后门攻击来破坏全局模型?
动机:现有的联邦学习防御后门攻击的方法通常基于静态攻击者模型,无法有效抵御采用策略性攻击策略的动态攻击者。
方法:将防御者和动态攻击者之间的战略互动建模为一场迷你游戏,设计出一种名为FedGame的交互式防御机制。
效果:实验证明,在受到后门攻击的情况下,使用FedGame训练的全局模型的表现接近未受攻击的情况。与多个最先进的基线相比,FedGame能有效地抵御策略性攻击者,并实现显著更高的鲁棒性。

Federated learning (FL) provides a distributed training paradigm where multiple clients can jointly train a global model without sharing their local data. However, recent studies have shown that FL offers an additional surface for backdoor attacks. For instance, an attacker can compromise a subset of clients and thus corrupt the global model to misclassify an input with a backdoor trigger as the adversarial target. Existing defenses for FL against backdoor attacks usually detect and exclude the corrupted information from the compromised clients based on a static attacker model. However, such defenses are inadequate against dynamic attackers who strategically adapt their attack strategies. To bridge this gap, we model the strategic interactions between the defender and dynamic attackers as a minimax game. Based on the analysis of the game, we design an interactive defense mechanism FedGame. We prove that under mild assumptions, the global model trained with FedGame under backdoor attacks is close to that trained without attacks. Empirically, we compare FedGame with multiple state-of-the-art baselines on several benchmark datasets under various attacks. We show that FedGame can effectively defend against strategic attackers and achieves significantly higher robustness than baselines. Our code is available at: https://github.com/AI-secure/FedGame.

Lending Interaction Wings to Recommender Systems with Conversational Agents
Jiarui Jin Xianyu Chen Fanghua Ye Mengyue Yang Yue Feng Weinan Zhang Yong Yu Jun Wang



研究问题:本文旨在提出一种新的离线训练和在线检查框架,将对话代理插入推荐系统。
动机:现有的推荐系统主要依赖于用户的历史行为进行训练,而对话代理可以通过在线获取用户的偏好来克服这一限制。
方法:提出了一个名为CORE的新框架,该框架通过统一的不确定性最小化框架将对话代理和推荐系统连接起来,而不是像大多数先前的对话推荐方法那样通过强化学习框架系统地结合对话和推荐部分。
效果:实验结果表明,CORE可以无缝地应用于各种推荐方法,并在热启动和冷启动设置中都能带来显著的改进。

An intelligent conversational agent (a.k.a., chat-bot) could embrace conversational technologies to obtain user preferences online, to overcome inherent limitations of recommender systems trained over the offline historical user behaviors. In this paper, we propose CORE, a new offline-training and online-checking framework to plug a COnversational agent into REcommender systems. Unlike most prior conversational recommendation approaches that systemically combine conversational and recommender parts through a reinforcement learning framework, CORE bridges the conversational agent and recommender system through a unified uncertainty minimization framework, which can be easily applied to any existing recommendation approach. Concretely, CORE treats a recommender system as an offline estimator to produce an estimated relevance score for each item, while CORE regards a conversational agent as an online checker that checks these estimated scores in each online session. We define uncertainty as the sum of unchecked relevance scores. In this regard, the conversational agent acts to minimize uncertainty via querying either attributes or items. Towards uncertainty minimization, we derive the certainty gain of querying each attribute and item, and develop a novel online decision tree algorithm to decide what to query at each turn. Our theoretical analysis reveals the bound of the expected number of turns of CORE in a cold-start setting. Experimental results demonstrate that CORE can be seamlessly employed on a variety of recommendation approaches, and can consistently bring significant improvements in both hot-start and cold-start settings.

Incentivizing Honesty among Competitors in Collaborative Learning and Optimization
Florian E. Dorner Nikola Konstantinov Georgi Stoyanov Pashaliev Martin Vechev



研究问题:如何让竞争对手在协作学习中诚实地更新模型,以实现高质量的学习效果。
动机:尽管协作学习有潜力训练出优于单一实体数据的机器学习模型,但在实际中,参与者往往是竞争关系,如通过提供最佳推荐来吸引客户的公司,这可能导致他们为了自身利益而损害其他参与者的模型。
方法:本文构建了一个游戏模型来模拟这种互动,并在该框架内研究了两个学习任务:单轮均值估计和多轮强凸目标的随机梯度下降。对于一类自然的玩家行为,我们证明了理性的客户端有动力强烈地操纵他们的更新,从而阻止学习。然后,我们提出了激励诚实通信并确保与完全合作相当的学习质量的机制。
效果:我们在一个标准的非凸联邦学习基准上实证地展示了我们的激励机制的有效性。我们的工作表明,明确地建模不诚实客户端的动机和行为,而不是假设他们是恶意的,可以为协作学习提供强大的鲁棒性保证。

Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity’s data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants' models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Lastly, we empirically demonstrate the effectiveness of our incentive scheme on a standard non-convex federated learning benchmark. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning.

Online Pricing for Multi-User Multi-Item Markets
Yigit Efe Erginbas Thomas Courtade Kannan Ramchandran Soham Rajesh Phade



研究问题:如何通过在线算法有效地向多个用户提供多种商品,同时从用户的接受/拒绝反馈中学习他们的估值,以最大化收入。
动机:现有的在线定价研究主要关注向顺序到达的用户销售单一商品,但在多商品、多用户的环境下,如何智能地为用户提供他们最重视的商品并设定他们能接受的最高价格,是一个复杂的问题。
方法:设计了三种用户估值模型(固定估值、随机体验和随机估值),并提供了具有近乎最优收益遗憾保证的在线算法。
效果:在固定估值模型下,算法在T轮中实现了$O(NM\log\log(LT))$的遗憾;在随机体验和随机估值模型下,算法分别实现了$widetilde{O}(\sqrt{NMLT})$的遗憾。

Online pricing has been the focus of extensive research in recent years, particularly in the context of selling an item to sequentially arriving users. However, what if a provider wants to maximize revenue by selling multiple items to multiple users in each round? This presents a complex problem, as the provider must intelligently offer the items to those users who value them the most without exceeding their highest acceptable prices. In this study, we tackle this challenge by designing online algorithms that can efficiently offer and price items while learning user valuations from accept/reject feedback. We focus on three user valuation models (fixed valuations, random experiences, and random valuations) and provide algorithms with nearly-optimal revenue regret guarantees. In particular, for any market setting with $N$ users, $M$ items, and load $L$ (which roughly corresponds to the maximum number of simultaneous allocations possible), our algorithms achieve regret of order $O(NM\log\log(LT))$ under fixed valuations model, $\widetilde{O}(\sqrt{NMLT})$ under random experiences model and $\widetilde{O}(\sqrt{NMLT})$ under random valuations model in $T$ rounds.

Causal Fairness for Outcome Control
Drago Plecko Elias Bareinboim



研究问题:如何使自动化决策系统公平和公正,考虑到性别、种族和宗教等敏感属性。
动机:随着社会向基于AI的决策基础设施转变,越来越多的决策从人类手中转移到自动化系统中。尽管这些发展使社会的各个方面更加高效,但大量证据表明,需要非常小心地使这种自动化决策系统公平和公正。
方法:通过因果分析研究了“利益”的概念,即特定个体在积极决策与消极决策对比时,反事实地说,会从中获得多少利益。提出了利益公平性的概念,可以看作是决策中的最小公平性要求,并开发了一种满足它的算法。然后指出,利益本身可能受到受保护属性的影响,并提出了一些可以用来分析这一点的因果工具。
效果:如果受保护属性在利益中的某些变化被认为是歧视性的,那么利益公平性的概念可能需要加强,这导致我们阐明了一个因果利益公平性的概念。使用这个概念,我们开发了一种新的优化程序,能够在决策过程中最大化Y值,同时确保因果公平性。

As society transitions towards an AI-based decision-making infrastructure, an ever-increasing number of decisions once under control of humans are now delegated to automated systems. Even though such developments make various parts of society more efficient, a large body of evidence suggests that a great deal of care needs to be taken to make such automated decision-making systems fair and equitable, namely, taking into account sensitive attributes such as gender, race, and religion. In this paper, we study a specific decision-making task called outcome control in which an automated system aims to optimize an outcome variable $Y$ while being fair and equitable. The interest in such a setting ranges from interventions related to criminal justice and welfare, all the way to clinical decision-making and public health. In this paper, we first analyze through causal lenses the notion of benefit, which captures how much a specific individual would benefit from a positive decision, counterfactually speaking, when contrasted with an alternative, negative one. We introduce the notion of benefit fairness, which can be seen as the minimal fairness requirement in decision-making, and develop an algorithm for satisfying it. We then note that the benefit itself may be influenced by the protected attribute, and propose causal tools which can be used to analyze this. Finally, if some of the variations of the protected attribute in the benefit are considered as discriminatory, the notion of benefit fairness may need to be strengthened, which leads us to articulating a notion of causal benefit fairness. Using this notion, we develop a new optimization procedure capable of maximizing $Y$ while ascertaining causal fairness in the decision process.

Unleashing the Power of Randomization in Auditing Differentially Private ML
Krishna Pillutla Galen Andrew Peter Kairouz Hugh Brendan McMahan Alina Oprea Sewoong Oh



研究问题:本文旨在提出一种严格的差分隐私机器学习审计方法,通过添加多个精心设计的“金丝雀”例子。
动机:为了处理随机化数据集,扩展差分隐私的定义,并设计随机化的“金丝雀”。
方法:引入提升的差分隐私(LiDP),尝试区分在数据集中训练有K个“金丝雀”与只有K-1个“金丝雀”的模型。同时,利用多个测试统计量和适应性的高阶相关性来创建新的置信区间。
效果:无论在理论还是实证上,这种新的方法都显著提高了样本复杂度,并在合成数据和真实数据上都取得了良好的效果。此外,新框架可以容易地整合最近设计的更强的“金丝雀”。

We present a rigorous methodology for auditing differentially private machine learning by adding multiple carefully designed examples called canaries. We take a first principles approach based on three key components. First, we introduce Lifted Differential Privacy (LiDP) that expands the definition of differential privacy to handle randomized datasets. This gives us the freedom to design randomized canaries. Second, we audit LiDP by trying to distinguish between the model trained with $K$ canaries versus $K-1$ canaries in the dataset, leaving one canary out. By drawing the canaries i.i.d., LiDP can leverage the symmetry in the design and reuse each privately trained model to run multiple statistical tests, one for each canary. Third, we introduce novel confidence intervals that take advantage of the multiple test statistics by adapting to the empirical higher-order correlations. Together, this new recipe demonstrates significant improvements in sample complexity, both theoretically and empirically, using synthetic and real data. Further, recent advances in designing stronger canaries can be readily incorporated in the new framework.

Supply-Side Equilibria in Recommender Systems
Meena Jagadeesan Nikhil Garg Jacob Steinhardt



研究问题:本文旨在研究个性化内容推荐系统对生产者激励的影响,以及由此产生的供应侧均衡。
动机:算法推荐系统如Spotify和Netflix不仅影响消费者行为,也影响生产者激励。生产者会试图创建被推荐算法展示的内容,这可能影响他们内容的多样性和质量。
方法:我们将生产者的决策建模为选择多维内容向量,用户具有异质偏好,这与经典的低维模型形成对比。通过使用对偶性论证,我们推导出专门化是否发生的充分必要条件。然后,我们在两个用户群体的具体设置中描述了均衡时的内容分布。
效果:我们的分析表明,专门化可以使生产者在均衡时实现正利润,这意味着专门化可以降低市场竞争性。从概念上讲,我们对供应侧竞争的分析有助于阐明个性化推荐如何塑造数字商品市场。

Algorithmic recommender systems such as Spotify and Netflix affect not only consumer behavior but also *producer incentives*. Producers seek to create content that will be shown by the recommendation algorithm, which can impact both the diversity and quality of their content. In this work, we investigate the resulting supply-side equilibria in personalized content recommender systems. We model the decisions of producers as choosing *multi-dimensional* content vectors and users as having *heterogenous* preferences, which contrasts with classical low-dimensional models. Multi-dimensionality and heterogeneity creates the potential for *specialization*, where different producers create different types of content at equilibrium. Using a duality argument, we derive necessary and sufficient conditions for whether specialization occurs. Then, we characterize the distribution of content at equilibrium in concrete settings with two populations of users. Lastly, we show that specialization can enable producers to achieve *positive profit at equilibrium*, which means that specialization can reduce the competitiveness of the marketplace. At a conceptual level, our analysis of supply-side competition takes a step towards elucidating how personalized recommendations shape the marketplace of digital goods.

Robust Contrastive Language-Image Pretraining against Data Poisoning and Backdoor Attacks
Wenhan Yang Jingdong Gao Baharan Mirzasoleiman



研究问题:现有的视觉-语言对比表示学习方法在零样本分类任务上取得了最先进的性能,但这种方法对各种类型的目标数据中毒和后门攻击非常脆弱。
动机:尽管大型多模态模型如CLIP容易受到目标数据中毒和后门攻击的影响,但针对这类攻击的鲁棒对比视觉-语言预训练仍然未得到解决。
方法:本文提出了RoCLIP,这是第一个有效地对抗目标数据中毒和后门攻击进行多模态视觉-语言模型预训练的方法。RoCLIP通过考虑一个相对大且不断变化的随机标题池来打破被污染的图像-标题对之间的关联,并在每个时期将每张图像与其在池中最相似的文本匹配,而不是其自己的标题。
效果:实验表明,RoCLIP使目标数据中毒和后门攻击在预训练CLIP模型时变得无效。特别是,RoCLIP将目标数据中毒攻击的成功率从93.75%降低到12.5%,将后门攻击的成功率降低到0%,同时提高了模型的线性探测性能10%,并保持了与CLIP相当的零样本性能。

Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of targeted data poisoning and backdoor attacks. Despite this vulnerability, robust contrastive vision-language pre-training against such attacks has remained unaddressed. In this work, we propose RoCLIP, the first effective method for robust pre-training multimodal vision-language models against targeted data poisoning and backdoor attacks. RoCLIP effectively breaks the association between poisoned image-caption pairs by considering a relatively large and varying pool of random captions, and matching every image with the text that is most similar to it in the pool instead of its own caption, every few epochs.It also leverages image and text augmentations to further strengthen the defense and improve the performance of the model. Our extensive experiments show that RoCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models. In particular, RoCLIP decreases the success rate for targeted data poisoning attacks from 93.75% to 12.5% and that of backdoor attacks down to 0%, while improving the model's linear probe performance by 10% and maintains a similar zero shot performance compared to CLIP. By increasing the frequency of matching, RoCLIP is able to defend strong attacks, which add up to 1% poisoned examples to the data, and successfully maintain a low attack success rate of 12.5%, while trading off the performance on some tasks.

CBD: A Certified Backdoor Detector Based on Local Dominant Probability
Zhen Xiang Zidi Xiong Bo Li



研究问题:现有的视觉-语言对比表示学习方法在零样本分类任务上取得了最先进的性能,但这种方法对各种类型的目标数据中毒和后门攻击非常脆弱。
动机:尽管大型多模态模型如CLIP容易受到目标数据中毒和后门攻击的影响,但针对这类攻击的鲁棒对比视觉-语言预训练仍然未得到解决。
方法:本文提出了RoCLIP,这是第一个有效地对抗目标数据中毒和后门攻击进行多模态视觉-语言模型预训练的方法。RoCLIP通过考虑一个相对大且不断变化的随机标题池来打破被污染的图像-标题对之间的关联,并在每个时期将每张图像与其在池中最相似的文本匹配,而不是其自己的标题。
效果:实验表明,RoCLIP使目标数据中毒和后门攻击在预训练CLIP模型时变得无效。特别是,RoCLIP将目标数据中毒攻击的成功率从93.75%降低到12.5%,将后门攻击的成功率降低到0%,同时提高了模型的线性探测性能10%,并保持了与CLIP相当的零样本性能。

Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.

Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition
Meena Jagadeesan Michael Jordan Jacob Steinhardt Nika Haghtalab



研究问题:随着机器学习模型规模的增大,预测准确率的提高是否总是呈上升趋势?在多个模型供应商竞争的情况下,这种趋势会如何变化?
动机:现有的研究主要关注单一模型供应商的情况,而现实中供应商之间存在竞争。本研究旨在探讨竞争对预测准确率的影响。
方法:定义了一个分类任务的竞争模型,并使用数据表示作为研究规模增加影响的工具。通过模拟预训练表示在CIFAR-10上的表现,发现在某些情况下,提高数据表示质量(用贝叶斯风险衡量)会导致整体用户预测准确率下降。
效果:本研究表明,对于单个模型供应商有利的规模增长趋势并不一定能转化为多个模型供应商市场中的社会福祉提升。

As the scale of machine learning models increases, trends such as scaling laws anticipate consistent downstream improvements in predictive accuracy. However, these trends take the perspective of a single model-provider in isolation, while in reality providers often compete with each other for users. In this work, we demonstrate that competition can fundamentally alter the behavior of these scaling trends, even causing overall predictive accuracy across users to be non-monotonic or decreasing with scale. We define a model of competition for classification tasks, and use data representations as a lens for studying the impact of increases in scale. We find many settings where improving data representation quality (as measured by Bayes risk) decreases the overall predictive accuracy across users (i.e., social welfare) for a marketplace of competing model-providers. Our examples range from closed-form formulas in simple settings to simulations with pretrained representations on CIFAR-10. At a conceptual level, our work suggests that favorable scaling trends for individual model-providers need not translate to downstream improvements in social welfare in marketplaces with multiple model providers.

Differentially Private Decoupled Graph Convolutions for Multigranular Topology Protection
Eli Chien Wei-Ning Chen Chao Pan Pan Li Ayfer Ozgur Olgica Milenkovic



研究问题:如何保护图神经网络(GNN)在处理图结构数据时暴露的敏感用户信息和交互。
动机:直接将标准的差分隐私(DP)方法应用于GNN存在两个主要问题,一是节点标签预测可能泄露隐私,二是实际应用中对节点属性和图拓扑的隐私要求可能不同。
方法:提出一种新的框架——图差分隐私(GDP),专门针对图学习进行设计,确保模型参数和预测结果的私有性。同时,提出了一种统一的图数据集邻接性概念,以分析GDP在不同级别的图拓扑隐私上的性质。
效果:通过在七个节点分类基准测试和说明性合成数据集上的大量实验,发现DPDGCs在隐私-效用权衡方面显著优于现有的DP-GNNs。

Graph Neural Networks (GNNs) have proven to be highly effective in solving real-world learning problems that involve graph-structured data. However, GNNs can also inadvertently expose sensitive user information and interactions through their model predictions. To address these privacy concerns, Differential Privacy (DP) protocols are employed to control the trade-off between provable privacy protection and model utility. Applying standard DP approaches to GNNs directly is not advisable due to two main reasons. First, the prediction of node labels, which relies on neighboring node attributes through graph convolutions, can lead to privacy leakage. Second, in practical applications, the privacy requirements for node attributes and graph topology may differ. In the latter setting, existing DP-GNN models fail to provide multigranular trade-offs between graph topology privacy, node attribute privacy, and GNN utility. To address both limitations, we propose a new framework termed Graph Differential Privacy (GDP), specifically tailored to graph learning. GDP ensures both provably private model parameters as well as private predictions. Additionally, we describe a novel unified notion of graph dataset adjacency to analyze the properties of GDP for different levels of graph topology privacy. Our findings reveal that DP-GNNs, which rely on graph convolutions, not only fail to meet the requirements for multigranular graph topology privacy but also necessitate the injection of DP noise that scales at least linearly with the maximum node degree. In contrast, our proposed Differentially Private Decoupled Graph Convolutions (DPDGCs) represent a more flexible and efficient alternative to graph convolutions that still provides the necessary guarantees of GDP. To validate our approach, we conducted extensive experiments on seven node classification benchmarking and illustrative synthetic datasets. The results demonstrate that DPDGCs significantly outperform existing DP-GNNs in terms of privacy-utility trade-offs.

Bounding training data reconstruction in DP-SGD
Jamie Hayes Borja Balle Saeed Mahloujifar



研究问题:本文旨在进一步研究差分隐私训练在保护深度学习模型不受重建攻击威胁方面的有效性。
动机:尽管差分隐私训练通常被解释为防止成员推断攻击的保证,但最近的研究表明,如果只需要保护训练数据不被重建,那么可以通过减少噪声来提高模型的效用。
方法:本文以差分隐私随机梯度下降(DP-SGD)为例,提供了任何重建攻击成功的上限,并进行了实证匹配的攻击。
效果:结果显示,不同的DP-SGD参数设置即使保证了相同的DP,对于防止重建攻击的成功率也会有所不同,这表明仅依赖DP保证可能无法有效控制重建攻击的威胁。

Differentially private training offers a protection which is usually interpreted as a guarantee against membership inference attacks. By proxy, this guarantee extends to other threats like reconstruction attacks attempting to extract complete training examples. Recent works provide evidence that if one does not need to protect against membership attacks but instead only wants to protect against a training data reconstruction, then utility of private models can be improved because less noise is required to protect against these more ambitious attacks. We investigate this question further in the context of DP-SGD, a standard algorithm for private deep learning, and provide an upper bound on the success of any reconstruction attack against DP-SGD together with an attack that empirically matches the predictions of our bound. Together, these two results open the door to fine-grained investigations on how to set the privacy parameters of DP-SGD in practice to protect against reconstruction attacks. Finally, we use our methods to demonstrate that different settings of the DP-SGD parameters leading to same DP guarantees can results in significantly different success rates for reconstruction, indicating that the DP guarantee alone might not be a good proxy for controlling the protection against reconstruction attacks.

Moral Responsibility for AI Systems
Sander Beckers



研究问题:随着越来越多的具有重大伦理维度的决策被外包给AI系统,对AI系统的道德责任进行定义变得非常重要。
动机:为了应对AI系统在道德责任方面的问题,需要明确其道德责任的定义和条件。
方法:本文提出了一种基于因果模型的形式化定义,包括因果关系和认知条件,并将该定义与现有的Braham和van Hees以及Halpern和Kleiman-Weiner的方法进行了比较。
效果:通过将道德责任定义为程度,可以更好地衡量AI系统在道德责任方面的表现。

As more and more decisions that have a significant ethical dimension are being outsourced to AI systems, it is important to have a definition of _moral responsibility_ that can be applied to AI systems. Moral responsibility for an outcome of an agent who performs some action is commonly taken to involve both a _causal condition_ and an _epistemic condition_: the action should cause the outcome, and the agent should have been aware - in some form or other - of the possible moral consequences of their action. This paper presents a formal definition of both conditions within the framework of causal models. I compare my approach to the existing approaches of Braham and van Hees (BvH) and of Halpern and Kleiman-Weiner (HK). I then generalize my definition into a _degree of responsibility_.

Estimating and Controlling for Equalized Odds via Sensitive Attribute Predictors
Beepul Bharti Paul Yi Jeremias Sulam



研究问题:随着机器学习模型在现实世界高风险决策设置中的使用不断增加,如何审计研究问题:随着机器学习模型在现实世界高风险决策设置中的使用不断增加,如何审计和控制这些模型可能对某些群体表现出的潜在公平性违规行为变得非常重要。
动机:在许多情况下,敏感属性信息(如人口统计、生物性别或其他决定群体成员资格的潜在敏感特征)往往不可用。因此,本研究探讨了著名的等机会(EOD)公平性定义。
方法:在没有敏感属性的设置中,我们首先为预测器的EOD违规提供了紧密且可计算的上界。其次,我们通过一种新的后处理校正方法展示了如何可证明地控制最坏情况的EOD。
效果:我们的研究结果刻画了当直接控制预测的敏感属性时,何时能最优化地控制最坏情况的EOD,以及何时不能。我们的结果在假设上比之前的工作更温和,并通过在合成和真实数据集上的实验进行了说明。

As the use of machine learning models in real world high-stakes decision settings continues to grow, it is highly important that we are able to audit and control for any potential fairness violations these models may exhibit towards certain groups. To do so, one naturally requires access to sensitive attributes, such as demographics, biological sex, or other potentially sensitive features that determine group membership. Unfortunately, in many settings, this information is often unavailable. In this work we study the well known equalized odds (EOD) definition of fairness. In a setting without sensitive attributes, we first provide tight and computable upper bounds for the EOD violation of a predictor. These bounds precisely reflect the worst possible EOD violation. Second, we demonstrate how one can provably control the worst-case EOD by a new post-processing correction method. Our results characterize when directly controlling for EOD with respect to the predicted sensitive attributes is -- and when is not -- optimal when it comes to controlling worst-case EOD. Our results hold under assumptions that are milder than previous works, and we illustrate these results with experiments on synthetic and real datasets.

RECESS Vaccine for Federated Learning: Proactive Defense Against Model Poisoning Attacks
Haonan Yan Wenjing Zhang Qian Chen Xiaoguang Li Wenhai Sun HUI LI Xiaodong Lin



研究问题:模型投毒攻击对联邦学习的应用构成严重威胁,现有的防御方法效果易受最新投毒攻击影响,导致预测准确率下降。
动机:目前的防御方法难以区分良性异常和恶意梯度,进一步损害了模型的泛化能力。
方法:提出一种名为RECESS的新型防御方法,包括检测和聚合,作为联邦学习的“疫苗”来对抗模型投毒攻击。RECESS主动查询每个参与客户端,使用精心设计的聚合梯度,并根据他们的反应以更高的精度检测恶意客户端。此外,RECESS采用一种新的基于信任评分的机制来稳健地聚合梯度。
效果:在多种设置下,包括白/黑盒、跨部门/设备联邦学习等,RECESS在典型模型架构和四个数据集上进行了广泛评估。实验结果表明,RECESS在减少最新模型投毒攻击引起的准确率损失方面优于五种经典和两种最先进的防御方法。

Model poisoning attacks greatly jeopardize the application of federated learning (FL). The effectiveness of existing defenses is susceptible to the latest model poisoning attacks, leading to a decrease in prediction accuracy. Besides, these defenses are intractable to distinguish benign outliers from malicious gradients, which further compromises the model generalization. In this work, we propose a novel defense including detection and aggregation, named RECESS, to serve as a “vaccine” for FL against model poisoning attacks. Different from the passive analysis in previous defenses, RECESS proactively queries each participating client with a delicately constructed aggregation gradient, accompanied by the detection of malicious clients according to their responses with higher accuracy. Further, RECESS adopts a newly proposed trust scoring based mechanism to robustly aggregate gradients. Rather than previous methods of scoring in each iteration, RECESS takes into account the correlation of clients’ performance over multiple iterations to estimate the trust score, bringing in a significant increase in detection fault tolerance. Finally, we extensively evaluate RECESS on typical model architectures and four datasets under various settings including white/black-box, cross-silo/device FL, etc. Experimental results show the superiority of RECESS in terms of reducing accuracy loss caused by the latest model poisoning attacks over five classic and two state-of-the-art defenses.

SALSA VERDE: a machine learning attack on LWE with sparse small secrets
Cathy Yuanchen Li Emily Wenger Zeyuan Allen-Zhu Francois Charton Kristin E. Lauter



研究问题:如何评估LWE问题的困难程度以及其特定参数选择的安全性。
动机:LWE问题是后量子密码学中的难题,而同态加密(HE)方案的安全性依赖于LWE问题的困难程度。因此,对LWE和其特定参数选择的安全性进行持续评估至关重要。
方法:我们提出了一种改进的ML攻击方法VERDE,可以恢复稀疏二进制、三元和窄高斯秘密。通过改进预处理和秘密恢复技术,VERDE可以在更大的维度($n=512$)和更小的模数($\log_2 q=12$对于$n=256$)下攻击LWE,使用更少的时间和功率。我们还提出了用于扩展的新架构。
效果:实验结果表明,我们的VERDE方法在恢复稀疏秘密方面取得了显著改进,并且可以在短时间内以较低的计算成本成功攻击LWE。

Learning with Errors (LWE) is a hard math problem used in post-quantum cryptography. Homomorphic Encryption (HE) schemes rely on the hardness of the LWE problem for their security, and two LWE-based cryptosystems were recently standardized by NIST for digital signatures and key exchange (KEM). Thus, it is critical to continue assessing the security of LWE and specific parameter choices. For example, HE uses secrets with small entries, and the HE community has considered standardizing small sparse secrets to improve efficiency and functionality. However, prior work, SALSA and PICANTE, showed that ML attacks can recover sparse binary secrets. Building on these, we propose VERDE, an improved ML attack that can recover sparse binary, ternary, and narrow Gaussian secrets. Using improved preprocessing and secret recovery techniques, VERDE can attack LWE with larger dimensions ($n=512$) and smaller moduli ($\log_2 q=12$ for $n=256$), using less time and power. We propose novel architectures for scaling. Finally, we develop a theory that explains the success of ML LWE attacks.

Cookie Consent Has Disparate Impact on Estimation Accuracy
Erik Miehling Rahul Nair Elizabeth M. Daly Karthikeyan Natesan Ramamurthy Robert Nelson Redmond



研究问题:用户同意对推荐系统学习其潜在属性的能力有何影响?不同人口统计特征的用户在同意分享cookie时,推荐系统对其潜在属性的估计是否存在差异?
动机:随着cookies被用于更准确地识别和追踪用户行为,引发了关于隐私和公平性的问题。用户的同意决策如何影响推荐系统对其潜在属性的学习?这种影响在不同人口统计特征之间是否一致?
方法:通过模拟参与驱动的推荐系统,进行实验研究。当同意率呈现人口统计依赖性时,分析用户同意与否对推荐系统学习其潜在属性的影响。
效果:研究发现,当同意率存在人口统计依赖性时,用户不同意分享cookie可能会反而使推荐系统对其了解得更多。此外,基本同意率的差距会放大这种效应:来自低同意率群体的用户,如果同意分享cookie,他们的估计误差通常比来自高同意率群体的用户更大;反之亦然。这需要提出新的公平性概念,鼓励用户隐私决策与系统估计其潜在属性的能力之间的一致性。

Cookies are designed to enable more accurate identification and tracking of user behavior, in turn allowing for more personalized ads and better performing ad campaigns. Given the additional information that is recorded, questions related to privacy and fairness naturally arise. How does a user's consent decision influence how much the system can learn about their demographic and tastes? Is the impact of a user's consent decision on the recommender system's ability to learn about their latent attributes uniform across demographics? We investigate these questions in the context of an engagement-driven recommender system using simulation. We empirically demonstrate that when consent rates exhibit demographic-dependence, user consent has a disparate impact on the recommender agent's ability to estimate users' latent attributes. In particular, we find that when consent rates are demographic-dependent, a user disagreeing to share their cookie may counter-intuitively cause the recommender agent to know more about the user than if the user agreed to share their cookie. Furthermore, the gap in base consent rates across demographics serves as an amplifier: users from the lower consent rate demographic who agree to cookie sharing generally experience higher estimation errors than the same users from the higher consent rate demographic, and conversely for users who choose to disagree to cookie sharing, with these differences increasing in consent rate gap. We discuss the need for new notions of fairness that encourage consistency between a user's privacy decisions and the system's ability to estimate their latent attributes.

Wasserstein distributional robustness of neural networks
Xingjian Bai Guangyi He Yifan Jiang Jan Obloj



研究问题:深度神经网络容易受到对抗性攻击,如何设计并防御这些攻击是当前的研究热点。
动机:传统的对抗性攻击假设每个输入数据点的扰动都有相同的上界,而分布威胁模型允许攻击者以非均匀的方式对输入进行扰动。
方法:利用Wasserstein分布鲁棒优化(DRO)技术,提出了一种新的对抗性训练方法,该方法考虑了一组分布威胁模型,并将更一般的攻击与样本外性能和奈特不确定性问题联系起来。
效果:通过在CIFAR-10、CIFAR-100和ImageNet数据集上使用RobustBench的深度神经网络进行数值实验,验证了理论结果的正确性。

Deep neural networks are known to be vulnerable to adversarial attacks (AA). For an image recognition task, this means that a small perturbation of the original can result in the image being misclassified. Design of such attacks as well as methods of adversarial training against them are subject of intense research. We re-cast the problem using techniques of Wasserstein distributionally robust optimization (DRO) and obtain novel contributions leveraging recent insights from DRO sensitivity analysis. We consider a set of distributional threat models. Unlike the traditional pointwise attacks, which assume a uniform bound on perturbation of each input data point, distributional threat models allow attackers to perturb inputs in a non-uniform way. We link these more general attacks with questions of out-of-sample performance and Knightian uncertainty. To evaluate the distributional robustness of neural networks, we propose a first-order AA algorithm and its multistep version. Our attack algorithms include Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) as special cases. Furthermore, we provide a new asymptotic estimate of the adversarial accuracy against distributional threat models. The bound is fast to compute and first-order accurate, offering new insights even for the pointwise AA. It also naturally yields out-of-sample performance guarantees. We conduct numerical experiments on CIFAR-10, CIFAR-100, ImageNet datasets using DNNs on RobustBench to illustrate our theoretical results. Our code is available at https://github.com/JanObloj/W-DRO-Adversarial-Methods.

FairLISA: Fair User Modeling with Limited Sensitive Attributes Information
Zheng Zhang Qi Liu Hao Jiang Fei Wang Yan Zhuang Le Wu Weibo Gao Enhong Chen



研究问题:传统的用户模型可能会从行为数据中无意识地捕捉到与敏感属性(如性别)相关的偏见,导致不公平和歧视。
动机:为了解决这一问题,研究人员提出了一些方法来明确地解耦用户模型结果和敏感属性以提高公平性。但这些方法大多需要完整的敏感属性标签,这在现实中很难实现。
方法:本文提出了一种新的FairLISA框架,该框架可以有效地利用已知和未知敏感属性的数据进行公平的模型训练。首先,我们提出了一个新的理论观点来建立已知和未知敏感属性数据与公平目标之间的关系。然后,基于这个观点,我们提供了一个通用的对抗性框架来有效地利用整个用户数据进行公平的用户建模。
效果:我们在推荐系统和认知诊断等代表性的用户建模任务上进行了实验。结果表明,我们的FairLISA可以在不同比例的缺失敏感属性的情况下有效地提高公平性,同时保持高准确性。

User modeling techniques profile users' latent characteristics (e.g., preference) from their observed behaviors, and play a crucial role in decision-making. Unfortunately, traditional user models may unconsciously capture biases related to sensitive attributes (e.g., gender) from behavior data, even when this sensitive information is not explicitly provided. This can lead to unfair issues and discrimination against certain groups based on these sensitive attributes. Recent studies have been proposed to improve fairness by explicitly decorrelating user modeling results and sensitive attributes. However, most existing approaches assume that fully sensitive attribute labels are available in the training set, which is unrealistic due to collection limitations like privacy concerns, and hence bear the limitation of performance. In this paper, we focus on a practical situation with limited sensitive data and propose a novel FairLISA framework, which can efficiently utilize data with known and unknown sensitive attributes to facilitate fair model training. We first propose a novel theoretical perspective to build the relationship between data with both known and unknown sensitive attributes with the fairness objective. Then, based on this, we provide a general adversarial framework to effectively leverage the whole user data for fair user modeling. We conduct experiments on representative user modeling tasks including recommender system and cognitive diagnosis. The results demonstrate that our FairLISA can effectively improve fairness while retaining high accuracy in scenarios with different ratios of missing sensitive attributes.

Towards Stable Backdoor Purification through Feature Shift Tuning
Rui Min Zeyu Qin Li Shen Minhao Cheng



研究问题:深度神经网络容易受到后门攻击,即攻击者通过篡改少量训练样本来恶意操纵模型行为。
动机:尽管已经提出了一系列的防御方法,但这些方法要么需要对训练过程进行复杂的修改,要么严重依赖于特定的模型架构,这使得它们难以部署到实际应用中。
方法:本文从微调这一最常见且易于部署的后门防御方法开始,通过全面的评估对抗不同的攻击场景。具体来说,我们引入了特征转移微调(FST)方法,该方法通过主动使分类器权重偏离最初被破坏的权重,来鼓励特征转移。
效果:实验结果表明,与在高中毒率下取得的良好防御结果相比,普通微调方法在低中毒率场景下完全失败。我们的分析表明,在低中毒率下,后门和清洁特征之间的纠缠削弱了基于微调的防御的效果。因此,为了提高后门净化效果,有必要分离后门和清洁特征。

It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor attacks where attackers could manipulate the model behavior maliciously by tampering with a small set of training samples. Although a line of defense methods is proposed to mitigate this threat, they either require complicated modifications to the training process or heavily rely on the specific model architecture, which makes them hard to deploy into real-world applications. Therefore, in this paper, we instead start with fine-tuning, one of the most common and easy-to-deploy backdoor defenses, through comprehensive evaluations against diverse attack scenarios. Observations made through initial experiments show that in contrast to the promising defensive results on high poisoning rates, vanilla tuning methods completely fail at low poisoning rate scenarios. Our analysis shows that with the low poisoning rate, the entanglement between backdoor and clean features undermines the effect of tuning-based defenses. Therefore, it is necessary to disentangle the backdoor and clean features in order to improve backdoor purification. To address this, we introduce Feature Shift Tuning (FST), a method for tuning-based backdoor purification. Specifically, FST encourages feature shifts by actively deviating the classifier weights from the originally compromised weights. Extensive experiments demonstrate that our FST provides consistently stable performance under different attack settings. Without complex parameter adjustments, FST also achieves much lower tuning costs, only $10$ epochs. Our codes are available at https://github.com/AISafety-HKUST/stable_backdoor_purification.

Balancing Risk and Reward: A Batched-Bandit Strategy for Automated Phased Release
Yufan Li Jialiang Mao Iavor Bojinov



研究问题:如何通过分阶段发布来平衡新产品或更新的风险和快速迭代学习的需求。
动机:在科技行业中,分阶段发布是一种常见的策略,用于通过一系列A/B测试逐渐发布新的产品或更新,需要以原则性的方式选择分配给新发布的单位比例,以平衡负面影响的风险和快速迭代学习的需求。
方法:本文将此问题形式化并提出一种算法,该算法可以自动确定每个阶段的发布百分比,平衡控制风险和最大化加速速度的需要。我们的框架将此挑战
效果:实验结果表明,与在高中毒率下取得的良好防御结果相比,普通微调方法在低中毒率场景下完全失败。我们的分析表明,在低中毒率下,后门和清洁特征之间的纠缠削弱了基于微调的防御的效果。因此,为了提高后门净化效果,有必要分离后门和清洁特征。

Phased releases are a common strategy in the technology industry for gradually releasing new products or updates through a sequence of A/B tests in which the number of treated units gradually grows until full deployment or deprecation. Performing phased releases in a principled way requires selecting the proportion of units assigned to the new release in a way that balances the risk of an adverse effect with the need to iterate and learn from the experiment rapidly. In this paper, we formalize this problem and propose an algorithm that automatically determines the release percentage at each stage in the schedule, balancing the need to control risk while maximizing ramp-up speed. Our framework models the challenge as a constrained batched bandit problem that ensures that our pre-specified experimental budget is not depleted with high probability. Our proposed algorithm leverages an adaptive Bayesian approach in which the maximal number of units assigned to the treatment is determined by the posterior distribution, ensuring that the probability of depleting the remaining budget is low. Notably, our approach analytically solves the ramp sizes by inverting probability bounds, eliminating the need for challenging rare-event Monte Carlo simulation. It only requires computing means and variances of outcome subsets, making it highly efficient and parallelizable.

Gaussian Membership Inference Privacy
Tobias Leemann Martin Pawelczyk Gjergji Kasneci



研究问题:提出一种新的隐私概念$f$-Membership Inference Privacy($f$-MIP),明确考虑了现实对手在成员推断攻击威胁模型下的能力。
动机:现有的隐私保护方法往往忽视了现实中的对手能力,而$f$-MIP通过考虑对手的能力,提供了可解释的隐私保证和提高的效用。
方法:通过理论分析基于似然比的成员推断攻击,提出了一种参数化的$f$-MIP保障,称为$\mu$-高斯成员推断隐私($\mu$-GMIP)。同时,展示了如何通过向梯度更新添加噪声来增强$f$-MIP。
效果:实验证明,该方法在视觉和表格数据集上训练的模型中具有有效性。

We propose a novel and practical privacy notion called $f$-Membership Inference Privacy ($f$-MIP), which explicitly considers the capabilities of realistic adversaries under the membership inference attack threat model. Consequently, $f$-MIP offers interpretable privacy guarantees and improved utility (e.g., better classification accuracy). In particular, we derive a parametric family of $f$-MIP guarantees that we refer to as $\mu$-Gaussian Membership Inference Privacy ($\mu$-GMIP) by theoretically analyzing likelihood ratio-based membership inference attacks on stochastic gradient descent (SGD). Our analysis highlights that models trained with standard SGD already offer an elementary level of MIP. Additionally, we show how $f$-MIP can be amplified by adding noise to gradient updates. Our analysis further yields an analytical membership inference attack that offers two distinct advantages over previous approaches. First, unlike existing state-of-the-art attacks that require training hundreds of shadow models, our attack does not require any shadow model. Second, our analytical attack enables straightforward auditing of our privacy notion $f$-MIP. Finally, we quantify how various hyperparameters (e.g., batch size, number of model parameters) and specific data characteristics determine an attacker's ability to accurately infer a point's membership in the training set. We demonstrate the effectiveness of our method on models trained on vision and tabular datasets.

A Theory of Transfer-Based Black-Box Attacks: Explanation and Implications
Yanbo Chen Weiwei Liu



研究问题:本文旨在通过统一的理论框架研究基于转移的攻击,并提出了解释模型。
动机:现有的实证工作只能提供特定角度的临时解释,而没有进行定量分析,因此基于转移的攻击背后的理论仍然是一个谜团。
方法:本文提出了一个名为“流形攻击模型”的解释模型,该模型形式化了流行的信念,并解释了现有的实证结果。
效果:该模型解释了为什么即使源模型不准确,对抗性示例也是可转移的。此外,该模型还暗示了可转移对抗性示例的存在取决于数据流形的“曲率”,从而定量解释了为什么基于转移的攻击成功率难以提高。

Transfer-based attacks are a practical method of black-box adversarial attacks, in which the attacker aims to craft adversarial examples from a source (surrogate) model that is transferable to the target model. A wide range of empirical works has tried to explain the transferability of adversarial examples from different angles. However, these works only provide ad hoc explanations without quantitative analyses. The theory behind transfer-based attacks remains a mystery. This paper studies transfer-based attacks under a unified theoretical framework. We propose an explanatory model, called the manifold attack model, that formalizes popular beliefs and explains the existing empirical results. Our model explains why adversarial examples are transferable even when the source model is inaccurate. Moreover, our model implies that the existence of transferable adversarial examples depends on the “curvature” of the data manifold, which quantitatively explains why the success rates of transfer-based attacks are hard to improve. We also discuss the expressive power and the possible extensions of our model in general applications.

Enhancing Adversarial Robustness via Score-Based Optimization
Boya Zhang Weijian Luo Zhihua Zhang



研究问题:如何通过引入微小的扰动来误导深度神经网络分类器,并开发能够减轻这些攻击影响的算法,以确保人工智能的安全使用。
动机:现有的基于扩散的攻击防御方法依赖于扩散模型的逆随机微分方程的顺序模拟,这在计算上效率低下,并且结果次优。
方法:本文提出了一种名为ScoreOpt的新型对抗防御方案,该方案在测试时优化对抗样本,使其向原始干净数据的方向移动,并在基于得分的先验指导下进行。
效果:我们在CIFAR10、CIFAR100和ImageNet等多个数据集上进行了全面实验。实验结果表明,我们的方法在鲁棒性和推理速度方面都优于现有的对抗防御方法。

Adversarial attacks have the potential to mislead deep neural network classifiers by introducing slight perturbations. Developing algorithms that can mitigate the effects of these attacks is crucial for ensuring the safe use of artificial intelligence. Recent studies have suggested that score-based diffusion models are effective in adversarial defenses. However, existing diffusion-based defenses rely on the sequential simulation of the reversed stochastic differential equations of diffusion models, which are computationally inefficient and yield suboptimal results. In this paper, we introduce a novel adversarial defense scheme named ScoreOpt, which optimizes adversarial samples at test-time, towards original clean data in the direction guided by score-based priors. We conduct comprehensive experiments on multiple datasets, including CIFAR10, CIFAR100 and ImageNet. Our experimental results demonstrate that our approach outperforms existing adversarial defenses in terms of both robustness performance and inference speed.

Fantastic Robustness Measures: The Secrets of Robust Generalization
Hoki Kim Jinseong Park Yujin Choi Jaewook Lee



研究问题:对抗训练已成为提高模型对对抗性示例鲁棒性的事实标准方法,但鲁棒过拟合仍是一个重大挑战,导致训练集和测试集之间的鲁棒性存在巨大差距。
动机:为了理解和改善鲁棒泛化,研究人员开发了各种度量方法,包括基于边界、平滑度和平坦度的度量。本研究旨在通过大规模分析来验证这些度量方法与鲁棒泛化之间的关系是否在多样化的设置中仍然有效。
方法:我们在CIFAR-10数据集上比较了超过1300个模型,并进一步在RobustBench的CIFAR-10、CIFAR-100和ImageNet上评估了100多个模型,以实证检验这些度量方法是否有效地捕捉到鲁棒泛化的差距。
效果:实验结果表明,这些度量方法在不同设置下能够有效地捕捉到鲁棒泛化的差距,有助于更好地理解对抗性鲁棒性,并激发了更多针对对抗性攻击的鲁棒防御方法的发展。

Adversarial training has become the de-facto standard method for improving the robustness of models against adversarial examples. However, robust overfitting remains a significant challenge, leading to a large gap between the robustness on the training and test datasets. To understand and improve robust generalization, various measures have been developed, including margin, smoothness, and flatness-based measures. In this study, we present a large-scale analysis of robust generalization to empirically verify whether the relationship between these measures and robust generalization remains valid in diverse settings. We demonstrate when and how these measures effectively capture the robust generalization gap by comparing over 1,300 models trained on CIFAR-10 under the $L_\infty$ norm and further validate our findings through an evaluation of more than 100 models from RobustBench across CIFAR-10, CIFAR-100, and ImageNet. We hope this work can help the community better understand adversarial robustness and motivate the development of more robust defense methods against adversarial attacks.

One Risk to Rule Them All: A Risk-Sensitive Perspective on Model-Based Offline Reinforcement Learning
Marc Rigter Bruno Lacerda Nick Hawes



研究问题:如何使离线强化学习(RL)在安全关键领域中实现风险规避和避免分布偏移。
动机:在线探索在安全关键领域不可行,决策需要考虑灾难性结果的风险,即需要风险规避。同时,离线RL需要解决分布偏移的问题。
方法:提出一种基于模型的方法,使用模型集合来估计认识不确定性和偶然不确定性,训练一个避免高不确定性动作的、具有风险规避性的政策。
效果:实验表明,该方法在确定性基准上表现强劲,并在随机领域中优于现有方法,实现了风险规避和避免分布偏移。

Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is not feasible. In such domains, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be *risk-averse*. An additional challenge of offline RL is avoiding *distributional shift*, i.e. ensuring that state-action pairs visited by the policy remain near those in the dataset. Previous offline RL algorithms that consider risk combine offline RL techniques (to avoid distributional shift), with risk-sensitive RL algorithms (to achieve risk-aversion). In this work, we propose risk-aversion as a mechanism to jointly address *both* of these issues. We propose a model-based approach, and use an ensemble of models to estimate epistemic uncertainty, in addition to aleatoric uncertainty. We train a policy that is risk-averse, and avoids high uncertainty actions. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that are risky due to environment stochasticity. Thus, by considering epistemic uncertainty via a model ensemble and introducing risk-aversion, our algorithm (1R2R) avoids distributional shift in addition to achieving risk-aversion to aleatoric risk. Our experiments show that 1R2R achieves strong performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.

The Utility of “Even if” Semifactual Explanation to Optimise Positive Outcomes
Eoin M. Kenny Weipeng Fuzzy Huang



研究问题:本文旨在探讨如何使用可解释的人工智能(XAI)优化自动化系统的积极结果,而不是将消极结果转化为积极结果。
动机:目前的XAI主要关注如何通过使用反事实来跨越决策边界将消极结果转化为积极结果,而本文则专注于积极的结果,并采取新的步骤使用XAI来优化它们。
方法:本文引入了“即使...”推理的概念,并提出了半事实的概念,以实例化这种推理。同时,还考虑了半事实的第一种因果形式化方法。
效果:实验结果表明,与先前的工作相比,我们的算法在最大化增益方面表现更好,而且在这个过程中因果关系非常重要。最重要的是,用户研究支持我们的主要假设,即当人们收到贷款批准的积极结果时,他们发现半事实解释比反事实更有用。

When users receive either a positive or negative outcome from an automated system, Explainable AI (XAI) has almost exclusively focused on how to mutate negative outcomes into positive ones by crossing a decision boundary using counterfactuals (e.g., *"If you earn 2k more, we will accept your loan application"*). Here, we instead focus on positive outcomes, and take the novel step of using XAI to optimise them (e.g., *"Even if you wish to half your down-payment, we will still accept your loan application"*). Explanations such as these that employ "even if..." reasoning, and do not cross a decision boundary, are known as semifactuals. To instantiate semifactuals in this context, we introduce the concept of *Gain* (i.e., how much a user stands to benefit from the explanation), and consider the first causal formalisation of semifactuals. Tests on benchmark datasets show our algorithms are better at maximising gain compared to prior work, and that causality is important in the process. Most importantly however, a user study supports our main hypothesis by showing people find semifactual explanations more useful than counterfactuals when they receive the positive outcome of a loan acceptance.

Beyond Black-Box Advice: Learning-Augmented Algorithms for MDPs with Q-Value Predictions
Tongxin Li Yiheng Lin Shaolei Ren Adam Wierman



研究问题:在存在不可信的机器学习建议的情况下,研究单轨迹时变马尔可夫决策过程(MDP)中一致性和鲁棒性之间的权衡。
动机:与将建议视为来自黑箱来源的典型方法不同,我们考虑了一个额外信息关于如何生成建议的环境。
方法:我们在一个包含连续和离散状态/动作空间的通用MDP模型下,证明了在Q值建议下的一致性和鲁棒性权衡,这是首次实现。
效果:我们的研究结果强调了利用Q值建议能够动态地追求更好的机器学习建议和鲁棒基线,从而在性能保证上接近最优,这被证明比仅使用黑箱建议能够得到的结果有所改善。

We study the tradeoff between consistency and robustness in the context of a single-trajectory time-varying Markov Decision Process (MDP) with untrusted machine-learned advice. Our work departs from the typical approach of treating advice as coming from black-box sources by instead considering a setting where additional information about how the advice is generated is available. We prove a first-of-its-kind consistency and robustness tradeoff given Q-value advice under a general MDP model that includes both continuous and discrete state/action spaces. Our results highlight that utilizing Q-value advice enables dynamic pursuit of the better of machine-learned advice and a robust baseline, thus result in near-optimal performance guarantees, which provably improves what can be obtained solely with black-box advice.

Causal Context Connects Counterfactual Fairness to Robust Prediction and Group Fairness
Jacy Reese Anthis Victor Veitch



研究问题:本文旨在探讨如何通过"因果上下文"弥合反事实公平性、稳健预测和群体公平性之间的差距。
动机:虽然反事实公平性是一个直观的标准,但其在现实世界的数据中无法直接观察,因此其应用受到限制。而群体公平性指标虽然不那么直观,但更容易观察到。
方法:本文使用"因果上下文"来连接反事实公平性、稳健预测和群体公平性。首先,我们通过展示在合理的条件下,反事实公平的预测者实际上是无偏目标分布下的准确性最优者,从而激发了反事实公平性的动机。其次,我们开发了一个对应关系,该对应关系是数据生成过程的因果图与哪些(如果有的话)群体公平性指标等同于反事实公平性。最后,我们展示了在三种常见的公平性背景下——测量误差、标签选择和预测器选择——反事实公平性分别等同于人口平等、机会均等和校准。
效果:在某些情况下,反事实公平性可以通过测量相对简单的群体公平性指标进行测试。

Counterfactual fairness requires that a person would have been classified in the same way by an AI or other algorithmic system if they had a different protected class, such as a different race or gender. This is an intuitive standard, as reflected in the U.S. legal system, but its use is limited because counterfactuals cannot be directly observed in real-world data. On the other hand, group fairness metrics (e.g., demographic parity or equalized odds) are less intuitive but more readily observed. In this paper, we use \textit{causal context} to bridge the gaps between counterfactual fairness, robust prediction, and group fairness. First, we motivate counterfactual fairness by showing that there is not necessarily a fundamental trade-off between fairness and accuracy because, under plausible conditions, the counterfactually fair predictor is in fact accuracy-optimal in an unbiased target distribution. Second, we develop a correspondence between the causal graph of the data-generating process and which, if any, group fairness metrics are equivalent to counterfactual fairness. Third, we show that in three common fairness contexts—measurement error, selection on label, and selection on predictors—counterfactual fairness is equivalent to demographic parity, equalized odds, and calibration, respectively. Counterfactual fairness can sometimes be tested by measuring relatively simple group fairness metrics.

Triple Eagle: Simple, Fast and Practical Budget-Feasible Mechanisms
Kai Han You Wu He Huang Shuang Cui



研究问题:本文旨在重新审视为次模态估值函数设计预算可行机制(BFMs)的经典问题。
动机:由于其在众包和社交媒体营销中的广泛应用,自Singer的开创性论文[FOCS'10]以来,这个问题得到了广泛的研究。
方法:我们提出了TripleEagle,一个新的算法框架,用于设计BFMs。基于此,我们提出了几种简单而有效的BFMs,其近似比优于最先进的工作。此外,我们的BFMs是文献中首次实现线性复杂度同时确保明显策略性的,使其比之前的BFMs更具实用性。
效果:我们进行了广泛的实验来评估我们BFMs的经验性能,实验结果强烈证明了我们方法的效率和有效性。

We revisit the classical problem of designing Budget-Feasible Mechanisms (BFMs) for submodular valuation functions, which has been extensively studied since the seminal paper of Singer [FOCS'10] due to its wide applications in crowdsourcing and social marketing. We propose TripleEagle, a novel algorithmic framework for designing BFMs, based on which we present several simple yet effective BFMs that achieve better approximation ratios than the state-of-the-art work. Moreover, our BFMs are the first in the literature to achieve linear complexities while ensuring obvious strategyproofness, making them more practical than the previous BFMs. We conduct extensive experiments to evaluate the empirical performance of our BFMs, and the experimental results strongly demonstrate the efficiency and effectiveness of our approach.

REFINE: A Fine-Grained Medication Recommendation System Using Deep Learning and Personalized Drug Interaction Modeling
Suman Bhoi Mong-Li Lee Wynne Hsu Ngiap Chuan Tan



研究问题:现有的药物推荐系统只提供类别级别的药物,并认为所有药物之间的相互作用都具有相同的严重程度,这限制了它们为个人需求提供个性化和安全建议的能力。
动机:患有共病的患者通常需要多种药物来管理他们的病情。然而,现有的药物推荐系统只提供类别级别的药物,并认为所有药物之间的相互作用都具有相同的严重程度,这限制了它们为个人需求提供个性化和安全建议的能力。
方法:我们引入了一种基于深度学习的细粒度药物推荐系统,称为REFINE,旨在改善治疗效果并最小化不良药物相互作用。为了更好地描述患者的健康状况,我们模拟了药物剂量滴定和实验室测试反应的趋势,并将视觉转换器适应于获取有效的患者表示。我们还将药物相互作用的严重程度模型化为加权图以学习安全的药物组合,并设计了一个平衡的损失函数以避免过于保守的建议并错过某些情况下可能需要的药物。
效果:在两个真实世界的数据集上的大量实验表明,REFINE优于最先进的技术。

Patients with co-morbidities often require multiple medications to manage their conditions. However, existing medication recommendation systems only offer class-level medications and regard all interactions among drugs to have the same level of severity. This limits their ability to provide personalized and safe recommendations tailored to individual needs. In this work, we introduce a deep learning-based fine-grained medication recommendation system called REFINE, which is designed to improve treatment outcomes and minimize adverse drug interactions. In order to better characterize patients’ health conditions, we model the trend in medication dosage titrations and lab test responses, and adapt the vision transformer to obtain effective patient representations. We also model drug interaction severity levels as weighted graphs to learn safe drug combinations and design a balanced loss function to avoid overly conservative recommendations and miss medications that might be needed for certain conditions. Extensive experiments on two real-world datasets show that REFINE outperforms state-of-the-art techniques.

Calibration by Distribution Matching: Trainable Kernel Calibration Metrics
Charles Thomas Marx Sofian Zalouk Stefano Ermon



研究问题:如何通过预测概率与经验频率对齐,使概率预测有意义地捕捉不确定性。
动机:许多现有的校准方法都是专门用于事后再校准的,这可能会降低预测的准确性。
方法:引入基于内核的校准度量标准,统一和推广了分类和回归中流行的校准形式。这些度量标准允许可微分的样本估计,使校准目标易于纳入经验风险最小化。
效果:实证评估表明,将这些度量标准用作正则化器可以提高一系列回归和分类任务中的校准、准确性和决策能力,优于仅依赖事后再校准的方法。

Calibration ensures that probabilistic forecasts meaningfully capture uncertainty by requiring that predicted probabilities align with empirical frequencies. However, many existing calibration methods are specialized for post-hoc recalibration, which can worsen the sharpness of forecasts. Drawing on the insight that calibration can be viewed as a distribution matching task, we introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression. These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization. Furthermore, we provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions. Our empirical evaluation demonstrates that employing these metrics as regularizers enhances calibration, sharpness, and decision-making across a range of regression and classification tasks, outperforming methods relying solely on post-hoc recalibration.

Static and Sequential Malicious Attacks in the Context of Selective Forgetting
CHENXU ZHAO Wei Qian Zhitao Ying Mengdi Huai



研究问题:如何应对恶意数据更新请求对选择性遗忘(即机器取消学习)系统的安全漏洞。
动机:尽管选择性遗忘在删除训练好的模型中指定数据的影响方面取得了显著的成功,但对其在恶意数据更新请求方面的安全漏洞关注不足。
方法:提出一种新的恶意选择性遗忘攻击类别,包括静态场景和顺序设置。静态场景中,所有恶意数据更新请求都由攻击者一次性提供;顺序设置中,数据更新请求依次到达,设计了一种新的顺序遗忘攻击框架,形式化为随机最优控制问题。
效果:通过理论分析和大量实验验证了所提出的选择性遗忘攻击的有效性。

With the growing demand for the right to be forgotten, there is an increasing need for machine learning models to forget sensitive data and its impact. To address this, the paradigm of selective forgetting (a.k.a machine unlearning) has been extensively studied, which aims to remove the impact of requested data from a well-trained model without retraining from scratch. Despite its significant success, limited attention has been given to the security vulnerabilities of the unlearning system concerning malicious data update requests. Motivated by this, in this paper, we explore the possibility and feasibility of malicious data update requests during the unlearning process. Specifically, we first propose a new class of malicious selective forgetting attacks, which involves a static scenario where all the malicious data update requests are provided by the adversary at once. Additionally, considering the sequential setting where the data update requests arrive sequentially, we also design a novel framework for sequential forgetting attacks, which is formulated as a stochastic optimal control problem. We also propose novel optimization algorithms that can find the effective malicious data update requests. We perform theoretical analyses for the proposed selective forgetting attacks, and extensive experimental results validate the effectiveness of our proposed selective forgetting attacks. The source code is available in the supplementary material.

Adversarial Robustness through Random Weight Sampling
Yanxiang Ma Minjing Dong Chang Xu



研究问题:深度神经网络在各种任务中易受攻击,如何提高其对抗鲁棒性。
动机:目前的对抗防御方法主要通过引入不同类型的扰动来破坏对抗攻击,但这些方法的防御性能对随机性参数非常敏感,且这些参数通常需要手动调整。
方法:提出将随机权重纳入优化中,以充分利用随机化防御的潜力。通过对随机性参数与梯度相似性和自然性能之间的联系进行理论分析,建议在优化过程中对随机权重施加理论指导的约束。
效果:通过引入约束训练随机权重(CTRW)模型,在多个数据集和卷积神经网络上进行评估,结果显示,相比于基线模型,CTRW模型的鲁棒性提高了约16%-17%(PGD-20)和22%-25%(Auto Attack)。

Deep neural networks have been found to be vulnerable in a variety of tasks. Adversarial attacks can manipulate network outputs, resulting in incorrect predictions. Adversarial defense methods aim to improve the adversarial robustness of networks by countering potential attacks. In addition to traditional defense approaches, randomized defense mechanisms have recently received increasing attention from researchers. These methods introduce different types of perturbations during the inference phase to destabilize adversarial attacks. Although promising empirical results have been demonstrated by these approaches, the defense performance is quite sensitive to the randomness parameters, which are always manually tuned without further analysis. On the contrary, we propose incorporating random weights into the optimization to fully exploit the potential of randomized defense. To perform better optimization of randomness parameters, we conduct a theoretical analysis of the connections between randomness parameters and gradient similarity as well as natural performance. From these two aspects, we suggest imposing theoretically-guided constraints on random weights during optimizations, as these weights play a critical role in balancing natural performance and adversarial robustness. We derive both the upper and lower bounds of random weight parameters by considering prediction bias and gradient similarity. In this study, we introduce the Constrained Trainable Random Weight (CTRW), which adds random weight parameters to the optimization and includes a constraint guided by the upper and lower bounds to achieve better trade-offs between natural and robust accuracy. We evaluate the effectiveness of CTRW on several datasets and benchmark convolutional neural networks. Our results indicate that our model achieves a robust accuracy approximately 16% to 17% higher than the baseline model under PGD-20 and 22% to 25% higher on Auto Attack.

Incentives in Private Collaborative Machine Learning
Rachael Hwee Ling Sim Yehong Zhang Trong Nghia Hoang Xinyi Xu Bryan Kian Hsiang Low Patrick Jaillet



研究问题:深度神经网络在各种任务中易受攻击,如何提高其对抗鲁棒性。
动机:目前的对抗防御方法主要通过引入不同类型的扰动来破坏对抗攻击,但这些方法的防御性能对随机性参数非常敏感,且这些参数通常需要手动调整。
方法:提出将随机权重纳入优化中,以充分利用随机化防御的潜力。通过对随机性参数与梯度相似性和自然性能之间的联系进行理论分析,建议在优化过程中对随机权重施加理论指导的约束。
效果:通过引入约束训练随机权重(CTRW)模型,在多个数据集和卷积神经网络上进行评估,结果显示,相比于基线模型,CTRW模型的鲁棒性提高了约16%-17%(PGD-20)和22%-25%(Auto Attack)。

Collaborative machine learning involves training models on data from multiple parties but must incentivize their participation. Existing data valuation methods fairly value and reward each party based on shared data or model parameters but neglect the privacy risks involved. To address this, we introduce _differential privacy_ (DP) as an incentive. Each party can select its required DP guarantee and perturb its _sufficient statistic_ (SS) accordingly. The mediator values the perturbed SS by the Bayesian surprise it elicits about the model parameters. As our valuation function enforces a _privacy-valuation trade-off_, parties are deterred from selecting excessive DP guarantees that reduce the utility of the grand coalition's model. Finally, the mediator rewards each party with different posterior samples of the model parameters. Such rewards still satisfy existing incentives like fairness but additionally preserve DP and a high similarity to the grand coalition's posterior. We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets.

Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes
Connor Toups Rishi Bommasani Kathleen Creel Sarah H Bana Dan Jurafsky Percy Liang



研究问题:本文旨在通过生态系统级别的分析,研究机器学习模型在特定环境中的部署对社会的影响。
动机:传统的机器学习研究主要关注模型层面,如准确性、鲁棒性等。然而,实际中,一个机器学习模型的社会影响部分取决于其被部署的环境。因此,引入生态系统级别的分析来捕捉这一点。
方法:我们考虑在给定环境中部署的模型集合,而不仅仅是单个模型。例如,在招聘中的生态系统级别分析认识到,求职者的结果不仅由单个招聘算法或公司决定,而是由所有申请过的公司的决定共同决定的。
效果:通过对三种模态(文本、图像、语音)和11个数据集的分析,我们发现已部署的机器学习系统容易出现系统性失败,即一些用户总是被所有可用的模型错误分类。即使个别模型在整体上随着时间的推移而改进,我们发现这些改进很少能减少系统性失败的发生率。相反,这些改进的好处主要惠及那些已经被其他模型正确分类的用户。

Machine learning is traditionally studied at the model level: researchers measure and improve the accuracy, robustness, bias, efficiency, and other dimensions of specific models. In practice, however, the societal impact of any machine learning model is partially determined by the context into which it is deployed. To capture this, we introduce *ecosystem-level analysis:* rather than analyzing a single model, we consider the collection of models that are deployed in a given context. For example, ecosystem-level analysis in hiring recognizes that a job candidate’s outcomes are determined not only by a single hiring algorithm or firm but instead by the collective decisions of all the firms to which the candidate applied. Across three modalities (text, images, speech) and 11 datasets, we establish a clear trend: deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. Even when individual models improve at the population level over time, we find these improvements rarely reduce the prevalence of systemic failure. Instead, the benefits of these improvements predominantly accrue to individuals who are already correctly classified by other models. In light of these trends, we analyze medical imaging for dermatology, a setting where the costs of systemic failure are especially high. While traditional analyses reveal that both models and humans exhibit racial performance disparities, ecosystem-level analysis reveals new forms of racial disparity in model predictions that do not present in human predictions. These examples demonstrate that ecosystem-level analysis has unique strengths in characterizing the societal impact of machine learning.

Adversarial Resilience in Sequential Prediction via Abstention
Surbhi Goel Steve Hanneke Shay Moran Abhishek Shetty



研究问题:在随机设置中,存在允许注入干净标签对抗性(或分布外)示例的敌对者时,如何进行序列预测。
动机:针对纯随机数据的算法往往在存在对抗性示例的情况下失败,导致错误的预测,这在许多高风险应用中是不希望的。另一方面,假设完全敌对的数据会导致非常悲观的界限,在实践中往往是无效的。
方法:我们提出了一种新的序列预测模型,通过允许学习者在对抗性示例上免费放弃预测,从而要求学习者做出确定性的预测,从而摆脱这些悲观的保证。假设可以访问非敌对示例的边缘分布,我们设计了一个学习者,其错误与假设类别的VC维(反映随机设置)成比例,而不是完全敌对设置的小石维。此外,我们还为VC维~1类和轴对齐矩形类设计了学习器,即使在没有访问边缘分布的情况下也能工作。我们的主要技术贡献是一种新的不确定性度量方法,用于学习VC类,这可能是独立的研究领域。
效果:实验结果表明,我们的模型在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the problem of sequential prediction in the stochastic setting with an adversary that is allowed to inject clean-label adversarial (or out-of-distribution) examples. Algorithms designed to handle purely stochastic data tend to fail in the presence of such adversarial examples, often leading to erroneous predictions. This is undesirable in many high-stakes applications such as medical recommendations, where abstaining from predictions on adversarial examples is preferable to misclassification. On the other hand, assuming fully adversarial data leads to very pessimistic bounds that are often vacuous in practice. To move away from these pessimistic guarantees, we propose a new model of sequential prediction that sits between the purely stochastic and fully adversarial settings by allowing the learner to abstain from making a prediction at no cost on adversarial examples, thereby asking the learner to make predictions with certainty. Assuming access to the marginal distribution on the non-adversarial examples, we design a learner whose error scales with the VC dimension (mirroring the stochastic setting) of the hypothesis class, as opposed to the Littlestone dimension which characterizes the fully adversarial setting. Furthermore, we design learners for VC dimension~1 classes and the class of axis-aligned rectangles, which work even in the absence of access to the marginal distribution. Our key technical contribution is a novel measure for quantifying uncertainty for learning VC classes, which may be of independent interest.

Learning with Explanation Constraints
Rattana Pukdee Dylan Sam J Zico Kolter Nina Balcan Pradeep Kumar Ravikumar



研究问题:如何通过解释约束来改善深度学习模型的学习?
动机:大型深度学习模型的解释性差,而预先存在模型应该如何行为的解释。
方法:提出了一种学习理论框架,从解释约束的角度来分析这些解释如何提高模型的学习。同时,还提供了一种算法解决方案,通过变分近似实现更好的性能和更频繁地满足这些约束。
效果:在一系列合成和真实世界的实验中,证明了该方法的优越性。

As larger deep learning models are hard to interpret, there has been a recent focus on generating explanations of these black-box models. In contrast, we may have apriori explanations of how models should behave. In this paper, we formalize this notion as learning from \emph{explanation constraints} and provide a learning theoretic framework to analyze how such explanations can improve the learning of our models. One may naturally ask, ``When would these explanations be helpful?" Our first key contribution addresses this question via a class of models that satisfies these explanation constraints in expectation over new data. We provide a characterization of the benefits of these models (in terms of the reduction of their Rademacher complexities) for a canonical class of explanations given by gradient information in the settings of both linear models and two layer neural networks. In addition, we provide an algorithmic solution for our framework, via a variational approximation that achieves better performance and satisfies these constraints more frequently, when compared to simpler augmented Lagrangian methods to incorporate these explanations. We demonstrate the benefits of our approach over a large array of synthetic and real-world experiments.

Breaking the Communication-Privacy-Accuracy Tradeoff with $f$-Differential Privacy
Richeng Jin Zhonggen Su Caijun Zhong Zhaoyang Zhang Tony Quek Huaiyu Dai



研究问题:本文探讨了在有隐私顾虑和有限通信能力的用户之间进行协作数据分析的联邦数据 analytics 问题。
动机:常见的压缩方案虽然提高了通信效率,但会引入信息丢失到本地数据中,同时是否提供了任何隐私保护仍是一个开放的问题。
方法:通过 $f$-差分隐私(DP)的视角,研究了具有有限输出空间的离散值机制的局部差分隐私保证。具体来说,我们为各种离散值机制推导出了紧密的 $f$-DP 保证,包括用于隐私保护的二项噪声和二项机制,以及用于数据压缩的信号基础方法。
效果:我们进一步研究了通过稀疏化增强隐私的方法,并提出了三元随机压缩器。通过利用压缩增强隐私,我们在流行的分布式均值估计用例中消除了准确性(以均方误差衡量)对通信成本的依赖性,从而打破了隐私、通信和准确性之间的三重权衡。

We consider a federated data analytics problem in which a server coordinates the collaborative data analysis of multiple users with privacy concerns and limited communication capability. The commonly adopted compression schemes introduce information loss into local data while improving communication efficiency, and it remains an open problem whether such discrete-valued mechanisms provide any privacy protection. In this paper, we study the local differential privacy guarantees of discrete-valued mechanisms with finite output space through the lens of $f$-differential privacy (DP). More specifically, we advance the existing literature by deriving tight $f$-DP guarantees for a variety of discrete-valued mechanisms, including the binomial noise and the binomial mechanisms that are proposed for privacy preservation, and the sign-based methods that are proposed for data compression, in closed-form expressions. We further investigate the amplification in privacy by sparsification and propose a ternary stochastic compressor. By leveraging compression for privacy amplification, we improve the existing methods by removing the dependency of accuracy (in terms of mean square error) on communication cost in the popular use case of distributed mean estimation, therefore breaking the three-way tradeoff between privacy, communication, and accuracy.

A3FL: Adversarially Adaptive Backdoor Attacks to Federated Learning
Hangfan Zhang Jinyuan Jia Jinghui Chen Lu Lin Dinghao Wu



研究问题:现有的联邦学习模型容易受到后门攻击,但现有研究的后门触发器通常是固定的或仅基于本地数据和模型进行优化,导致攻击效果不佳。
动机:为了提高后门攻击的成功率和持久性,需要一种能够适应全局训练动态的后门触发器。
方法:提出了一种新的后门攻击方法A3FL,该方法通过对抗性地调整后门触发器,使其在全局模型中更难以被移除。
效果:在基准数据集上的大量实验表明,A3FL在十二种现有防御措施上表现出了强大的攻击效果。

Federated Learning (FL) is a distributed machine learning paradigm that allows multiple clients to train a global model collaboratively without sharing their local training data. Due to its distributed nature, many studies have shown that it is vulnerable to backdoor attacks. However, existing studies usually used a predetermined, fixed backdoor trigger or optimized it based solely on the local data and model without considering the global training dynamics. This leads to sub-optimal and less durable attack effectiveness, i.e., their attack success rate is low when the attack budget is limited and decreases quickly if the attacker can no longer perform attacks anymore. To address these limitations, we propose A3FL, a new backdoor attack which adversarially adapts the backdoor trigger to make it less likely to be removed by the global training dynamics. Our key intuition is that the difference between the global model and the local model in FL makes the local-optimized trigger much less effective when transferred to the global model. We solve this by optimizing the trigger to even survive the worst-case scenario where the global model was trained to directly unlearn the trigger. Extensive experiments on benchmark datasets are conducted for twelve existing defenses to comprehensively evaluate the effectiveness of our A3FL. Our code is available at https://github.com/hfzhang31/A3FL.

When Does Confidence-Based Cascade Deferral Suffice?
Wittawat Jitkrittum Neha Gupta Aditya Krishna Menon Harikrishna Narasimhan Ankit Singh Rawat Sanjiv Kumar



研究问题:本文旨在理解和改进基于信心的延迟策略在什么情况下可能会失败,以及何时替代的延迟策略可以表现得更好。
动机:尽管基于信心的延迟策略在实践中效果显著,但它忽视了级联的结构,例如没有对下游模型的错误进行建模。
方法:首先,我们提出了一个理论上的最佳延迟规则,精确地描述了基于信心的延迟可能遭受的情况。然后,我们研究了事后延迟机制,并证明它们可以在以下情况下显著改善基于信心的延迟:(i)下游模型是专门处理输入子集的专家;(ii)样本受到标签噪声的影响;(iii)训练集和测试集之间存在分布偏移。
效果:实验结果表明,当下游模型是专门处理输入子集的专家、样本受到标签噪声的影响或训练集和测试集之间存在分布偏移时,后验延迟机制可以显著提高基于信心的延迟的性能。

Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade --- e.g., not modelling the errors of downstream models --- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.

On the explainable properties of 1-Lipschitz Neural Networks: An Optimal Transport Perspective
Mathieu Serrurier Franck Mamalet Thomas FEL Louis Béthune Thibaut Boissin



研究问题:本文旨在解决传统神经网络生成的显著图噪声大、解析度低的问题。
动机:输入梯度在许多应用中起着关键作用,包括评估模型鲁棒性的对抗性攻击算法,生成显著图的解释性AI技术,以及反事实解释。然而,传统神经网络生成的显著图通常噪声大,提供的信息有限。
方法:本文提出使用最优传输问题的对偶损失来学习1-Lipschitz神经网络,其生成的显著图具有理想的可解释人工智能(XAI)属性:它们高度集中在图像的关键部分,噪声低,在各种模型和度量标准上显著优于最先进的解释方法。
效果:实验结果表明,这种显著图与传统方法相比,更能反映人类对ImageNet的解释。此外,这种网络通过联合优化分类目标和梯度(即显著图)与传输计划方向的对齐,从而在学习过程中被证明具有鲁棒性设计。

Input gradients have a pivotal role in a variety of applications, including adversarial attack algorithms for evaluating model robustness, explainable AI techniques for generating saliency maps, and counterfactual explanations. However, saliency maps generated by traditional neural networks are often noisy and provide limited insights. In this paper, we demonstrate that, on the contrary, the saliency maps of 1-Lipschitz neural networks, learnt with the dual loss of an optimal transportation problem, exhibit desirable XAI properties: They are highly concentrated on the essential parts of the image with low noise, significantly outperforming state-of-the-art explanation approaches across various models and metrics. We also prove that these maps align unprecedentedly well with human explanations on ImageNet. To explain the particularly beneficial properties of the saliency map for such models, we prove this gradient encodes both the direction of the transportation plan and the direction towards the nearest adversarial attack. Following the gradient down to the decision boundary is no longer considered an adversarial attack, but rather a counterfactual explanation that explicitly transports the input from one class to another. Thus, Learning with such a loss jointly optimizes the classification objective and the alignment of the gradient , i.e. the saliency map, to the transportation plan direction. These networks were previously known to be certifiably robust by design, and we demonstrate that they scale well for large problems and models, and are tailored for explainability using a fast and straightforward method.

Black-Box Differential Privacy for Interactive ML
Haim Kaplan Yishay Mansour Shay Moran Kobbi Nissim Uri Stemmer



研究问题:本文旨在重新审视最近由Naor等人提出的交互式联合差分隐私变体,并将其推广到处理现有隐私定义似乎过于严格的在线过程。
动机:传统的差分隐私形式,如Golowich和Livni [2021]研究的形式,其错误边界的开销仅为双指数级。相比之下,我们提出的新型隐私定义在错误边界的开销上只有多项式级的增加。
方法:通过考虑在线分类的基本设置,我们将任何可能非私有的学习规则有效地转化为具有仅在错误边界上的多项式开销的私有学习规则。
效果:实验结果表明,我们的新型隐私定义相比传统的差分隐私形式有显著优势,其在错误边界的开销上只有多项式级的增加。

In this work we revisit an interactive variant of joint differential privacy, recently introduced by Naor et al. [2023], and generalize it towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. In order to demonstrate the advantages of this privacy definition compared to traditional forms of differential privacy, we consider the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with traditional forms of differential privacy, such as the one studied by Golowich and Livni [2021], where only a double exponential overhead in the mistake bound is known (via an information theoretic upper bound).

Top-Ambiguity Samples Matter: Understanding Why Deep Ensemble Works in Selective Classification
Qiang Ding Yixuan Cao Ping Luo



研究问题:本文旨在解决机器学习模型在处理困难输入时,如何提高预测的可靠性。
动机:尽管在实践中,集成方法在选择性分类中非常有效,但对其工作原理的分析却相对缺乏。
方法:受到一个有趣的实证结果的启发,即集成方法的改进主要来自于其成员模型分歧最大的高模糊性样本,作者证明了在一定覆盖范围内,基于一些假设,集成方法的选择性风险低于任何成员模型。
效果:通过在计算机视觉和自然语言处理任务上的系统实验,验证了这些假设和理论结果的正确性。

Selective classification allows a machine learning model to reject some hard inputs and thus improve the reliability of its predictions. In this area, the ensemble method is powerful in practice, but there has been no solid analysis on why the ensemble method works. Inspired by an interesting empirical result that the improvement of the ensemble largely comes from top-ambiguity samples where its member models diverge, we prove that, based on some assumptions, the ensemble has a lower selective risk than the member model for any coverage within a range. The proof is nontrivial since the selective risk is a non-convex function of the model prediction. The assumptions and the theoretical results are supported by systematic experiments on both computer vision and natural language processing tasks.

RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion
Zhuoqun Huang Neil G Marchant Keane Lucas Lujo Bauer Olga Ohrimenko Benjamin I. P. Rubinstein



研究问题:如何为离散或变长输入的分类器,如源代码,提供经过认证的对抗性鲁棒性。
动机:现有的随机平滑方法主要针对连续输入的分类器,如图像,而对离散或变长输入的分类器的研究较少。
方法:提出一种适用于离散序列分类器的随机化平滑方法——随机删除(RS-Del),该方法通过随机删除编辑来提供对抗删除、插入和替换编辑的鲁棒性。
效果:在恶意软件检测的案例研究中,当应用于流行的MalConv恶意软件检测模型时,我们的平滑方法RS-Del在编辑距离半径为128字节时实现了91%的认证准确率。

Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection—a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.

Content-based Unrestricted Adversarial Attack
Zhaoyu Chen Bo Li Shuang Wu Kaixun Jiang Shouhong Ding Wenqiang Zhang



研究问题:如何生成能有效且逼真地欺骗人类感知和深度神经网络的无限制对抗性示例。
动机:目前的无限制对抗性攻击方法通常牺牲了无限制的程度,通过主观选择一些图像内容来保证对抗性示例的真实性,这限制了其攻击性能。
方法:提出一种名为基于内容的无限制对抗性攻击的新型无限制攻击框架。利用表示自然图像的低维流形,将图像映射到该流形上,并沿着其对抗方向进行优化。在这个框架内,实现了基于稳定扩散的对抗性内容攻击,可以生成具有各种对抗性内容的高可转移的无限制对抗性示例。
效果:广泛的实验和可视化表明,ACA在超越最先进的攻击方法和防御方法方面特别有效,平均分别提高了13.3-50.4%和16.8-48.0%。

Unrestricted adversarial attacks typically manipulate the semantic content of an image (e.g., color or texture) to create adversarial examples that are both effective and photorealistic, demonstrating their ability to deceive human perception and deep neural networks with stealth and success. However, current works usually sacrifice unrestricted degrees and subjectively select some image content to guarantee the photorealism of unrestricted adversarial examples, which limits its attack performance. To ensure the photorealism of adversarial examples and boost attack performance, we propose a novel unrestricted attack framework called Content-based Unrestricted Adversarial Attack. By leveraging a low-dimensional manifold that represents natural images, we map the images onto the manifold and optimize them along its adversarial direction. Therefore, within this framework, we implement Adversarial Content Attack (ACA) based on Stable Diffusion and can generate high transferable unrestricted adversarial examples with various adversarial contents. Extensive experimentation and visualization demonstrate the efficacy of ACA, particularly in surpassing state-of-the-art attacks by an average of 13.3-50.4\% and 16.8-48.0\% in normally trained models and defense methods, respectively.

Adapting Fairness Interventions to Missing Values
Raymond Feng Flavio Calmon Hao Wang



研究问题:现实世界数据中的缺失值对算法公平性构成了重大且独特的挑战。
动机:不同的人口群体可能受到缺失数据的不同影响,而处理缺失值的标准程序——先进行数据填充,然后使用填充后的数据进行分类——可能会加剧歧视。
方法:我们分析了缺失值如何影响算法公平性,并提出了可扩展和自适应的公平分类算法来处理缺失值。这些算法可以与任何现有的公平干预算法结合,以处理所有可能的缺失模式,同时保留编码在缺失模式中的信息。
效果:通过与最先进的公平干预措施的数值实验,我们发现我们的自适应算法始终比“填充-然后分类”实现更高的公平性和准确性。

Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification—a procedure referred to as "impute-then-classify"—can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.

Model Shapley: Equitable Model Valuation with Black-box Access
Xinyi Xu Thanh Lam Chuan-Sheng Foo Bryan Kian Hsiang Low



研究问题:如何公平地评估和定价预训练的机器学习模型。
动机:现有的AI市场需要一种公平的模型评估方法来为预训练的ML模型定价,特别是在黑箱访问设置下,不允许披露模型的具体信息。
方法:通过利用模型预测的Dirichlet抽象,提出了一种新的公平模型评估方法,称为模型Shapley。同时,利用模型Shapley的Lipschitz连续性设计了一种学习策略,用于预测大型市场中许多供应商(如150个)的模型的模型Shapley值。
效果:通过各种真实世界数据集和异构模型类型的广泛实证验证,证明了模型Shapley的有效性。

Valuation methods of data and machine learning (ML) models are essential to the establishment of AI marketplaces. Importantly, certain practical considerations (e.g., operational constraints, legal restrictions) favor the use of model valuation over data valuation. Also, existing marketplaces that involve trading of pre-trained ML models call for an equitable model valuation method to price them. In particular, we investigate the black-box access setting which allows querying a model (to observe predictions) without disclosing model-specific information (e.g., architecture and parameters). By exploiting a Dirichlet abstraction of a model’s predictions, we propose a novel and equitable model valuation method called model Shapley. We also leverage a Lipschitz continuity of model Shapley to design a learning approach for predicting the model Shapley values (MSVs) of many vendors’ models (e.g., 150) in a large-scale marketplace. We perform extensive empirical validation on the effectiveness of model Shapley using various real-world datasets and heterogeneous model types.

Black-box Backdoor Defense via Zero-shot Image Purification
Yucheng Shi Mengnan Du Xuansheng Wu Zihan Guan Jin Sun Ninghao Liu



研究问题:如何防御后门攻击,特别是针对只允许查询访问的真实世界黑箱模型。
动机:后门攻击通过在训练数据中注入有毒样本,导致模型部署时对被毒化的输入进行错误分类,防御这种攻击具有挑战性。
方法:本文提出了一种名为零射图像净化(ZIP)的新型防御框架。该框架无需了解模型的内部信息或任何有关干净/有毒样本的先验知识即可应用于中毒模型。防御框架包括两个步骤:首先,对有毒图像应用线性变换(如模糊)以破坏后门模式;然后,使用预训练的扩散模型恢复因变换而丢失的语义信息。
效果:我们在多个数据集上评估了ZIP框架,实验结果表明,我们的ZIP框架优于最先进的后门防御基线。我们相信,我们的结果将为未来黑箱模型的防御方法提供有价值的见解。

Backdoor attacks inject poisoned samples into the training data, resulting in the misclassification of the poisoned input during a model's deployment. Defending against such attacks is challenging, especially for real-world black-box models where only query access is permitted. In this paper, we propose a novel defense framework against backdoor attacks through Zero-shot Image Purification (ZIP). Our framework can be applied to poisoned models without requiring internal information about the model or any prior knowledge of the clean/poisoned samples. Our defense framework involves two steps. First, we apply a linear transformation (e.g., blurring) on the poisoned image to destroy the backdoor pattern. Then, we use a pre-trained diffusion model to recover the missing semantic information removed by the transformation. In particular, we design a new reverse process by using the transformed image to guide the generation of high-fidelity purified images, which works in zero-shot settings. We evaluate our ZIP framework on multiple datasets with different types of attacks. Experimental results demonstrate the superiority of our ZIP framework compared to state-of-the-art backdoor defense baselines. We believe that our results will provide valuable insights for future defense methods for black-box models. Our code is available at https://github.com/sycny/ZIP.

Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
Zhaohan Xi Tianyu Du Changjiang Li Ren Pang Shouling Ji Jinghui Chen Fenglong Ma Ting Wang



研究问题:预训练语言模型作为少次学习者在少次场景下的安全风险尚未被探索。
动机:现有的防御措施由于少次场景的独特挑战而无法应对预训练语言模型的脆弱性。
方法:提出了一种名为MDP的新型轻量级、可插拔和有效的防御机制,利用被污染和清洁样本之间的掩蔽敏感性差距进行识别。
效果:通过基准数据集和代表性攻击的实证评估,验证了MDP的有效性。

Pre-trained language models (PLMs) have demonstrated remarkable performance as few-shot learners. However, their security risks under such settings are largely unexplored. In this work, we conduct a pilot study showing that PLMs as few-shot learners are highly vulnerable to backdoor attacks while existing defenses are inadequate due to the unique challenges of few-shot scenarios. To address such challenges, we advocate MDP, a novel lightweight, pluggable, and effective defense for PLMs as few-shot learners. Specifically, MDP leverages the gap between the masking-sensitivity of poisoned and clean samples: with reference to the limited few-shot data as distributional anchors, it compares the representations of given samples under varying masking and identifies poisoned samples as ones with significant variations. We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness. The empirical evaluation using benchmark datasets and representative attacks validates the efficacy of MDP. The code of MDP is publicly available.

Defending against Data-Free Model Extraction by Distributionally Robust Defensive Training
Zhenyi Wang Li Shen Tongliang Liu Tiehang Duan Yanjun Zhu Donglin Zhan David Doermann Mingchen Gao



研究问题:如何防止不依赖原始训练数据分布的黑盒模型复制(DFME)。
动机:现有的防御方法存在计算和内存效率低下,需要对攻击数据分布做出强假设,或者只能在模型窃取发生后延迟攻击或证明模型被窃取等问题。
方法:提出一种名为MeCo的内存和计算高效的防御方法,通过对目标受害者模型进行分布稳健的防御性训练来阻止DFME的发生,同时保持模型效用。具体来说,我们随机化输入,使其:(1)导致攻击者的知识蒸馏损失不匹配;(2)干扰零阶梯度估计;(3)改变攻击查询数据的标签预测。因此,攻击者只能从黑盒模型中提取误导信息。
效果:通过大量的实验,我们发现MeCo可以显著降低现有DFME方法的有效性,并大大提高运行效率。

Data-Free Model Extraction (DFME) aims to clone a black-box model without knowing its original training data distribution, making it much easier for attackers to steal commercial models. Defense against DFME faces several challenges: (i) effectiveness; (ii) efficiency; (iii) no prior on the attacker's query data distribution and strategy. However, existing defense methods: (1) are highly computation and memory inefficient; or (2) need strong assumptions about attack data distribution; or (3) can only delay the attack or prove a model theft after the model stealing has happened. In this work, we propose a Memory and Computation efficient defense approach, named MeCo, to prevent DFME from happening while maintaining the model utility simultaneously by distributionally robust defensive training on the target victim model. Specifically, we randomize the input so that it: (1) causes a mismatch of the knowledge distillation loss for attackers; (2) disturbs the zeroth-order gradient estimation; (3) changes the label prediction for the attack query data. Therefore, the attacker can only extract misleading information from the black-box model. Extensive experiments on defending against both decision-based and score-based DFME demonstrate that MeCo can significantly reduce the effectiveness of existing DFME methods and substantially improve running efficiency.

The Distortion of Binomial Voting Defies Expectation
Yannai Gonczarowski Gregory Kehne Ariel D. Procaccia Ben Schiffer Shirley Zhang



研究问题:本文旨在研究计算社会选择中投票规则的扭曲程度,以克服有限的偏好信息来选择社会可接受的结果。
动机:尽管对投票规则的扭曲进行了广泛研究,但大多数都是从最坏情况的角度进行的。我们希望通过考虑选民效用的潜在分布来研究预期的扭曲。
方法:我们设计并分析了一种新的、直观的规则——二项式投票,该规则为预期的扭曲和预期的福利提供了强大的分布无关保证。
效果:实验结果表明,二项式投票在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In computational social choice, the distortion of a voting rule quantifies the degree to which the rule overcomes limited preference information to select a socially desirable outcome. This concept has been investigated extensively, but only through a worst-case lens. Instead, we study the expected distortion of voting rules with respect to an underlying distribution over voter utilities. Our main contribution is the design and analysis of a novel and intuitive rule, binomial voting, which provides strong distribution-independent guarantees for both expected distortion and expected welfare.

Asymmetric Certified Robustness via Feature-Convex Neural Networks
Samuel Pfrommer Brendon G. Anderson Julien Piet Somayeh Sojoudi



研究问题:如何实现机器学习模型的非对称鲁棒性认证,并提高其对攻击的防御能力。
动机:现实中对抗性攻击常常具有非对称结构,攻击者只试图引发假阴性结果。因此需要一种方法来形式化非对称鲁棒性认证问题,并提出相应的解决方案。
方法:提出特征凸神经网络架构,该架构由输入凸神经网络(ICNN)和Lipschitz连续的特征映射组成,以实现非对称对抗鲁棒性。
效果:在恶意软件分类和MNIST、Fashion-MNIST、CIFAR-10数据集子集上进行的实验表明,特征凸分类器可以获得显著的认证$\ell_1$、$\ell_2$和$\ell_{\infty}$-半径,同时比竞争性基线更具计算效率。

Real-world adversarial attacks on machine learning models often feature an asymmetric structure wherein adversaries only attempt to induce false negatives (e.g., classify a spam email as not spam). We formalize the asymmetric robustness certification problem and correspondingly present the feature-convex neural network architecture, which composes an input-convex neural network (ICNN) with a Lipschitz continuous feature map in order to achieve asymmetric adversarial robustness. We consider the aforementioned binary setting with one "sensitive" class, and for this class we prove deterministic, closed-form, and easily-computable certified robust radii for arbitrary $\ell_p$-norms. We theoretically justify the use of these models by characterizing their decision region geometry, extending the universal approximation theorem for ICNN regression to the classification setting, and proving a lower bound on the probability that such models perfectly fit even unstructured uniformly distributed data in sufficiently high dimensions. Experiments on Malimg malware classification and subsets of the MNIST, Fashion-MNIST, and CIFAR-10 datasets show that feature-convex classifiers attain substantial certified $\ell_1$, $\ell_2$, and $\ell_{\infty}$-radii while being far more computationally efficient than competitive baselines.

Data Minimization at Inference Time
Cuong Tran Ferdinando Fioretto



研究问题:在高风险领域,如法律、银行、招聘和医疗等,学习模型经常依赖敏感的用户信息进行推理,是否需要使用所有输入特征以获得准确的预测?
动机:这不仅可以显著降低个人隐私风险,也可以减少组织验证信息准确性所需的大量人力。
方法:论文中提出了一种高效的序列算法,用于确定每个个体应提供的属性。
效果:实验表明,在个性化设置下,个体可能只需要披露一小部分特征,就能保持与使用全部用户信息的模型相同的决策准确性。

In high-stakes domains such as legal, banking, hiring, and healthcare, learning models frequently rely on sensitive user information for inference, necessitating the complete set of features. This not only poses significant privacy risks for individuals but also demands substantial human effort from organizations to verify information accuracy. This study asks whether it is necessary to use all input features for accurate predictions at inference time. The paper demonstrates that, in a personalized setting, individuals may only need to disclose a small subset of features without compromising decision-making accuracy. The paper also provides an efficient sequential algorithm to determine the appropriate attributes for each individual to provide. Evaluations across various learning tasks show that individuals can potentially report as little as 10\% of their information while maintaining the same accuracy level as a model that employs the full set of user information.

Fairness Aware Counterfactuals for Subgroups
Loukas Kavouras Konstantinos Tsopelas Giorgos Giannopoulos Dimitris Sacharidis Eleni Psaroudaki Nikolaos Theologitis Dimitrios Rontogiannis Dimitris Fotakis Ioannis Emiris



研究问题:本文旨在通过反事实解释审计子群体公平性,提出了一种名为FACTS的框架。
动机:重新审视并广义化现有的子群体公平性概念,提出新的、更精细的子群体公平性概念。
方法:构建了一个模型无关、高效、参数化和可解释的评估子群体公平性的框架。
效果:通过在不同的基准数据集上进行详尽的实验评估,展示了该方法的优势、广泛的应用性和效率。

In this work, we present Fairness Aware Counterfactuals for Subgroups (FACTS), a framework for auditing subgroup fairness through counterfactual explanations. We start with revisiting (and generalizing) existing notions and introducing new, more refined notions of subgroup fairness. We aim to (a) formulate different aspects of the difficulty of individuals in certain subgroups to achieve recourse, i.e. receive the desired outcome, either at the micro level, considering members of the subgroup individually, or at the macro level, considering the subgroup as a whole, and (b) introduce notions of subgroup fairness that are robust, if not totally oblivious, to the cost of achieving recourse. We accompany these notions with an efficient, model-agnostic, highly parameterizable, and explainable framework for evaluating subgroup fairness. We demonstrate the advantages, the wide applicability, and the efficiency of our approach through a thorough experimental evaluation on different benchmark datasets.

Optimal privacy guarantees for a relaxed threat model: Addressing sub-optimal adversaries in differentially private machine learning
Georgios Kaissis Alexander Ziller Stefan Kolek Anneliese Riess Daniel Rueckert



研究问题:本文旨在研究在现实威胁模型放松下,缺乏对准确模型训练数据库访问的(次优)攻击者可能拥有相关或部分数据时,如何限制机器学习模型的隐私泄露。
动机:现有的差分隐私机制主要针对强大的(最优)攻击者,而在实际情况中,这类攻击者很少遇到。因此,本文考虑了一个更现实的威胁模型放松情况,即(次优)攻击者缺乏对准确模型训练数据库的访问权限,但可能拥有相关或部分数据。
方法:本文通过假设检验误差的形式,对这种设置下的敌对成员推断能力进行了形式化描述和实验验证。
效果:本文的研究有助于用户在现实威胁模型放松情况下解释敏感数据处理系统的隐私属性,并为他们选择适当的噪声水平。

Differentially private mechanisms restrict the membership inference capabilities of powerful (optimal) adversaries against machine learning models. Such adversaries are rarely encountered in practice. In this work, we examine a more realistic threat model relaxation, where (sub-optimal) adversaries lack access to the exact model training database, but may possess related or partial data. We then formally characterise and experimentally validate adversarial membership inference capabilities in this setting in terms of hypothesis testing errors. Our work helps users to interpret the privacy properties of sensitive data processing systems under realistic threat model relaxations and choose appropriate noise levels for their use-case.

Fairly Recommending with Social Attributes: A Flexible and Controllable Optimization Approach
Jinqiu Jin Haoxuan Li Fuli Feng Sihao Ding Peng Wu Xiangnan He



研究问题:本文旨在解决推荐模型中的物品侧群体公平性(IGF)问题,即要求推荐模型对不同的物品群体进行相似的处理。
动机:现有的IGF概念只关注物品曝光的直接效用,即不同物品群体之间的曝光数量,而忽视了通过社会影响从邻近用户获得的社会效用,如社交媒体上的信息分享。
方法:本文提出了两种社交属性感知的IGF度量标准,要求在不同物品群体中暴露出的物品上具有相似的用户社交属性。考虑到直接效用和社会效用之间的权衡,我们为训练推荐模型制定了一个新的多目标优化问题,以实现灵活的权衡并确保可控的准确性。为了解决这个问题,我们开发了一种基于梯度的优化算法,并从理论上证明了该算法可以找到具有不同权衡和保证准确性的帕累托最优解。
效果:在两个真实世界数据集上的大量实验验证了我们方法的有效性。

Item-side group fairness (IGF) requires a recommendation model to treat different item groups similarly, and has a crucial impact on information diffusion, consumption activity, and market equilibrium. Previous IGF notions only focus on the direct utility of the item exposures, i.e., the exposure numbers across different item groups. Nevertheless, the item exposures also facilitate utility gained from the neighboring users via social influence, called social utility, such as information sharing on the social media. To fill this gap, this paper introduces two social attribute-aware IGF metrics, which require similar user social attributes on the exposed items across the different item groups. In light of the trade-off between the direct utility and social utility, we formulate a new multi-objective optimization problem for training recommender models with flexible trade-off while ensuring controllable accuracy. To solve this problem, we develop a gradient-based optimization algorithm and theoretically show that the proposed algorithm can find Pareto optimal solutions with varying trade-off and guaranteed accuracy. Extensive experiments on two real-world datasets validate the effectiveness of our approach.

Training Private Models That Know What They Don’t Know
Stephan Rabanser Anvith Thudi Abhradeep Guha Thakurta Krishnamurthy Dj Dvijotham Nicolas Papernot



研究问题:训练可靠的深度学习模型,避免过度自信但错误的预测,特别是在需要保护敏感数据的差分隐私设置下。
动机:在差分隐私约束下,一些流行的选择性预测方法可能会增加隐私泄露的风险,而新的只使用现成私有学习算法生成的检查点的方法则表现出色。
方法:通过全面的实证研究,对选择性分类器进行深入研究,并提出了一种新的评估机制,以分析不同隐私级别下的选择性预测性能。
效果:实验结果表明,虽然可以通过降低覆盖范围来恢复非私有模型的性能水平,但这需要付出相当大的代价。

Training reliable deep learning models which avoid making overconfident but incorrect predictions is a longstanding challenge. This challenge is further exacerbated when learning has to be differentially private: protection provided to sensitive data comes at the price of injecting additional randomness into the learning process. In this work, we conduct a thorough empirical investigation of selective classifiers---that can abstain under uncertainty---under a differential privacy constraint. We find that some popular selective prediction approaches are ineffective in a differentially private setting because they increase the risk of privacy leakage. At the same time, we identify that a recent approach that only uses checkpoints produced by an off-the-shelf private learning algorithm stands out as particularly suitable under DP. Further, we show that differential privacy does not just harm utility but also degrades selective classification performance. To analyze this effect across privacy levels, we propose a novel evaluation mechanism which isolates selective prediction performance across model utility levels at full coverage. Our experimental results show that recovering the performance level attainable by non-private models is possible but comes at a considerable coverage cost as the privacy budget decreases.

Predict-then-Calibrate: A New Perspective of Robust Contextual LP
Chunlin Sun Linyu Liu Xiaocheng Li



研究问题:本文旨在解决考虑协变量(上下文或旁信息)的优化问题,即预测模型在测试阶段如何利用协变量来预测目标函数。
动机:现有的方法在选择预测模型或对基础数据做出强假设时存在问题,因此需要一种能够充分利用现有机器学习方法潜力且能独立于预测模型选择进行风险和鲁棒性保证推导的新方法。
方法:提出了一种名为“预测-然后-校准”的通用算法设计范式。首先开发一个不考虑下游风险轮廓或鲁棒性保证的预测模型,然后使用校准(或重新校准)方法量化预测的不确定性。
效果:实验结果表明,预测-然后-校准范式在改善预测模型或校准模型时都能带来更好的最终性能,同时为上下文LP问题提供了新的泛化边界,并阐明了DRO对于上下文LP的现有结果。

Contextual optimization, also known as predict-then-optimize or prescriptive analytics, considers an optimization problem with the presence of covariates (context or side information). The goal is to learn a prediction model (from the training data) that predicts the objective function from the covariates, and then in the test phase, solve the optimization problem with the covariates but without the observation of the objective function. In this paper, we consider a risk-sensitive version of the problem and propose a generic algorithm design paradigm called predict-then-calibrate. The idea is to first develop a prediction model without concern for the downstream risk profile or robustness guarantee, and then utilize calibration (or recalibration) methods to quantify the uncertainty of the prediction. While the existing methods suffer from either a restricted choice of the prediction model or strong assumptions on the underlying data, we show the disentangling of the prediction model and the calibration/uncertainty quantification has several advantages. First, it imposes no restriction on the prediction model and thus fully unleashes the potential of off-the-shelf machine learning methods. Second, the derivation of the risk and robustness guarantee can be made independent of the choice of the prediction model through a data-splitting idea. Third, our paradigm of predict-then-calibrate applies to both (risk-sensitive) robust and (risk-neutral) distributionally robust optimization (DRO) formulations. Theoretically, it gives new generalization bounds for the contextual LP problem and sheds light on the existing results of DRO for contextual LP. Numerical experiments further reinforce the advantage of the predict-then-calibrate paradigm in that an improvement on either the prediction model or the calibration model will lead to a better final performance.

Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks
Jimmy Z. Di Jack Douglas Jayadev Acharya Gautam Kamath Ayush Sekhari



研究问题:在模型再训练可能被引发的情况下,我们引入了伪装数据中毒攻击这一新的攻击向量。
动机:当模型需要重新训练时,攻击者可以通过精心设计的点来影响模型的预测结果。
方法:我们通过构建掩盖毒化数据集影响的伪装数据点,实现了对CIFAR-10、Imagenette和Imagewoof等数据集的干净标签定向攻击。
效果:我们的攻击在从零开始的再训练(机器取消学习的理想化设置,其他高效方法试图模仿)以及Graves等人(2021)的近似取消学习方法中都显示出了有效性。

We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset. We demonstrate efficacy of our attack when unlearning is performed via retraining from scratch, the idealized setting of machine unlearning which other efficient methods attempt to emulate, as well as against the approximate unlearning approach of Graves et al. (2021).

Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples
Shaokui Wei Mingda Zhang Hongyuan Zha Baoyuan Wu



研究问题:本文旨在解决机器学习模型中的后门攻击问题,即攻击者通过在训练集中注入有毒样本,使模型对特定触发器预测到特定的目标类别。
动机:后门攻击是机器学习模型的一种严重安全威胁,需要找到一种方法来清除被后门化的模型。
方法:本文提出了一种新的方法——共享对抗性撤销(SAU),该方法首先生成共享对抗性示例(SAEs),然后撤销生成的SAEs,使得它们能被净化模型正确分类和/或在不同模型中被不同地分类,从而减轻后门化模型中的后门效应。
效果:实验结果表明,该方法在各种基准数据集和网络架构上都取得了最先进的防御后门性能。

Backdoor attacks are serious security threats to machine learning models where an adversary can inject poisoned samples into the training set, causing a backdoored model which predicts poisoned samples with particular triggers to particular target classes, while behaving normally on benign samples. In this paper, we explore the task of purifying a backdoored model using a small clean dataset. By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk, which mainly captures the risk on the shared adversarial examples (SAEs) between the backdoored model and the purified model. This upper bound further suggests a novel bi-level optimization problem for mitigating backdoor using adversarial training techniques. To solve it, we propose Shared Adversarial Unlearning (SAU). Specifically, SAU first generates SAEs, and then, unlearns the generated SAEs such that they are either correctly classified by the purified model and/or differently classified by the two models, such that the backdoor effect in the backdoored model will be mitigated in the purified model. Experiments on various benchmark datasets and network architectures show that our proposed method achieves state-of-the-art performance for backdoor defense. The code is available at https://github.com/SCLBD/BackdoorBench (PyTorch) and https://github.com/shawkui/MindTrojan (MindSpore).

Neural Polarizer: A Lightweight and Effective Backdoor Defense via Purifying Poisoned Features
Mingli Zhu Shaokui Wei Hongyuan Zha Baoyuan Wu



研究问题:深度神经网络对后门攻击的敏感性。
动机:受光学偏振器机制的启发,提出一种新的后门防御方法。
方法:在被后门化的模型中插入一个可学习的神经偏振器作为中间层,通过过滤触发信息来净化被污染的样本,同时保留良性信息。
效果:实验表明,该方法在各种神经网络架构和数据集上去除后门的效果和效率都很高,特别是在干净数据非常有限的情况下。

Recent studies have demonstrated the susceptibility of deep neural networks to backdoor attacks. Given a backdoored model, its prediction of a poisoned sample with trigger will be dominated by the trigger information, though trigger information and benign information coexist. Inspired by the mechanism of the optical polarizer that a polarizer could pass light waves with particular polarizations while filtering light waves with other polarizations, we propose a novel backdoor defense method by inserting a learnable neural polarizer into the backdoored model as an intermediate layer, in order to purify the poisoned sample via filtering trigger information while maintaining benign information. The neural polarizer is instantiated as one lightweight linear transformation layer, which is learned through solving a well designed bi-level optimization problem, based on a limited clean dataset. Compared to other fine-tuning-based defense methods which often adjust all parameters of the backdoored model, the proposed method only needs to learn one additional layer, such that it is more efficient and requires less clean data. Extensive experiments demonstrate the effectiveness and efficiency of our method in removing backdoors across various neural network architectures and datasets, especially in the case of very limited clean data. Codes are available at \href{https://github.com/SCLBD/BackdoorBench}{https://github.com/SCLBD/BackdoorBench} (PyTorch) and \href{https://github.com/JulieCarlon/NPD-MindSpore}{https://github.com/JulieCarlon/NPD-MindSpore} (MindSpore).

Public Opinion Field Effect Fusion in Representation Learning for Trending Topics Diffusion
Junliang Li Yajun Yang Qinghua Hu Xin Wang Hong Gao



研究问题:本文旨在解决社交媒体中热门话题扩散和预测分析的问题,以及现有方法未考虑公众舆论场效应的问题。
动机:在现实世界中,往往存在多个热门话题或舆论领袖同时出现,形成各自的舆论场,这些舆论场之间的竞争会影响公众舆论的发展。然而,现有的方法并未考虑到这一现象。
方法:本文提出了一种新的异构表示学习框架,引入了公众舆论场效应和社会圈子影响效应,以更准确地预测热门话题的扩散。
效果:通过在真实数据集上的大量实验,验证了该模型的优越性。

Trending topic diffusion and prediction analysis is an important problem and has been well studied in social networks. Representation learning is an effective way to extract node embeddings, which can help for topic propagation analysis by completing downstream tasks such as link prediction and node classification. In real world, there are often several trending topics or opinion leaders in public opinion space at the same time and they can be regarded as different centers of public opinion. A public opinion field will be formed surrounding every center. These public opinion fields compete for public's attention and it will potentially affect the development of public opinion. However, the existing methods do not consider public opinion field effect for trending topics diffusion. In this paper, we introduce three well-known observations about public opinion field effect in media and communication studies, and propose a novel and effective heterogeneous representation learning framework to incorporate public opinion field effect and social circle influence effect. To the best of our knowledge, our work is the first to consider these effects in representation learning for trending topic diffusion. Extensive experiments on real-world datasets validate the superiority of our model.

Strategic Data Sharing between Competitors
Nikita Tsoy Nikola Konstantinov



研究问题:如何在保护企业利益的同时,通过数据共享提升机器学习模型的性能?
动机:尽管跨组织的数据共享可以提升企业的机器学习模型,但也可能使竞争对手受益,从而影响利润。
方法:提出了一个包含企业生产决策、额外数据对模型质量的影响以及数据共享谈判过程三个部分的分析框架,并基于经济理论的市场模型进行实例化研究。
效果:研究发现市场条件对数据共享的激励有深远影响,具体表现为产品相似度降低和学习任务难度增大会促进数据共享。

Collaborative learning techniques have significantly advanced in recent years, enabling private model training across multiple organizations. Despite this opportunity, firms face a dilemma when considering data sharing with competitors—while collaboration can improve a company’s machine learning model, it may also benefit competitors and hence reduce profits. In this work, we introduce a general framework for analyzing this data-sharing trade-off. The framework consists of three components, representing the firms’ production decisions, the effect of additional data on model quality, and the data-sharing negotiation process, respectively. We then study an instantiation of the framework, based on a conventional market model from economic theory, to identify key factors that affect collaboration incentives. Our findings indicate a profound impact of market conditions on the data-sharing incentives. In particular, we find that reduced competition, in terms of the similarities between the firms’ products, and harder learning tasks foster collaboration.

Offline Reinforcement Learning with Differential Privacy
Dan Qiao Yu-Xiang Wang



研究问题:如何设计一种具有差分隐私保证的离线强化学习算法,以保护训练数据中个体的敏感信息并防止各种隐私风险。
动机:现有的离线强化学习算法可能会保留训练数据中的个体敏感信息,从而面临各种隐私风险。
方法:设计了一种具有差分隐私保证的离线强化学习算法,该算法在表格和线性马尔可夫决策过程设置下都具有强大的实例依赖学习界限。
效果:理论和模拟表明,与非私有版本相比,对于中等规模的数据集,隐私保证几乎不会降低效用。

The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov Decision Process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger
Zhiqi Bu Yu-Xiang Wang Sheng Zha George Karypis



研究问题:如何进行有效的差分隐私深度学习模型训练。
动机:选择合适的剪切阈值对于在差分隐私下实现高准确性至关重要,但需要为任何差分隐私优化器(包括DP-SGD、DP-Adam、DP-LAMB等)调整R。
方法:提出一种易于使用的替代方案,称为自动剪切,消除了为任何差分隐私优化器调整R的需要。
效果:自动变体与现有的差分隐私优化器一样私密和计算效率高,但不需要特定的差分隐私超参数,使差分隐私训练变得容易,如同标准的非私有训练。在非凸设置中对自动DP-SGD进行了严格的收敛性分析,表明在样本梯度的对称梯度噪声假设下(通常用于非DP文献),它可以享受与标准SGD匹配的渐近收敛率。在各种语言和视觉任务上展示了自动剪切可以超越或匹配最先进的技术,并且可以轻松地与现有的代码库一起使用,只需进行最小的更改。

Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping threshold $R$, however, is vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune $R$ for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, showing that it can enjoy an asymptotic convergence rate that matches the standard SGD, under a symmetric gradient noise assumption of the per-sample gradients (commonly used in the non-DP literature). We demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases.

Unbounded Differentially Private Quantile and Maximum Estimation
David Durfee



研究问题:本文研究了在没有数据集上界的情况下,如何高效地计算数据的分位数,特别是最高分位数如最大值。
动机:目前的分位数计算方法在处理无界数据集时效率低下,且无法保证良好的隐私保护。
方法:通过调用"AboveThreshold"子程序,即使在没有数据上界的情况下,也可以有效地进行分位数计算。该过程迭代地应用在基础的稀疏向量技术中。
效果:实验结果表明,该方法可以更准确、更稳健地估计最高分位数,对于差分隐私求和和均值估计中的裁剪操作至关重要。此外,两次调用可以处理完全无界的数据集。

In this work we consider the problem of differentially private computation of quantiles for the data, especially the highest quantiles such as maximum, but with an unbounded range for the dataset. We show that this can be done efficiently through a simple invocation of $\texttt{AboveThreshold}$, a subroutine that is iteratively called in the fundamental Sparse Vector Technique, even when there is no upper bound on the data. In particular, we show that this procedure can give more accurate and robust estimates on the highest quantiles with applications towards clipping that is essential for differentially private sum and mean estimation. In addition, we show how two invocations can handle the fully unbounded data setting. Within our study, we show that an improved analysis of $\texttt{AboveThreshold}$ can improve the privacy guarantees for the widely used Sparse Vector Technique that is of independent interest. We give a more general characterization of privacy loss for $\texttt{AboveThreshold}$ which we immediately apply to our method for improved privacy guarantees. Our algorithm only requires one $O(n)$ pass through the data, which can be unsorted, and each subsequent query takes $O(1)$ time. We empirically compare our unbounded algorithm with the state-of-the-art algorithms in the bounded setting. For inner quantiles, we find that our method often performs better on non-synthetic datasets. For the maximal quantiles, which we apply to differentially private sum computation, we find that our method performs significantly better.

The Target-Charging Technique for Privacy Analysis across Interactive Computations
Edith Cohen Xin Lyu



研究问题:本文提出了一种名为“目标收费技术”的统一的隐私分析框架,用于多次使用差分隐私算法访问敏感数据集的交互式设置。
动机:传统的组合方法中,随着访问次数的增加,隐私保证会迅速恶化,而TCT允许那些未达到指定“目标”的计算基本上免费(而对那些达到目标的计算则产生少量开销)。
方法:TCT从私有候选者中选择top-k和稀疏向量技术等工具,并将其显著的隐私增强效益从噪声Lipschitz函数扩展到一般的私有算法。
效果:实验结果表明,TCT在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We propose the \emph{Target Charging Technique} (TCT), a unified privacy analysis framework for interactive settings where a sensitive dataset is accessed multiple times using differentially private algorithms. Unlike traditional composition, where privacy guarantees deteriorate quickly with the number of accesses, TCT allows computations that don't hit a specified \emph{target}, often the vast majority, to be essentially free (while incurring instead a small overhead on those that do hit their targets). TCT generalizes tools such as the sparse vector technique and top-k selection from private candidates and extends their remarkable privacy enhancement benefits from noisy Lipschitz functions to general private algorithms.

Calibrating “Cheap Signals” in Peer Review without a Prior
Yuxuan Lu Yuqing Kong



研究问题:如何减少同行评审中的噪声和偏见,提高论文排名的公正性?
动机:现有的平均评分方法无法有效消除评审中的噪声和偏见,导致论文排名结果不公。
方法:提出一种无需任何先验信息的一次性噪声校准过程,让评审者预测他人的评分进行校准。
效果:实验证明,校准后的评分比平均评分更能抵抗噪声和偏见的影响,且随着评审者数量的增加,其误差概率趋近于零。

Peer review lies at the core of the academic process, but even well-intentioned reviewers can still provide noisy ratings. While ranking papers by average ratings may reduce noise, varying noise levels and systematic biases stemming from ``cheap'' signals (e.g. author identity, proof length) can lead to unfairness. Detecting and correcting bias is challenging, as ratings are subjective and unverifiable. Unlike previous works relying on prior knowledge or historical data, we propose a one-shot noise calibration process without any prior information. We ask reviewers to predict others' scores and use these predictions for calibration. Assuming reviewers adjust their predictions according to the noise, we demonstrate that the calibrated score results in a more robust ranking compared to average ratings, even with varying noise levels and biases. In detail, we show that the error probability of the calibrated score approaches zero as the number of reviewers increases and is significantly lower compared to average ratings when the number of reviewers is small.

Strategyproof Voting under Correlated Beliefs
Daniel Halpern Rachel Li Ariel D. Procaccia



研究问题:在投票理论中,当选民对候选人有排名偏好时,Gibbard-Satterthwaite定理基本上排除了选择赢家的合理策略方法的存在。如果我们将策略性放宽到只适用于对他人偏好有信念的贝叶斯选民呢?
动机:当选民相信其他参与者的排名是独立从固定分布中抽取时,这种不可能性仍然存在。然而,选民认为其他投票是相关的,无论是彼此之间还是与他们自己的排名相关,这是相当合理的。
方法:我们考虑了社会选择中的经典概率模型如马洛尔斯、普拉克特-卢卡斯和瑟斯顿-莫斯特勒模型所引发的信念。我们特别挑选出多数票规则(选择排名第一的候选人)作为一个特别有希望的选择,因为它对于包含我们引入的特定信念的大类信念来说是策略性的。
效果:最后,我们给出了在这种信念类别中,多数票是唯一具有这种属性的位置评分规则的例子:当有足够的选民时,没有其他评分规则是对马洛尔斯模型引发的信念的策略性的。进一步支持多数票的观点是,一些著名的非评分投票规则在这种信念类别上无法做到策略性。

In voting theory, when voters have ranked preferences over candidates, the celebrated Gibbard-Satterthwaite Theorem essentially rules out the existence of reasonable strategyproof methods for picking a winner. What if we weaken strategyproofness to only hold for Bayesian voters with beliefs over others' preferences? When voters believe other participants' rankings are drawn independently from a fixed distribution, the impossibility persists. However, it is quite reasonable for a voter to believe that other votes are correlated, either to each other or to their own ranking. We consider such beliefs induced by classic probabilistic models in social choice such as the Mallows, Placket-Luce, and Thurstone-Mosteller models. We single out the plurality rule (choosing the candidate ranked first most often) as a particularly promising choice as it is strategyproof for a large class of beliefs containing the specific ones we introduce. Further, we show that plurality is unique among positional scoring rules in having this property: no other scoring rule is strategyproof for beliefs induced by the Mallows model when there are a sufficient number of voters. Finally, we give examples of prominent non-scoring voting rules failing to be strategyproof on beliefs in this class, further bolstering the case for plurality.

A normative theory of social conflict
Sergey A. Shuvaev Evgeny M Amelchenko Dmitry Smagin Natalia Kudryavtseva Grigori Enikolopov Alexei A. Koulakov



研究问题:本研究旨在通过收集和分析小鼠在社会冲突中的行为和全脑神经数据,理解其背后的原理。
动机:社会冲突是一种生存机制,可以产生正常和病理行为。为了揭示其底层原理,研究人员对小鼠进行了实验。
方法:研究人员将动物的交互模拟为一个正则形式游戏,使用贝叶斯推理来处理动物力量的部分可观察性。他们发现,行为和神经数据与一级心理理论(1-ToM)模型一致,即小鼠对所有涉及的小鼠的力量形成“主要”信念,并对对手的信念进行“次要”估计。
效果:该模型确定了携带这些信念信息的大脑区域,并为部分可观察环境中的社会行为研究提供了一个框架。

Social conflict is a survival mechanism yielding both normal and pathological behaviors. To understand its underlying principles, we collected behavioral and whole-brain neural data from mice advancing through stages of social conflict. We modeled the animals’ interactions as a normal-form game using Bayesian inference to account for the partial observability of animals’ strengths. We find that our behavioral and neural data are consistent with the first-level Theory of Mind (1-ToM) model where mice form “primary” beliefs about the strengths of all mice involved and “secondary” beliefs that estimate the beliefs of their opponents. Our model identifies the brain regions that carry the information about these beliefs and offers a framework for studies of social behaviors in partially observable settings.

GlucoSynth: Generating Differentially-Private Synthetic Glucose Traces
Josephine Lamp Mark Derdzinski Christopher Hannemann Joost Van der Linden Lu Feng Tianhao Wang David Evans



研究问题:本文旨在解决生成高质量、私人的合成葡萄糖轨迹的问题,这是一个可以推广到许多其他时间序列来源的任务。
动机:现有的时间序列数据合成方法,如使用生成对抗网络(GANs)的方法,无法捕捉葡萄糖数据的内在特性,并且在没有严重降低合成数据的效用的情况下,无法提供任何形式的数据隐私保证。
方法:本文提出了GlucoSynth,一种新的保护隐私的GAN框架,用于生成合成的葡萄糖轨迹。我们方法的核心理念是保留轨迹中模式(葡萄糖事件)之间的关系,以及时间动态。我们的框架结合了差分隐私机制,以提供强大的形式化隐私保证。
效果:通过对120万条葡萄糖轨迹进行综合评估,我们发现GlucoSynth在生成高质量的合成葡萄糖轨迹和提供强大的隐私保证方面优于所有先前的方法。

We focus on the problem of generating high-quality, private synthetic glucose traces, a task generalizable to many other time series sources. Existing methods for time series data synthesis, such as those using Generative Adversarial Networks (GANs), are not able to capture the innate characteristics of glucose data and cannot provide any formal privacy guarantees without severely degrading the utility of the synthetic data. In this paper we present GlucoSynth, a novel privacy-preserving GAN framework to generate synthetic glucose traces. The core intuition behind our approach is to conserve relationships amongst motifs (glucose events) within the traces, in addition to temporal dynamics. Our framework incorporates differential privacy mechanisms to provide strong formal privacy guarantees. We provide a comprehensive evaluation on the real-world utility of the data using 1.2 million glucose traces; GlucoSynth outperforms all previous methods in its ability to generate high-quality synthetic glucose traces with strong privacy guarantees.

Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings
Klim Kireev Maksym Andriushchenko Carmela Troncoso Nicolas Flammarion



研究问题:当前对抗性鲁棒性的研究主要集中在图像和文本数据上,但在欺诈检测、医疗诊断或推荐系统等场景中,缺乏鲁棒性可能会带来严重风险,这些场景通常依赖的是表格数据。
动机:表格数据中的对抗性鲁棒性存在两个主要挑战。首先,表格数据集通常包含分类特征,因此无法直接使用现有的优化程序进行处理。其次,在表格领域广泛使用的非深度网络算法性能优秀,但增强鲁棒性的算法通常是针对神经网络(如对抗性训练)的。
方法:本文提出了一种方法,可以对表格数据进行对抗性鲁棒的深度网络训练,并通过适用于分类数据的通用鲁棒嵌入将这种鲁棒性转移到其他分类器上。这些嵌入是通过双层交替最小化框架创建的,可以转移到提升树或随机森林上,使它们在无需对抗性训练的情况下变得鲁棒,同时保持其在表格数据上的高准确性。
效果:实验结果表明,本文的方法在适用于表格数据的实际威胁模型中优于现有技术。

Research on adversarial robustness is primarily focused on image and text data. Yet, many scenarios in which lack of robustness can result in serious risks, such as fraud detection, medical diagnosis, or recommender systems often do not rely on images or text but instead on tabular data. Adversarial robustness in tabular data poses two serious challenges. First, tabular datasets often contain categorical features, and therefore cannot be tackled directly with existing optimization procedures. Second, in the tabular domain, algorithms that are not based on deep networks are widely used and offer great performance, but algorithms to enhance robustness are tailored to neural networks (e.g. adversarial training). In this paper, we tackle both challenges. We present a method that allows us to train adversarially robust deep networks for tabular data and to transfer this robustness to other classifiers via universal robust embeddings tailored to categorical data. These embeddings, created using a bilevel alternating minimization framework, can be transferred to boosted trees or random forests making them robust without the need for adversarial training while preserving their high accuracy on tabular data. We show that our methods outperform existing techniques within a practical threat model suitable for tabular data.

BERT Lost Patience Won't Be Robust to Adversarial Slowdown
Zachary Coalson Gabriel Ritter Rakesh B Bobba Sanghyun Hong



研究问题:本文旨在评估多出口语言模型对抗对抗性减速的鲁棒性。
动机:设计了一种通过早期退出点生成自然对抗文本的减速攻击,以检验其鲁棒性。
方法:使用这种攻击作为工具,对三种多出口机制进行综合评估,并与GLUE基准测试进行对抗性减速对比。
效果:实验结果表明,该攻击显著降低了三种方法在白盒和黑盒设置中的计算节省量。更复杂的机制更容易受到对抗性减速的影响。此外,我们发现对抗性训练无法击败我们的减速攻击,但使用如ChatGPT的对话模型进行输入清理可以有效去除扰动。这提示未来的工作需要开发高效且鲁棒的多出口模型。

In this paper, we systematically evaluate the robustness of multi-exit language models against adversarial slowdown. To audit their robustness, we design a slowdown attack that generates natural adversarial text bypassing early-exit points. We use the resulting WAFFLE attack as a vehicle to conduct a comprehensive evaluation of three multi-exit mechanisms with the GLUE benchmark against adversarial slowdown. We then show our attack significantly reduces the computational savings provided by the three methods in both white-box and black-box settings. The more complex a mechanism is, the more vulnerable it is to adversarial slowdown. We also perform a linguistic analysis of the perturbed text inputs, identifying common perturbation patterns that our attack generates, and comparing them with standard adversarial text attacks. Moreover, we show that adversarial training is ineffective in defeating our slowdown attack, but input sanitization with a conversational model, e.g., ChatGPT, can remove perturbations effectively. This result suggests that future work is needed for developing efficient yet robust multi-exit models. Our code is available at: https://github.com/ztcoalson/WAFFLE

On Measuring Fairness in Generative Models
Christopher T.H Teo Milad Abdollahzadeh Ngai-man Cheung



研究问题:现有的公平性测量框架存在较大的测量误差,影响对公平生成模型的评估。
动机:为了解决这一问题,提出了一种新的公平性测量框架CLEAM,以减少在敏感属性分类器中的误差。
方法:通过使用统计模型来考虑敏感属性分类器的不准确性,从而降低测量误差。
效果:实验结果表明,CLEAM可以显著降低测量误差,并在重要的文本到图像生成器和GANs中揭示了相当大的偏见,引发了对其应用的关注。

Recently, there has been increased interest in fair generative models. In this work, we conduct, for the first time, an in-depth study on fairness measurement, a critical component in gauging progress on fair generative models. We make three contributions. First, we conduct a study that reveals that the existing fairness measurement framework has considerable measurement errors, even when highly accurate sensitive attribute (SA) classifiers are used. These findings cast doubts on previously reported fairness improvements. Second, to address this issue, we propose CLassifier Error-Aware Measurement (CLEAM), a new framework which uses a statistical model to account for inaccuracies in SA classifiers. Our proposed CLEAM reduces measurement errors significantly, e.g., 4.98%→0.62% for StyleGAN2 w.r.t. Gender. Additionally, CLEAM achieves this with minimal additional overhead. Third, we utilize CLEAM to measure fairness in important text-to-image generator and GANs, revealing considerable biases in these models that raise concerns about their applications. Code and more resources: https: //sutd-visual-computing-group.github.io/CLEAM/.

The Memory-Perturbation Equation: Understanding Model's Sensitivity to Data
Peter Nickl Lu Xu Dharmesh Tailor Thomas Möllenhoff Mohammad Emtiyaz Khan



研究问题:理解模型对其训练数据的敏感性是关键,但也可能是具有挑战性和昂贵的,特别是在训练过程中。
动机:为了简化这些问题,我们提出了记忆扰动方程(MPE),该方程将模型的敏感性与其训练数据的扰动联系起来。
方法:使用贝叶斯原理推导出的MPE统一了现有的敏感性度量,将其推广到各种模型和算法,并揭示了关于敏感性的有用属性。
效果:我们的实证结果表明,在训练过程中获得的敏感性估计可以用于准确预测未见过测试数据上的泛化能力。预期这个提出的方程将对未来的鲁棒和自适应学习研究有所帮助。

Understanding model’s sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivities. Our empirical results show that sensitivity estimates obtained during training can be used to faithfully predict generalization on unseen test data. The proposed equation is expected to be useful for future research on robust and adaptive learning.

Enhancing Sharpness-Aware Optimization Through Variance Suppression
Bingcong Li Georgios B. Giannakis



研究问题:如何通过最小化最大损失来提高深度神经网络的泛化能力。
动机:尽管在没有大量数据增强的情况下,已有的证据表明锐度感知最小化(SAM)在提高深度神经网络的泛化能力方面具有显著的优点,但是过于友好的攻击者可能会削弱模型的最外层泛化能力。
方法:本文提出了一种新的方法,通过抑制攻击者的方差(VaSSO)来稳定攻击者,以避免其过于友好。这种方法的数学稳定性保证了其在模型无关任务上比SAM有所改进,包括图像分类和机器翻译。
效果:实验证明,VaSSO使SAM具有对抗高级别标签噪声的鲁棒性。

Sharpness-aware minimization (SAM) has well documented merits in enhancing generalization of deep neural networks, even without sizable data augmentation. Embracing the geometry of the loss function, where neighborhoods of 'flat minima' heighten generalization ability, SAM seeks 'flat valleys' by minimizing the maximum loss caused by an *adversary* perturbing parameters within the neighborhood. Although critical to account for sharpness of the loss function, such an '*over-friendly* adversary' can curtail the outmost level of generalization. The novel approach of this contribution fosters stabilization of adversaries through *variance suppression* (VaSSO) to avoid such friendliness. VaSSO's *provable* stability safeguards its numerical improvement over SAM in model-agnostic tasks, including image classification and machine translation. In addition, experiments confirm that VaSSO endows SAM with robustness against high levels of label noise.

Rehearsal Learning for Avoiding Undesired Future
Tian Qin Tian-Zuo Wang Zhi-Hua Zhou



研究问题:如何通过机器学习模型做出决策以避免未来不良结果?
动机:目前的机器学习模型主要进行预测,但在某些情况下,我们更希望找到能够避免不良结果的决策。
方法:提出一种复习学习框架,通过影响关系和贝叶斯框架下的结构方程,找出可以改变结果的可行动决策。
效果:实验验证了复习学习框架的有效性和风险估计的准确性。

Machine learning (ML) models have been widely used to make predictions. Instead of a predictive statement about future outcomes, in many situations we want to pursue a decision: what can we do to avoid the undesired future if an ML model predicts so? In this paper, we present a rehearsal learning framework, in which decisions that can persuasively avoid the happening of undesired outcomes can be found and recommended. Based on the influence relation, we characterize the generative process of variables with structural rehearsal models, consisting of a probabilistic graphical model called rehearsal graphs and structural equations, and find actionable decisions that can alter the outcome by reasoning under a Bayesian framework. Moreover, we present a probably approximately correct bound to quantify the associated risk of a decision. Experiments validate the effectiveness of the proposed rehearsal learning framework and the informativeness of the bound.

Have it your way: Individualized Privacy Assignment for DP-SGD
Franziska Boenisch Christopher Mühl Adam Dziedzic Roy Rinberg Nicolas Papernot



研究问题:现有的使用差分隐私训练机器学习模型的方法中,所有用户使用统一的隐私预算可能无法满足不同用户的隐私期望。
动机:由于不同的用户可能有不同的隐私期望,因此对所有点设置统一的隐私预算可能对某些用户过于保守,或对其他用户保护不足。
方法:通过个性化的隐私预算来满足这些偏好,并引入了一种支持这种个性化预算的差分隐私随机梯度下降(DP-SGD)的变体,称为个体化DP-SGD(IDP-SGD)。
效果:由于IDP-SGD提供了符合个别用户及其数据点的偏好的隐私保证,因此实证发现它可以改善隐私-效用权衡。

When training a machine learning model with differential privacy, one sets a privacy budget. This uniform budget represents an overall maximal privacy violation that any user is willing to face by contributing their data to the training set. We argue that this approach is limited because different users may have different privacy expectations. Thus, setting a uniform privacy budget across all points may be overly conservative for some users or, conversely, not sufficiently protective for others. In this paper, we capture these preferences through individualized privacy budgets. To demonstrate their practicality, we introduce a variant of Differentially Private Stochastic Gradient Descent (DP-SGD) which supports such individualized budgets. DP-SGD is the canonical approach to training models with differential privacy. We modify its data sampling and gradient noising mechanisms to arrive at our approach, which we call Individualized DP-SGD (IDP-SGD). Because IDP-SGD provides privacy guarantees tailored to the preferences of individual users and their data points, we empirically find it to improve privacy-utility trade-offs.

Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders
Jan Dubiński Stanisław Pawlak Franziska Boenisch Tomasz Trzcinski Adam Dziedzic



研究问题:预训练语言模型如何利用知识图谱中的有信息量的实体来增强语言表示?
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Machine Learning as a Service (MLaaS) APIs provide ready-to-use and high-utility encoders that generate vector representations for given inputs. Since these encoders are very costly to train, they become lucrative targets for model stealing attacks during which an adversary leverages query access to the API to replicate the encoder locally at a fraction of the original training costs. We propose *Bucks for Buckets (B4B)*, the first *active defense* that prevents stealing while the attack is happening without degrading representation quality for legitimate API users. Our defense relies on the observation that the representations returned to adversaries who try to steal the encoder's functionality cover a significantly larger fraction of the embedding space than representations of legitimate users who utilize the encoder to solve a particular downstream task. B4B leverages this to adaptively adjust the utility of the returned representations according to a user's coverage of the embedding space. To prevent adaptive adversaries from eluding our defense by simply creating multiple user accounts (sybils), B4B also individually transforms each user's representations. This prevents the adversary from directly aggregating representations over multiple accounts to create their stolen encoder copy. Our active defense opens a new path towards securely sharing and democratizing encoders over public APIs.

Beyond Confidence: Reliable Models Should Also Consider Atypicality
Mert Yuksekgonul Linjun Zhang James Zou Carlos Guestrin



研究问题:本研究旨在探究机器学习模型预测的可靠性与样本或类别的非典型性之间的关系。
动机:虽然大多数机器学习模型可以为其预测提供信心,但仅凭信心无法理解预测的可靠性。例如,如果输入在训练数据集中表示不佳或输入本质上模糊不清,模型可能会产生低信心的预测。
方法:我们首先证明非典型性与误校准和准确性有强烈的关联。具体来说,我们通过实证发现,对于非典型输入或非典型类别的预测,其信心过高且准确率较低。然后,我们展示了将非典型性纳入考量可以提高神经网络和大型语言模型的不确定性量化和性能。
效果:在一个案例研究中,我们发现在使用不同肤色组别但没有访问组属性的情况下,使用非典型性可以提高皮肤损伤分类器的性能。总的来说,我们建议模型不仅应使用信心,还应使用非典型性来提高不确定性量化和性能。我们的研究结果表明,简单的后验非典型性估计器可以提供显著的价值。

While most machine learning models can provide confidence in their predictions, confidence is insufficient to understand a prediction's reliability. For instance, the model may have a low confidence prediction if the input is not well-represented in the training dataset or if the input is inherently ambiguous. In this work, we investigate the relationship between how atypical~(rare) a sample or a class is and the reliability of a model's predictions. We first demonstrate that atypicality is strongly related to miscalibration and accuracy. In particular, we empirically show that predictions for atypical inputs or atypical classes are more overconfident and have lower accuracy. Using these insights, we show incorporating atypicality improves uncertainty quantification and model performance for discriminative neural networks and large language models. In a case study, we show that using atypicality improves the performance of a skin lesion classifier across different skin tone groups without having access to the group attributes. Overall, we propose that models should use not only confidence but also atypicality to improve uncertainty quantification and performance. Our results demonstrate that simple post-hoc atypicality estimators can provide significant value.

Temporal Robustness against Data poisoning
Wenxiao Wang Soheil Feizi



研究问题:现有的数据投毒威胁模型主要关注被投毒样本的数量,如果攻击者能以较低的成本投毒比预期更多的样本,他们可能会在短时间内使现有的防御措施失效。
动机:为了解决这个问题,我们利用了通常可用但在过去被忽视的数据的出生日期时间戳。
方法:我们提出了一个基于时间的威胁模型,引入了两个新的指标:早期和持续时间,分别衡量攻击开始提前多久以及攻击持续多久。
效果:通过这个模型,我们定义了对数据投毒的时间鲁棒性的概念,即使有无限量的被投毒样本,只要攻击是时间有限的,也能提供有意义的保护。我们还开发并验证了一种基线防御策略,即时间聚合,它提供了可证明的时间鲁棒性,并突出了我们的时间威胁模型在数据投毒方面的潜力。

Data poisoning considers cases when an adversary manipulates the behavior of machine learning algorithms through malicious training data. Existing threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, if attackers can poison more samples than expected with affordable overhead, as in many practical scenarios, they may be able to render existing defenses ineffective in a short time. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning with two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. Using these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples when the attacks are temporally bounded. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal threat model for data poisoning.

Enhancing User Intent Capture in Session-Based Recommendation with Attribute Patterns
Xin Liu Zheng Li Yifan Gao Jingfeng Yang Tianyu Cao Zhengyang Wang Bing Yin Yangqiu Song



研究问题:电子商务中基于会话的推荐系统旨在预测匿名用户将根据浏览和购买历史购买的下一个商品。
动机:构建全局或局部转换图以补充会话数据可能导致噪声关联和用户意图消失。
方法:我们提出了频繁属性模式增强变压器(FAPAT),通过构建属性转移图和匹配属性模式来描述用户意图。具体来说,频繁且紧凑的属性模式被用作内存以增强会话表示,然后通过一个门和一个变压器块融合整个会话信息。
效果:通过对两个公共基准测试和三个领域的1亿工业数据的大量实验,我们发现FAPAT在各种评估指标(命中,NDCG,MRR)上平均比最先进的方法高出4.5%,表现出色。此外,我们还通过预测商品属性和周期-商品推荐来评估模型捕获用户意图的能力。

The goal of session-based recommendation in E-commerce is to predict the next item that an anonymous user will purchase based on the browsing and purchase history. However, constructing global or local transition graphs to supplement session data can lead to noisy correlations and user intent vanishing. In this work, we propose the Frequent Attribute Pattern Augmented Transformer (FAPAT) that characterizes user intents by building attribute transition graphs and matching attribute patterns. Specifically, the frequent and compact attribute patterns are served as memory to augment session representations, followed by a gate and a transformer block to fuse the whole session information. Through extensive experiments on two public benchmarks and 100 million industrial data in three domains, we demonstrate that FAPAT consistently outperforms state-of-the-art methods by an average of 4.5% across various evaluation metrics (Hits, NDCG, MRR). Besides evaluating the next-item prediction, we estimate the models' capabilities to capture user intents via predicting items' attributes and period-item recommendations.

topic-6

Topic words :  learning,  policy,  reinforcement,  rl,  agent,  reward,  offline,  state

Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov Archit Sharma Eric Mitchell Christopher D Manning Stefano Ermon Chelsea Finn



研究问题:如何使大规模无监督语言模型(LMs)的行为更精确可控。
动机:现有的获取可控性的方法需要收集人类对模型生成的相对质量的标签,并通过强化学习进行微调,但这种方法复杂且不稳定。
方法:本文提出了一种直接偏好优化(DPO)算法,通过在奖励函数和最优策略之间建立映射,将约束的奖励最大化问题转化为一个分类问题进行优化。
效果:实验表明,DPO能以优于或等同于现有方法的效果使LMs与人类偏好对齐,同时其实现和训练过程更简单,且无需拟合奖励模型、在微调过程中采样或进行大量的超参数调整。

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models
Tsun-Hsuan Wang Juntian Zheng Pingchuan Ma Yilun Du Byungchul Kim Andrew Everett Spielberg Joshua B. Tenenbaum Chuang Gan Daniela Rus



研究问题:如何优化人工生物形态和控制,以应用于物理软机器人和虚拟角色创建。
动机:自然进化出具有高度形态和行为智能的生物,而计算方法在接近这种多样性和有效性方面滞后。
方法:提出了DiffuseBot,这是一种物理增强的扩散模型,可以生成在各种任务中表现出色的软机器人形态。DiffuseBot通过(i)用提供性能证书的物理动态模拟增强扩散过程,以及(ii)引入联合设计过程,利用来自可微分模拟的物理敏感性信息来共同优化物理设计和控制,从而弥合了虚拟生成内容和物理效用之间的鸿沟。
效果:展示了一系列模拟和制造的机器人及其能力。

Nature evolves creatures with a high complexity of morphological and behavioral intelligence, meanwhile computational methods lag in approaching that diversity and efficacy. Co-optimization of artificial creatures' morphology and control in silico shows promise for applications in physical soft robotics and virtual character creation; such approaches, however, require developing new learning algorithms that can reason about function atop pure structure. In this paper, we present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks. \name bridges the gap between virtually generated content and physical utility by (i) augmenting the diffusion process with a physical dynamical simulation which provides a certificate of performance, and (ii) introducing a co-design procedure that jointly optimizes physical design and control by leveraging information about physical sensitivities from differentiable simulation. We showcase a range of simulated and fabricated robots along with their capabilities. Check our website: https://diffusebot.github.io/

When Demonstrations meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning
Siliang Zeng Chenliang Li Alfredo Garcia Mingyi Hong



研究问题:本文旨在解决离线逆强化学习(Offline IRL)中,从专家代理的固定、有限的动作演示中恢复奖励和环境动态结构的问题。
动机:在安全敏感的应用如临床决策和自动驾驶中,准确执行任务的专家模型具有重要应用价值。然而,专家偏好的结构与专家对环境动态(即“世界”)的模型密切相关。因此,从有限数据和有限覆盖范围中获得的世界模型不准确可能会增加估计奖励的误差。
方法:本文提出了一种双层优化的奖励估计任务,其中上层是基于专家策略(下层)的保守模型的最大似然最大化。策略模型是保守的,因为它在奖励最大化的同时,会受到一个惩罚,该惩罚随着估计的世界模型的不确定性的增加而增加。本文还提出了一个新的算法框架来解决双层优化问题,并为相关的最优奖励估计器提供了统计和计算性能保证。
效果:实验结果表明,所提出的算法在MuJoCo的连续控制任务和D4RL基准测试集中的不同数据集上,显著优于最新的离线IRL和模仿学习基准测试。

Offline inverse reinforcement learning (Offline IRL) aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving. However, the structure of an expert's preferences implicit in observed actions is closely linked to the expert's model of the environment dynamics (i.e. the ``world''). Thus, inaccurate models of the world obtained from finite data with limited coverage could compound inaccuracy in estimated rewards. To address this issue, we propose a bi-level optimization formulation of the estimation task wherein the upper level is likelihood maximization based upon a conservative model of the expert's policy (lower level). The policy model is conservative in that it maximizes reward subject to a penalty that is increasing in the uncertainty of the estimated model of the world. We propose a new algorithmic framework to solve the bi-level optimization problem formulation and provide statistical and computational guarantees of performance for the associated optimal reward estimator. Finally, we demonstrate that the proposed algorithm outperforms the state-of-the-art offline IRL and imitation learning benchmarks by a large margin, over the continuous control tasks in MuJoCo and different datasets in the D4RL benchmark.

Bridging RL Theory and Practice with the Effective Horizon
Cassidy Laidlaw Stuart Russell Anca Dragan



研究问题:深度强化学习在一些环境中表现出色,而在另一些环境中却失败得彻底。理想的情况是,强化学习理论应该能够解释为什么会这样,即预测实际性能的界限。
动机:当前的强化学习理论并不能很好地解释深度强化学习的成败原因。因此,研究人员引入了一个新的数据集BRIDGE,以比较标准的深度强化学习算法和先前的样本复杂度界限。
方法:通过分析155个常见深度强化学习基准测试中的马尔可夫决策过程(MDP)及其相应的表格表示形式,研究人员可以精确计算实例依赖的界限。他们发现,当随机策略下具有最高Q值的动作也是最优策略下具有最高Q值的动作时,深度强化学习往往成功;反之,则往往失败。
效果:研究人员将这一特性概括为一种新的MDP复杂性度量,称为“有效视距”。利用BRIDGE数据集,他们证明了基于有效视距的界限比先前的样本复杂度界限更能反映PPO和DQN的实际性能。此外,与现有的界限不同,有效视距还可以预测使用奖励塑造或预训练探索策略的效果。

Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the *random* policy also have the highest Q-values under the *optimal* policy—i.e., when it is optimal to act greedily with respect to the random's policy Q function—deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the *effective horizon*, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.

When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment
Tianwei Ni Michel Ma Benjamin Eysenbach Pierre-Luc Bacon



研究问题:强化学习算法面临的两个主要挑战是学习和表示过去的观察结果,以及确定行动如何影响未来的回报。这两个挑战都涉及到长期依赖性的建模。
动机:Transformer架构在解决涉及长期依赖性的问题方面非常成功,包括在强化学习领域。然而,基于Transformer的强化学习方法的强大性能背后的根本原因尚不清楚:是因为它们学习了有效的记忆,还是因为它们进行了有效的信用分配?
方法:我们引入了内存长度和信用分配长度的正式定义,并设计了简单的可配置任务来测量这些不同的数量。
效果:实证结果表明,Transformers可以增强RL算法的内存能力,扩展到需要记住1500步前的观察结果的任务。然而,Transformers并没有改善长期的信用分配。总的来说,我们的研究结果为Transformers在RL中的成功提供了解释,同时也强调了一个值得未来研究和基准设计的重要领域。

Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations $1500$ steps ago. However, Transformers do not improve long-term credit assignment. In summary, our results provide an explanation for the success of Transformers in RL, while also highlighting an important area for future research and benchmark design. Our code is open-sourced at https://github.com/twni2016/Memory-RL.

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
Peter Shaw Mandar Joshi James Cohan Jonathan Berant Panupong Pasupat Hexiang Hu Urvashi Khandelwal Kenton Lee Kristina Toutanova



研究问题:如何利用像素级截图和键盘鼠标操作的通用动作空间,训练出能超越人类众包工作者在图形用户界面(GUI)任务上表现的数字代理。
动机:现有的数字代理主要依赖从HTML或其他结构化数据源提取的文本表示,这并不总是可用的。这些输入表示通常与特定于任务的动作空间相结合。
方法:本文通过像素级预训练的方法,创建了能使用人类常用的概念接口——像素级截图和对应键盘鼠标操作的通用动作空间来与数字世界交互的代理。
效果:首次证明这种代理能够在基于GUI的任务中超越人类众包工作者,表现出色。

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use — via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

Relax, it doesn’t matter how you get there: A new self-supervised approach for multi-timescale behavior analysis
Mehdi Azabou Michael Jacob Mendelson Nauman Ahad Maks Sorokin Shantanu Thakoor Carolina Urzay Eva L Dyer



研究问题:如何预测动物在复杂和不可预测的自然环境中的行为?
动机:现有的模型在预测动物在自由和自然状态下的行为时,效果不佳。
方法:开发了一种多任务表示学习模型,结合了动作预测目标和多尺度架构,以捕捉局部和全局动态。
效果:在机器人和MABe 2022多代理行为挑战赛中,该模型在所有情况下都表现出色,能够解决各种下游任务。

Unconstrained and natural behavior consists of dynamics that are complex and unpredictable, especially when trying to predict what will happen multiple steps into the future. While some success has been found in building representations of animal behavior under constrained or simplified task-based conditions, many of these models cannot be applied to free and naturalistic settings where behavior becomes increasingly hard to model. In this work, we develop a multi-task representation learning model for animal behavior that combines two novel components: (i) an action-prediction objective that aims to predict the distribution of actions over future timesteps, and (ii) a multi-scale architecture that builds separate latent spaces to accommodate short- and long-term dynamics. After demonstrating the ability of the method to build representations of both local and global dynamics in robots in varying environments and terrains, we apply our method to the MABe 2022 Multi-Agent Behavior challenge, where our model ranks first overall on both mice and fly benchmarks. In all of these cases, we show that our model can build representations that capture the many different factors that drive behavior and solve a wide range of downstream tasks.

Double Gumbel Q-Learning
David Yu-Tung Hui Aaron Courville Pierre-Luc Bacon



研究问题:本文旨在解决深度神经网络在Q学习中引入的两种异方差Gumbel噪声源问题。
动机:为了应对这些噪声源,我们提出了双Gumbel Q学习,这是一种适用于离散和连续控制的深度Q学习算法。
方法:在离散控制中,我们为该算法的损失函数推导出了一个封闭形式的表达式。而在连续控制中,这个损失函数是难以处理的,因此我们推导出了一个近似值,其中包含一个调节Q学习的悲观程度的超参数。
效果:我们在33个来自DeepMind Control、MuJoCo、MetaWorld和Box2D的任务上展示了DoubleGum的表现优于DDPG、TD3、SAC、XQL、分位数回归和混合高斯评论家。同时,我们还发现调整这个超参数可能会进一步提高样本效率。

We show that Deep Neural Networks introduce two heteroscedastic Gumbel noise sources into Q-Learning. To account for these noise sources, we propose Double Gumbel Q-Learning, a Deep Q-Learning algorithm applicable for both discrete and continuous control. In discrete control, we derive a closed-form expression for the loss function of our algorithm. In continuous control, this loss function is intractable and we therefore derive an approximation with a hyperparameter whose value regulates pessimism in Q-Learning. We present a default value for our pessimism hyperparameter that enables DoubleGum to outperform DDPG, TD3, SAC, XQL, quantile regression, and Mixture-of-Gaussian Critics in aggregate over 33 tasks from DeepMind Control, MuJoCo, MetaWorld, and Box2D and show that tuning this hyperparameter may further improve sample efficiency.

Sample Efficient Reinforcement Learning in Mixed Systems through Augmented Samples and Its Applications to Queueing Networks
Honghao Wei Xin Liu Weina Wang Lei Ying



研究问题:本文考虑了一类涉及两种状态的强化学习问题,即随机状态和伪随机状态。
动机:在这类系统中,随机状态遵循随机转换核,而伪随机状态的转换则是确定的,给定随机状态/转换。我们称这样的系统为混合系统,它们在各种应用中广泛使用,包括制造系统、通信网络和排队网络。
方法:我们提出了一种样本高效的强化学习方法,通过生成增强的数据样本来加速学习。该方法是数据驱动的(模型自由),但它从真实和增强样本的数据样本中学习策略。这种方法显著提高了学习效率,减少了样本复杂度,使得数据集只需要对随机状态有足够的覆盖即可。
效果:我们在适应的Q迭代(FQI)下分析了所提出方法的样本复杂度,并证明最优性差距减小为 $O\left(\sqrt{\frac{1}{n}}+\sqrt{\frac{1}{m}}\right)$,其中 n 代表真实样本的数量,m 是每个真实样本的增强样本数量。值得注意的是,如果没有增强样本,由于伪随机状态的数据覆盖不足,最优性差距为 O(1)。我们在多个排队网络应用上的实验结果证实,所提出的方法确实显著加速了深度 Q-学习和深度政策梯度。

This paper considers a class of reinforcement learning problems, which involve systems with two types of states: stochastic and pseudo-stochastic. In such systems, stochastic states follow a stochastic transition kernel while the transitions of pseudo-stochastic states are deterministic {\em given} the stochastic states/transitions. We refer to such systems as mixed systems, which are widely used in various applications, including Manufacturing systems, communication networks, and queueing networks. We propose a sample-efficient RL method that accelerates learning by generating augmented data samples. The proposed algorithm is data-driven (model-free), but it learns the policy from data samples from both real and augmented samples. This method significantly improves learning by reducing the sample complexity such that the dataset only needs to have sufficient coverage of the stochastic states. We analyze the sample complexity of the proposed method under Fitted Q Iteration (FQI) and demonstrate that the optimality gap decreases as $O\left(\sqrt{\frac{1}{n}}+\sqrt{\frac{1}{m}}\right),$ where $n$ represents the number of real samples, and $m$ is the number of augmented samples per real sample. It is important to note that without augmented samples, the optimality gap is $O(1)$ due to the insufficient data coverage of the pseudo-stochastic states. Our experimental results on multiple queueing network applications confirm that the proposed method indeed significantly accelerates both deep Q-learning and deep policy gradient.

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
Masatoshi Uehara Haruka Kiyohara Andrew Bennett Victor Chernozhukov Nan Jiang Nathan Kallus Chengchun Shi Wen Sun



研究问题:本文旨在研究部分可观察马尔可夫决策过程(POMDPs)的部分可观察策略评估(OPE)。
动机:现有的方法如序列重要性采样估计器和拟合Q评估在POMDPs中受到视程的困扰。
方法:通过引入未来依赖值函数,该方法以未来代理作为输入,从而开发了一种新颖的无模型OPE方法。
效果:实验结果表明,只要未来和历史包含关于潜在状态的足够信息以及贝尔曼完备性,我们的OPE估计器就会接近真实策略值。

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope

Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis
Alexander Meulemans Simon Schug Seijin Kobayashi Nathaniel Daw Greg Wayne



研究问题:如何提高强化学习中的样本效率,实现更好的奖励分配方法。
动机:目前的强化学习方法在奖励分配上存在偏差和方差大的问题,需要更精确的奖励分配方法来提高样本效率。
方法:基于事后视角奖励分配(HCA),提出反事实贡献分析(COCOA)这一新的基于模型的奖励分配算法族。该算法通过测量行动对后续奖励的影响,并通过量化反事实查询:“如果代理采取另一种行动,是否仍能达到此奖励?”来实现精确的奖励分配。
效果:实验结果表明,相比于HCA,该方法能降低偏差和方差,从而提高样本效率。这为强化学习提供了一条新的路径。

To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action’s influence on future rewards. Building upon Hindsight Credit Assignment (HCA), we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms. Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: ‘Would the agent still have reached this reward if it had taken another action?’. We show that measuring contributions w.r.t. rewarding _states_, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments. Instead, we measure contributions w.r.t. rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance. We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities. By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based credit assignment methods is due to lower bias and variance compared to HCA and common baselines. Our results demonstrate how modeling action contributions towards rewarding outcomes can be leveraged for credit assignment, opening a new path towards sample-efficient reinforcement learning.

Regularized Behavior Cloning for Blocking the Leakage of Past Action Information
Seokin Seo HyeongJoo Hwang Hongseok Yang Kee-Eung Kim



研究问题:在部分可观察环境中,当过去的动作信息泄露到观察历史中时,通过行为克隆的模仿学习经常会导致模型模仿自己的过去动作,从而引发灾难性失败。
动机:为了解决这个问题,本文提出了一种名为“过去动作泄漏正则化”(PALR)的原则性正则化方法。
方法:该方法的主要思想是利用条件独立性的概念来减轻信息泄露的影响。我们比较了不同的条件独立度度量和其估计器的实例,结果显示使用基于核的估计器效果最好。
效果:我们在基准数据集上进行了广泛的实验,以评估我们的正则化方法的效果。实验结果表明,我们的方法显著优于先前的相关方法,显示出在过去的动作信息泄露到观察历史中时成功模仿专家动作的潜力。

For partially observable environments, imitation learning with observation histories (ILOH) assumes that control-relevant information is sufficiently captured in the observation histories for imitating the expert actions. In the offline setting wherethe agent is required to learn to imitate without interaction with the environment, behavior cloning (BC) has been shown to be a simple yet effective method for imitation learning. However, when the information about the actions executed in the past timesteps leaks into the observation histories, ILOH via BC often ends up imitating its own past actions. In this paper, we address this catastrophic failure by proposing a principled regularization for BC, which we name Past Action Leakage Regularization (PALR). The main idea behind our approach is to leverage the classical notion of conditional independence to mitigate the leakage. We compare different instances of our framework with natural choices of conditional independence metric and its estimator. The result of our comparison advocates the use of a particular kernel-based estimator for the conditional independence metric. We conduct an extensive set of experiments on benchmark datasets in order to assess the effectiveness of our regularization method. The experimental results show that our method significantly outperforms prior related approaches, highlighting its potential to successfully imitate expert actions when the past action information leaks into the observation histories.

HIQL: Offline Goal-Conditioned RL with Latent States as Actions
Seohong Park Dibya Ghosh Benjamin Eysenbach Sergey Levine



研究问题:如何直接从大量无标签(奖励自由)数据中学习,特别是在强化学习中实现目标条件RL。
动机:尽管强化学习中的直接目标条件RL有潜力利用大量的无标签数据,但直接从多样化的离线数据中构建有效的算法是具有挑战性的,因为对遥远的目标准确估计精确的价值函数很困难。
方法:提出了一种基于离线数据的分层算法进行目标条件RL。通过使用一个无动作价值函数,我们学习了两种策略来利用这种结构:一种是将状态视为动作并预测子目标(潜在表示)的高级策略;另一种是预测达到这个子目标的动作的低级策略。
效果:通过分析和示例,我们展示了这种分层分解使我们的方法对估计价值函数中的噪声具有鲁棒性。我们将该方法应用于离线目标达成基准测试,表明我们的方法可以解决阻碍先前方法的长期任务,可以扩展到高维图像观察,并且可以轻松利用无动作数据。

Unsupervised pre-training has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goal-conditioned RL can potentially provide an analogous self-supervised approach for making use of large quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goal-reaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that treats states as actions and predicts (a latent representation of) a subgoal and a low-level policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. Our code is available at https://seohong.me/projects/hiql/

Behavior Alignment via Reward Function Optimization
Dhawal Gupta Yash Chandak Scott M. Jordan Philip S. Thomas Bruno Castro da Silva



研究问题:设计有效的奖励函数以引导强化学习(RL)代理实现特定行为是一项复杂的任务。
动机:由于需要识别非稀疏的奖励结构并避免无意中引发不良行为,因此这是一项具有挑战性的任务。
方法:我们引入了一个新的框架,使用双层目标来学习“行为对齐奖励函数”。这些函数将反映设计师启发式和领域知识的辅助奖励与环境的主要奖励相结合。
效果:通过在各种任务上进行评估,包括小规模实验和高维控制挑战,我们发现该方法不仅解决了现有方法的关键缺点,而且即使在给定的辅助奖励函数不准确或质量较差的情况下,也能始终产生高性能的解决方案。

Designing reward functions for efficiently guiding reinforcement learning (RL) agents toward specific behaviors is a complex task. This is challenging since it requires the identification of reward structures that are not sparse and that avoid inadvertently inducing undesirable behaviors. Naively modifying the reward structure to offer denser and more frequent feedback can lead to unintended outcomes and promote behaviors that are not aligned with the designer's intended goal. Although potential-based reward shaping is often suggested as a remedy, we systematically investigate settings where deploying it often significantly impairs performance. To address these issues, we introduce a new framework that uses a bi-level objective to learn \emph{behavior alignment reward functions}. These functions integrate auxiliary rewards reflecting a designer's heuristics and domain knowledge with the environment's primary rewards. Our approach automatically determines the most effective way to blend these types of feedback, thereby enhancing robustness against heuristic reward misspecification. Remarkably, it can also adapt an agent's policy optimization process to mitigate suboptimalities resulting from limitations and biases inherent in the underlying RL algorithms. We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. We investigate heuristic auxiliary rewards of varying quality---some of which are beneficial and others detrimental to the learning process. Our results show that our framework offers a robust and principled way to integrate designer-specified heuristics. It not only addresses key shortcomings of existing approaches but also consistently leads to high-performing solutions, even when given misaligned or poorly-specified auxiliary reward functions.

Learning Universal Policies via Text-Guided Video Generation
Yilun Du Sherry Yang Bo Dai Hanjun Dai Ofir Nachum Joshua B. Tenenbaum Dale Schuurmans Pieter Abbeel



研究问题:本文旨在利用文本引导的图像合成技术,将序列决策问题转化为文本条件的视频生成问题,以构建更通用的智能代理。
动机:受文本引导的图像合成技术在生成复杂新颖图像方面取得的成功启发,研究人员希望探索这类工具是否能用于构建更通用的智能代理。
方法:通过将规划器生成的未来帧集合描述为未来计划行动的视频,从生成的视频中提取控制动作,实现了基于文本编码的目标规范的序列决策问题转化为文本条件的视频生成问题。
效果:该方法能够自然地、组合地推广到新的目标上,并能在不同的状态和动作空间环境中统一表示为图像空间,例如,实现在不同机器人操作任务之间的学习和泛化。此外,通过利用预训练的语言嵌入和互联网上的广泛可用视频,该方法可以实现对真实机器人的高度逼真的视频计划的知识转移。

A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be used to construct more general-purpose agents. Specifically, we cast the sequential decision making problem as a text-conditioned video generation problem, where, given a text-encoded specification of a desired goal, a planner synthesizes a set of future frames depicting its planned actions in the future, after which control actions are extracted from the generated video. By leveraging text as the underlying goal specification, we are able to naturally and combinatorially generalize to novel goals. The proposed policy-as-video formulation can further represent environments with different state and action spaces in a unified space of images, which, for example, enables learning and generalization across a variety of robot manipulation tasks. Finally, by leveraging pretrained language embeddings and widely available videos from the internet, the approach enables knowledge transfer through predicting highly realistic video plans for real robots.

Continual Learning for Instruction Following from Realtime Feedback
Alane Suhr Yoav Artzi



研究问题:如何通过用户在协作互动中提供的反馈,持续训练一个遵循指令的代理。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to immediate reward. We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data.

SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks
Bill Yuchen Lin Yicheng Fu Karina Yang Faeze Brahman Shiyu Huang Chandra Bhagavatula Prithviraj Ammanabrolu Yejin Choi Xiang Ren



研究问题:本文旨在设计一种名为SwiftSage的新型代理框架,以优化复杂交互式推理任务的行动规划。
动机:受到人类认知的双重过程理论的启发,SwiftSage结合了行为克隆和大型语言模型(LLMs)的优点,以提高任务完成性能。
方法:该框架主要由两个模块组成:快速直观思考的Swift模块和模拟深思熟虑的思维过程的Sage模块。Swift模块是一个小型的编码器-解码器LM,在oracle代理的动作轨迹上进行微调;而Sage模块则使用GPT-4等LLMs进行子目标规划和基础设定。
效果:在科学世界基准测试中的30个任务中,SwiftSage显著优于SayCan、ReAct和Reflexion等其他方法,证明了其在解决复杂交互式任务方面的有效性。

We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition, designed to excel in action planning for complex interactive reasoning tasks. SwiftSage integrates the strengths of behavior cloning and prompting large language models (LLMs) to enhance task completion performance. The framework comprises two primary modules: the Swift module, representing fast and intuitive thinking, and the Sage module, emulating deliberate thought processes. The Swift module is a small encoder-decoder LM fine-tuned on the oracle agent's action trajectories, while the Sage module employs LLMs such as GPT-4 for subgoal planning and grounding. We develop a heuristic method to harmoniously integrate the two modules, resulting in a more efficient and robust problem-solving process. In 30 tasks from the ScienceWorld benchmark, SwiftSage significantly outperforms other methods such as SayCan, ReAct, and Reflexion, demonstrating its effectiveness in solving complex interactive tasks.

Honesty Is the Best Policy: Defining and Mitigating AI Deception
Francis Rhys Ward Francesca Toni Francesco Belardinelli Tom Everitt



研究问题:本文旨在解决AI系统中的欺骗行为对安全性、可信度和合作性的挑战。
动机:现有的定义无法全面解释学习代理在游戏中的欺骗行为,因此需要提出一个适用于真实世界机器学习系统的正式定义。
方法:在结构因果游戏中引入一个基于哲学文献的正式欺骗定义,并提供图形化标准。
效果:实验证明,这些结果可以用于减轻强化学习代理和语言模型中的欺骗行为。

Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.

Learning from Active Human Involvement through Proxy Value Propagation
Zhenghao Peng Wenjie Mo Chenda Duan Quanyi Li Bolei Zhou



研究问题:本文旨在解决AI系统中的欺骗行为对安全性、可信度和合作性的挑战。
动机:现有的定义无法全面解释学习代理在游戏中的欺骗行为,因此需要提出一个适用于真实世界机器学习系统的正式定义。
方法:在结构因果游戏中引入一个基于哲学文献的正式欺骗定义,并提供图形化标准。
效果:实验证明,这些结果可以用于减轻强化学习代理和语言模型中的欺骗行为。

Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state- action pairs in the human demonstration are labeled with high values, while those agents’ actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents’ exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human- in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: https://metadriverse.github.io/pvp.

Calibrated Stackelberg Games: Learning Optimal Commitments Against Calibrated Agents
Nika Haghtalab Chara Podimata Kunhe Yang



研究问题:本文旨在对标准斯塔克尔伯格博弈(SGs)框架进行一般化,提出校准的斯塔克尔伯格博弈(CSGs)。
动机:在标准的斯塔克尔伯格博弈中,代理人直接获取委托人的行动信息。然而,在校准的斯塔克尔伯格博弈中,代理人无法直接获取委托人的行动信息,而是通过对委托人行动的校准预测来进行最佳反应。这种模型能够更好地处理现实生活中的应用,并且比标准的斯塔克尔伯格博弈更具鲁棒性。
方法:本文提出了一种获得自适应校准算法的通用方法,并将其应用于有限规模的CSGs。同时,我们还引入了更强的校准概念——适应性校准,它可以在任何时间提供精细的校准保证,以抵御对抗序列。
效果:本文的主要技术成果表明,在CSGs中,委托人可以在有限和连续的设置中实现收敛到游戏最优斯塔克尔伯格值的效用,并且无法实现更高的效用。此外,我们的结果还立即应用在斯塔克尔伯格安全博弈的学习设置和战略分类中,这两种情况下的代理人都是经过校准的。

In this paper, we introduce a generalization of the standard Stackelberg Games (SGs) framework: _Calibrated Stackelberg Games_. In CSGs, a principal repeatedly interacts with an agent who (contrary to standard SGs) does not have direct access to the principal's action but instead best responds to _calibrated forecasts_ about it. CSG is a powerful modeling tool that goes beyond assuming that agents use ad hoc and highly specified algorithms for interacting in strategic settings to infer the principal's actions and thus more robustly addresses real-life applications that SGs were originally intended to capture. Along with CSGs, we also introduce a stronger notion of calibration, termed _adaptive calibration_, that provides fine-grained any-time calibration guarantees against adversarial sequences. We give a general approach for obtaining adaptive calibration algorithms and specialize them for finite CSGs. In our main technical result, we show that in CSGs, the principal can achieve utility that converges to the optimum Stackelberg value of the game both in _finite_ and _continuous_ settings and that no higher utility is achievable. Two prominent and immediate applications of our results are the settings of learning in Stackelberg Security Games and strategic classification, both against _calibrated_ agents.

A Robust and Opponent-Aware League Training Method for StarCraft II
Ruozi Huang Xipeng Wu Hongsheng Yu Zhong Fan Haobo Fu QIANG FU Yang Wei



研究问题:训练一个超越人类的人工智能(AI)在类似《星际争霸II》的游戏中是非常困难的。
动机:AlphaStar是第一个使用联赛训练框架,通过游戏理论方法在《星际争霸II》完整游戏中击败人类专业人士的AI。本文旨在改进AlphaStar的联赛训练。
方法:我们训练目标驱动的探索者,其发现主要代理和整个联盟弱点的能力比AlphaStar中的无条件探索者大大提高。此外,我们还赋予联盟中的代理对手建模的新能力,使代理更能响应对手的实时策略。
效果:基于这些改进,我们在比AlphaStar少几个数量级的资源下训练出更好、超越人类的AI(见表1的全面比较)。考虑到《星际争霸II》在游戏AI研究中的象征性角色,我们相信我们的方法及结果为如何在各种大规模真实世界的游戏中利用通用联赛训练框架获得最低可攻击策略提供了有价值的设计原则。

It is extremely difficult to train a superhuman Artificial Intelligence (AI) for games of similar size to StarCraft II. AlphaStar is the first AI that beat human professionals in the full game of StarCraft II, using a league training framework that is inspired by a game-theoretic approach. In this paper, we improve AlphaStar's league training in two significant aspects. We train goal-conditioned exploiters, whose abilities of spotting weaknesses in the main agent and the entire league are greatly improved compared to the unconditioned exploiters in AlphaStar. In addition, we endow the agents in the league with the new ability of opponent modeling, which makes the agent more responsive to the opponent's real-time strategy. Based on these improvements, we train a better and superhuman AI with orders of magnitude less resources than AlphaStar (see Table 1 for a full comparison). Considering the iconic role of StarCraft II in game AI research, we believe our method and results on StarCraft II provide valuable design principles on how one would utilize the general league training framework for obtaining a least-exploitable strategy in various, large-scale, real-world games.

Coherent Soft Imitation Learning
Joe Watson Sandy Huang Nicolas Heess



研究问题:本研究旨在解决模仿学习中的策略选择问题,即在行为克隆(BC)和逆强化学习(IRL)之间做出选择。
动机:由于示范的质量和状态-动作覆盖度以及额外对马尔可夫决策过程的访问,BC和IRL的选择对于模仿学习至关重要。然而,混合策略(结合BC和IRL)很少见,因为初始的政策优化对抗不准确的奖励会减少用BC预训练政策的益处。
方法:本研究提出了一种模仿学习方法,该方法捕捉了BC和IRL的优势。在熵正则化(“软”)强化学习设置中,我们证明了可以通过反向正则化策略更新来将行为克隆的策略用作塑造的奖励和批评者假设空间。这种一致性促进了使用奖励估计和与环境的额外交互来微调克隆的策略。这种方法通过初始的行为克隆和随后的利用在线或离线数据源的RL进行细化,方便地实现了模仿学习。
效果:该方法的简单性使得能够优雅地扩展到高维和基于视觉的任务,与对抗性方法相比,具有稳定的学习和最小的超参数调整。

Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) for the policy or inverse reinforcement learning (IRL) for the reward. Such methods enable agents to learn complex tasks from humans that are difficult to capture with hand-designed reward functions. Choosing between BC or IRL for imitation depends on the quality and state-action coverage of the demonstrations, as well as additional access to the Markov decision process. Hybrid strategies that combine BC and IRL are rare, as initial policy optimization against inaccurate rewards diminishes the benefit of pretraining the policy with BC. Our work derives an imitation method that captures the strengths of both BC and IRL. In the entropy-regularized (`soft') reinforcement learning setting, we show that the behavioral-cloned policy can be used as both a shaped reward and a critic hypothesis space by inverting the regularized policy update. This coherency facilitates fine-tuning cloned policies using the reward estimate and additional interactions with the environment. This approach conveniently achieves imitation learning through initial behavioral cloning and subsequent refinement via RL with online or offline data sources. The simplicity of the approach enables graceful scaling to high-dimensional and vision-based tasks, with stable learning and minimal hyperparameter tuning, in contrast to adversarial approaches. For the open-source implementation and simulation results, see https://joemwatson.github.io/csil/.

Survival Instinct in Offline Reinforcement Learning
Anqi Li Dipendra Misra Andrey Kolobov Ching-An Cheng



研究问题:本文探讨了离线强化学习(RL)算法的行为,特别是在使用错误的奖励标签进行训练时的表现。
动机:作者发现,即使在使用“错误”的奖励标签(例如,所有地方都是零或真实奖励的负值)进行训练时,离线RL也能在许多基准数据集上产生表现良好且安全的策略。这种现象不能用离线RL的回报最大化目标来解释。
方法:作者认为这种令人惊讶的稳健性属性是由于离线RL算法中的悲观主义概念与常见数据收集实践中的某些隐含偏见之间的相互作用。悲观主义使代理具有生存本能,即长期保持在数据支持范围内的激励,而有限且有偏的数据覆盖进一步限制了生存策略集。
效果:作者的理论和实证结果表明,对于给定的奖励类别(可能甚至不包含真实的奖励),如果满足某些训练数据分布的条件,离线RL可以从该类别中的任何奖励中学习到接近最优和安全的策略。作者建议在解释现有离线RL基准结果或创建未来基准时,应考虑生存本能。

We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of *pessimism* in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a *survival instinct*, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. Formally, given a reward class -- which may not even contain the true reward -- we identify conditions on the training data distribution that enable offline RL to learn a near-optimal and safe policy from any reward within the class. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for offline RL, whereby an agent is "nudged" to learn a desirable behavior with imperfect reward but purposely biased data coverage. Please visit our website [https://survival-instinct.github.io](https://survival-instinct.github.io) for accompanied code and videos.

Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration
Zhihan Liu Miao Lu Wei Xiong Han Zhong Hao Hu Shenao Zhang Sirui Zheng Zhuoran Yang Zhaoran Wang



研究问题:在强化学习中,如何平衡探索和利用以实现最优策略。
动机:现有的样本高效算法通常需要数据依赖的级数约束或复杂的采样过程来鼓励探索,这在实践中难以实施。
方法:提出一种名为“最大化探索”(MEX)的易于实施的强化学习框架,只需优化一个整合估计和计划组件并自动平衡探索和利用的目标。
效果:理论上,证明了MEX在一般函数逼近器下实现了次线性遗憾,并可扩展到零和马尔科夫游戏设置。同时,通过修改深度RL基线,设计了模型基础和无模型设置的MEX实践版本,并在各种稀疏奖励的MuJoCo环境中稳定地超越了基线。与现有具有一般函数逼近器的样本高效算法相比,MEX在保持相似的样本效率的同时,计算成本更低,并且更符合现代深度RL方法。

In reinforcement learning (RL), balancing exploration and exploitation is crucial for achieving an optimal policy in a sample-efficient way. To this end, existing sample- efficient algorithms typically consist of three components: estimation, planning, and exploration. However, to cope with general function approximators, most of them involve impractical algorithmic components to incentivize exploration, such as data-dependent level-set constraints or complicated sampling procedures. To address this challenge, we propose an easy-to-implement RL framework called Maximize to Explore (MEX), which only needs to optimize unconstrainedly a single objective that integrates the estimation and planning components while balancing exploration and exploitation automatically. Theoretically, we prove that the MEX achieves a sublinear regret with general function approximators and is extendable to the zero-sum Markov game setting. Meanwhile, we adapt deep RL baselines to design practical versions of MEX in both the model-based and model-free settings, which outperform baselines in various MuJoCo environments with sparse reward by a stable margin. Compared with existing sample-efficient algorithms with general function approximators, MEX achieves similar sample efficiency while also enjoying a lower computational cost and is more compatible with modern deep RL methods.

Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning
Shenzhi Wang Qisen Yang Jiawei Gao Matthieu Gaetan Lin HAO CHEN Liwei Wu Ning Jia Shiji Song Gao Huang



研究问题:本文旨在解决离线到在线强化学习中存在的分布偏移问题,并提出一种通用有效的框架。
动机:现有的解决方案通常采用单一的策略改进和约束平衡来处理分布偏移问题,但这种方法可能无法充分利用每个收集的样本,因为不同状态下的数据质量存在显著差异。
方法:本文提出了家庭离线到在线强化学习(FamO2O)框架,该框架利用一个通用模型训练具有不同改进/约束强度的策略族,并使用一个平衡模型为每个状态选择适当的策略。
效果:实验结果表明,FamO2O在各种现有方法上取得了显著改进,并在D4RL基准测试中达到了最先进的性能。

Offline-to-online reinforcement learning (RL) is a training paradigm that combines pre-training on a pre-collected dataset with fine-tuning in an online environment. However, the incorporation of online fine-tuning can intensify the well-known distributional shift problem. Existing solutions tackle this problem by imposing a policy constraint on the policy improvement objective in both offline and online learning. They typically advocate a single balance between policy improvement and constraints across diverse data collections. This one-size-fits-all manner may not optimally leverage each collected sample due to the significant variation in data quality across different states. To this end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances. FamO2O utilizes a universal model to train a family of policies with different improvement/constraint intensities, and a balance model to select a suitable policy for each state. Theoretically, we prove that state-adaptive balances are necessary for achieving a higher policy performance upper bound. Empirically, extensive experiments show that FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark. Codes are available at https://github.com/LeapLabTHU/FamO2O.

Convex-Concave Zero-Sum Stochastic Stackelberg Games
Denizalp Goktas Arjun Prakash Amy Greenwald



研究问题:本文旨在开发一种政策梯度方法,通过从观察到的游戏轨迹中计算的有噪声研究问题:本文旨在开发一种政策梯度方法,通过从观察到的游戏轨迹中计算的有噪声的梯度估计值来解决一类从经济学到人机交互的大规模问题。
动机:零和随机斯塔克尔伯格博弈可以用来模拟从经济学到人机交互的大量问题。作者希望通过开发新的算法,能够更有效地解决这类问题。
方法:作者开发了一种新的政策梯度方法,用于解决这些游戏。他们证明了当游戏是凸-凹时,他们的算法可以在多项式时间内收敛到斯塔克尔伯格均衡。
效果:实验结果表明,将到达-避免问题建模为斯塔克尔伯格博弈,比其它解决方案(特别是纳什均衡)更安全、更少可能导致碰撞,并且更有可能达到目标。

Zero-sum stochastic Stackelberg games can be used to model a large class of problems, ranging from economics to human robot interaction. In this paper, we develop policy gradient methods to solve these games from noisy gradient estimates computed from observed trajectories of play. We prove that our algorithms converge to Stackelberg equilibrium in polynomial time when the games are convex-concave. We also prove that reach-avoid problems are naturally modeled as convex-concave zero-sum stochastic Stackelberg games. Finally, we run experiments which demonstrate that modeling reach-avoid problems as Stackelberg games leads to solutions which are safer, thus less likely to result in collisions, and liver, thus more likely to reach their goals, than alternative solutions, in particular Nash equilibrium.

Combining Behaviors with the Successor Features Keyboard
Wilka Carvalho Andre Saraiva Angelos Filos Andrew Kyle Lampinen Loic Matthey Richard Lewis Honglak Lee Satinder Singh Danilo Jimenez Rezende Daniel Zoran



研究问题:如何有效地在任务之间转移行为知识?
动机:现有的转移方法依赖于手动设计的状态特征和任务编码,这在新环境中设计起来非常繁琐。
方法:提出了“成功特征键盘”(SFK)和“分类成功特征近似器”(CSFA),通过发现状态特征和任务编码来实现转移。
效果:在具有挑战性的3D环境中,使用SFK和CSFA实现了首次使用SFs的转移,比其他方法更快地转移到长周期任务。

The Option Keyboard (OK) was recently proposed as a method for transferring behavioral knowledge across tasks. OK transfers knowledge by adaptively combining subsets of known behaviors using Successor Features (SFs) and Generalized Policy Improvement (GPI). However, it relies on hand-designed state-features and task encodings which are cumbersome to design for every new environment. In this work, we propose the "Successor Features Keyboard" (SFK), which enables transfer with discovered state-features and task encodings. To enable discovery, we propose the "Categorical Successor Feature Approximator" (CSFA), a novel learning algorithm for estimating SFs while jointly discovering state-features and task encodings. With SFK and CSFA, we achieve the first demonstration of transfer with SFs in a challenging 3D environment where all the necessary representations are discovered. We first compare CSFA against other methods for approximating SFs and show that only CSFA discovers representations compatible with SF&GPI at this scale. We then compare SFK against transfer learning baselines and show that it transfers most quickly to long-horizon tasks.

f-Policy Gradients: A General Framework for Goal-Conditioned RL using f-Divergences
Siddhant Agarwal Ishan Durugkar Peter Stone Amy Zhang



研究问题:在目标条件强化学习(RL)中,由于奖励信号稀疏,即只有在达到目标时才接收到奖励,因此策略优化成为一个困难的问题。
动机:现有的方法通过学习密集的奖励函数来弥补稀疏奖励的问题,但如果奖励不匹配,可能会导致次优策略。此外,最近的研究表明,针对特定问题的有效奖励塑造可能取决于底层的学习算法。
方法:本文提出了一种名为$f$-Policy Gradients(或$f$-PG)的新方法来鼓励探索。$f$-PG最小化代理的状态访问分布和目标之间的f-散度,我们证明这可以导致最优策略。我们为各种f-散度推导出梯度以优化此目标。我们的学习范式为稀疏奖励设置中的探索提供了密集的学习信号。我们还引入了一个熵正则化的策略优化目标,我们称之为$state$-MaxEnt RL(或$s$-MaxEnt RL),作为我们目标的一个特例。
效果:我们在一个具有挑战性的网格世界以及Point Maze和FetchReach环境中发现,与标准的 policy gradient 方法相比,$f$-PG具有更好的性能。更多信息请访问我们的网站 https://agarwalsiddhant10.github.io/projects/fpg.html。

Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called $f$-Policy Gradients, or $f$-PG. $f$-PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in sparse reward settings. We further introduce an entropy-regularized policy optimization objective, that we call $state$-MaxEnt RL (or $s$-MaxEnt RL) as a special case of our objective. We show that several metric-based shaping rewards like L2 can be used with $s$-MaxEnt RL, providing a common ground to study such metric-based shaping rewards with efficient exploration. We find that $f$-PG has better performance compared to standard policy gradient methods on a challenging gridworld as well as the Point Maze and FetchReach environments. More information on our website https://agarwalsiddhant10.github.io/projects/fpg.html.

A Diffusion-Model of Joint Interactive Navigation
Matthew Niedoba Jonathan Wilder Lavington Yunpeng Liu Vasileios Lioutas Justice Sefas Xiaoxuan Liang Dylan Green Setareh Dabiri Berend Zwartsenberg Adam Scibior Frank Wood



研究问题:如何模拟自动驾驶系统中的多样化和现实行为?
动机:使用预先录制的真实世界交通场景进行模拟可以确保真实性,但安全关键事件的罕见性使得大规模收集驾驶场景变得昂贵。
方法:提出DJINN——一种基于扩散的生成交通场景的方法。该方法联合扩散所有代理的轨迹,条件是过去、现在或未来的灵活状态观察集。
效果:在流行的轨迹预测数据集上,我们在联合轨迹度量上报告了最先进的性能。此外,我们展示了DJINN如何灵活地直接从各种有价值的条件分布中进行测试时间采样,包括目标基采样、行为类别采样和场景编辑。

Simulation of autonomous vehicle systems requires that simulated traffic participants exhibit diverse and realistic behaviors. The use of prerecorded real-world traffic scenarios in simulation ensures realism but the rarity of safety critical events makes large scale collection of driving scenarios expensive. In this paper, we present DJINN -- a diffusion based method of generating traffic scenarios. Our approach jointly diffuses the trajectories of all agents, conditioned on a flexible set of state observations from the past, present, or future. On popular trajectory forecasting datasets, we report state of the art performance on joint trajectory metrics. In addition, we demonstrate how DJINN flexibly enables direct test-time sampling from a variety of valuable conditional distributions including goal-based sampling, behavior-class sampling, and scenario editing.

ELDEN: Exploration via Local Dependencies
Zizhao Wang Jiaheng Hu Peter Stone Roberto Martín-Martín



研究问题:如何有效地探索状态空间大、奖励稀疏的任务,以解决强化学习中的问题。
动机:在状态空间大、奖励稀疏的任务中,代理需要有效探索状态空间以找到奖励。为了解决这个问题,社区提出了用内在奖励来增强奖励函数,这是一种鼓励代理访问状态的奖励信号。
方法:我们提出了一种新的定义环境有趣状态的方法,适用于具有分解状态空间和复杂链式依赖关系的环境,其中代理的行动可能会改变一个实体的值,这个实体可能依次影响另一个实体的值。我们的见解是,在这些环境中,有趣的探索状态是代理不确定实体(如代理或对象)是否相互影响的州。我们提出了ELDEN,一种通过局部依赖性进行探索的新的内在奖励,它鼓励发现实体之间的新交互。
效果:我们在四个具有复杂依赖关系的领域中评估了ELDEN的性能,从2D网格世界到3D机器人任务。在所有领域,ELDEN都能正确识别局部依赖关系并学习成功的策略,显著优于先前最先进的探索方法。

Tasks with large state space and sparse rewards present a longstanding challenge to reinforcement learning. In these tasks, an agent needs to explore the state space efficiently until it finds a reward. To deal with this problem, the community has proposed to augment the reward function with intrinsic reward, a bonus signal that encourages the agent to visit interesting states. In this work, we propose a new way of defining interesting states for environments with factored state spaces and complex chained dependencies, where an agent's actions may change the value of one entity that, in order, may affect the value of another entity. Our insight is that, in these environments, interesting states for exploration are states where the agent is uncertain whether (as opposed to how) entities such as the agent or objects have some influence on each other. We present ELDEN, Exploration via Local DepENdencies, a novel intrinsic reward that encourages the discovery of new interactions between entities. ELDEN utilizes a novel scheme --- the partial derivative of the learned dynamics to model the local dependencies between entities accurately and computationally efficiently. The uncertainty of the predicted dependencies is then used as an intrinsic reward to encourage exploration toward new interactions. We evaluate the performance of ELDEN on four different domains with complex dependencies, ranging from 2D grid worlds to 3D robotic tasks. In all domains, ELDEN correctly identifies local dependencies and learns successful policies, significantly outperforming previous state-of-the-art exploration methods.

Model-free Posterior Sampling via Learning Rate Randomization
Daniil Tiapkin Denis Belomestny Daniele Calandriello Eric Moulines Remi Munos Alexey Naumov pierre perrault Michal Valko Pierre MENARD



研究问题:本文介绍了一种新的随机化无模型算法RandQL,用于最小化回合马尔可夫决策过程(MDPs)中的遗憾。
动机:RandQL是首个可行的基于后验采样的无模型算法,其性能优于现有的方法。
方法:通过学习率随机化实现乐观探索,不使用奖励机制。
效果:在基线探索环境中,RandQL的表现优于现有方法。

In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order $\widetilde{\mathcal{O}}(\sqrt{H^{5}SAT})$, where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order $\widetilde{\mathcal{O}}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where $d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

MIMEx: Intrinsic Rewards from Masked Input Modeling
Toru Lin Allan Jabri



研究问题:探索高维观测环境困难,如何利用内在奖励进行有效探索。
动机:内在奖励通常涉及对状态、转换或轨迹的"新颖性"进行深度网络估计,已有工作表明条件预测目标如掩蔽自动编码可以看作是伪似然的随机估计。
方法:提出了一种通用框架——掩蔽输入模型探索(MIMEx),通过灵活调整掩码分布来控制底层条件预测任务的难度。
效果:在一系列具有挑战性的稀疏奖励视觉运动任务中,MIMEx相比竞争性基线取得了优越的结果。

Exploring in environments with high-dimensional observations is hard. One promising approach for exploration is to use intrinsic rewards, which often boils down to estimating "novelty" of states, transitions, or trajectories with deep networks. Prior works have shown that conditional prediction objectives such as masked autoencoding can be seen as stochastic estimation of pseudo-likelihood. We show how this perspective naturally leads to a unified view on existing intrinsic reward approaches: they are special cases of conditional prediction, where the estimation of novelty can be seen as pseudo-likelihood estimation with different mask distributions. From this view, we propose a general framework for deriving intrinsic rewards -- Masked Input Modeling for Exploration (MIMEx) -- where the mask distribution can be flexibly tuned to control the difficulty of the underlying conditional prediction task. We demonstrate that MIMEx can achieve superior results when compared against competitive baselines on a suite of challenging sparse-reward visuomotor tasks.

Egocentric Planning for Scalable Embodied Task Achievement
Xiaotian Liu Hector Palacios Christian Muise



研究问题:如何让代理在多样化的环境中执行任务,特别是在对象类型泛化和执行合适动作以完成任务方面。
动机:现有的方法在处理复杂环境中的任务时面临重大挑战,尤其是在对象类型泛化和执行合适动作以完成任务方面。
方法:提出了一种创新的方法——自我中心规划(Egocentric Planning),该方法结合了符号规划和面向对象的部分可观察马尔科夫决策过程(POMDPs),以解决复杂环境中的任务,利用现有的视觉感知和自然语言处理模型。
效果:在ALFRED模拟环境中进行评估,证明了该方法的高可扩展性,在ALFRED基准测试中取得了36.07%的未见过的成功率,并在CVPR Embodied AI研讨会上赢得了ALFRED挑战赛。这种方法需要可靠的感知,以及指定或学习代理行动的前条件和结果的符号描述,以及哪些对象类型可以揭示关于其他对象的信息。只要可以使用现有技能解决的问题,它就可以自然地扩展到解决新的任务。这项工作为研究旨在泛化到新任务的端到端和混合方法提供了坚实的基线,包括最近依赖语言模型的方法,但这些方法往往难以扩展到长序列的动作或为新任务生成稳健的计划。

Embodied agents face significant challenges when tasked with performing actions in diverse environments, particularly in generalizing across object types and executing suitable actions to accomplish tasks. Furthermore, agents should exhibit robustness, minimizing the execution of illegal actions. In this work, we present Egocentric Planning, an innovative approach that combines symbolic planning and Object-oriented POMDPs to solve tasks in complex environments, harnessing existing models for visual perception and natural language processing. We evaluated our approach in ALFRED, a simulated environment designed for domestic tasks, and demonstrated its high scalability, achieving an impressive 36.07\% unseen success rate in the ALFRED benchmark and winning the ALFRED challenge at CVPR Embodied AI workshop. Our method requires reliable perception and the specification or learning of a symbolic description of the preconditions and effects of the agent's actions, as well as what object types reveal information about others. It can naturally scale to solve new tasks beyond ALFRED, as long as they can be solved using the available skills. This work offers a solid baseline for studying end-to-end and hybrid methods that aim to generalize to new tasks, including recent approaches relying on LLMs, but often struggle to scale to long sequences of actions or produce robust plans for novel tasks.

A State Representation for Diminishing Rewards
Ted Moskovitz Samo Hromadka Ahmed Touati Diana L Borsa Maneesh Sahani



研究问题:在多任务强化学习中,如何使代理快速适应从固定分布中随机抽取的各种固定奖励函数。
动机:在自然环境中,连续的任务很少是独立的,而是反映基于奖励刺激的可用性和主观感知的变化优先级。
方法:引入一种新的状态表示——λ表示(λR),这是一种在此类环境中进行策略评估所需的新的状态表示,它不仅推广了SR,还推广了文献中的其他几种状态表示。
效果:我们确立了λR的形式属性,并考察了其在机器学习中的规范优势,以及其对自然行为(特别是觅食行为)研究的有用性。

A common setting in multitask reinforcement learning (RL) demands that an agent rapidly adapt to various stationary reward functions randomly sampled from a fixed distribution. In such situations, the successor representation (SR) is a popular framework which supports rapid policy evaluation by decoupling a policy's expected discounted, cumulative state occupancies from a specific reward function. However, in the natural world, sequential tasks are rarely independent, and instead reflect shifting priorities based on the availability and subjective perception of rewarding stimuli. Reflecting this disjunction, in this paper we study the phenomenon of diminishing marginal utility and introduce a novel state representation, the $\lambda$ representation ($\lambda$R) which, surprisingly, is required for policy evaluation in this setting and which generalizes the SR as well as several other state representations from the literature. We establish the $\lambda$R's formal properties and examine its normative advantages in the context of machine learning, as well as its usefulness for studying natural behaviors, particularly foraging.

Online POMDP Planning with Anytime Deterministic Guarantees
Moran Barenboim Vadim Indelman



研究问题:如何有效地解决现实世界中自主代理在不确定性下进行规划的问题。
动机:部分可观察的马尔科夫决策过程(POMDPs)可以形式化地描述不确定性下的规划,但找到最优计划对于大型问题来说计算成本过高。
方法:通过简化解决方案和理论最优解之间的确定性关系,推导出一种新颖的算法。首先,我们为每个后验节点计算完整信念时选择分支观测值的子集制定了界限。然后,由于完整的信念更新可能计算量较大,我们将界限扩展到支持状态和观测空间的缩减。
效果:我们的保证可以与现有的采样状态和观测的最优解集成,返回的解决方案相对于最优策略具有确定性界限。最后,我们通过实验结果证实了我们的研究。

Autonomous agents operating in real-world scenarios frequently encounter uncertainty and make decisions based on incomplete information. Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs). However, finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks. In recent years, approximate algorithms, such as tree search and sample-based methodologies, have emerged as state-of-the-art POMDP solvers for larger problems. Despite their effectiveness, these algorithms offer only probabilistic and often asymptotic guarantees toward the optimal solution due to their dependence on sampling. To address these limitations, we derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one. First, we derive bounds for selecting a subset of the observations to branch from while computing a complete belief at each posterior node. Then, since a complete belief update may be computationally demanding, we extend the bounds to support reduction of both the state and the observation spaces. We demonstrate how our guarantees can be integrated with existing state-of-the-art solvers that sample a subset of states and observations. As a result, the returned solution holds deterministic bounds relative to the optimal policy. Lastly, we substantiate our findings with supporting experimental results.

Residual Q-Learning: Offline and Online Policy Customization without Value
Chenran Li Chen Tang Haruki Nishimura Jean Mercat Masayoshi Tomizuka Wei Zhan



研究问题:如何通过模仿学习(IL)训练一个既能继承原有行为特性,又能满足不同下游任务需求的自定义策略。
动机:在复杂的真实世界任务中,手动设计奖励函数困难,或者需要模仿人类专家行为时,模仿学习框架具有吸引力。然而,学到的模仿策略只能遵循演示中的行为,我们需要定制策略行为以满足来自不同下游任务的多样化需求。
方法:我们提出了一种新的问题设定——策略定制,将学习任务定义为训练一个继承原有策略特性,同时满足目标下游任务额外要求的策略。我们提出了一种新颖且原则性的方法来理解和确定两个任务目标之间的权衡。具体来说,我们将定制问题表述为一个马尔可夫决策过程(MDP),其奖励函数结合了1)演示的内在奖励;和2)由下游任务指定的附加奖励。
效果:我们提出了一种新的框架——剩余Q学习,它可以解决所提出的MDP问题,而无需知道原有策略的内在奖励或值函数。我们推导出了一系列可以实现离线和在线策略定制的剩余Q学习算法,并证明这些算法可以在各种环境中有效地完成策略定制任务。

Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. It is especially appealing for solving complex real-world tasks where handcrafting reward function is difficult, or when the goal is to mimic human expert behavior. However, the learned imitative policy can only follow the behavior in the demonstration. When applying the imitative policy, we may need to customize the policy behavior to meet different requirements coming from diverse downstream tasks. Meanwhile, we still want the customized policy to maintain its imitative nature. To this end, we formulate a new problem setting called policy customization. It defines the learning task as training a policy that inherits the characteristics of the prior policy while satisfying some additional requirements imposed by a target downstream task. We propose a novel and principled approach to interpret and determine the trade-off between the two task objectives. Specifically, we formulate the customization problem as a Markov Decision Process (MDP) with a reward function that combines 1) the inherent reward of the demonstration; and 2) the add-on reward specified by the downstream task. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy without knowing the inherent reward or value function of the prior policy. We derive a family of residual Q-learning algorithms that can realize offline and online policy customization, and show that the proposed algorithms can effectively accomplish policy customization tasks in various environments. Demo videos and code are available on our website: https://sites.google.com/view/residualq-learning.

Is RLHF More Difficult than Standard RL? A Theoretical Perspective
Yuanhao Wang Qinghua Liu Chi Jin



研究问题:如何利用人类反馈进行强化学习。
动机:标准强化学习直接从奖励信号中学习,而强化学习从人类反馈(偏好)中学习,偏好信息似乎比奖励信息少,使得基于偏好的强化学习更具挑战性。
方法:本文通过理论证明,对于广泛的偏好模型,我们可以使用现有的奖励基础强化学习算法和技术直接解决基于偏好的强化学习问题,只需较小的或无需额外成本。具体来说,(1)对于从奖励基础概率模型中抽取的偏好,我们将问题简化为可以容忍奖励中微小错误的鲁棒奖励基础强化学习;(2)对于一般的任意偏好,目标是找到冯·诺依曼赢家,我们将问题简化为多代理奖励基础强化学习,该算法在限制策略集下寻找因子马尔可夫游戏的纳什均衡。
效果:我们实例化了所有奖励基础强化学习子程序的具体可证明算法,并将我们的理论应用于包括表格MDP和具有通用函数近似的MDP在内的一大类模型。当可以进行K次比较时,我们还提供了保证。

Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. The latter case can be further reduce to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.

Instructing Goal-Conditioned Reinforcement Learning Agents with Temporal Logic Objectives
Wenjie Qiu Wensen Mao He Zhu



研究问题:现有的以线性时态逻辑(LTL)形式编写的长期指令为条件的任务特定策略学习方法,在面对复杂任务时存在泛化性问题。
动机:为了解决现有方法无法适应分布外LTL目标的问题,需要一种无需额外训练LTL任务空间,就能让简单目标条件强化学习代理遵循任意LTL规范的方法。
方法:本文提出了一种新的方法,通过这种方法,简单的目标条件强化学习代理可以无需额外训练就能遵循任意LTL规范,且不受限制,能推广到ω-正则表达式。
效果:实验结果表明,该方法能有效使目标条件强化学习代理适应复杂的零射击时态逻辑任务规范。

Goal-conditioned reinforcement learning (RL) is a powerful approach for learning general-purpose skills by reaching diverse goals. However, it has limitations when it comes to task-conditioned policies, where goals are specified by temporally extended instructions written in the Linear Temporal Logic (LTL) formal language. Existing approaches for finding LTL-satisfying policies rely on sampling a large set of LTL instructions during training to adapt to unseen tasks at inference time. However, these approaches do not guarantee generalization to out-of-distribution LTL objectives, which may have increased complexity. In this paper, we propose a novel approach to address this challenge. We show that simple goal-conditioned RL agents can be instructed to follow arbitrary LTL specifications without additional training over the LTL task space. Unlike existing approaches that focus on LTL specifications expressible as regular expressions, our technique is unrestricted and generalizes to $\omega$-regular expressions. Experiment results demonstrate the effectiveness of our approach in adapting goal-conditioned RL agents to satisfy complex temporal logic task specifications zero-shot.

Language Model Alignment with Elastic Reset
Michael Noukhovitch Samuel Lavoie Florian Strub Aaron Courville



研究问题:如何优化语言模型的奖励和漂移之间的平衡?
动机:目前常用的测试指标无法充分衡量奖励和漂移之间的权衡,而常见的方法通过修改奖励函数来解决这个问题,但可能导致性能下降。
方法:提出了Elastic Reset算法,通过定期将在线模型重置为其自身的指数移动平均(EMA),然后再将EMA模型重置为初始模型,以实现更高的奖励和更少的漂移。
效果:在小规模翻译基准测试中,使用Elastic Reset进行语言模型微调取得了最先进的性能;在中等规模的IMDB模拟情感任务中,所有基线都被超越;在与LLaMA-7B的对话机器人技术QA任务中,实现了更高性能和更好的对齐。

Finetuning language models with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimizing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. First, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. The standard method modified the reward with a Kullback-Lieber (KL) penalty between the online and initial model. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We periodically reset the online model to an exponentially moving average (EMA) of itself, then reset the EMA model to the initial model. Through the use of an EMA, our model recovers quickly after resets and achieves higher reward with less drift in the same number of steps. We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark, outperforms all baselines in a medium-scale RLHF-like IMDB mock sentiment task and leads to a more performant and more aligned technical QA chatbot with LLaMA-7B. Code available https://github.com/mnoukhov/elastic-reset

Multi-Objective Intrinsic Reward Learning for Conversational Recommender Systems
Zhendong Chu Nan Wang Hongning Wang



研究问题:如何设计任务特定的奖励以促进会话推荐系统(CRS)的政策学习。
动机:主流的基于强化学习的CRS解决方案严重依赖于手工制作的奖励函数,这可能与CRS任务中的用户意图不相符。因此,设计任务特定的奖励对于促进CRS政策学习至关重要。
方法:我们提出了一种新的方法来解决这个问题,通过从与用户的交互中学习内在奖励。具体来说,我们将内在奖励的学习形式化为一个多目标双层优化问题。内部层优化由学习到的内在奖励增强的CRS策略,而外部层驱动内在奖励优化两个CRS特定目标:最大化成功率和最小化达到成功推荐所需的对话轮数。
效果:我们在三个公共CRS基准上进行了大量的实验,结果表明我们的算法通过利用信息丰富的学习内在奖励显著提高了CRS的性能。

Conversational Recommender Systems (CRS) actively elicit user preferences to generate adaptive recommendations. Mainstream reinforcement learning-based CRS solutions heavily rely on handcrafted reward functions, which may not be aligned with user intent in CRS tasks. Therefore, the design of task-specific rewards is critical to facilitate CRS policy learning, which remains largely under-explored in the literature. In this work, we propose a novel approach to address this challenge by learning intrinsic rewards from interactions with users. Specifically, we formulate intrinsic reward learning as a multi-objective bi-level optimization problem. The inner level optimizes the CRS policy augmented by the learned intrinsic rewards, while the outer level drives the intrinsic rewards to optimize two CRS-specific objectives: maximizing the success rate and minimizing the number of turns to reach a successful recommendation}in conversations. To evaluate the effectiveness of our approach, we conduct extensive experiments on three public CRS benchmarks. The results show that our algorithm significantly improves CRS performance by exploiting informative learned intrinsic rewards.

Online learning of long-range dependencies
Nicolas Zucchet Robert Meier Simon Schug Asier Mujika Joao Sacramento



研究问题:如何有效地进行长期信用分配在循环神经网络中。
动机:目前的在线学习算法要么不可扩展,要么无法学习长范围依赖性。
方法:利用多层网络中的独立循环模块,这种架构模式最近被证明特别强大。
效果:实验结果表明,该算法在合成内存问题和具有挑战性的长范围竞技场基准套件上表现良好,为在线学习设定了新的标准。

Online learning holds the promise of enabling efficient long-term credit assignment in recurrent neural networks. However, current algorithms fall short of offline backpropagation by either not being scalable or failing to learn long-range dependencies. Here we present a high-performance online learning algorithm that merely doubles the memory and computational requirements of a single inference pass. We achieve this by leveraging independent recurrent modules in multi-layer networks, an architectural motif that has recently been shown to be particularly powerful. Experiments on synthetic memory problems and on the challenging long-range arena benchmark suite reveal that our algorithm performs competitively, establishing a new standard for what can be achieved through online learning. This ability to learn long-range dependencies offers a new perspective on learning in the brain and opens a promising avenue in neuromorphic computing.

Finding Counterfactually Optimal Action Sequences in Continuous State Spaces
Stratis Tsirtsis Manuel Gomez Rodriguez



研究问题:如何有效地分析连续环境中的序列决策过程。
动机:现有的序列决策分析方法主要针对离散状态环境,而在许多实际应用中,环境状态是连续的。
方法:通过有限视窗马尔可夫决策过程和一类广泛的双射结构因果模型,对离散动作和连续状态进行形式化描述。在此基础上,提出寻找反事实最优动作序列的问题,并表明该问题通常无法在多项式时间内解决。然后,基于A*算法开发一种搜索方法,该方法在环境动态满足自然形式的Lipschitz连续性的条件下,保证返回问题的最优解。
效果:实验证明,该方法在实际临床数据上非常高效,有潜力为序列决策任务提供深入洞察。

Whenever a clinician reflects on the efficacy of a sequence of treatment decisions for a patient, they may try to identify critical time steps where, had they made different decisions, the patient's health would have improved. While recent methods at the intersection of causal inference and reinforcement learning promise to aid human experts, as the clinician above, to *retrospectively* analyze sequential decision making processes, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the A* algorithm that, under a natural form of Lipschitz continuity of the environment’s dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.

Neural Multi-Objective Combinatorial Optimization with Diversity Enhancement
Jinbiao Chen Zizhen Zhang Zhiguang Cao Yaoxin Wu Yining Ma Te Ye Jiahai Wang



研究问题:现有的多目标组合优化(MOCO)的神经方法主要依赖于分解,这往往导致子问题的重复解决方案,从而限制了帕累托集的大小。
动机:为了生成更多的帕累托解决方案,我们提出了一种新的具有多样性增强的神经启发式算法(NHDE)。
方法:一方面,为了防止不同子问题的解决方案重复,我们提出了一种指标增强的深度强化学习方法来指导模型,并设计了一种异构图注意力机制来捕捉实例图和帕累托前沿图之间的关系。另一方面,为了挖掘每个子问题附近的更多解决方案,我们提出了一种多个帕累托最优策略来采样和保留理想的解决方案。
效果:实验结果表明,我们的NHDE能够在帕累托前沿上生成更具多样性的解决方案,从而实现优越的整体性能。此外,我们的NHDE是通用的,可以应用于不同的MOCO神经方法。

Most of existing neural methods for multi-objective combinatorial optimization (MOCO) problems solely rely on decomposition, which often leads to repetitive solutions for the respective subproblems, thus a limited Pareto set. Beyond decomposition, we propose a novel neural heuristic with diversity enhancement (NHDE) to produce more Pareto solutions from two perspectives. On the one hand, to hinder duplicated solutions for different subproblems, we propose an indicator-enhanced deep reinforcement learning method to guide the model, and design a heterogeneous graph attention mechanism to capture the relations between the instance graph and the Pareto front graph. On the other hand, to excavate more solutions in the neighborhood of each subproblem, we present a multiple Pareto optima strategy to sample and preserve desirable solutions. Experimental results on classic MOCO problems show that our NHDE is able to generate a Pareto front with higher diversity, thereby achieving superior overall performance. Moreover, our NHDE is generic and can be applied to different neural methods for MOCO.

Provably Efficient Offline Reinforcement Learning in Regular Decision Processes
Roberto Cipollone Anders Jonsson Alessandro Ronca Mohammad Sadegh Talebi



研究问题:本文探讨了在已知有限状态自动机下,如何利用预先收集的非马尔科夫观察序列数据进行强化学习。
动机:现有的离线强化学习方法主要针对马尔科夫决策过程(MDPs),对于非马尔科夫决策过程(RDPs)的研究较少。而RDPs可以通过有限状态自动机捕获过去事件的历史依赖性。
方法:本文提出了一种名为RegORL的算法,该算法结合了自动机学习和最先进的离线RL算法,适用于处理未知有限状态自动机的RDPs。
效果:实验结果表明,RegORL能够有效地学习出接近最优的策略,并且是首个被证明在RDPs上有效的离线学习算法。

This paper deals with offline (or batch) Reinforcement Learning (RL) in episodic Regular Decision Processes (RDPs). RDPs are the subclass of Non-Markov Decision Processes where the dependency on the history of past events can be captured by a finite-state automaton. We consider a setting where the automaton that underlies the RDP is unknown, and a learner strives to learn a near-optimal policy using pre-collected data, in the form of non-Markov sequences of observations, without further exploration. We present RegORL, an algorithm that suitably combines automata learning techniques and state-of-the-art algorithms for offline RL in MDPs. RegORL has a modular design allowing one to use any off-the-shelf offline RL algorithm in MDPs. We report a non-asymptotic high-probability sample complexity bound for RegORL to yield an $\varepsilon$-optimal policy, which makes appear a notion of concentrability relevant for RDPs. Furthermore, we present a sample complexity lower bound for offline RL in RDPs. To our best knowledge, this is the first work presenting a provably efficient algorithm for offline learning in RDPs.

Information Maximizing Curriculum: A Curriculum-Based Approach for Learning Versatile Skills
Denis Blessing Onur Celik Xiaogang Jia Moritz Reuss Maximilian Xiling Li Rudolf Lioutikov Gerhard Neumann



研究问题:当训练数据来自人类演示者时,模仿学习往往导致多模态分布,因为人类行为的变化性。大多数模仿学习方法依赖于最大似然(ML)目标来学习参数化策略,但这可能导致次优或不安全的行为,因为ML目标具有模态平均属性。
动机:本文提出了信息最大化课程,这是一种基于课程的方法,为每个数据点分配权重,并鼓励模型专门学习其可以表示的数据,通过允许模型忽略无法表示的模态数据,有效地解决了模态平均问题。
方法:为了覆盖所有模态并实现多功能行为,我们将该方法扩展到混合专家(MoE)策略,其中每个混合组件选择自己的训练数据子集进行学习。提出了一种新颖的最大熵目标来实现数据集的全面覆盖,从而使策略能够包含数据分布中的所有模态。
效果:我们在使用多功能人类演示的复杂模拟控制任务上展示了该方法的有效性,与最先进的方法相比,取得了优越的性能。

Imitation learning uses data for training policies to solve complex tasks. However, when the training data is collected from human demonstrators, it often leads to multimodal distributions because of the variability in human actions. Most imitation learning methods rely on a maximum likelihood (ML) objective to learn a parameterized policy, but this can result in suboptimal or unsafe behavior due to the mode-averaging property of the ML objective. In this work, we propose Information Maximizing Curriculum, a curriculum-based approach that assigns a weight to each data point and encourages the model to specialize in the data it can represent, effectively mitigating the mode-averaging problem by allowing the model to ignore data from modes it cannot represent. To cover all modes and thus, enable versatile behavior, we extend our approach to a mixture of experts (MoE) policy, where each mixture component selects its own subset of the training data for learning. A novel, maximum entropy-based objective is proposed to achieve full coverage of the dataset, thereby enabling the policy to encompass all modes within the data distribution. We demonstrate the effectiveness of our approach on complex simulated control tasks using versatile human demonstrations, achieving superior performance compared to state-of-the-art methods.

Accelerating Exploration with Unlabeled Prior Data
Qiyang Li Jason Zhang Dibya Ghosh Amy Zhang Sergey Levine



研究问题:如何利用无奖励标签的先验数据来指导和加速解决新稀疏奖励任务的代理的探索?
动机:在现实世界中,代理很少需要完全从零开始解决稀疏奖励任务。我们可能拥有可以提供关于世界中可能的动作和结果的大量指导的先前经验,我们可以利用这些经验更有效地探索新任务。
方法:我们提出了一种简单的方法,该方法从在线经验中学习奖励模型,用乐观的奖励标记未标记的先验数据,然后将其与在线数据并行用于下游策略和评论家优化。
效果:我们的方法在几个挑战性的稀疏奖励领域中实现了快速的探索,包括AntMaze领域、Adroit手部操作领域和视觉模拟机器人操作领域。我们的结果强调了将未标记的先验数据集成到现有的在线RL算法中的简便性,以及这样做的效果(可能令人惊讶)。

Learning to solve tasks from a sparse reward signal is a major challenge for standard reinforcement learning (RL) algorithms. However, in the real world, agents rarely need to solve sparse reward tasks entirely from scratch. More often, we might possess prior experience to draw on that provides considerable guidance about which actions and outcomes are possible in the world, which we can use to explore more effectively for new tasks. In this work, we study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization. This general formula leads to rapid exploration in several challenging sparse-reward domains where tabula rasa exploration is insufficient, including the AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain. Our results highlight the ease of incorporating unlabeled prior data into existing online RL algorithms, and the (perhaps surprising) effectiveness of doing so.

Direct Preference-based Policy Optimization without Reward Modeling
Gaon An Junhyeok Lee Xingdong Zuo Norio Kosaka Kyung-Min Kim Hyun Oh Song



研究问题:本文旨在解决现有强化学习算法在奖励函数难以定义时的问题,提出了一种直接从偏好中学习的偏好增强学习(PbRL)算法。
动机:现有的PbRL方法需要先从偏好数据中学习奖励模型,然后使用该模型进行强化学习,但当偏好信息来自人类教师时,获取准确的奖励模型可能很困难。
方法:本文提出的PbRL算法直接从偏好中学习,不依赖于任何奖励模型。为此,我们采用了对比学习框架设计了一个新的策略评分指标,该指标会给与偏好一致的策略高分。
效果:我们在具有实际人类偏好标签的离线RL任务上应用了我们的算法,结果显示我们的算法优于或等同于现有的PbRL方法。特别是在高维控制任务上,我们的算法超过了那些使用真实奖励信息学习的离线RL方法。最后,我们还展示了我们的算法可以成功应用于微调大型语言模型。

Preference-based reinforcement learning (PbRL) is an approach that enables RL agents to learn from preference, which is particularly useful when formulating a reward function is challenging. Existing PbRL methods generally involve a two-step procedure: they first learn a reward model based on given preference data and then employ off-the-shelf reinforcement learning algorithms using the learned reward model. However, obtaining an accurate reward model solely from preference information, especially when the preference is from human teachers, can be difficult. Instead, we propose a PbRL algorithm that directly learns from preference without requiring any reward modeling. To achieve this, we adopt a contrastive learning framework to design a novel policy scoring metric that assigns a high score to policies that align with the given preferences. We apply our algorithm to offline RL tasks with actual human preference labels and show that our algorithm outperforms or is on par with the existing PbRL methods. Notably, on high-dimensional control tasks, our algorithm surpasses offline RL methods that learn with ground-truth reward information. Finally, we show that our algorithm can be successfully applied to fine-tune large language models.

Compositional Policy Learning in Stochastic Control Systems with Formal Guarantees
Đorđe Žikelić Mathias Lechner Abhinav Verma Krishnendu Chatterjee Thomas A Henzinger



研究问题:强化学习在复杂的控制任务中表现出了潜力,但其缺乏对策略行为的正式保证,这阻碍了其部署。
动机:本文提出了一种新的方法,用于在随机环境中学习神经网络策略的组合,并附带一个正式的证书,该证书保证以所需的概率满足策略行为的规定。
方法:该方法利用SpectRL提供的逻辑规定的组合性质,在概率可达避免规范图上进行学习。形式上的保证是通过学习神经网络策略以及为图的子任务学习的可达避免鞅(RASM)并将其组合成全局策略来提供的。
效果:我们在随机九宫环境上实现了该方法的原型,并进行了评估。

Reinforcement learning has shown promising results in learning neural network policies for complicated control tasks. However, the lack of formal guarantees about the behavior of such policies remains an impediment to their deployment. We propose a novel method for learning a composition of neural network policies in stochastic environments, along with a formal certificate which guarantees that a specification over the policy's behavior is satisfied with the desired probability. Unlike prior work on verifiable RL, our approach leverages the compositional nature of logical specifications provided in SpectRL, to learn over graphs of probabilistic reach-avoid specifications. The formal guarantees are provided by learning neural network policies together with reach-avoid supermartingales (RASM) for the graph’s sub-tasks and then composing them into a global policy. We also derive a tighter lower bound compared to previous work on the probability of reach-avoidance implied by a RASM, which is required to find a compositional policy with an acceptable probabilistic threshold for complex tasks with multiple edge policies. We implement a prototype of our approach and evaluate it on a Stochastic Nine Rooms environment.

Learning to Influence Human Behavior with Offline Reinforcement Learning
Joey Hong Sergey Levine Anca Dragan



研究问题:AI代理在与人类互动时,如何影响人类的行为、意图和策略,特别是在人类行为非最优的情况下。
动机:目前的研究中,大部分都假设人类行为接近最优,如竞争性游戏或自动驾驶等。然而,我们关注的是在需要捕捉人类非最优行为的情况下,AI代理如何影响人类。例如,在协作任务中,由于认知偏差或信息不足,人们的表现不佳,AI代理应如何引导他们采取更优的行为?
方法:我们从离线的人与人交互数据集进行学习,通过扩展和结合观察到的人类行为元素,使离线强化学习能够有效地影响非最优的人类。
效果:我们证明了离线强化学习可以通过以下两种方式解决有效影响的挑战。首先,即使数据集不包含成功影响的例子,代理也能从各种任务的次优人与人交互数据集中学习影响策略,引导人类在新任务上表现更好。其次,通过对人类行为的建模和条件化,离线强化学习不仅可以影响人类的行为,还可以影响他们的基本策略,并能适应他们策略的变化。

When interacting with people, AI agents do not just influence the state of the world -- they also influence the actions people take in response to the agent, and even their underlying intentions and strategies. Accounting for and leveraging this influence has mostly been studied in settings where it is sufficient to assume that human behavior is near-optimal: competitive games, or general-sum settings like autonomous driving alongside human drivers. Instead, we focus on influence in settings where there is a need to capture human suboptimality. For instance, imagine a collaborative task in which, due either to cognitive biases or lack of information, people do not perform very well -- how could an agent influence them towards more optimal behavior? Assuming near-optimal human behavior will not work here, and so the agent needs to learn from real human data. But experimenting online with humans is potentially unsafe, and creating a high-fidelity simulator of the environment is often impractical. Hence, we focus on learning from an offline dataset of human-human interactions. Our observation is that offline reinforcement learning (RL) can learn to effectively influence suboptimal humans by extending and combining elements of observed human-human behavior. We demonstrate that offline RL can solve two challenges with effective influence. First, we show that by learning from a dataset of suboptimal human-human interaction on a variety of tasks -- none of which contains examples of successful influence -- an agent can learn influence strategies to steer humans towards better performance even on new tasks. Second, we show that by also modeling and conditioning on human behavior, offline RL can learn to affect not just the human's actions but also their underlying strategy, and adapt to changes in their strategy.

Posterior Sampling for Competitive RL: Function Approximation and Partial Observation
Shuang Qiu Ziyu Dai Han Zhong Zhaoran Wang Zhuoran Yang Tong Zhang



研究问题:本文研究了在一般函数近似情况下,用于竞争性强化学习的后验采样算法。
动机:针对零和马尔科夫博弈(MGs)在自我对弈和对抗学习两种关键设置下的问题,提出了自我对弈和对抗广义逃避系数(GEC)作为函数近似的复杂性度量,以捕捉MG中的探索-利用权衡。
方法:基于自我对弈GEC,提出了一种基于模型的自我对弈后验采样方法,以控制两个玩家学习纳什均衡,该方法可以成功处理状态的部分可观察性。此外,识别出一组符合对手对抗策略的、适合MG学习的局部可观察MG模型。结合对抗GEC,提出了一种基于模型的后验采样方法,用于学习可能部分可观察的对抗MG。
效果:为提出的算法提供了低遗憾界限,该界限可以与提出的GEC和剧集数量T成次线性比例缩放。据我们所知,这是首次开发了适用于大多数可处理的零和MG类的竞争性RL的通用基于模型的后验采样算法,可用于完全可观察和部分可观察的MGs的自我对弈和对抗学习。

This paper investigates posterior sampling algorithms for competitive reinforcement learning (RL) in the context of general function approximations. Focusing on zero-sum Markov games (MGs) under two critical settings, namely self-play and adversarial learning, we first propose the self-play and adversarial generalized eluder coefficient (GEC) as complexity measures for function approximation, capturing the exploration-exploitation trade-off in MGs. Based on self-play GEC, we propose a model-based self-play posterior sampling method to control both players to learn Nash equilibrium, which can successfully handle the partial observability of states. Furthermore, we identify a set of partially observable MG models fitting MG learning with the adversarial policies of the opponent. Incorporating the adversarial GEC, we propose a model-based posterior sampling method for learning adversarial MG with potential partial observability. We further provide low regret bounds for proposed algorithms that can scale sublinearly with the proposed GEC and the number of episodes $T$. To the best of our knowledge, we for the first time develop generic model-based posterior sampling algorithms for competitive RL that can be applied to a majority of tractable zero-sum MG classes in both fully observable and partially observable MGs with self-play and adversarial learning.

Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees
Sharan Vaswani Amirreza Kazemi Reza Babanezhad Harikandeh Nicolas Le Roux



研究问题:本文旨在解决强化学习中Actor-Critic方法训练目标与实际奖励目标不匹配的问题。
动机:目前的Actor-Critic方法在训练过程中,Critic的目标函数(最小化TD误差)可能与实际的奖励最大化目标存在偏差。
方法:提出一种决策感知的联合目标函数来同时训练Actor和Critic,并设计了一个通用的AC算法,可以处理任何形式的函数近似。
效果:通过实验证明,该算法在简单的强化学习问题上表现优越,能够保证策略的单调改进。

Actor-critic (AC) methods are widely used in reinforcement learning (RL), and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO), and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.

Similarity-based cooperative equilibrium
Caspar Oesterheld Johannes Treutlein Roger Baker Grosse Vincent Conitzer Jakob Nicolaus Foerster



研究问题:如何使机器学习代理在单次囚徒困境中实现合作。
动机:标准博弈论预测,在许多社会困境中,如单次囚徒困境,机器学习代理将无法相互合作。
方法:引入一个更现实的设置,即代理只能观察到一个数字,表示他们彼此之间的相似程度。
效果:实验证明,这种设置可以实现与完全透明设置相同的合作结果,并且可以通过简单的机器学习方法学习合作。

As machine learning agents act more autonomously in the world, they will increasingly interact with each other. Unfortunately, in many social dilemmas like the one-shot Prisoner’s Dilemma, standard game theory predicts that ML agents will fail to cooperate with each other. Prior work has shown that one way to enable cooperative outcomes in the one-shot Prisoner’s Dilemma is to make the agents mutually transparent to each other, i.e., to allow them to access one another’s source code (Rubinstein, 1998; Tennenholtz, 2004) – or weights in the case of ML agents. However, full transparency is often unrealistic, whereas partial transparency is commonplace. Moreover, it is challenging for agents to learn their way to cooperation in the full transparency setting. In this paper, we introduce a more realistic setting in which agents only observe a single number indicating how similar they are to each other. We prove that this allows for the same set of cooperative outcomes as the full transparency setting. We also demonstrate experimentally that cooperation can be learned using simple ML methods.

BQ-NCO: Bisimulation Quotienting for Efficient Neural Combinatorial Optimization
Darko Drakulic Sofia Michel Florian Mai Arnaud Sors Jean-Marc Andreoli



研究问题:尽管神经网络在端到端的启发式学习中取得了成功,但分布外泛化仍然是一个挑战。
动机:本文提出了一种新的组合优化问题的马尔可夫决策过程(MDP)形式,有效地利用了组合优化问题的常见对称性来提高分布外鲁棒性。
方法:从构造性方法的直接MDP形式开始,引入了一种基于MDP中的双模拟商(BQ)的通用状态空间缩小方法。然后,对于具有递归性质的组合优化问题,我们专门化了双模拟,并展示了如何利用这些问题的对称性来简化状态和促进MDP求解。
效果:我们在五个经典问题上进行了说明,包括欧几里得和非对称旅行商问题、带容量的车辆路线问题、定向运动和背包问题等。此外,对于每个问题,我们都引入了一个简单的注意力政策网络用于BQ-MDPs,通过模仿小实例(接近)最优解进行训练。我们在合成和现实基准上获得了这五个COPs的新的最佳结果。值得注意的是,与大多数现有的神经网络方法相比,我们的学习策略在比训练时大得多的实例上表现出优秀的泛化性能,无需任何额外的搜索过程。

Despite the success of neural-based combinatorial optimization methods for end-to-end heuristic learning, out-of-distribution generalization remains a challenge. In this paper, we present a novel formulation of Combinatorial Optimization Problems (COPs) as Markov Decision Processes (MDPs) that effectively leverages common symmetries of COPs to improve out-of-distribution robustness. Starting from a direct MDP formulation of a constructive method, we introduce a generic way to reduce the state space, based on Bisimulation Quotienting (BQ) in MDPs. Then, for COPs with a recursive nature, we specialize the bisimulation and show how the reduced state exploits the symmetries of these problems and facilitates MDP solving. Our approach is principled and we prove that an optimal policy for the proposed BQ-MDP actually solves the associated COPs. We illustrate our approach on five classical problems: the Euclidean and Asymmetric Traveling Salesman, Capacitated Vehicle Routing, Orienteering and Knapsack Problems. Furthermore, for each problem, we introduce a simple attention-based policy network for the BQ-MDPs, which we train by imitation of (near) optimal solutions of small instances from a single distribution. We obtain new state-of-the-art results for the five COPs on both synthetic and realistic benchmarks. Notably, in contrast to most existing neural approaches, our learned policies show excellent generalization performance to much larger instances than seen during training, without any additional search procedure. Our code is available at: [link](https://github.com/naver/bq-nco).

Learning Shared Safety Constraints from Multi-task Demonstrations
Konwoo Kim Gokul Swamy Zuxin Liu Ding Zhao Sanjiban Choudhury Steven Wu



研究问题:如何在环境中学习并执行共享的安全约束,以使代理尊重这些约束。
动机:手动指定安全约束既耗时又容易出错,因此需要通过专家示范来学习这些约束。
方法:通过扩展逆强化学习(IRL)技术到约束空间,从专家安全任务完成的示范中学习约束。
效果:在多任务设置中利用多样化的示范来学习更紧的约束集,从而有效地解决了约束学习问题。

Regardless of the particular task we want to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task setting to learn a tighter set of constraints. We validate our method with simulation experiments on high-dimensional continuous control tasks.

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization
Jinxin Liu Hongyin Zhang Zifeng Zhuang Yachen Kang Donglin Wang Bin Wang



研究问题:如何避免迭代误差传播,将离线RL的两层优化(值估计和策略提取)解耦,并在测试时执行外层优化?
动机:现有的非迭代离线RL方法无法完全回答三个核心问题:1. 我们应该从内层向表层转移什么信息?2. 在利用转移的信息进行安全/自信的外层优化时应注意什么?3. 在测试期间同时进行外层优化有什么好处?
方法:受基于模型的优化(MBO)的启发,我们提出了DROP(从策略设计),该方法全面回答了上述问题。在内层,DROP将离线数据分解为多个子集,并学习一个MBO得分模型。为了在外层安全地利用得分模型,我们明确地学习了一个行为嵌入并引入了保守的正则化。在测试期间,我们证明DROP允许部署适应,实现跨状态的自适应推理。
效果:实验上,我们在各种任务上评估DROP,结果显示DROP与先前的方法相比具有相当或更好的性能。

In this work, we decouple the iterative bi-level offline RL (value estimation and policy extraction) from the offline training phase, forming a non-iterative bi-level paradigm and avoiding the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization (value estimation) in training, while performing outer-level optimization (policy extraction) in testing. Naturally, such a paradigm raises three core questions that are not fully answered by prior non-iterative offline RL counterparts like reward-conditioned policy: (q1) What information should we transfer from the inner-level to the outer-level? (q2) What should we pay attention to when exploiting the transferred information for safe/confident outer-level optimization? (q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we propose DROP (design from policies), which fully answers the above questions. Specifically, in the inner-level, DROP decomposes offline data into multiple subsets, and learns an MBO score model (a1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (a2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (a3). Empirically, we evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods.

SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations
Youngsoo Jang Geon-Hyeong Kim Jongmin Lee Sungryull Sohn Byoungjip Kim Honglak Lee Moontae Lee



研究问题:如何通过模仿学习来制定一个安全策略,既能模仿优选行为,又能避免非优选行为。
动机:在许多真实世界的场景中,满足安全约束比最大化预期回报更重要,但学习避免违反约束(即非优选)的行为是非常具有挑战性的。
方法:提出了一种无超参数的离线安全模仿学习方法——SafeDICE,该方法通过利用非优选示范在平稳分布空间中来学习安全策略。
效果:实验证明,与基线算法相比,该算法能学习到更符合成本约束的安全策略,且不会降低奖励性能。

We consider offline safe imitation learning (IL), where the agent aims to learn the safe policy that mimics preferred behavior while avoiding non-preferred behavior from non-preferred demonstrations and unlabeled demonstrations. This problem setting corresponds to various real-world scenarios, where satisfying safety constraints is more important than maximizing the expected return. However, it is very challenging to learn the policy to avoid constraint-violating (i.e. non-preferred) behavior, as opposed to standard imitation learning which learns the policy to mimic given demonstrations. In this paper, we present a hyperparameter-free offline safe IL algorithm, SafeDICE, that learns safe policy by leveraging the non-preferred demonstrations in the space of stationary distributions. Our algorithm directly estimates the stationary distribution corrections of the policy that imitate the demonstrations excluding the non-preferred behavior. In the experiments, we demonstrate that our algorithm learns a more safe policy that satisfies the cost constraint without degrading the reward performance, compared to baseline algorithms.

BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learning
Xuan Chen Wenbo Guo Guanhong Tao Xiangyu Zhang Dawn Song



研究问题:后门攻击对深度强化学习(DRL)策略的供应链管理构成严重威胁。
动机:尽管最近的研究中提出了初步的防御措施,但这些方法的通用性和可扩展性非常有限。
方法:我们提出了BIRD,一种在无需了解攻击规格和访问其训练过程的情况下,从预训练的DRL策略中检测并移除后门的技术。
效果:我们在十个不同的单代理或多代理环境中评估了BIRD对三种后门攻击的抵抗能力,结果验证了BIRD的有效性、效率和通用性,以及其对不同攻击变化和适应的鲁棒性。

Backdoor attacks pose a severe threat to the supply chain management of deep reinforcement learning (DRL) policies. Despite initial defenses proposed in recent studies, these methods have very limited generalizability and scalability. To address this issue, we propose BIRD, a technique to detect and remove backdoors from a pretrained DRL policy in a clean environment without requiring any knowledge about the attack specifications and accessing its training process. By analyzing the unique properties and behaviors of backdoor attacks, we formulate trigger restoration as an optimization problem and design a novel metric to detect backdoored policies. We also design a finetuning method to remove the backdoor, while maintaining the agent's performance in the clean environment. We evaluate BIRD against three backdoor attacks in ten different single-agent or multi-agent environments. Our results verify the effectiveness, efficiency, and generalizability of BIRD, as well as its robustness to different attack variations and adaptions.

Team-PSRO for Learning Approximate TMECor in Large Team Games via Cooperative Reinforcement Learning
Stephen Marcus McAleer Gabriele Farina Gaoyue Zhou Mingzhi Wang Yaodong Yang Tuomas Sandholm



研究问题:如何提高多玩家零和游戏的算法性能。
动机:目前的算法在两人零和游戏中表现优秀,但在多人游戏如桥牌、足球等中表现不佳。
方法:提出了两种新的算法——Team-PSRO和Team-PSRO Mix-and-Match,通过强化学习让团队学习对手的元策略的最佳反应。
效果:实验证明,这两种算法都能收敛到TMECor,且在大型游戏中优于自我对弈的强化学习。

Recent algorithms have achieved superhuman performance at a number of two-player zero-sum games such as poker and go. However, many real-world situations are multi-player games. Zero-sum two-team games, such as bridge and football, involve two teams where each member of the team shares the same reward with every other member of that team, and each team has the negative of the reward of the other team. A popular solution concept in this setting, called TMECor, assumes that teams can jointly correlate their strategies before play, but are not able to communicate during play. This setting is harder than two-player zero-sum games because each player on a team has different information and must use their public actions to signal to other members of the team. Prior works either have game-theoretic guarantees but only work in very small games, or are able to scale to large games but do not have game-theoretic guarantees. In this paper we introduce two algorithms: Team-PSRO, an extension of PSRO from two-player games to team games, and Team-PSRO Mix-and-Match which improves upon Team PSRO by better using population policies. In Team-PSRO, in every iteration both teams learn a joint best response to the opponent's meta-strategy via reinforcement learning. As the reinforcement learning joint best response approaches the optimal best response, Team-PSRO is guaranteed to converge to a TMECor. In experiments on Kuhn poker and Liar's Dice, we show that a tabular version of Team-PSRO converges to TMECor, and a version of Team PSRO using deep cooperative reinforcement learning beats self-play reinforcement learning in the large game of Google Research Football.

Reference-Based POMDPs
Edward Kim Yohan Karunanayake Hanna Kurniawati



研究问题:如何使机器人在部分可观察和非确定性的情况下做出良好的决策。
动机:尽管POMDP(部分可观察马尔科夫决策过程)在解决这类问题上取得了进展,但长期规划和不断变化的环境仍然难以解决。
方法:提出一种改进的POMDP问题,称为基于参考的POMDP,通过修改POMDP目标函数来平衡预期总奖励和接近给定参考策略(随机策略)。
效果:实验结果表明,基于参考的POMDP算法在长期导航问题上显著优于POMCP算法。

Making good decisions in partially observable and non-deterministic scenarios is a crucial capability for robots. A Partially Observable Markov Decision Process (POMDP) is a general framework for the above problem. Despite advances in POMDP solving, problems with long planning horizons and evolving environments remain difficult to solve even by the best approximate solvers today. To alleviate this difficulty, we propose a slightly modified POMDP problem, called a Reference-Based POMDP, where the POMDP objective function is slightly modified to balance between maximizing the expected total reward and being close to a given reference (stochastic) policy. The optimal policy of a Reference-Based POMDP can be computed via iterative expectations using the given reference policy, thereby avoiding exhaustive enumeration of actions at each belief node of the search tree. We demonstrate theoretically that the standard POMDP under stochastic policies is related to the Reference-Based POMDP under suitable conditions. To demonstrate the feasibility of exploiting the Reference-Based POMDP formulation, we present a basic algorithm RefSolver. Results from experiments on long-horizon navigation problems indicate that this basic algorithm substantially outperforms POMCP.

Persuading Farsighted Receivers in MDPs: the Power of Honesty
Martino Bernasconi Matteo Castiglioni Alberto Marchesi Mirco Mutti



研究问题:本文探讨了知情的发送者如何通过策略性地披露信息来影响不知情的接收者的行为,特别是在接收者进行序列化交互的情况下。
动机:当前的研究主要关注于计算最优的信息揭示策略(即信号方案),但大多数研究都假设接收者只考虑一步的效用,忽视了未来的回报。然而,当接收者具有远见并考虑未来回报时,找到最优的马尔可夫信号方案是NP-hard的。
方法:本文提出了一种算法,可以计算出最优和ε-有说服力的历史依赖信号方案,该方案的时间复杂度为多项式时间。同时,引入了一种方便的历史依赖信号方案的子类——承诺形式,它与一般的历史依赖方案一样强大,且可以有效地表示。
效果:实验结果表明,历史依赖的信号方案比马尔可夫信号方案更有效,而且承诺形式的历史依赖信号方案在效率和效果上都表现出色。

Bayesian persuasion studies the problem faced by an informed sender who strategically discloses information to influence the behavior of an uninformed receiver. Recently, a growing attention has been devoted to settings where the sender and the receiver interact sequentially, in which the receiver's decision-making problem is usually modeled as a Markov decision process (MDP). However, the literature focuses on computing optimal information-revelation policies (a.k.a. signaling schemes) under the restrictive assumption that the receiver acts myopically, selecting actions to maximize the one-step utility and disregarding future rewards. This is justified by the fact that, when the receiver is farsighted and thus considers future rewards, finding an optimal Markovian signaling scheme is NP-hard. In this paper, we show that Markovian signaling schemes do not constitute the "right" class of policies. Indeed, differently from most of the MDPs settings, we show that Markovian signaling schemes are not optimal, and general history-dependent signaling schemes should be considered. Moreover, we also show that history-dependent signaling schemes circumvent the negative complexity results affecting Markovian signaling schemes. Formally, we design an algorithm that computes an optimal and $\epsilon$-persuasive history-dependent signaling scheme in time polynomial in ${1}/{\epsilon}$ and in the instance size. The crucial challenge is that general history-dependent signaling schemes cannot be represented in polynomial space. Nevertheless, we introduce a convenient subclass of history-dependent signaling schemes, called promise-form, which are as powerful as general history-dependent ones and efficiently representable. Intuitively, promise-form signaling schemes compactly encode histories in the form of honest promises on future receiver's rewards.

Distributional Policy Evaluation: a Maximum Entropy approach to Representation Learning
Riccardo Zamboni Alberto Maria Metelli Marcello Restelli



研究问题:本文旨在提出一种新的最大熵框架,用于分布强化学习中的策略评估。
动机:最大熵框架已在各种强化学习任务中得到有效应用,但尚未在分布强化学习环境中进行策略评估。
方法:提出了一种名为“分布最大熵策略评估”(D-Max-Ent PE)的新的最大熵框架,并在此基础上进行了状态空间表示的学习。
效果:通过数值模拟,证明了该算法能够匹配预期的理论行为,并突出了聚合与样本机制之间的关系。

The Maximum Entropy (Max-Ent) framework has been effectively employed in a variety of Reinforcement Learning (RL) tasks. In this paper, we first propose a novel Max-Ent framework for policy evaluation in a distributional RL setting, named *Distributional Maximum Entropy Policy Evaluation* (D-Max-Ent PE). We derive a generalization-error bound that depends on the complexity of the representation employed, showing that this framework can explicitly take into account the features used to represent the state space while evaluating a policy. Then, we exploit these favorable properties to drive the representation learning of the state space in a Structural Risk Minimization fashion. We employ state-aggregation functions as feature functions and we specialize the D-Max-Ent approach into an algorithm, named *D-Max-Ent Progressive Factorization*, which constructs a progressively finer-grained representation of the state space by balancing the trade-off between preserving information (bias) and reducing the effective number of states, i.e., the complexity of the representation space (variance). Finally, we report the results of some illustrative numerical simulations, showing that the proposed algorithm matches the expected theoretical behavior and highlighting the relationship between aggregations and sample regimes.

Constrained Policy Optimization with Explicit Behavior Density For Offline Reinforcement Learning
Jing Zhang Chi Zhang Wenjia Wang Bingyi Jing



研究问题:离线强化学习(RL)由于无法与环境交互,面临估计分布外(OOD)点的挑战。
动机:现有的处理此问题的方法要么控制策略排除OOD动作,要么使Q函数悲观。但这些方法可能过于保守,或无法准确识别OOD区域。
方法:我们提出了一种约束策略优化与显性行为密度(CPED)的方法,该方法利用流-GAN模型来明确估计行为策略的密度。通过估计明确的密度,CPED可以准确识别安全区域并在该区域内进行探索,从而产生较少保守的学习策略。
效果:实验结果表明,CPED在各种标准的离线强化学习任务上优于现有的替代方案,产生更高的期望回报。

Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the $Q$ function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable exploration within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal $Q$-function value. Empirically, CPED outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.

Learning to Discover Skills through Guidance
Hyunseung Kim Byungkun Lee Hojoon Lee Dongyoon Hwang Sejik Park Kyushik Min Jaegul Choo



研究问题:在无监督技能发现(USD)领域,主要挑战是探索能力有限,主要是由于技能偏离初始轨迹时会产生重大的惩罚。
动机:为了增强探索能力,现有的方法通过增加辅助奖励来最大化状态的知识不确定性或熵。但是,随着环境复杂度的增加,这些奖励的效果会下降。
方法:我们提出了一种新的USD算法——DISCO-DANCE。该算法首先选择最有可能到达未探索状态的引导技能,然后引导其他技能跟随引导技能,最后将引导的技能分散开来,以最大化其在未探索状态中的区分度。
效果:实验证明,DISCO-DANCE在具有挑战性的环境中优于其他USD基线,包括两个导航基准测试和一个连续控制基准测试。

In the field of unsupervised skill discovery (USD), a major challenge is limited exploration, primarily due to substantial penalties when skills deviate from their initial trajectories. To enhance exploration, recent methodologies employ auxiliary rewards to maximize the epistemic uncertainty or entropy of states. However, we have identified that the effectiveness of these rewards declines as the environmental complexity rises. Therefore, we present a novel USD algorithm, skill **disco**very with gui**dance** (**DISCO-DANCE**), which (1) selects the guide skill that possesses the highest potential to reach unexplored states, (2) guides other skills to follow guide skill, then (3) the guided skills are dispersed to maximize their discriminability in unexplored states. Empirical evaluation demonstrates that DISCO-DANCE outperforms other USD baselines in challenging environments, including two navigation benchmarks and a continuous control benchmark. Qualitative visualizations and code of DISCO-DANCE are available at https://mynsng.github.io/discodance/.

Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models
Xingyuan Zhang Philip Becker-Ehmck Patrick van der Smagt Maximilian Karl



研究问题:如何通过观察和模仿他人来快速学习新的行为,特别是在强化学习中需要大量环境交互的情况下。
动机:人类能够通过观察和模仿他人快速学习新行为,这主要归功于他们拥有的身体模型可以推断出导致观察到行为的最可能的动作。
方法:本文提出了一种名为“最大化证据的动作推理”(AIME)的方法,该方法使用世界模型复制这种行为。AIME包括两个阶段,第一阶段,代理通过最大化ELBO从过去的经验中学习世界模型以理解自己的身体;第二阶段,代理接收到专家执行新任务的观察演示,并尝试模仿专家的行为。
效果:我们的方法在DeepMind控制套件的Walker和Cheetah embodiment上进行了实证验证,发现其零射击模仿性能优于最先进的基线。

Unlike most reinforcement learning agents which require an unrealistic amount of environment interactions to learn a new behaviour, humans excel at learning quickly by merely observing and imitating others. This ability highly depends on the fact that humans have a model of their own embodiment that allows them to infer the most likely actions that led to the observed behaviour. In this paper, we propose Action Inference by Maximising Evidence (AIME) to replicate this behaviour using world models. AIME consists of two distinct phases. In the first phase, the agent learns a world model from its past experience to understand its own body by maximising the ELBO. While in the second phase, the agent is given some observation-only demonstrations of an expert performing a novel task and tries to imitate the expert's behaviour. AIME achieves this by defining a policy as an inference model and maximising the evidence of the demonstration under the policy and world model. Our method is "zero-shot" in the sense that it does not require further training for the world model or online interactions with the environment after given the demonstration. We empirically validate the zero-shot imitation performance of our method on the Walker and Cheetah embodiment of the DeepMind Control Suite and find it outperforms the state-of-the-art baselines. Code is available at: https://github.com/argmax-ai/aime.

Hybrid Policy Optimization from Imperfect Demonstrations
Hanlin Yang Chao Yu peng sun Siji Chen



研究问题:如何利用少量的不完美演示来加速强化学习代理的在线学习过程。
动机:在现实世界的应用中,获取高质量的专家演示通常是昂贵甚至不可能的,因此需要一种方法来解决这个问题。
方法:提出了一种新的强化学习算法,称为混合策略优化(HYPO),该算法使用少量不完美的演示来指导在线代理进行有效的探索。
效果:实验结果表明,HYPO在各种具有挑战性的任务中,如MuJoCo稀疏奖励环境、Google研究足球和AirSim无人机模拟等,都显著优于几个基线方法。

Exploration is one of the main challenges in Reinforcement Learning (RL), especially in environments with sparse rewards. Learning from Demonstrations (LfD) is a promising approach to solving this problem by leveraging expert demonstrations. However, expert demonstrations of high quality are usually costly or even impossible to collect in real-world applications. In this work, we propose a novel RL algorithm called HYbrid Policy Optimization (HYPO), which uses a small number of imperfect demonstrations to accelerate an agent's online learning process. The key idea is to train an offline guider policy using imitation learning in order to instruct an online agent policy to explore efficiently. Through mutual update of the guider policy and the agent policy, the agent can leverage suboptimal demonstrations for efficient exploration while avoiding the conservative policy caused by imperfect demonstrations. Empirical results show that HYPO significantly outperforms several baselines in various challenging tasks, such as MuJoCo with sparse rewards, Google Research Football, and the AirSim drone simulation.

Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control
Nathan Rahn Pierluca D'Oro Harley Wiltzer Pierre-Luc Bacon Marc G Bellemare



研究问题:深度强化学习代理在连续控制任务中的性能表现不稳定。
动机:通过研究回报景观,即策略与回报之间的映射,提供新的视角来理解这些行为。
方法:对回报进行分布性观察,映射出策略空间的故障易发区域,揭示出隐藏的策略质量维度。
效果:通过寻找参数空间中的简单路径,改善策略的稳定性,从而提升策略的鲁棒性。

Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.

Semantic HELM: A Human-Readable Memory for Reinforcement Learning
Fabian Paischer Thomas Adler Markus Hofmarcher Sepp Hochreiter



研究问题:强化学习代理在现实世界中部署时,常需应对部分可观察的环境,但现有方法缺乏可解释性。
动机:为了提高强化学习代理的可解释性,我们提出了一种新颖的记忆机制,将过去事件以人类语言进行表示。
方法:我们使用CLIP将视觉输入与语言标记关联起来,然后将这些标记输入预训练的语言模型,作为代理的记忆,为其提供连贯且易于理解的过去表示。
效果:我们在一组部分可观察的环境中训练了这种记忆机制,发现它在需要记忆组件的任务上表现出色,而在不需要记忆的任务上基本达到了与强大基线相当的性能。在一个具有挑战性的连续识别任务中,我们的记忆机制比之前的方法快两个数量级地收敛。由于我们的记忆机制是易于理解的,我们可以查看代理的记忆,检查是否存储了关键信息,从而显著提高了故障排除能力,并为更易解释的代理铺平了道路。

Reinforcement learning agents deployed in the real world often have to cope with partially observable environments. Therefore, most agents employ memory mechanisms to approximate the state of the environment. Recently, there have been impressive success stories in mastering partially observable environments, mostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft. However, existing methods lack interpretability in the sense that it is not comprehensible for humans what the agent stores in its memory. In this regard, we propose a novel memory mechanism that represents past events in human language. Our method uses CLIP to associate visual inputs with language tokens. Then we feed these tokens to a pretrained language model that serves the agent as memory and provides it with a coherent and human-readable representation of the past. We train our memory mechanism on a set of partially observable environments and find that it excels on tasks that require a memory component, while mostly attaining performance on-par with strong baselines on tasks that do not. On a challenging continuous recognition task, where memorizing the past is crucial, our memory mechanism converges two orders of magnitude faster than prior methods. Since our memory mechanism is human-readable, we can peek at an agent's memory and check whether crucial pieces of information have been stored. This significantly enhances troubleshooting and paves the way toward more interpretable agents.

Consistent Aggregation of Objectives with Diverse Time Preferences Requires Non-Markovian Rewards
Silviu Pitis



研究问题:如何合理地设定多目标代理的奖励函数?
动机:随着人工智能代理的能力提升,其被用于服务多样化的目标和利益相关者。然而,这些目标的组合通常是随意的,没有明确的依据。
方法:本文从一组直观吸引人的公理出发,证明了当每个目标的时间偏好(折扣因子)可能变化时,Markovian奖励函数的Markovian聚合是不可能的。因此,最优的多目标代理必须接受相对于单个目标来说是非Markovian的奖励。为此,提出了一种实用的非Markovian聚合方案,仅通过为每个目标增加一个参数就可以克服不可能性。
效果:这项工作为顺序多目标代理和跨期选择提供了新的见解,对于设计服务于具有不同时间偏好的多代委托人的AI系统具有实际意义。

As the capabilities of artificial agents improve, they are being increasingly deployed to service multiple diverse objectives and stakeholders. However, the composition of these objectives is often performed ad hoc, with no clear justification. This paper takes a normative approach to multi-objective agency: from a set of intuitively appealing axioms, it is shown that Markovian aggregation of Markovian reward functions is not possible when the time preference (discount factor) for each objective may vary. It follows that optimal multi-objective agents must admit rewards that are non-Markovian with respect to the individual objectives. To this end, a practical non-Markovian aggregation scheme is proposed, which overcomes the impossibility with only one additional parameter for each objective. This work offers new insights into sequential, multi-objective agency and intertemporal choice, and has practical implications for the design of AI systems deployed to serve multiple generations of principals with varying time preference.

A Definition of Continual Reinforcement Learning
David Abel Andre Barreto Benjamin Van Roy Doina Precup Hado van Hasselt Satinder Singh



研究问题:本文旨在对连续强化学习问题进行严谨的定义,以突出其承诺并明确其主要概念。
动机:尽管连续强化学习的重要性,但该领域缺乏一个简单明了的问题定义,使得主要概念不够精确清晰。
方法:通过一种新的数学语言来分析和分类代理,将“永不停止学习”的代理的概念形式化,并将连续学习代理定义为可以无限期执行隐式搜索过程的代理,将连续强化学习定义为最佳代理都是连续学习代理的环境。
效果:这些定义和观点形式化了学习核心的许多直观概念,并为围绕持续学习代理的新研究路径开放了大门。

In a standard view of the reinforcement learning problem, an agent’s goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning. Despite the importance of continual reinforcement learning, the community lacks a simple definition of the problem that highlights its commitments and makes its primary concepts precise and clear. To this end, this paper is dedicated to carefully defining the continual reinforcement learning problem. We formalize the notion of agents that “never stop learning” through a new mathematical language for analyzing and cataloging agents. Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents. We provide two motivating examples, illustrating that traditional views of multi-task reinforcement learning and continual supervised learning are special cases of our definition. Collectively, these definitions and perspectives formalize many intuitive concepts at the heart of learning, and open new research pathways surrounding continual learning agents.

Conservative State Value Estimation for Offline Reinforcement Learning
Liting Chen Jie Yan Zhengdao Shao Lu Wang Qingwei Lin Saravan Rajmohan Thomas Moscibroda Dongmei Zhang



研究问题:离线强化学习中,由于数据集和当前已学习策略的分布漂移,导致价值估计过高,进而引发学习失败的问题。
动机:为了解决离线强化学习中的分布漂移问题,本文提出了一种新方法——保守状态值估计(CSVE),通过直接对OOD状态施加惩罚来学习保守的V函数。
方法:CSVE通过在Bellman迭代中引入一个奖励或价值估计的惩罚项,同时避免对OOD状态和动作进行外推,从而进行更有效的状态值估计。
效果:在D4RL的经典连续控制任务中,该方法的表现优于保守Q函数学习方法,并且在最新的SOTA方法中具有很强的竞争力。

Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective state value estimation with conservative guarantees and further better policy optimization. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states around the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.

Hybrid Search for Efficient Planning with Completeness Guarantees
Kalle Kujanpää Joni Pajarinen Alexander Ilin



研究问题:解决计算机科学中长期存在的复杂规划问题。
动机:基于学习的子目标搜索方法在处理这些问题上显示出潜力,但它们往往缺乏完整性保证,即使存在解决方案,也可能找不到。
方法:提出一种高效的方法来增强子目标搜索方法以实现离散动作空间的完整性。具体来说,通过低层动作执行多层次(混合)搜索来增强高层搜索,我们称之为完整子目标搜索。
效果:实验结果表明,我们的完整子目标搜索不仅保证了完整性,甚至可以在高层搜索可以解决的问题实例上提高搜索扩展性能。这种方法使得子目标级别的规划可以在需要完整性的关键系统中得到应用。

Solving complex planning problems has been a long-standing challenge in computer science. Learning-based subgoal search methods have shown promise in tackling these problems, but they often suffer from a lack of completeness guarantees, meaning that they may fail to find a solution even if one exists. In this paper, we propose an efficient approach to augment a subgoal search method to achieve completeness in discrete action spaces. Specifically, we augment the high-level search with low-level actions to execute a multi-level (hybrid) search, which we call complete subgoal search. This solution achieves the best of both worlds: the practical efficiency of high-level search and the completeness of low-level search. We apply the proposed search method to a recently proposed subgoal search algorithm and evaluate the algorithm trained on offline data on complex planning problems. We demonstrate that our complete subgoal search not only guarantees completeness but can even improve performance in terms of search expansions for instances that the high-level could solve without low-level augmentations. Our approach makes it possible to apply subgoal-level planning for systems where completeness is a critical requirement.

Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning
Seungyong Moon Junyoung Yeom Bumsoo Park Hyun Oh Song



研究问题:在程序生成的环境中发现具有层次结构的成就是一项重大挑战。
动机:这需要代理具备广泛的能力,包括泛化和长期推理。许多先前的方法都建立在基于模型或分层的方法上,认为显式的长期规划模块对于学习层次依赖性是有利的。然而,这些方法需要过多的环境交互或大型模型,限制了它们的实用性。
方法:我们证明,当使用最新的实现实践进行优化时,简单而通用的无模型算法——近端策略优化(PPO)优于以前的方法。此外,我们发现PPO代理可以在一定程度上预测下一个将被解锁的成就,尽管信心有限。基于这一观察,我们引入了一种名为成就蒸馏的新型对比学习方法,增强了代理预测下一个成就的能力。
效果:我们的方法展示了强大的发现分层成就的能力,并在具有挑战性的Crafter环境中以样本有效的方式表现出最先进的性能,同时使用的模型参数更少。

Discovering achievements with a hierarchical structure in procedurally generated environments presents a significant challenge. This requires an agent to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods have been built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be advantageous for learning hierarchical dependencies. However, these methods demand an excessive number of environment interactions or large model sizes, limiting their practicality. In this work, we demonstrate that proximal policy optimization (PPO), a simple yet versatile model-free algorithm, outperforms previous methods when optimized with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, albeit with limited confidence. Based on this observation, we introduce a novel contrastive learning method, called achievement distillation, which strengthens the agent's ability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment in a sample-efficient manner while utilizing fewer model parameters.

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach
Riccardo Poiani Nicole Nobili Alberto Maria Metelli Marcello Restelli



研究问题:现有的强化学习算法中,策略评估主要通过蒙特卡洛模拟进行,但这种固定长度的轨迹收集策略是否是最佳选择?
动机:为了提高策略评估的质量,需要寻找更有效的数据收集策略。
方法:提出了一种名为RIDO的自适应数据收集策略优化算法,该算法将可用的交互预算分割成小批量,并在每一轮中确定最小化估计器方差的经验且鲁棒的轨迹时间表。
效果:实验结果表明,RIDO能够适应向需要更多采样的时间步长调整其轨迹时间表,从而提高最终估计的质量。

Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of *fixed length* within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we consider as quality index the variance of an unbiased policy return estimator that uses trajectories of different lengths, i.e., *truncated*. We first derive a closed-form expression of this variance that clearly shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially might be able to allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the variance of the final estimate. Building on these findings, we present an *adaptive* algorithm called **R**obust and **I**terative **D**ata collection strategy **O**ptimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust estimate of the estimator's variance. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.

Self-Predictive Universal AI
Elliot Catt Jordi Grau-Moya Marcus Hutter Matthew Aitchison Tim Genewein Gregoire Deletang Li Kevin Wenliang Joel Veness



研究问题:本文旨在提出一种名为Self-AIXI的新的通用代理,通过最大化利用学习来获取良好的策略。
动机:现有的强化学习算法通常结合学习和规划技术来制定有效的策略,而Self-AIXI则反其道而行之,通过自我预测自己的行动数据来获取良好的策略。
方法:Self-AIXI通过自我预测自己的行动数据来生成策略,这与其他的TD(0)代理相似,都是通过在当前的最优策略上进行动作最大化步骤来生成Q值估计。
效果:实验证明,Self-AIXI能够收敛到AIXI,并且继承了一系列的优良性质,如最大的Legg-Hutter智能和自我优化属性。

Reinforcement Learning (RL) algorithms typically utilize learning and/or planning techniques to derive effective policies. The integration of both approaches has proven to be highly successful in addressing complex sequential decision-making challenges, as evidenced by algorithms such as AlphaZero and MuZero, which consolidate the planning process into a parametric search-policy. AIXI, the most potent theoretical universal agent, leverages planning through comprehensive search as its primary means to find an optimal policy. Here we define an alternative universal agent, which we call Self-AIXI, that on the contrary to AIXI, maximally exploits learning to obtain good policies. It does so by self-predicting its own stream of action data, which is generated, similarly to other TD(0) agents, by taking an action maximization step over the current on-policy (universal mixture-policy) Q-value estimates. We prove that Self-AIXI converges to AIXI, and inherits a series of properties like maximal Legg-Hutter intelligence and the self-optimizing property.

Model-Free Active Exploration in Reinforcement Learning
Alessio Russo Alexandre Proutiere



研究问题:本文旨在解决强化学习中的探索问题,并提出一种新的无模型解决方案。
动机:现有的样本最优探索算法依赖于估计系统模型,而我们的方法不需要模型。
方法:我们从实例特定的采样数量下界出发,推导出最优探索策略,并设计了一种基于集成的无模型探索策略。
效果:数值结果表明,我们的策略能够比最先进的探索方法更快地识别出有效的策略。

We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches.

Self-Supervised Reinforcement Learning that Transfers using Random Features
Boyuan Chen Chuning Zhu Pulkit Agrawal Kaiqing Zhang Abhishek Gupta



研究问题:解决具有高维观测和长时序的单任务序列决策问题,同时实现跨任务的行为转移。
动机:模型自由强化学习算法在解决这类问题上有巨大潜力,但难以泛化到其他任务;而基于模型的强化学习方法虽然能自然地实现不同奖励函数之间的转移,但在复杂环境中难以扩展。
方法:提出一种自我监督的强化学习方法,通过自我监督预训练模型自由强化学习并用随机特征作为奖励,隐式地模拟长期环境动态,然后使用这些隐式模型进行规划,快速适应新的问题和奖励函数。
效果:该方法在模拟的操纵和移动领域等多种任务上实现了跨任务的行为转移,为通用决策制定者打开了大门。

Model-free reinforcement learning algorithms have exhibited great potential in solving single-task sequential decision-making problems with high-dimensional observations and long horizons, but are known to be hard to generalize across tasks. Model-based RL, on the other hand, learns task-agnostic models of the world that naturally enables transfer across different reward functions, but struggles to scale to complex environments due to the compounding error. To get the best of both worlds, we propose a self-supervised reinforcement learning method that enables the transfer of behaviors across tasks with different rewards, while circumventing the challenges of model-based RL. In particular, we show self-supervised pre-training of model-free reinforcement learning with a number of random features as rewards allows implicit modeling of long-horizon environment dynamics. Then, planning techniques like model-predictive control using these implicit models enable fast adaptation to problems with new reward functions. Our method is self-supervised in that it can be trained on offline datasets without reward labels, but can then be quickly deployed on new tasks. We validate that our proposed method enables transfer across tasks on a variety of manipulation and locomotion domains in simulation, opening the door to generalist decision-making agents.

FlowPG: Action-constrained Policy Gradient with Normalizing Flows
Janaka Chathuranga Brahmanage Jiajing Ling Akshat Kumar



研究问题:如何确保强化学习中的行动满足约束条件,同时提高训练速度和收敛性。
动机:在行动受限的强化学习中,保证每一步行动的有效性是一大挑战。常见的使用投影层的方法需要解决优化问题,可能导致训练时间长、收敛慢和零梯度问题。
方法:首先,利用正态流模型在学习可行的动作空间和潜在变量(如高斯)支持之间建立一个可逆、可微分的映射。其次,学习流模型需要从可行的动作空间进行采样,这也是一个挑战。我们开发了基于汉密尔顿蒙特卡洛和概率句子决策图的采样方法,用于处理凸和非凸约束。最后,我们将学习到的正态流与DDPG算法集成。设计良好的正态流将无需优化求解器即可将策略输出转换为有效行动。
效果:实验结果表明,我们的方法在多种连续控制任务上显著减少了约束违规次数(在某些情况下高达一个数量级),并且速度快了几倍。

Action-constrained reinforcement learning (ACRL) is a popular approach for solving safety-critical and resource-allocation related decision making problems. A major challenge in ACRL is to ensure agent taking a valid action satisfying constraints in each RL step. Commonly used approach of using a projection layer on top of the policy network requires solving an optimization program which can result in longer training time, slow convergence, and zero gradient problem. To address this, first we use a normalizing flow model to learn an invertible, differentiable mapping between the feasible action space and the support of a simple distribution on a latent variable, such as Gaussian. Second, learning the flow model requires sampling from the feasible action space, which is also challenging. We develop multiple methods, based on Hamiltonian Monte-Carlo and probabilistic sentential decision diagrams for such action sampling for convex and non-convex constraints. Third, we integrate the learned normalizing flow with the DDPG algorithm. By design, a well-trained normalizing flow will transform policy output into a valid action without requiring an optimization solver. Empirically, our approach results in significantly fewer constraint violations (upto an order-of-magnitude for several instances) and is multiple times faster on a variety of continuous control tasks.

Doubly Robust Augmented Transfer for Meta-Reinforcement Learning
Yuankun Jiang Nuowen Kan Chenglin Li Wenrui Dai Junni Zou Hongkai Xiong



研究问题:元强化学习(Meta-RL)在稀疏奖励设置中性能下降。
动机:目前的基于事后的样本转移方法可以缓解这个问题,但它们受限于任务之间只存在奖励函数差异的不切实际的假设。
方法:本文提出了一种双重鲁棒增强转移(DRaT)方法,旨在解决具有动态失配和跨任务变化奖励函数的更一般的稀疏奖励元强化学习场景。
效果:实验结果表明,DRaT在各种具有不同动力学和奖励函数的稀疏奖励MuJoCo移动任务上显著优于其他基于事后的方法。

Meta-reinforcement learning (Meta-RL), though enabling a fast adaptation to learn new skills by exploiting the common structure shared among different tasks, suffers performance degradation in the sparse-reward setting. Current hindsight-based sample transfer approaches can alleviate this issue by transferring relabeled trajectories from other tasks to a new task so as to provide informative experience for the target reward function, but are unfortunately constrained with the unrealistic assumption that tasks differ only in reward functions. In this paper, we propose a doubly robust augmented transfer (DRaT) approach, aiming at addressing the more general sparse reward meta-RL scenario with both dynamics mismatches and varying reward functions across tasks. Specifically, we design a doubly robust augmented estimator for efficient value-function evaluation, which tackles dynamics mismatches with the optimal importance weight of transition distributions achieved by minimizing the theoretically derived upper bound of mean squared error (MSE) between the estimated values of transferred samples and their true values in the target task. Due to its intractability, we then propose an interval-based approximation to this optimal importance weight, which is guaranteed to cover the optimum with a constrained and sample-independent upper bound on the MSE approximation error. Based on our theoretical findings, we finally develop a DRaT algorithm for transferring informative samples across tasks during the training of meta-RL. We implement DRaT on an off-policy meta-RL baseline, and empirically show that it significantly outperforms other hindsight-based approaches on various sparse-reward MuJoCo locomotion tasks with varying dynamics and reward functions.

Winner Takes It All: Training Performant RL Populations for Combinatorial Optimization
Nathan Grinsztajn Daniel Furelos-Blanco Shikha Surana Clément Bonnet Thomas D Barrett



研究问题:如何将强化学习应用于组合优化问题,以解决其固有的复杂性。
动机:由于组合优化问题的固有复杂性,无法期望代理一次性解决问题,因此需要额外的搜索策略。
方法:提出了一种学习互补策略种群的方法,即Poppy训练程序,通过最大化种群性能来诱导无监督专业化。
效果:在旅行商问题、车辆路径问题和作业调度问题等三个流行的NP-hard问题上,Poppy获得了最先进的强化学习结果。

Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam-search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on three popular NP-hard problems: traveling salesman, capacitated vehicle routing, and job-shop scheduling.

Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning
Zih-Yun Chiu Yi-Lin Tuan William Yang Wang Michael C. Yip



研究问题:如何使强化学习代理接近人类的学习效率,并实现对外部知识的灵活利用和泛化?
动机:人类通过观察他人尝试任务的策略来学习,而现有的强化学习模型在结合和应用外部知识策略方面存在困难。
方法:提出了一种融合多种知识策略的知识基础强化学习(KGRL)框架,并设计了一种新的演员网络结构——知识内含注意力网络(KIAN),以实现对外部知识的灵活组合和替换。
效果:实验结果表明,KIAN优于其他结合外部知识策略的方法,实现了高效且灵活的学习。

Reinforcement learning (RL) agents have long sought to approach the efficiency of human learning. Humans are great observers who can learn by aggregating external knowledge from various sources, including observations from others' policies of attempting a task. Prior studies in RL have incorporated external knowledge policies to help agents improve sample efficiency. However, it remains non-trivial to perform arbitrary combinations and replacements of those policies, an essential feature for generalization and transferability. In this work, we present Knowledge-Grounded RL (KGRL), an RL paradigm fusing multiple knowledge policies and aiming for human-like efficiency and flexibility. We propose a new actor architecture for KGRL, Knowledge-Inclusive Attention Network (KIAN), which allows free knowledge rearrangement due to embedding-based attentive action prediction. KIAN also addresses entropy imbalance, a problem arising in maximum entropy KGRL that hinders an agent from efficiently exploring the environment, through a new design of policy distributions. The experimental results demonstrate that KIAN outperforms alternative methods incorporating external knowledge policies and achieves efficient and flexible learning. Our implementation is available at https://github.com/Pascalson/KGRL.git .

Diffused Task-Agnostic Milestone Planner
Mineui Hong Minjae Kang Songhwai Oh



研究问题:近年来,使用序列模型预测未来轨迹来解决决策问题显示出了有希望的结果。
动机:本文进一步利用序列预测方法在更广泛的领域如长期规划、基于视觉的控制和多任务决策中发挥作用。
方法:为此,我们提出了一种利用基于扩散的生成序列模型在潜在空间中规划一系列里程碑,并让代理跟随这些里程碑来完成给定任务的方法。
效果:我们的方法在离线强化学习基准测试和一个视觉操作环境中进行了演示,结果表明,我们的方法在解决长程稀疏奖励任务和多任务问题上优于离线RL方法,同时也在最具挑战性的基于视觉的操作基准上实现了最先进的性能。

Addressing decision-making problems using sequence modeling to predict future trajectories shows promising results in recent years. In this paper, we take a step further to leverage the sequence predictive method in wider areas such as long-term planning, vision-based control, and multi-task decision-making. To this end, we propose a method to utilize a diffusion-based generative sequence model to plan a series of milestones in a latent space and to have an agent to follow the milestones to accomplish a given task. The proposed method can learn control-relevant, low-dimensional latent representations of milestones, which makes it possible to efficiently perform long-term planning and vision-based control. Furthermore, our approach exploits generation flexibility of the diffusion model, which makes it possible to plan diverse trajectories for multi-task decision-making. We demonstrate the proposed method across offline reinforcement learning (RL) benchmarks and an visual manipulation environment. The results show that our approach outperforms offline RL methods in solving long-horizon, sparse-reward tasks and multi-task problems, while also achieving the state-of-the-art performance on the most challenging vision-based manipulation benchmark.

ODE-based Recurrent Model-free Reinforcement Learning for POMDPs
Xuanle Zhao Duzhen Zhang Han Liyuan Tielin Zhang Bo XU



研究问题:如何从原始观察中推断出未知的物理或生物环境中的信息,特别是在部分可观察的环境中。
动机:在部分可观察的环境中,如何提取未被观察到的信息是一个难题。通过使用具有紧凑上下文的循环策略,基于上下文的强化学习提供了一种灵活的方法来从历史转换中提取未被观察到的信息。
方法:我们提出了一种新的基于ODE的循环模型,该模型结合了无模型强化学习(RL)框架,用于解决部分可观察马尔可夫决策过程(POMDPs)。
效果:我们的实验表明,我们的方法在各种部分可观察的连续控制和元RL任务上都是有效的。此外,我们的实验还表明,由于ODE能够模拟不规则采样的时间序列,因此我们的方法对不规则的观察是鲁棒的。

Neural ordinary differential equations (ODEs) are widely recognized as the standard for modeling physical mechanisms, which help to perform approximate inference in unknown physical or biological environments. In partially observable (PO) environments, how to infer unseen information from raw observations puzzled the agents. By using a recurrent policy with a compact context, context-based reinforcement learning provides a flexible way to extract unobservable information from historical transitions. To help the agent extract more dynamics-related information, we present a novel ODE-based recurrent model combines with model-free reinforcement learning (RL) framework to solve partially observable Markov decision processes (POMDPs). We experimentally demonstrate the efficacy of our methods across various PO continuous control and meta-RL tasks. Furthermore, our experiments illustrate that our method is robust against irregular observations, owing to the ability of ODEs to model irregularly-sampled time series.

Offline RL with Discrete Proxy Representations for Generalizability in POMDPs
Pengjie Gu Xinyu Cai Dong Xing Xinrun Wang Mengchen Zhao Bo An



研究问题:本文旨在解决离线强化学习在现实场景中遇到的部分可观察性问题,即训练时数据完全可见,但执行时可能遇到被隐藏的观察结果,以及训练时无法知道哪些观察结果是隐藏的。
动机:现有的离线强化学习方法对于部分可观察性的问题处理不足,训练出的模型在面对隐藏的观察结果时表现不佳。
方法:提出一种新的离线强化学习方法——Offline RL with DiscrEte pRoxy representations (ORDER)。该方法通过学习离散的状态表示来提高对各种隐藏观察结果的鲁棒性,并使用代理表示从被隐藏的部分可观察的轨迹中恢复状态。
效果:实验结果表明,ORDER在处理各种部分可观察的离线强化学习场景中表现出色,证明了离散代理表示在泛化性能上的重要性。此外,由于ORDER是一个灵活的框架,可以适用于任何离线强化学习算法,因此有望推动强化学习策略在实际世界中应对各种部分可观察性的挑战。

Offline Reinforcement Learning (RL) has demonstrated promising results in various applications by learning policies from previously collected datasets, reducing the need for online exploration and interactions. However, real-world scenarios usually involve partial observability, which brings crucial challenges of the deployment of offline RL methods: i) the policy trained on data with full observability is not robust against the masked observations during execution, and ii) the information of which parts of observations are masked is usually unknown during training. In order to address these challenges, we present Offline RL with DiscrEte pRoxy representations (ORDER), a probabilistic framework which leverages novel state representations to improve the robustness against diverse masked observabilities. Specifically, we propose a discrete representation of the states and use a proxy representation to recover the states from masked partial observable trajectories. The training of ORDER can be compactly described as the following three steps. i) Learning the discrete state representations on data with full observations, ii) Training the decision module based on the discrete representations, and iii) Training the proxy discrete representations on the data with various partial observations, aligning with the discrete representations. We conduct extensive experiments to evaluate ORDER, showcasing its effectiveness in offline RL for diverse partially observable scenarios and highlighting the significance of discrete proxy representations in generalization performance. ORDER is a flexible framework to employ any offline RL algorithms and we hope that ORDER can pave the way for the deployment of RL policy against various partial observabilities in the real world.

CEIL: Generalized Contextual Imitation Learning
Jinxin Liu Li He Yachen Kang Zifeng Zhuang Donglin Wang Huazhe Xu



研究问题:本文提出了一种通用且广泛应用的模仿学习(IL)算法,名为ContExtual Imitation Learning (CEIL)。
动机:受到后见之明信息匹配的启发,我们通过显式地学习后见之明嵌入函数和使用后见之明嵌入来学习上下文策略,从而推导出CEIL。
方法:为了实现IL的专家匹配目标,我们主张优化一个上下文变量,使其偏向于模仿专家行为。
效果:在流行的MuJoCo任务(在线)和D4RL数据集(离线)上进行实证评估,与先前最先进的基线相比,CEIL在大多数在线IL任务中更具有样本效率,并在离线任务中实现了更好或相当的性能。

In this paper, we present ContExtual Imitation Learning (CEIL), a general and broadly applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight information matching, we derive CEIL by explicitly learning a hindsight embedding function together with a contextual policy using the hindsight embeddings. To achieve the expert matching objective for IL, we advocate for optimizing a contextual variable such that it biases the contextual policy towards mimicking expert behaviors. Beyond the typical learning from demonstrations (LfD) setting, CEIL is a generalist that can be effectively applied to multiple settings including: 1) learning from observations (LfO), 2) offline IL, 3) cross-domain IL (mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate CEIL on the popular MuJoCo tasks (online) and the D4RL dataset (offline). Compared to prior state-of-the-art baselines, we show that CEIL is more sample-efficient in most online IL tasks and achieves better or competitive performances in offline tasks.

SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning
Dohyeok Lee Seungyub Han Taehyun Cho Jungwoo Lee



研究问题:深度强化学习在面对更复杂的任务或包含分布外数据的离线数据集时,如何减轻过度估计偏差以实现成功的表现。
动机:为了克服过度估计的偏差,研究人员已经探索了利用多个Q函数多样性的Q学习集成方法。
方法:通过引入基于随机矩阵理论的新型Q集成独立性正则化损失,提出了一种用于强化学习的“尖峰Wishart Q集成独立性正则化”(SPQR)方法。
效果:实验结果表明,SPQR在在线和离线RL基准测试中均优于基线算法。

Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.

VOCE: Variational Optimization with Conservative Estimation for Offline Safe Reinforcement Learning
Jiayi Guan Guang Chen Jiaming Ji Long Yang Ao Zhou Zhijun Li changjun jiang



研究问题:如何直接在离线数据集上学习满足安全约束的策略,以解决高采样成本和潜在危险的场景下的问题。
动机:现有的方法在保证安全性的同时难以实现高回报,因此需要一种能在离线数据集中优化安全策略的新方法。
方法:提出一种变分优化与保守估计算法(VOCE),通过引入概率推理和悲观估计方法来优化策略并减少OOD动作的外推误差。
效果:实验证明,VOCE算法在多个实验任务中表现优秀,特别是在安全性方面超过了现有最先进的算法。

Offline safe reinforcement learning (RL) algorithms promise to learn policies that satisfy safety constraints directly in offline datasets without interacting with the environment. This arrangement is particularly important in scenarios with high sampling costs and potential dangers, such as autonomous driving and robotics. However, the influence of safety constraints and out-of-distribution (OOD) actions have made it challenging for previous methods to achieve high reward returns while ensuring safety. In this work, we propose a Variational Optimization with Conservative Eestimation algorithm (VOCE) to solve the problem of optimizing safety policies in the offline dataset. Concretely, we reframe the problem of offline safe RL using probabilistic inference, which introduces variational distributions to make the optimization of policies more flexible. Subsequently, we utilize pessimistic estimation methods to estimate the Q-value of cost and reward, which mitigates the extrapolation errors induced by OOD actions. Finally, extensive experiments demonstrate that the VOCE algorithm achieves competitive performance across multiple experimental tasks, particularly outperforming state-of-the-art algorithms in terms of safety.

CaMP: Causal Multi-policy Planning for Interactive Navigation in Multi-room Scenes
Xiaohan Wang Yuehu Liu Xinhang Song Beibei Wang Shuqiang Jiang



研究问题:在现实场景中,如杂乱的房间,可能没有明确的路线到达目标,如何有效地进行交互导航。
动机:传统的视觉导航假设存在多个清晰的路线,但在复杂场景中,由于障碍物属性多样且难以测量,动作和结果之间的因果关系容易混淆,导致效率低下的探索。
方法:提出一个因果图来阐明交互导航中的混淆偏见,并设计了一个多策略模型,通过探索反事实交互来减少不必要的探索。
效果:在ProcTHOR模拟器上构建了一个包含60万个任务剧集、1.2万个多房间场景的大型数据集,并通过实验证明了该方法的有效性。

Visual navigation has been widely studied under the assumption that there may be several clear routes to reach the goal. However, in more practical scenarios such as a house with several messy rooms, there may not. Interactive Navigation (InterNav) considers agents navigating to their goals more effectively with object interactions, posing new challenges of learning interaction dynamics and extra action space. Previous works learn single vision-to-action policy with the guidance of designed representations. However, the causality between actions and outcomes is prone to be confounded when the attributes of obstacles are diverse and hard to measure. Learning policy for long-term action planning in complex scenes also leads to extensive inefficient exploration. In this paper, we introduce a causal diagram of InterNav clarifying the confounding bias caused by obstacles. To address the problem, we propose a multi-policy model that enables the exploration of counterfactual interactions as well as reduces unnecessary exploration. We develop a large-scale dataset containing 600k task episodes in 12k multi-room scenes based on the ProcTHOR simulator and showcase the effectiveness of our method with the evaluations on our dataset.

Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL
Peng Cheng Xianyuan Zhan Zhihao Wu Wenjia Zhang Youfang Lin Shou cheng Song Han Wang Li Jiang



研究问题:现有的离线强化学习算法在小数据集上的性能严重依赖于数据集的规模和状态-动作空间覆盖范围,这对实际部署带来了重大挑战。
动机:通过利用系统动力学的基本对称性,可以显著提高小数据集下的离线强化学习性能。
方法:提出了一种时间反转对称(T-对称)强制执行的动态模型(TDM),建立了一对正向和反向潜在动力学之间的一致性。
效果:基于大量实验,发现TSRL在小基准数据集上表现出色,即使只使用原始样本的1%,也大大超过了最近的离线强化学习算法在数据效率和泛化能力方面的表现。

Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability. Code is available at: https://github.com/pcheng2/TSRL

Guide Your Agent with Adaptive Multimodal Rewards
Changyeon Kim Younggyo Seo Hao Liu Lisa Lee Jinwoo Shin Honglak Lee Kimin Lee



研究问题:如何使代理能够适应未见过的环境,提高模仿学习中的泛化能力。
动机:现有的模仿学习方法在面对未见过的环境时,泛化能力较弱。
方法:提出一种名为ARP的自适应回报条件策略框架,利用预训练的多模态编码器和自然语言任务描述,通过计算视觉观察和自然语言指令在多模态嵌入空间中的相似性作为奖励信号,并使用专家演示进行训练。
效果:实验表明,ARP能有效缓解目标泛化问题,即使在面对未见过的文字指令时,也表现出优越的泛化性能。同时,通过引入预训练多模态编码器的微调方法,可以进一步提高奖励质量,从而提升性能。

Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. This work presents Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: \url{https://sites.google.com/view/2023arp}.

AdaPlanner: Adaptive Planning from Feedback with Language Models
Haotian Sun Yuchen Zhuang Lingkai Kong Bo Dai Chao Zhang



研究问题:大型语言模型在复杂的序列决策任务中,如何进行有效的计划和反馈循环。
动机:现有的方法或采取贪婪策略,或依赖静态计划,无法适应环境反馈,导致在复杂任务和长计划视野下性能下降。
方法:提出一种闭环方法AdaPlanner,使大型语言模型代理能够根据环境反馈自适应地细化自我生成的计划。包括计划内和计划外两种策略的自适应细化,以及利用成功计划作为少样本范例的技能发现机制。
效果:在ALFWorld和MiniWoB++环境中的实验表明,AdaPlanner在使用2倍和600倍更少样本的情况下,分别比最先进的基线高出3.73%和4.11%。

Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks. However, most existing methods either take actions greedily without planning or rely on static plans that are not adaptable to environmental feedback. Consequently, the sequential decision-making performance of LLM agents degenerates with problem complexity and plan horizons increase. We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback. In AdaPlanner, the LLM agent adaptively refines its plan from feedback with both in-plan and out-of-plan refinement strategies. To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities. Furthermore, we propose a skill discovery mechanism that leverages successful plans as few-shot exemplars, enabling the agent to plan and refine with fewer task demonstrations. Our experiments in the ALFWorld and MiniWoB++ environments demonstrate that AdaPlanner outperforms state-of-the-art baselines by 3.73% and 4.11% while utilizing 2x and 600x fewer samples, respectively. The implementation of AdaPlanner is available at https://github.com/haotiansun14/AdaPlanner.

Long-Term Fairness with Unknown Dynamics
Tongxin Yin Reilly Raab Mingyan Liu Yang Liu



研究问题:如何利用机器学习实现长期公平性,特别是在影响人类群体的政策中。
动机:目前的机器学习模型往往只关注短期效果,而忽视了长期公平性的问题。
方法:本文将长期公平性定义为一个在线强化学习问题,通过动态控制目标(如实现人口的公平状态)来影响人类群体。
效果:实验结果表明,该算法能够适应未知的动态变化,通过牺牲短期利益,推动政策-人口系统向更理想的平衡状态发展。在分类任务中,该算法在群体公平性方面优于其他基线方法。

While machine learning can myopically reinforce social inequalities, it may also be used to dynamically seek equitable outcomes. In this paper, we formalize long-term fairness as an online reinforcement learning problem for a policy affecting human populations. This formulation accommodates dynamical control objectives, such as achieving equitable population states, that cannot be incorporated into static formulations of fairness. We demonstrate that algorithmic solutions to the proposed fairness problem can adapt to unknown dynamics and, by sacrificing short-term incentives, drive the policy-population system towards more desirable equilibria. For the proposed setting, we develop an algorithm that adapts recent work in online learning and prove that this algorithm achieves simultaneous probabilistic bounds on cumulative loss and cumulative violations of fairness. In the classification setting subject to group fairness, we compare our proposed algorithm to several baselines, including the repeated retraining of myopic or distributionally robust classifiers, and to a deep reinforcement learning algorithm that lacks fairness guarantees. Our experiments model human populations according to evolutionary game theory and integrate real-world datasets.

Arbitrarily Scalable Environment Generators via Neural Cellular Automata
Yulun Zhang Matthew Christopher Fontaine Varun Bhatt Stefanos Nikolaidis Jiaoyang Li



研究问题:如何生成任意大的环境以提高多机器人系统的吞吐量。
动机:现有的方法只能优化相对较小的仓库环境,无法复制真实的世界仓库规模。随着环境规模的增大,搜索空间呈指数增长,这是挑战所在。此外,以前的方法是在模拟中用最多350个机器人进行测试的,而实际的仓库可以容纳数千个机器人。
方法:我们提出通过质量多样性(QD)算法来优化神经细胞自动机(NCA)环境生成器,而不是直接优化环境。我们在小环境中使用QD算法训练一系列NCA生成器,然后在测试时从生成器中生成任意大的环境。
效果:我们的研究表明,NCA环境生成器无论环境大小如何,都能保持一致、规范化的模式,显著提高了两个不同领域的多机器人系统的可扩展性,最多可容纳2350个机器人。此外,我们还证明,我们的方法可以将单个代理的强化学习策略扩展到具有相似模式的任意大的环境中。

We study the problem of generating arbitrarily large environments to improve the throughput of multi-robot systems. Prior work proposes Quality Diversity (QD) algorithms as an effective method for optimizing the environments of automated warehouses. However, these approaches optimize only relatively small environments, falling short when it comes to replicating real-world warehouse sizes. The challenge arises from the exponential increase in the search space as the environment size increases. Additionally, the previous methods have only been tested with up to 350 robots in simulations, while practical warehouses could host thousands of robots. In this paper, instead of optimizing environments, we propose to optimize Neural Cellular Automata (NCA) environment generators via QD algorithms. We train a collection of NCA generators with QD algorithms in small environments and then generate arbitrarily large environments from the generators at test time. We show that NCA environment generators maintain consistent, regularized patterns regardless of environment size, significantly enhancing the scalability of multi-robot systems in two different domains with up to 2,350 robots. Additionally, we demonstrate that our method scales a single-agent reinforcement learning policy to arbitrarily large environments with similar patterns. We include the source code at https://github.com/lunjohnzhang/warehouse_env_gen_nca_public.

Contextual Bandits and Imitation Learning with Preference-Based Active Queries
Ayush Sekhari Karthik Sridharan Wen Sun Runzhe Wu



研究问题:本文研究了在缺乏直接知识的情况下,学习者如何通过比较两个动作并从专家那里获得有噪声的偏好反馈来最小化执行动作的遗憾和查询次数。
动机:在许多情况下,学习者无法直接获取执行动作的奖励,而需要通过与专家的交互来获取偏好反馈。
方法:我们假设学习者可以访问一个函数类,该类可以在适当的链接函数下表示专家的偏好模型,并提出了一种利用在线回归查询的算法。
效果:对于上下文环境设置,我们的算法实现了将两者最佳结合的遗憾边界,其规模为O(min{√T, d/Δ}),其中T代表交互次数,d代表函数类的逃避维度,Δ代表最优动作在所有上下文中对所有次优动作的最小偏好。我们的算法不需要知道Δ的值,并且获得的遗憾边界与在每一轮都观察到奖励信号的标准上下文环境设置中可以实现的遗憾边界相当。此外,我们的算法只向专家查询O(min{T, d^2/Δ^2})次。然后,我们将我们的算法扩展到模仿学习设置,其中代理与未知的环境进行H步长的交互,并对遗憾和查询复杂性提供了类似的保证。有趣的是,通过基于偏好的反馈,我们的模仿学习算法可以学习出一个优于次优专家的策略,这与需要访问专家的动作和奖励信号的交互式模仿学习算法的结果相匹配。

We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively request the expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is two-fold: to minimize regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions and present an algorithm that leverages an online regression oracle with respect to this function class. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as $O(\min\\{\sqrt{T}, d/\Delta\\})$, where $T$ represents the number of interactions, $d$ represents the eluder dimension of the function class, and $\Delta$ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of $\Delta$, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only $O(\min\\{T, d^2/\Delta^2\\})$ queries to the expert. We then extend our algorithm to the imitation learning setting, where the agent engages with an unknown environment in episodes of length $H$, and provide similar guarantees regarding regret and query complexity. Interestingly, with preference-based feedback, our imitation learning algorithm can learn a policy outperforming a sub-optimal expert, matching the result from interactive imitation learning algorithms [Ross and Bagnell, 2014] that require access to the expert's actions and also reward signals.

POMDP Planning for Object Search in Partially Unknown Environment
Yongbo Chen Hanna Kurniawati



研究问题:在复杂的环境中,如何有效地搜索目标物体,如架子、桌子和床等。
动机:由于定位误差、视野有限和视觉遮挡等因素,移动机器人在复杂环境中寻找目标物体面临重大挑战。
方法:提出了一种针对3D区域中的目标搜索的POMDP(部分可观察马尔可夫决策过程)模型,并设计了感知模块和规划算法(GPOMCP)。
效果:通过Gazebo模拟实验,发现该方法比基于POMCP的基线方法能更快地找到目标物体,且成功率更高,同时计算需求相同。

Efficiently searching for target objects in complex environments that contain various types of furniture, such as shelves, tables, and beds, is crucial for mobile robots, but it poses significant challenges due to various factors such as localization errors, limited field of view, and visual occlusion. To address this problem, we propose a Partially Observable Markov Decision Process (POMDP) formulation with a growing state space for object search in a 3D region. We solve this POMDP by carefully designing a perception module and developing a planning algorithm, called Growing Partially Observable Monte-Carlo Planning (GPOMCP), based on online Monte-Carlo tree search and belief tree reuse with a novel upper confidence bound. We have demonstrated that belief tree reuse is reasonable and achieves good performance when the belief differences are limited. Additionally, we introduce a guessed target object with an updating grid world to guide the search in the information-less and reward-less cases, like the absence of any detected objects. We tested our approach using Gazebo simulations on four scenarios of target finding in a realistic indoor living environment with the Fetch robot simulator. Compared to the baseline approaches, which are based on POMCP, our results indicate that our approach enables the robot to find the target object with a higher success rate faster while using the same computational requirements.

Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspective
Zeyu Zhang Yi Su Hui Yuan Yiran Wu Rishab Balasubramanian Qingyun Wu Huazheng Wang Mengdi Wang



研究问题:现有的离线学习排名方法往往对用户生成点击数据的点击模型做出强烈假设,需要针对不同的点击模型进行特定调整。
动机:本文旨在统一排名过程,并将其视为马尔可夫决策过程,通过离线强化学习直接学习最优排名。
方法:提出一种与点击模型无关的统一离线学习排名(CUOLR)方法,该方法可以很容易地应用于各种点击模型。
效果:在各种大规模数据集上的实验结果表明,CUOLR始终优于最先进的离线学习排名算法,同时在不同的点击模型下保持了一致性和鲁棒性。

Off-policy Learning to Rank (LTR) aims to optimize a ranker from data collected by a deployed logging policy. However, existing off-policy learning to rank methods often make strong assumptions about how users generate the click data, i.e., the click model, and hence need to tailor their methods specifically under different click models. In this paper, we unified the ranking process under general stochastic click models as a Markov Decision Process (MDP), and the optimal ranking could be learned with offline reinforcement learning (RL) directly. Building upon this, we leverage offline RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a wide range of click models. Through a dedicated formulation of the MDP, we show that offline RL algorithms can adapt to various click models without complex debiasing techniques and prior knowledge of the model. Results on various large-scale datasets demonstrate that CUOLR consistently outperforms the state-of-the-art off-policy learning to rank algorithms while maintaining consistency and robustness under different click models.

Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation
Ruida Zhou Tao Liu Min Cheng Dileep Kalathil Panganamala Kumar Chao Tian



研究问题:本文旨在研究强化学习中的鲁棒性,以确定一个在训练模拟器和测试环境之间具有模型匹配性的高性能策略。
动机:现有的基于策略的鲁棒强化学习算法主要关注易于进行鲁棒策略评估的不确定性集合下的表格设置,但在状态数量增加时不再可行。
方法:我们提出了两种新的不确定性集合形式,一种基于双重采样,另一种基于积分概率度量。这两种方法都使得大规模鲁棒强化学习即使在只能访问模拟器的情况下也具有可行性。我们还提出了一种结合了新不确定性集合并采用函数近似的鲁棒自然演员评论(RNAC)方法。
效果:我们证明了所提出的RNAC算法在函数近似误差内收敛到最优鲁棒策略的有限时间保证。最后,我们在多个MuJoCo环境和一个真实的TurtleBot导航任务中展示了由我们的RNAC方法学习的策略的鲁棒性能。

We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.

ReDS: Offline RL With Heteroskedastic Datasets via Support Constraints
Anikait Singh Aviral Kumar Quan Vuong Yevgen Chebotar Sergey Levine



研究问题:现有的离线强化学习(RL)方法在处理状态空间中行为可变性不均匀的数据集时,由于需要在整个状态空间上保持与行为策略相近的程度,往往无法有效学习。
动机:为了解决这一问题,我们提出了一种新的离线RL方法,即保守Q-learning(CQL)重加权(ReDS)。
方法:我们通过重新加权数据分布来获取近似的支持约束形式,该分布是当前策略和另一个额外策略的混合,额外的策略被训练来挖掘可能处于行为策略下的错误动作。
效果:实验结果表明,我们的方法在各种离线RL问题上,包括游戏、导航和像素级操作等,都能显著提高性能。

Offline reinforcement learning (RL) learns policies entirely from static datasets. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy **to the same extent** across the state space. Ideally, the learned policy should be free to choose **per state** how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning (CQL) to obtain an approximate support constraint formulation. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method, CQL (ReDS), is theoretically motivated, and improves performance across a wide range of offline RL problems in games, navigation, and pixel-based manipulation.

On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling and Beyond
Thanh Nguyen-Tang Raman Arora



研究问题:本文旨在理解如何从历史数据集中进行样本高效的序列决策学习,即离线强化学习。
动机:我们对于能够利用(价值)函数近似的样本高效算法感兴趣。
方法:我们提出了一个数据多样性的概念,它涵盖了离线RL中的覆盖度量的所有先前概念,并使用这个概念来统一基于版本空间(VS)、正则化优化(RO)和后验采样(PS)的三类离线RL算法。
效果:在标准假设下,我们证明了基于VS、RO和PS的算法实现了相当的样本效率,恢复了有标准假设下的最先进的次优性界。这一结果令人惊讶,因为之前的研究表明,与基于VS的算法相比,基于RO的算法具有不利的样本复杂度,而由于其探索性质,后验采样在离线RL中很少被考虑。值得注意的是,我们提出的用于离线RL的无模型PS算法是新颖的,其次优性界具有频率主义(即最坏情况)的性质。

We seek to understand what facilitates sample-efficient learning from historical datasets for sequential decision-making, a problem that is popularly known as offline reinforcement learning (RL). Further, we are interested in algorithms that enjoy sample efficiency while leveraging (value) function approximation. In this paper, we address these fundamental questions by (i) proposing a notion of data diversity that subsumes the previous notions of coverage measures in offline RL and (ii) using this notion to \emph{unify} three distinct classes of offline RL algorithms based on version spaces (VS), regularized optimization (RO), and posterior sampling (PS). We establish that VS-based, RO-based, and PS-based algorithms, under standard assumptions, achieve \emph{comparable} sample efficiency, which recovers the state-of-the-art sub-optimality bounds for finite and linear model classes with the standard assumptions. This result is surprising, given that the prior work suggested an unfavorable sample complexity of the RO-based algorithm compared to the VS-based algorithm, whereas posterior sampling is rarely considered in offline RL due to its explorative nature. Notably, our proposed model-free PS-based algorithm for offline RL is \emph{novel}, with sub-optimality bounds that are \emph{frequentist} (i.e., worst-case) in nature.

Decision Stacks: Flexible Reinforcement Learning via Modular Generative Models
Siyan Zhao Aditya Grover



研究问题:强化学习在序列决策制定中具有吸引力,但同时面临算法挑战,如保持最大表现力和进行有效学习和推理的模型选择灵活性。
动机:本文提出了一个生成框架Decision Stacks,将目标条件策略代理分解为3个生成模块,以解决这些挑战。
方法:通过独立的生成模型模拟观察、奖励和行动的时间演变,这些模型可以通过教师强制并行学习。该框架保证了表现力和灵活性,可以设计不同的模块来考虑关键因素,如架构偏差、优化目标和动态性、跨领域的可转移性和推理速度。
效果:实证结果显示,Decision Stacks在多个MDP和POMDP环境中用于离线策略优化,优于现有方法,并实现了灵活的生成决策。

Reinforcement learning presents an attractive paradigm to reason about several distinct aspects of sequential decision making, such as specifying complex goals, planning future observations and actions, and critiquing their utilities. However, the combined integration of these capabilities poses competing algorithmic challenges in retaining maximal expressivity while allowing for flexibility in modeling choices for efficient learning and inference. We present Decision Stacks, a generative framework that decomposes goal-conditioned policy agents into 3 generative modules. These modules simulate the temporal evolution of observations, rewards, and actions via independent generative models that can be learned in parallel via teacher forcing. Our framework guarantees both expressivity and flexibility in designing individual modules to account for key factors such as architectural bias, optimization objective and dynamics, transferrability across domains, and inference speed. Our empirical results demonstrate the effectiveness of Decision Stacks for offline policy optimization for several MDP and POMDP environments, outperforming existing methods and enabling flexible generative decision making.

A Long $N$-step Surrogate Stage Reward for Deep Reinforcement Learning
Junmin Zhong Ruofan Wu Jennie Si



研究问题:深度强化学习中高方差问题阻碍了其成功收敛,影响了任务性能,并阻碍了其在连续控制问题上的应用。
动机:为了解决这个问题,我们提出了一种新的阶段奖励估计器——长N步替代阶段奖励(LNSS)。
方法:LNSS利用未来步骤的长期奖励轨迹,通过平均奖励、收敛速度、学习成功率和Q值及奖励的方差降低来提供一致的性能改进。
效果:我们在DeepMind Control Suite和OpenAI Gym的各种环境中使用LNSS在基础深度RL算法如DDPG、D4PG和TD3中进行评估,结果显示LNSS奖励使深度RL取得了之前难以获得的良好结果,同时LNSS还使Q值的方差上限呈指数级降低。

We introduce a new stage reward estimator named the long $N$-step surrogate stage (LNSS) reward for deep reinforcement learning (RL). It aims at mitigating the high variance problem, which has shown impeding successful convergence of learning, hurting task performance, and hindering applications of deep RL in continuous control problems. In this paper we show that LNSS, which utilizes a long reward trajectory of rewards of future steps, provides consistent performance improvement measured by average reward, convergence speed, learning success rate,and variance reduction in $Q$ values and rewards. Our evaluations are based on a variety of environments in DeepMind Control Suite and OpenAI Gym by using LNSS in baseline deep RL algorithms such as DDPG, D4PG, and TD3. We show that LNSS reward has enabled good results that have been challenging to obtain by deep RL previously. Our analysis also shows that LNSS exponentially reduces the upper bound on the variances of $Q$ values from respective single-step methods.

Guarantees for Self-Play in Multiplayer Games via Polymatrix Decomposability
Revan MacQueen James R. Wright



研究问题:本文旨在探讨在多智能体系统中,通过自我对弈进行学习的机器学习方法。
动机:自我对弈可以生成大量的学习数据,但缺点是训练后的代理的行为可能与学习者通过自我互动所预期的行为大不相同。
方法:对于近似分解为一组两人零和游戏的多玩家游戏(称为常数和多矩阵游戏),其中全局ε-纳什均衡在每个子游戏中远离纳什均衡(称为子游戏稳定性),任何无外部遗憾的自我对弈学习算法都将产生具有有界脆弱性的策略。
效果:首次发现多玩家游戏的一种结构属性,使一类广泛自我对弈算法产生策略的性能保证。通过Leduc扑克的实验证明了这些发现。

Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multiplayer games. We show that in games that approximately decompose into a set of two-player constant-sum games (called constant-sum polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria in each subgame (called subgame stability), any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.

State-Action Similarity-Based Representations for Off-Policy Evaluation
Brahma S Pavse Josiah P. Hanna



研究问题:本文旨在解决强化学习中的离线评估(OPE)问题,即如何估计给定固定数据集的评估策略的预期回报。
动机:现有的离线评估算法通常直接使用原始固定数据集来学习评估策略的动作值函数,这在数据效率上存在不足。
方法:本文提出了一种通过学习编码器转换固定数据集,然后将其输入到FQE中以提高FQE的数据效率的方法。同时,引入了一种针对OPE的状态-动作行为相似度度量,并使用该度量和固定数据集来学习一个模型这种度量的编码器。
效果:理论分析和实验结果表明,这种方法可以有效地提高FQE的数据效率,降低OPE误差,并在面对挑战性的OPE任务时,比其他基于OPE的表示学习方法表现更好。此外,学习到的表示还可以显著减轻FQE在不同分布偏移下的发散问题。

In reinforcement learning, off-policy evaluation (OPE) is the problem of estimating the expected return of an evaluation policy given a fixed dataset that was collected by running one or more different policies. One of the more empirically successful algorithms for OPE has been the fitted q-evaluation (FQE) algorithm that uses temporal difference updates to learn an action-value function, which is then used to estimate the expected return of the evaluation policy. Typically, the original fixed dataset is fed directly into FQE to learn the action-value function of the evaluation policy. Instead, in this paper, we seek to enhance the data-efficiency of FQE by first transforming the fixed dataset using a learned encoder, and then feeding the transformed dataset into FQE. To learn such an encoder, we introduce an OPE-tailored state-action behavioral similarity metric, and use this metric and the fixed dataset to learn an encoder that models this metric. Theoretically, we show that this metric allows us to bound the error in the resulting OPE estimate. Empirically, we show that other state-action similarity metrics lead to representations that cannot represent the action-value function of the evaluation policy, and that our state-action representation method boosts the data-efficiency of FQE and lowers OPE error relative to other OPE-based representation learning methods on challenging OPE tasks. We also empirically show that the learned representations significantly mitigate divergence of FQE under varying distribution shifts. Our code is available here: https://github.com/Badger-RL/ROPE.

Game Solving with Online Fine-Tuning
Ti-Rong Wu Hung Guei Ting Han Wei Chung-Chin Shih Jui-Te Chin I-Chen Wu



研究问题:如何通过在线微调优化AlphaZero算法,以解决游戏中所有可能的败者走法,并获取完整的策略。
动机:目前的AlphaZero算法在游戏对战中表现出超人的水平,但其强大的策略和价值预测功能在寻找全面的游戏解决方案时可能会产生误导,特别是在评估那些在自我对弈过程中不可能出现的弱势走法时。
方法:本文提出了两种在线微调的方法,以学习为游戏解决量身定制的策略。
效果:实验证明,使用在线微调可以在解决一系列具有挑战性的7x7 Killall-Go问题时,仅使用23.54%的计算时间,相比于没有在线微调的基线。这种方法的效果随着问题规模的增大而增大,并且可以扩展到任何用于解决问题的树搜索算法。

Game solving is a similar, yet more difficult task than mastering a game. Solving a game typically means to find the game-theoretic value (outcome given optimal play), and optionally a full strategy to follow in order to achieve that outcome. The AlphaZero algorithm has demonstrated super-human level play, and its powerful policy and value predictions have also served as heuristics in game solving. However, to solve a game and obtain a full strategy, a winning response must be found for all possible moves by the losing player. This includes very poor lines of play from the losing side, for which the AlphaZero self-play process will not encounter. AlphaZero-based heuristics can be highly inaccurate when evaluating these out-of-distribution positions, which occur throughout the entire search. To address this issue, this paper investigates applying online fine-tuning while searching and proposes two methods to learn tailor-designed heuristics for game solving. Our experiments show that using online fine-tuning can solve a series of challenging 7x7 Killall-Go problems, using only 23.54\% of computation time compared to the baseline without online fine-tuning. Results suggest that the savings scale with problem size. Our method can further be extended to any tree search algorithm for problem solving. Our code is available at https://rlg.iis.sinica.edu.tw/papers/neurips2023-online-fine-tuning-solver.

Weakly Coupled Deep Q-Networks
Ibrahim El Shar Daniel R. Jiang



研究问题:如何提高弱耦合马尔可夫决策过程(WCMDP)类问题的强化学习性能。
动机:弱耦合马尔可夫决策过程在实际应用中频繁出现,但其子问题数量增多时会变得难以处理。
方法:提出弱耦合深度Q网络(WCDQN),使用一个网络训练多个DQN“子代理”,然后将他们的解决方案结合起来建立最优行动值的上界,指导主DQN代理向最优性发展。
效果:数值实验表明,与DQN和相关技术相比,WCDQN在最多有10个子问题、总动作数为$3^{10}$和连续状态空间的情况下,具有更快的收敛速度。

We propose weakly coupled deep Q-networks (WCDQN), a novel deep reinforcement learning algorithm that enhances performance in a class of structured problems called weakly coupled Markov decision processes (WCMDP). WCMDPs consist of multiple independent subproblems connected by an action space constraint, which is a structural property that frequently emerges in practice. Despite this appealing structure, WCMDPs quickly become intractable as the number of subproblems grows. WCDQN employs a single network to train multiple DQN ``subagents,'' one for each subproblem, and then combine their solutions to establish an upper bound on the optimal action value. This guides the main DQN agent towards optimality. We show that the tabular version, weakly coupled Q-learning (WCQL), converges almost surely to the optimal action value. Numerical experiments show faster convergence compared to DQN and related techniques in settings with as many as 10 subproblems, $3^{10}$ total actions, and a continuous state space.

Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion
Taehyun Cho Seungyub Han Heesoo Lee Kyungjae Lee Jungwoo Lee



研究问题:分布强化学习算法尝试利用估计的不确定性进行探索,如面对不确定性的乐观态度。然而,使用估计的方差进行乐观探索可能导致数据收集的偏差,阻碍收敛或性能。
动机:本文提出了一种新的分布强化学习方法,通过随机化风险标准来选择行动,而不失去风险中性目标。
方法:我们通过扭曲风险度量来提供一种扰动的分布性贝尔曼最优性算子。同时,我们证明了该方法在较弱的收缩性质下的收敛性和最优性。
效果:我们的理论研究结果支持该方法不会陷入偏见的探索,并保证收敛到最优回报。最后,我们在包括Atari 55游戏在内的各种环境中实证地表明,我们的方法优于其他现有的基于分布的算法。

Distributional reinforcement learning algorithms have attempted to utilize estimated uncertainty for exploration, such as optimism in the face of uncertainty. However, using the estimated variance for optimistic exploration may cause biased data collection and hinder convergence or performance. In this paper, we present a novel distributional reinforcement learning that selects actions by randomizing risk criterion without losing the risk-neutral objective. We provide a perturbed distributional Bellman optimality operator by distorting the risk measure. Also,we prove the convergence and optimality of the proposed method with the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including Atari 55 games.

Large Language Models Are Semi-Parametric Reinforcement Learning Agents
Danyang Zhang Lu Chen Situo Zhang Hongshen Xu Zihan Zhao Kai Yu



研究问题:提出一种基于大型语言模型(LLM)的新型可进化代理框架作为Rememberer,以模拟人类的记忆和推理机制。
动机:通过给LLM配备长期的经验记忆,使得Rememberer能够利用过去的经验,即使面对不同的任务目标也能表现出色,超越了使用固定示例或短暂工作记忆的LLM代理。
方法:引入了经验记忆强化学习(RLEM)来更新记忆,使整个系统能从成功和失败的经验中学习,并在不微调LLM参数的情况下提升其能力。
效果:在两个RL任务集上进行的大量实验表明,Rememberer的平均结果超过了先前的最佳性能4%和2%,证明了其优越性和鲁棒性。

Inspired by the insights in cognitive science with respect to human memory and reasoning mechanism, a novel evolvable LLM-based (Large Language Model) agent framework is proposed as Rememberer. By equipping the LLM with a long-term experience memory, Rememberer is capable of exploiting the experiences from the past episodes even for different task goals, which excels an LLM-based agent with fixed exemplars or equipped with a transient working memory. We further introduce **R**einforcement **L**earning with **E**xperience **M**emory (**RLEM**) to update the memory. Thus, the whole system can learn from the experiences of both success and failure, and evolve its capability without fine-tuning the parameters of the LLM. In this way, the proposed Rememberer constitutes a semi-parametric RL agent. Extensive experiments are conducted on two RL task sets to evaluate the proposed framework. The average results with different initialization and training sets exceed the prior SOTA by 4% and 2% for the success rate on two task sets and demonstrate the superiority and robustness of Rememberer.

Robust Multi-Agent Reinforcement Learning via Adversarial Regularization: Theoretical Foundation and Stable Algorithms
Alexander Bukharin Yan Li Yue Yu Qingru Zhang Zhehui Chen Simiao Zuo Chao Zhang Songan Zhang Tuo Zhao



研究问题:多智能体强化学习(MARL)在多个领域表现出了良好的效果,但其策略通常缺乏鲁棒性,对环境的小变化敏感。
动机:为了解决MARL在真实世界部署中可能遇到的环境差异问题,提高其鲁棒性。
方法:提出了一种新的鲁棒MARL框架ERNIE,通过对抗正则化来增强策略相对于状态观察和行动的Lipschitz连续性。同时,将对抗正则化重新表述为Stackelberg博弈以降低训练不稳定性。
效果:实验证明,ERNIE框架能有效抵抗噪声观察、变化的转换动态和恶意行动的影响。在交通灯控制和粒子环境等任务上表现优秀。

Multi-Agent Reinforcement Learning (MARL) has shown promising results across several domains. Despite this promise, MARL policies often lack robustness and are therefore sensitive to small changes in their environment. This presents a serious concern for the real world deployment of MARL algorithms, where the testing environment may slightly differ from the training environment. In this work we show that we can gain robustness by controlling a policy’s Lipschitz constant, and under mild conditions, establish the existence of a Lipschitz and close-to-optimal policy. Motivated by these insights, we propose a new robust MARL framework, ERNIE, that promotes the Lipschitz continuity of the policies with respect to the state observations and actions by adversarial regularization. The ERNIE framework provides robustness against noisy observations, changing transition dynamics, and malicious actions of agents. However, ERNIE’s adversarial regularization may introduce some training instability. To reduce this instability, we reformulate adversarial regularization as a Stackelberg game. We demonstrate the effectiveness of the proposed framework with extensive experiments in traffic light control and particle environments. In addition, we extend ERNIE to mean-field MARL with a formulation based on distributionally robust optimization that outperforms its non-robust counterpart and is of independent interest. Our code is available at https://github.com/abukharin3/ERNIE.

Policy Space Diversity for Non-Transitive Games
Jian Yao Weiming Liu Haobo Fu Yaodong Yang Stephen Marcus McAleer QIANG FU Yang Wei



研究问题:如何提高多智能体非传递博弈中近似纳什均衡的算法效率和效果。
动机:现有的Policy-Space Response Oracles (PSRO)算法在提升策略多样性上存在不足,且多样性的提升并不一定能带来更好的近似纳什均衡。
方法:提出一种新的策略多样性度量方式,并开发了一种基于状态-动作样本优化该度量的方法。将这种多样性正则化方法融入到PSRO的最佳响应求解中,形成新的PSRO变体——"策略空间多样性" PSRO (PSD-PSRO)。
效果:实验证明,PSD-PSRO在单状态游戏、Leduc和Goofspiel等任务上,比现有最佳的PSRO变体更能产生不可被利用的策略。

Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness with existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving of PSRO, we obtain a new PSRO variant, \textit{Policy Space Diversity} PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on single-state games, Leduc, and Goofspiel demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.

Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation
Wenhao Ding Laixi Shi Yuejie Chi Ding Zhao



研究问题:本文旨在解决强化学习中的一种关键稳健性问题,即对抗虚假相关性的稳健性,其中不同部分的状态没有由未观察到的混淆因素引起的相关性。
动机:在现实生活中的任务中,虚假相关性无处不在,例如自动驾驶汽车通常在白天观察到繁忙的交通,而在夜间由于未观察到的人类活动而只有轻微的交通。学习这种无用甚至有害的相关性的模型在测试案例中的混淆因素与训练案例偏离时可能会灾难性地失败。
方法:为了解决这个问题,我们提出了Robust State-Confounded Markov Decision Processes(RSC-MDPs),并从理论上证明了它与其他稳健RL算法相比在避免学习虚假相关性方面的优越性。我们还设计了一个经验性的算法来学习RSC-MDPs的稳健最优策略,该策略在八个现实的自动驾驶和操作任务中超越了所有基线。
效果:实验结果表明,我们的RSC-MDPs和学习策略在所有测试的任务上都优于其他基线,显示出了良好的稳健性和泛化能力。

Robustness has been extensively studied in reinforcement learning (RL) to handle various forms of uncertainty such as random perturbations, rare events, and malicious attacks. In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have correlations induced by unobserved confounders. These spurious correlations are ubiquitous in real-world tasks, for instance, a self-driving car usually observes heavy traffic in the daytime and light traffic at night due to unobservable human activity. A model that learns such useless or even harmful correlation could catastrophically fail when the confounder in the test case deviates from the training one. Although motivated, enabling robustness against spurious correlation poses significant challenges since the uncertainty set, shaped by the unobserved confounder and causal structure, is difficult to characterize and identify. Existing robust algorithms that assume simple and unstructured uncertainty sets are therefore inadequate to address this challenge. To solve this issue, we propose Robust State-Confounded Markov Decision Processes (RSC-MDPs) and theoretically demonstrate its superiority in avoiding learning spurious correlations compared with other robust RL counterparts. We also design an empirical algorithm to learn the robust optimal policy for RSC-MDPs, which outperforms all baselines in eight realistic self-driving and manipulation tasks.

Autonomous Capability Assessment of Sequential Decision-Making Systems in Stochastic Settings
Pulkit Verma Rushang Karia Siddharth Srivastava



研究问题:如何让用户安全地使用AI系统,特别是那些具有序列决策能力的黑箱AI系统。
动机:尽管AI系统的使用越来越普遍,但用户对其能力和限制的理解仍然不足,这可能导致误用和风险。
方法:本文提出了一种新的方法,通过主动学习与黑箱SDM系统交互,学习并解释其能力的概率模型。
效果:理论分析和实验证明,该方法可以在少量样本中泛化,并能有效地描述任意黑箱SDM代理的能力,且能保证学习过程收敛到正确的代理模型。

It is essential for users to understand what their AI systems can and can't do in order to use them safely. However, the problem of enabling users to assess AI systems with sequential decision-making (SDM) capabilities is relatively understudied. This paper presents a new approach for modeling the capabilities of black-box AI systems that can plan and act, along with the possible effects and requirements for executing those capabilities in stochastic settings. We present an active-learning approach that can effectively interact with a black-box SDM system and learn an interpretable probabilistic model describing its capabilities. Theoretical analysis of the approach identifies the conditions under which the learning process is guaranteed to converge to the correct model of the agent; empirical evaluations on different agents and simulated scenarios show that this approach is few-shot generalizable and can effectively describe the capabilities of arbitrary black-box SDM agents in a sample-efficient manner.

Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management
Dhawal Gupta Yinlam Chow Azamat Tulepbergenov Mohammad Ghavamzadeh Craig Boutilier



研究问题:如何有效地利用强化学习驱动对话管理,以实现非短视的、丰富的对话和最大化用户满意度。
动机:尽管强化学习和语言模型取得了进步,但使用强化学习来驱动对话聊天机器人仍然面临重大挑战,如在线探索的成本高昂和可能产生的不安全因素。
方法:开发了专门针对对话规划的各种强化学习算法,利用最新的混合专家语言模型(MoE-LMs),通过利用MoE-LM的结构,显著减小了动作空间的大小,提高了基于强化学习的对话管理的效能。
效果:在开放领域对话中进行评估,展示了生成的语句的意图多样性和整体对话管理性能方面的有效性。

Reinforcement learning (RL) has shown great promise for developing agents for dialogue management (DM) that are non-myopic, conduct rich conversations, and maximize overall user satisfaction. Despite the advancements in RL and language models (LMs), employing RL to drive conversational chatbots still poses significant challenges. A primary issue stems from RL’s dependency on online exploration for effective learning, a process that can be costly. Moreover, engaging in online interactions with humans during the training phase can raise safety concerns, as the LM can potentially generate unwanted outputs. This issue is exacerbated by the combinatorial action spaces facing these algorithms, as most LM agents generate responses at the word level. We develop various RL algorithms, specialized in dialogue planning, that leverage recent Mixture-of-Expert Language Models (MoE-LMs)---models that capture diverse semantics, generate utterances reflecting different intents, and are amenable for multi-turn DM. By exploiting the MoE-LM structure, our methods significantly reduce the size of the action space and improve the efficacy of RL-based DM. We evaluate our methods in open-domain dialogue to demonstrate their effectiveness with respect to the diversity of intent in generated utterances and overall DM performance.

Online Nonstochastic Model-Free Reinforcement Learning
Udaya Ghai Arushi Gupta Wenhan Xia Karan Singh Elad Hazan



研究问题:本文旨在调查适用于动态或甚至对抗性环境的鲁棒无模型强化学习算法。
动机:传统的基于状态的策略在存在未建模干扰的环境中往往难以应对挑战,优化线性基于状态的策略即使在线性动力系统等良性环境中也会带来非凸目标的障碍。
方法:从最新的基于模型的控制进展中获得灵感,引入一类以干扰信号为中心的新策略。我们定义了这些信号的几个类别,称之为伪干扰,并开发了基于它们的相应策略类。我们为优化这些策略提供了高效实用的算法。
效果:我们研究了在线适应强化学习代理面对对抗性干扰的任务。我们的方法与任何黑盒无模型方法无缝集成,在处理线性动力学时提供可证明的遗憾保证。这些遗憾保证无条件地改善了最知名的结果,即在没有依赖状态空间维度的情况下进行带状线性控制。我们在各种标准的RL基准上评估我们的方法,并展示了改进的鲁棒性。

We investigate robust model-free reinforcement learning algorithms designed for environments that may be dynamic or even adversarial. Traditional state-based policies often struggle to accommodate the challenges imposed by the presence of unmodeled disturbances in such settings. Moreover, optimizing linear state-based policies pose an obstacle for efficient optimization, leading to nonconvex objectives, even in benign environments like linear dynamical systems. Drawing inspiration from recent advancements in model-based control, we intro- duce a novel class of policies centered on disturbance signals. We define several categories of these signals, which we term pseudo-disturbances, and develop corresponding policy classes based on them. We provide efficient and practical algorithms for optimizing these policies. Next, we examine the task of online adaptation of reinforcement learning agents in the face of adversarial disturbances. Our methods seamlessly integrate with any black-box model-free approach, yielding provable regret guarantees when dealing with linear dynamics. These regret guarantees unconditionally improve the best-known results for bandit linear control in having no dependence on the state-space dimension. We evaluate our method over various standard RL benchmarks and demonstrate improved robustness.

Provably Efficient Algorithm for Nonstationary Low-Rank MDPs
Yuan Cheng Jing Yang Yingbin Liang



研究问题:本文旨在研究强化学习在不断变化的环境中的非平稳马尔可夫决策过程,以解决深度强化学习中的未知表示问题。
动机:目前的理论研究主要关注于表格和线性(混合)MDPs,这并不能捕捉到深度强化学习中的未知表示。
方法:本文首次尝试在剧集低秩MDPs中研究非平稳强化学习,其中转换核和奖励都可能随时间变化,而低秩模型除了线性状态嵌入函数外还包含未知表示。我们首先提出了一种依赖于参数的策略优化算法,称为PORTAL,并将其进一步改进为无需任何先验非平稳性知识的参数自由版本的Ada-PORTAL。
效果:对于这两种算法,我们都提供了平均动态次优差距的上界,这表明只要非平稳性不是显著大,PORTAL和Ada-PORTAL都是样本高效的,并且可以以多项式样本复杂度实现任意小的平均动态次优差距。

Reinforcement learning (RL) under changing environment models many real-world applications via nonstationary Markov Decision Processes (MDPs), and hence gains considerable interest. However, theoretical studies on nonstationary MDPs in the literature have mainly focused on tabular and linear (mixture) MDPs, which do not capture the nature of unknown representation in deep RL. In this paper, we make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time, and the low-rank model contains unknown representation in addition to the linear state embedding function. We first propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL, which is able to tune its hyper-parameters adaptively without any prior knowledge of nonstationarity. For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with polynomial sample complexity.

Trust Region-Based Safe Distributional Reinforcement Learning for Multiple Constraints
Dohyeong Kim Kyungjae Lee Songhwai Oh



研究问题:在安全关键性的机器人任务中,如何降低潜在失败并满足避免碰撞、限制能源消耗和保持平衡等多重约束。
动机:传统的强化学习算法往往无法处理这些多重约束,因此需要提出一种能够处理这些约束的安全强化学习方法。
方法:提出了一种基于信任区域的名为安全分布演员评论家(SDAC)的多约束安全强化学习算法。该算法通过引入梯度集成方法和开发TD(λ)目标分布来管理不可行问题和估计风险厌恶约束。
效果:通过大量的实验表明,与现有的安全强化学习基线相比,SDAC在满足所有约束的情况下需要的步骤数减少了1.93倍,并且在单约束任务中的约束违反次数减少了1.78倍。

In safety-critical robotic tasks, potential failures must be reduced, and multiple constraints must be met, such as avoiding collisions, limiting energy consumption, and maintaining balance. Thus, applying safe reinforcement learning (RL) in such robotic tasks requires to handle multiple constraints and use risk-averse constraints rather than risk-neutral constraints. To this end, we propose a trust region-based safe RL algorithm for multiple constraints called a safe distributional actor-critic (SDAC). Our main contributions are as follows: 1) introducing a gradient integration method to manage infeasibility issues in multi-constrained problems, ensuring theoretical convergence, and 2) developing a TD($\lambda$) target distribution to estimate risk-averse constraints with low biases. We evaluate SDAC through extensive experiments involving multi- and single-constrained robotic tasks. While maintaining high scores, SDAC shows 1.93 times fewer steps to satisfy all constraints in multi-constrained tasks and 1.78 times fewer constraint violations in single-constrained tasks compared to safe RL baselines. Code is available at: https://github.com/rllab-snu/Safe-Distributional-Actor-Critic.

Cross-Episodic Curriculum for Transformer Agents
Lucy Xiaoyang Shi Yunfan Jiang Jake Grigsby Linxi Fan Yuke Zhu



研究问题:如何提高Transformer代理的学习效率和泛化能力。
动机:通过将跨剧经验纳入Transformer的上下文中,形成一种课程形式,以此提高学习效率和泛化能力。
方法:提出了一种新的算法——跨剧课程(CEC),通过在线学习和混合质量演示的顺序结构来构建包含学习进展和熟练度增加的课程。
效果:在两个代表性场景下展示了CEC的有效性,包括深度思维实验室的多任务强化学习和RoboMimic的模仿学习,无论在哪种情况下,CEC产生的策略都表现出优越的性能和强大的泛化能力。

We present a new algorithm, Cross-Episodic Curriculum (CEC), to boost the learning efficiency and generalization of Transformer agents. Central to CEC is the placement of cross-episodic experiences into a Transformer’s context, which forms the basis of a curriculum. By sequentially structuring online learning trials and mixed-quality demonstrations, CEC constructs curricula that encapsulate learning progression and proficiency increase across episodes. Such synergy combined with the potent pattern recognition capabilities of Transformer models delivers a powerful cross-episodic attention mechanism. The effectiveness of CEC is demonstrated under two representative scenarios: one involving multi-task reinforcement learning with discrete control, such as in DeepMind Lab, where the curriculum captures the learning progression in both individual and progressively complex settings; and the other involving imitation learning with mixed-quality data for continuous control, as seen in RoboMimic, where the curriculum captures the improvement in demonstrators' expertise. In all instances, policies resulting from CEC exhibit superior performance and strong generalization. Code is open-sourced on the project website https://cec-agent.github.io/ to facilitate research on Transformer agent learning.

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning
Gen Li Wenhao Zhan Jason D. Lee Yuejie Chi Yuxin Chen



研究问题:本文研究了混合环境下的表格强化学习,即如何在已知的离线数据集和未知在线环境交互中进行有效的策略微调。
动机:如何高效利用在线数据强化和补充离线数据集,以实现有效的策略微调是关键问题。
方法:借助奖励无关探索和离线强化学习的最新进展,设计了一个三阶段的混合强化学习算法,该算法在样本复杂度上超越了纯离线强化学习和纯在线强化学习。
效果:提出的算法在数据收集过程中不需要任何奖励信息。理论基于新的概念“单策略部分集中度”,它捕捉了分布不匹配和覆盖不足之间的权衡,并指导了离线数据和在线数据的交互。

This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic exploration and offline RL, we design a three-stage hybrid RL algorithm that beats the best of both worlds --- pure offline RL and pure online RL --- in terms of sample complexities. The proposed algorithm does not require any reward information during data collection. Our theory is developed based on a new notion called **single-policy partial concentrability**, which captures the trade-off between distribution mismatch and miscoverage and guides the interplay between offline and online data.

Bayesian Risk-Averse Q-Learning with Streaming Observations
Yuhao Wang Enlu Zhou



研究问题:如何通过模拟训练环境进行强化学习,并解决由于缺乏数据导致的模型在训练环境和真实环境之间的不匹配问题。
动机:为了解决模型不确定性问题,我们采用了无限期的贝叶斯风险MDP(BRMDP)方法,利用贝叶斯后验来估计转换模型,并引入风险函数以考虑模型不确定性。
方法:我们开发了一种多阶段贝叶斯风险规避Q学习算法,该算法使用来自真实环境的流式观察结果来解决BRMDP问题。
效果:我们的理论分析表明,BRMDP能够平衡稳健性和保守性。我们的算法学习到了一种依赖于真实世界观察结果的风险规避最优策略,并且我们为其提供了强收敛性的保证。

We consider a robust reinforcement learning problem, where a learning agent learns from a simulated training environment. To account for the model mis-specification between this training environment and the true environment due to lack of data, we adopt a formulation of Bayesian risk MDP (BRMDP) with infinite horizon, which uses Bayesian posterior to estimate the transition model and impose a risk functional to account for the model uncertainty. Observations from the real environment that is out of the agent's control arrive periodically and are utilized by the agent to update the Bayesian posterior to reduce model uncertainty. We theoretically demonstrate that BRMDP balances the trade-off between robustness and conservativeness, and we further develop a multi-stage Bayesian risk-averse Q-learning algorithm to solve BRMDP with streaming observations from real environment. The proposed algorithm learns a risk-averse yet optimal policy that depends on the availability of real-world observations. We provide a theoretical guarantee of strong convergence for the proposed algorithm.

Bi-Level Offline Policy Optimization with Limited Exploration
Wenzhuo Zhou



研究问题:本文旨在解决离线强化学习中由于数据集缺乏充分探索而导致的分布偏移问题。
动机:现有的离线强化学习方法在处理分布偏移问题上存在困难,尤其是在函数近似的情况下。
方法:本文提出了一种双层结构的策略优化算法,该算法在上下两层之间建立了一个分层的交互模型。下层关注于构建一个价值估计的信心集合,同时控制由于分布不匹配而产生的不确定性。上层则试图从下层形成的信心集合中最大化一个保守的价值估计。
效果:实验结果表明,该方法在合成、基准和真实世界的离线RL数据集上都表现出色,与最先进的方法竞争。

We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration, especially under function approximation. To tackle this issue, we propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level). The lower level focuses on constructing a confidence set of value estimates that maintain sufficiently small weighted average Bellman errors, while controlling uncertainty arising from distribution mismatch. Subsequently, at the upper level, the policy aims to maximize a conservative value estimate from the confidence set formed at the lower level. This novel formulation preserves the maximum flexibility of the implicitly induced exploratory data distribution, enabling the power of model extrapolation. In practice, it can be solved through a computationally efficient, penalized adversarial estimation procedure. Our theoretical regret guarantees do not rely on any data-coverage and completeness-type assumptions, only requiring realizability. These guarantees also demonstrate that the learned policy represents the ``best effort'' among all policies, as no other policies can outperform it. We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.

Efficient Adversarial Attacks on Online Multi-agent Reinforcement Learning
Guanlin Liu Lifeng Lai



研究问题:本研究旨在理解对抗性攻击对多智能体强化学习(MARL)模型的影响,以保障该模型的安全应用。
动机:由于多智能体强化学习在许多领域都有广泛的应用,因此了解对抗性攻击对其的影响对于保证其安全使用至关重要。
方法:我们考虑了一个外生攻击者可以修改代理接收奖励之前或环境接收动作之前的奖励或操纵动作的情况。攻击者的目标是引导每个代理进入目标策略,或在攻击者选择的特定奖励函数下最大化累积奖励,同时最小化反馈和动作的操纵量。我们首先展示了仅行动中毒攻击和仅奖励中毒攻击的局限性,然后引入了结合行动中毒和奖励中毒的攻击策略。
效果:实验结果表明,即使攻击者对底层环境和代理的算法没有先验信息,混合攻击策略也能有效地攻击MARL代理。

Due to the broad range of applications of multi-agent reinforcement learning (MARL), understanding the effects of adversarial attacks against MARL model is essential for the safe applications of this model. Motivated by this, we investigate the impact of adversarial attacks on MARL. In the considered setup, there is an exogenous attacker who is able to modify the rewards before the agents receive them or manipulate the actions before the environment receives them. The attacker aims to guide each agent into a target policy or maximize the cumulative rewards under some specific reward function chosen by the attacker, while minimizing the amount of the manipulation on feedback and action. We first show the limitations of the action poisoning only attacks and the reward poisoning only attacks. We then introduce a mixed attack strategy with both the action poisoning and reward poisoning. We show that the mixed attack strategy can efficiently attack MARL agents even if the attacker has no prior information about the underlying environment and the agents’ algorithms.

Successor-Predecessor Intrinsic Exploration
Changmin Yu Neil Burgess Maneesh Sahani Samuel Gershman



研究问题:如何在强化学习中进行有效的探索,特别是在外部奖励稀疏的环境中。
动机:现有的强化学习方法主要关注基于未来状态前景的度量来构造内在奖励,忽视了转换序列的回顾性结构中所包含的信息。
方法:提出了一种名为“后继-前驱内在探索”(SPIE)的新的内在奖励算法,该算法结合了前瞻性和回顾性信息,使代理能够利用回顾性信息产生具有结构意识的探索行为。
效果:实验结果表明,SPIE在奖励稀疏和瓶颈状态环境中产生了更有效、更符合生态学原理的探索行为,并且在稀疏奖励的Atari游戏中,使用SPIE的深度强化学习代理比现有方法获得了更强的实证性能。

Exploration is essential in reinforcement learning, particularly in environments where external rewards are sparse. Here we focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards. Although the study of intrinsic rewards has a long history, existing methods focus on composing the intrinsic reward based on measures of future prospects of states, ignoring the information contained in the retrospective structure of transition sequences. Here we argue that the agent can utilise retrospective information to generate explorative behaviour with structure-awareness, facilitating efficient exploration based on global instead of local information. We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information. We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods. We also implement SPIE in deep reinforcement learning agents, and show that the resulting agent achieves stronger empirical performance than existing methods on sparse-reward Atari games.

Effectively Learning Initiation Sets in Hierarchical Reinforcement Learning
Akhil Bagaria Ben M Abbatematteo Omer Gottesman Matt Corsaro Sreehari Rammohan George Konidaris



研究问题:在分层强化学习中,一个代理学习选项必须解决三个问题:识别选项的子目标(终止条件)、学习策略、以及学习该策略将成功的地方(启动集)。
动机:虽然终止条件通常首先被确定,但选项的策略和启动集必须同时学习,这是具有挑战性的,因为启动集依赖于选项的策略,而这个策略会随着代理的学习而改变。因此,从选项执行中获得的数据会随着时间的推移变得无效,导致启动集不准确,从而损害下游任务的性能。
方法:我们提出了使用离线策略估计和分类工具来解决学习启动集中特有的三个问题:数据非平稳性、时间信用分配和悲观主义。
效果:我们的方法是快速学习到更高质量的启动集,比现有方法更快(在MiniGrid和Montezuma's Revenge中),并能自动发现机器人操作的有希望的抓取(在Robosuite中),并在MuJoCo中的一个具有挑战性的迷宫导航任务中提高了一种最先进的选项发现方法的性能。

An agent learning an option in hierarchical reinforcement learning must solve three problems: identify the option's subgoal (termination condition), learn a policy, and learn where that policy will succeed (initiation set). The termination condition is typically identified first, but the option policy and initiation set must be learned simultaneously, which is challenging because the initiation set depends on the option policy, which changes as the agent learns. Consequently, data obtained from option execution becomes invalid over time, leading to an inaccurate initiation set that subsequently harms downstream task performance. We highlight three issues---data non-stationarity, temporal credit assignment, and pessimism---specific to learning initiation sets, and propose to address them using tools from off-policy value estimation and classification. We show that our method learns higher-quality initiation sets faster than existing methods (in MiniGrid and Montezuma's Revenge), can automatically discover promising grasps for robot manipulation (in Robosuite), and improves the performance of a state-of-the-art option discovery method in a challenging maze navigation task in MuJoCo.

StateMask: Explaining Deep Reinforcement Learning through State Mask
Zelei Cheng Xian Wu Jiahao Yu Wenhai Sun Wenbo Guo Xinyu Xing



研究问题:尽管深度强化学习(DRL)代理在许多具有挑战性的场景中表现出色,但这些代理的黑箱特性极大地限制了它们在关键领域的应用。
动机:为了解决这一问题,我们提出了一种新的方法StateMask,用于识别对代理最终奖励最关键的状态。
方法:StateMask的基本思想是学习一个掩码网络,该网络可以屏蔽目标代理,并在某些步骤上迫使它采取随机行动,而不会影响代理的性能。通过精心设计,我们可以从理论上保证被屏蔽的代理与原始代理的表现相似。
效果:我们在各种流行的RL环境中评估了StateMask,并证明其在解释保真度方面优于现有的解释器。此外,我们还展示了StateMask在发起对抗性攻击和修补策略错误等方面的优势。

Despite the promising performance of deep reinforcement learning (DRL) agents in many challenging scenarios, the black-box nature of these agents greatly limits their applications in critical domains. Prior research has proposed several explanation techniques to understand the deep learning-based policies in RL. Most existing methods explain why an agent takes individual actions rather than pinpointing the critical steps to its final reward. To fill this gap, we propose StateMask, a novel method to identify the states most critical to the agent's final reward. The high-level idea of StateMask is to learn a mask net that blinds a target agent and forces it to take random actions at some steps without compromising the agent's performance. Through careful design, we can theoretically ensure that the masked agent performs similarly to the original agent. We evaluate StateMask in various popular RL environments and show its superiority over existing explainers in explanation fidelity. We also show that StateMask has better utilities, such as launching adversarial attacks and patching policy errors.

Diverse Conventions for Human-AI Collaboration
Bidipta Sarkar Andy Shih Dorsa Sadigh



研究问题:如何生成多样化的合作多智能体游戏中的约定,以提高新伙伴交互时的泛化能力。
动机:标准多智能体强化学习技术如自我对弈,会收敛于任意和非多样的约定,导致与新伙伴交互时表现差。
方法:通过在自我对弈中最大化约定的奖励,同时在与先前发现的约定(交叉对弈)中最小化其奖励,来生成多样化的约定。
效果:在各种多智能体协作游戏中,包括"Overcooked",发现该技术可以适应人类的约定,并在与真实用户配对时超越人类水平的表现。

Conventions are crucial for strong performance in cooperative multi-agent games, because they allow players to coordinate on a shared strategy without explicit communication. Unfortunately, standard multi-agent reinforcement learning techniques, such as self-play, converge to conventions that are arbitrary and non-diverse, leading to poor generalization when interacting with new partners. In this work, we present a technique for generating diverse conventions by (1) maximizing their rewards during self-play, while (2) minimizing their rewards when playing with previously discovered conventions (cross-play), stimulating conventions to be semantically different. To ensure that learned policies act in good faith despite the adversarial optimization of cross-play, we introduce mixed-play, where an initial state is randomly generated by sampling self-play and cross-play transitions and the player learns to maximize the self-play reward from this initial state. We analyze the benefits of our technique on various multi-agent collaborative games, including Overcooked, and find that our technique can adapt to the conventions of humans, surpassing human-level performance when paired with real users.

Scenario Diffusion: Controllable Driving Scenario Generation With Diffusion
Ethan Pronovost Meghana Reddy Ganesina Noureldin Hendy Zeyu Wang Andres Morales Kai Wang Nicholas Roy



研究问题:如何有效地生成可控的合成交通场景,以扩大自动驾驶车辆的安全性验证。
动机:现有的自动化生成合成交通场景的方法无法提供足够的控制能力,且无法适应不同的地理区域。
方法:提出一种名为“场景扩散”的新型扩散基础架构,通过结合潜在扩散、目标检测和轨迹回归,同时生成代理的姿态、方向和轨迹分布。这种分布是以地图和描述所需场景的令牌集为条件,从而提供对生成的场景的额外控制。
效果:实验证明,该方法具有足够的表现力来模拟多样化的交通模式,并能推广到不同的地理区域。

Automated creation of synthetic traffic scenarios is a key part of scaling the safety validation of autonomous vehicles (AVs). In this paper, we propose Scenario Diffusion, a novel diffusion-based architecture for generating traffic scenarios that enables controllable scenario generation. We combine latent diffusion, object detection and trajectory regression to generate distributions of synthetic agent poses, orientations and trajectories simultaneously. This distribution is conditioned on the map and sets of tokens describing the desired scenario to provide additional control over the generated scenario. We show that our approach has sufficient expressive capacity to model diverse traffic patterns and generalizes to different geographical regions.

Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback
Han Shao Lee Cohen Avrim Blum Yishay Mansour Aadirupa Saha Matthew Walter



研究问题:提出一种多目标决策框架,以适应用户对不同目标的偏好,通过策略比较学习偏好。
动机:现有的模型中,每个用户对各个目标的重要性有不同的偏好,但往往未知。我们的目标是为给定的用户计算一个接近最优的策略。
方法:我们的模型包括一个已知的马尔可夫决策过程和一个向量值奖励函数,每个用户都有一个未知的偏好向量来表示每个目标的相对重要性。我们考虑两种用户反馈模型,一种是用户提供两个策略并返回他们更喜欢的策略作为反馈,另一种是用户提供两个带权重的代表轨迹集并选择他们更喜欢的一个。在这两种情况下,我们都提出了一种算法,该算法使用的数量比较查询与目标数量呈准线性关系。
效果:实验结果表明,我们的方法可以在满足用户需求的同时,有效地计算出接近最优的策略。

In this work, we propose a multi-objective decision making framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our model consists of a known Markov decision process with a vector-valued reward function, with each user having an unknown preference vector that expresses the relative importance of each objective. The goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models. We first address the case where a user is provided with two policies and returns their preferred policy as feedback. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one. In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a number of comparison queries that scales quasilinearly in the number of objectives.

On Imitation in Mean-field Games
Giorgia Ramponi Pavel Kolev Olivier Pietquin Niao He Mathieu Lauriere Matthieu Geist



研究问题:在平均场博弈(MFGs)中探索模仿学习(IL)的问题,目标是模仿遵循未知支付函数的Nash均衡策略的代理群体的行为。
动机:与单代理IL相比,MFGs中的IL提出了新的挑战,特别是当奖励函数和转移核都依赖于种群分布时。
方法:我们引入了一种新的解决方案概念,称为纳什模仿差距。然后我们证明,当只有奖励依赖于种群分布时,MFGs中的IL可以简化为具有类似保证的单代理IL。然而,当动态依赖种群时,我们提供了一个新的最大值边界,表明在这种情况下IL更难实现。为了解决这个问题,我们提出了一个新的对抗性公式,其中强化学习问题被一个平均场控制(MFC)问题所取代。
效果:实验结果表明,这种新的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We explore the problem of imitation learning (IL) in the context of mean-field games (MFGs), where the goal is to imitate the behavior of a population of agents following a Nash equilibrium policy according to some unknown payoff function. IL in MFGs presents new challenges compared to single-agent IL, particularly when both the reward function and the transition kernel depend on the population distribution. In this paper, departing from the existing literature on IL for MFGs, we introduce a new solution concept called the Nash imitation gap. Then we show that when only the reward depends on the population distribution, IL in MFGs can be reduced to single-agent IL with similar guarantees. However, when the dynamics is population-dependent, we provide a novel upper-bound that suggests IL is harder in this setting. To address this issue, we propose a new adversarial formulation where the reinforcement learning problem is replaced by a mean-field control (MFC) problem, suggesting progress in IL within MFGs may have to build upon MFC.

Optimistic Exploration in Reinforcement Learning Using Symbolic Model Estimates
Sarath Sreedharan Michael Katz



研究问题:如何利用符号模型和强化学习解决稀疏奖励设置中的问题。
动机:现有的方法受限于对底层问题的符号近似假设,需要更高效的方法进行探索。
方法:提出一种新的乐观符号近似学习方法,通过学习模型动态并结合自动规划社区开发的快速多样化规划器进行优化探索。
效果:在多个基准领域测试该方法,并与其他RL策略进行比较,证明其有效性。

There has been an increasing interest in using symbolic models along with reinforcement learning (RL) problems, where these coarser abstract models are used as a way to provide RL agents with higher level guidance. However, most of these works are inherently limited by their assumption of having an access to a symbolic approximation of the underlying problem. To address this issue, we introduce a new method for learning optimistic symbolic approximations of the underlying world model. We will see how these representations, coupled with fast diverse planners developed by the automated planning community, provide us with a new paradigm for optimistic exploration in sparse reward settings. We investigate the possibility of speeding up the learning process by generalizing learned model dynamics across similar actions with minimal human input. Finally, we evaluate the method, by testing it on multiple benchmark domains and compare it with other RL strategies.

RoboCLIP: One Demonstration is Enough to Learn Robot Policies
Sumedh Anand Sontakke Jesse Zhang Séb Arnold Karl Pertsch Erdem Biyik Dorsa Sadigh Chelsea Finn Laurent Itti



研究问题:强化学习中奖励规范设计是一个困难的问题,需要大量的专家监督来设计强大的奖励函数。
动机:受视频和语言模型(VLMs)领域进展的启发,我们提出了RoboCLIP,一种在线模仿学习方法,使用单个示范(克服大数据需求)以视频演示或任务的文本描述形式生成奖励,而无需手动奖励函数设计。
方法:RoboCLIP利用预训练的VLMs进行奖励生成,无需任何微调。使用RoboCLIP奖励训练的强化学习代理在下游机器人操作任务上表现出比其他竞争模仿学习方法高2-3倍的零样本性能,仅使用一个视频/文本示范。
效果:实验结果表明,RoboCLIP能够在不需要大量专家演示的情况下,通过单次示范生成有效的奖励函数,显著提高了强化学习代理的性能。

Reward specification is a notoriously difficult problem in reinforcement learning, requiring extensive expert supervision to design robust reward functions. Imitation learning (IL) methods attempt to circumvent these problems by utilizing expert demonstrations instead of using an extrinsic reward function but typically require a large number of in-domain expert demonstrations. Inspired by advances in the field of Video-and-Language Models (VLMs), we present RoboCLIP, an online imitation learning method that uses a single demonstration (overcoming the large data requirement) in the form of a video demonstration or a textual description of the task to generate rewards without manual reward function design. Additionally, RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains. RoboCLIP utilizes pretrained VLMs without any finetuning for reward generation. Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, doing so using only one video/text demonstration. Visit our website at https://sites.google.com/view/roboclip/home for experiment videos.

Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought
Huaxiaoyue Wang Gonzalo Gonzalez-Pumariega Yash Sharma Sanjiban Choudhury



研究问题:如何将语言指令和示范转化为机器人个性化任务的代码。
动机:虽然大型语言模型在将语言指令转化为机器人任务代码方面表现出色,但将示范转化为任务代码仍然是一个挑战,因为示范和代码的长度和复杂性都很大,直接学习映射是困难的。
方法:本文提出了一个名为Demo2Code的新框架,该框架通过扩展思维链和定义一个连接两者的共同潜在规范,从示范中生成机器人任务代码。该框架采用稳健的两阶段过程:(1)一种递归总结技术,将示范压缩成简洁的规范;(2)一种从生成的规范中递归扩展每个函数的代码合成方法。
效果:我们在各种机器人任务基准上进行了广泛的评估,包括一个新的游戏基准Robotouille,该基准旨在模拟厨房环境中的各种烹饪任务。

Language instructions and demonstrations are two natural ways for users to teach robots personalized tasks. Recent progress in Large Language Models (LLMs) has shown impressive performance in translating language instructions into code for robotic tasks. However, translating demonstrations into task code continues to be a challenge due to the length and complexity of both demonstrations and code, making learning a direct mapping intractable. This paper presents Demo2Code, a novel framework that generates robot task code from demonstrations via an extended chain-of-thought and defines a common latent specification to connect the two. Our framework employs a robust two-stage process: (1) a recursive summarization technique that condenses demonstrations into concise specifications, and (2) a code synthesis approach that expands each function recursively from the generated specifications. We conduct extensive evaluation on various robot task benchmarks, including a novel game benchmark Robotouille, designed to simulate diverse cooking tasks in a kitchen environment.

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms
Yashaswini Murthy Mehrdad Moharrami R. Srikant



研究问题:如何为平均奖励设置的近似策略迭代和强化学习算法获得有意义的性能边界。
动机:在平均奖励目标有意义的应用中,通常使用折扣因子接近1的折扣奖励公式,但这会导致性能边界随着期望范围的平方而扩展。因此,如何获得有限时间的错误边界是一个开放的问题。
方法:通过获得第一个在策略评估和策略改进错误为零时趋于零的平均奖励MDP的非平凡有限时间错误边界来解决这个问题。
效果:实验结果表明,这种方法可以在策略评估和策略改进错误为零时获得有限时间的错误边界,从而解决了这个问题。

Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, often discounted reward formulations are used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first non-trivial finite time error bounds for average-reward MDPs which go to zero in the limit as policy evaluation and policy improvement errors go to zero.

Information-guided Planning: An Online Approach for Partially Observable Problems
Matheus Aparecido Do Carmo Alves Amokh Varma Yehia Elkhatib Leandro Soriano Marcolino



研究问题:本文提出了一种新的在线部分可观察性规划算法IB-POMCP。
动机:现有的规划算法在处理稀疏奖励配置的场景时存在限制,我们的目标是通过使用世界信念熵的估计来指导树搜索过程,以改善决策过程。
方法:我们的方法被称为信息引导规划过程,它结合了一种新的I-UCB函数,通过这种方式,该算法在几个基准场景中的表现优于最先进的基线,并具有理论收敛保证。
效果:实验结果表明,我们的算法在奖励和推理时间方面都取得了显著的改进。

This paper presents IB-POMCP, a novel algorithm for online planning under partial observability. Our approach enhances the decision-making process by using estimations of the world belief's entropy to guide a tree search process and surpass the limitations of planning in scenarios with sparse reward configurations. By performing what we denominate as an *information-guided planning process*, the algorithm, which incorporates a novel I-UCB function, shows significant improvements in reward and reasoning time compared to state-of-the-art baselines in several benchmark scenarios, along with theoretical convergence guarantees.

Regularity as Intrinsic Reward for Free Play
Cansu Sancaktar Justus Piater Georg Martius



研究问题:提出一种新的内在激励奖励信号——规律性,用于指导强化学习中的任务探索。
动机:从儿童发展中得到启示,认为追求结构和秩序有助于引导探索偏向于未受朴素不确定性影响的内在奖励的任务子空间。
方法:将规律性作为内在奖励(RaIR)进行操作化,并在基于模型的强化学习中使用。在合成环境中展示追求规律性目标可以产生的结构化模式,并在多目标机器人操纵环境中展示该方法的优势。
效果:在自由游戏中引入RaIR并将其作为内在奖励来补充模型的知识不确定性,观察到自主构建塔和其他规律结构,从而显著提高了装配任务的零样本下游任务性能。

We propose regularity as a novel reward signal for intrinsically-motivated reinforcement learning. Taking inspiration from child development, we postulate that striving for structure and order helps guide exploration towards a subspace of tasks that are not favored by naive uncertainty-based intrinsic rewards. Our generalized formulation of Regularity as Intrinsic Reward (RaIR) allows us to operationalize it within model-based reinforcement learning. In a synthetic environment, we showcase the plethora of structured patterns that can emerge from pursuing this regularity objective. We also demonstrate the strength of our method in a multi-object robotic manipulation environment. We incorporate RaIR into free play and use it to complement the model’s epistemic uncertainty as an intrinsic reward. Doing so, we witness the autonomous construction of towers and other regular structures during free play, which leads to a substantial improvement in zero-shot downstream task performance on assembly tasks.

Tempo Adaptation in Non-stationary Reinforcement Learning
Hyunin Lee Yuhao Ding Jongmin Lee Ming Jin Javad Lavaei Somayeh Sojoudi



研究问题:本文旨在解决非平稳强化学习中代理与环境之间的“时间同步”问题,这是阻碍其实际应用的关键因素。
动机:在现实中,环境的变化发生在真实经过的时间($t$)上,而不是在回合进度($k$)上。现有的工作在每一回合$k$中,代理先进行轨迹生成和策略训练,然后进入下一回合$k+1$。但在时间不同步的环境中,代理在时间$t_k$分配$\Delta t$进行轨迹生成和训练,然后在$t_k+\Delta t$进入下一回合。尽管总的回合数($K$)是固定的,但代理由于选择的交互时间($t_1,t_2,...,t_K$)不同而积累不同的轨迹,这对策略的次优差距产生重大影响。
方法:我们提出了一种主动同步节奏(Proactively Synchronizing Tempo, $\texttt{ProST}$)框架,通过最小化其性能指标(即动态遗憾)的上界来计算一个次优序列{$t_1,t_2,...,t_K$}。我们的主要贡献是证明一个次优的{$t_{1:K}$}在策略训练时间和环境变化速度(环境节奏)之间进行了权衡。理论上,这项工作开发了一个次优的{$t_{1:K}$}作为环境非平稳性的函数,同时也实现了次线性的动态遗憾。
效果:我们在各种高维非平稳环境中的实验评估表明,$\texttt{ProST}$框架在次优的{$t_{1:K}$}上实现了比现有方法更高的在线回报。

We first raise and tackle a ``time synchronization'' issue between the agent and the environment in non-stationary reinforcement learning (RL), a crucial factor hindering its real-world applications. In reality, environmental changes occur over wall-clock time ($t$) rather than episode progress ($k$), where wall-clock time signifies the actual elapsed time within the fixed duration $t \in [0, T]$. In existing works, at episode $k$, the agent rolls a trajectory and trains a policy before transitioning to episode $k+1$. In the context of the time-desynchronized environment, however, the agent at time $t_{k}$ allocates $\Delta t$ for trajectory generation and training, subsequently moves to the next episode at $t_{k+1}=t_{k}+\Delta t$. Despite a fixed total number of episodes ($K$), the agent accumulates different trajectories influenced by the choice of interaction times ($t_1,t_2,...,t_K$), significantly impacting the suboptimality gap of the policy. We propose a Proactively Synchronizing Tempo ($\texttt{ProST}$) framework that computes a suboptimal sequence {$t_1,t_2,...,t_K$} (= { $t_{1:K}$}) by minimizing an upper bound on its performance measure, i.e., the dynamic regret. Our main contribution is that we show that a suboptimal {$t_{1:K}$} trades-off between the policy training time (agent tempo) and how fast the environment changes (environment tempo). Theoretically, this work develops a suboptimal {$t_{1:K}$} as a function of the degree of the environment's non-stationarity while also achieving a sublinear dynamic regret. Our experimental evaluation on various high-dimensional non-stationary environments shows that the $\texttt{ProST}$ framework achieves a higher online return at suboptimal {$t_{1:K}$} than the existing methods.

Conformal Prediction for Uncertainty-Aware Planning with Diffusion Dynamics Model
Jiankai Sun Yiqi Jiang Jianing Qiu Parth Talpur Nobel Mykel Kochenderfer Mac Schwager



研究问题:如何量化用于机器人任务规划的扩散模型的不确定性。
动机:在不确定、动态和部分可观察的环境中,机器人应用需要对轨迹预测模型的不确定性进行量化。
方法:使用Conformal Prediction(CP)对扩散模型进行不确定性量化,通过改变训练损失函数鼓励更稳健的性能,并在测试时用CP过程进行校准以获得具有保证覆盖级别的轨迹预测覆盖集。
效果:实验结果表明,该方法能够降低学习到的轨迹预测模型的不确定性,并在现有的离线RL基准测试和挑战性连续规划任务上优于先前的算法。

Robotic applications often involve working in environments that are uncertain, dynamic, and partially observable. Recently, diffusion models have been proposed for learning trajectory prediction models trained from expert demonstrations, which can be used for planning in robot tasks. Such models have demonstrated a strong ability to overcome challenges such as multi-modal action distributions, high-dimensional output spaces, and training instability. It is crucial to quantify the uncertainty of these dynamics models when using them for planning. In this paper, we quantify the uncertainty of diffusion dynamics models using Conformal Prediction (CP). Given a finite number of exchangeable expert trajectory examples (called the “calibration set”), we use CP to obtain a set in the trajectory space (called the “coverage region”) that is guaranteed to contain the output of the diffusion model with a user-defined probability (called the “coverage level”). In PlanCP, inspired by concepts from conformal prediction, we modify the loss function for training the diffusion model to include a quantile term to encourage more robust performance across the variety of training examples. At test time, we then calibrate PlanCP with a conformal prediction process to obtain coverage sets for the trajectory prediction with guaranteed coverage level. We evaluate our algorithm on various planning tasks and model-based offline reinforcement learning tasks and show that it reduces the uncertainty of the learned trajectory prediction model. As a by-product, our algorithm PlanCP outperforms prior algorithms on existing offline RL benchmarks and challenging continuous planning tasks. Our method can be combined with most model-based planning approaches to produce uncertainty estimates of the closed-loop system.

Reward Finetuning for Faster and More Accurate Unsupervised Object Discovery
Katie Z Luo Zhenzhen Liu Xiangyu Chen Yurong You Sagie Benaim Cheng Perng Phoo Mark Campbell Wen Sun Bharath Hariharan Kilian Q Weinberger



研究问题:如何利用人类反馈的强化学习方法改进自主车辆中的机器学习方法,使其更好地符合人类期望。
动机:尽管强化学习在大型语言模型中取得了成功,但在自主车辆研究中的影响却无法比拟。在自主车辆中,与人类期望的对齐是至关重要的。
方法:提出将类似的基于强化学习的方法应用于无监督的对象发现,即从LiDAR点中学习检测对象,而无需任何训练标签。使用简单的启发式方法模拟人类反馈,并将其组合成一个简单的奖励函数,该函数与其分数正相关于边界框的准确性。
效果:实验证明,该方法不仅更准确,而且比先前的对象发现工作训练速度快几个数量级。

Recent advances in machine learning have shown that Reinforcement Learning from Human Feedback (RLHF) can improve machine learning models and align them with human preferences. Although very successful for Large Language Models (LLMs), these advancements have not had a comparable impact in research for autonomous vehicles—where alignment with human expectations can be imperative. In this paper, we propose to adapt similar RL-based methods to unsupervised object discovery, i.e. learning to detect objects from LiDAR points without any training labels. Instead of labels, we use simple heuristics to mimic human feedback. More explicitly, we combine multiple heuristics into a simple reward function that positively correlates its score with bounding box accuracy, i.e., boxes containing objects are scored higher than those without. We start from the detector’s own predictions to explore the space and reinforce boxes with high rewards through gradient updates. Empirically, we demonstrate that our approach is not only more accurate, but also orders of magnitudes faster to train compared to prior works on object discovery. Code is available at https://github.com/katieluo88/DRIFT.

$\texttt{TACO}$: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning
Ruijie Zheng Xiyao Wang Yanchao Sun Shuang Ma Jieyu Zhao Huazhe Xu Hal Daumé III Furong Huang



研究问题:尽管强化学习取得了进展,但样本效率低下仍是一个重大障碍。
动机:现有的尝试通过创建自我监督的辅助任务来解决这一问题,但这些目标通常不足以学习代表最优策略或值函数的表示,并且它们往往考虑的是具有小的、抽象的离散动作空间的任务,从而忽视了连续控制中动作表示学习的重要性。
方法:本文介绍了一种简单而强大的时间对比学习法——TACO,它同时学习状态和动作表示,优化当前状态与动作序列表示和相应未来状态表示之间的互信息。
效果:理论上,TACO可以学习包含足够控制信息的状态和动作表示,从而提高样本效率。在在线RL方面,TACO在Deepmind Control Suite的九个具有挑战性的视觉连续控制任务上平均在一百万次环境交互步骤后提高了40%的性能。此外,我们还表明TACO也可以作为现有离线视觉RL方法的一个即插即用模块,为不同质量的离线数据集建立新的离线视觉RL的最先进性能。

Despite recent progress in reinforcement learning (RL) from raw pixel data, sample inefficiency continues to present a substantial obstacle. Prior works have attempted to address this challenge by creating self-supervised auxiliary tasks, aiming to enrich the agent's learned representations with control-relevant information for future state prediction. However, these objectives are often insufficient to learn representations that can represent the optimal policy or value function, and they often consider tasks with small, abstract discrete action spaces and thus overlook the importance of action representation learning in continuous control. In this paper, we introduce $\texttt{TACO}$: $\textbf{T}$emporal $\textbf{A}$ction-driven $\textbf{CO}$ntrastive Learning, a simple yet powerful temporal contrastive learning approach that facilitates the concurrent acquisition of latent state and action representations for agents. $\texttt{TACO}$ simultaneously learns a state and an action representation by optimizing the mutual information between representations of current states paired with action sequences and representations of the corresponding future states. Theoretically, $\texttt{TACO}$ can be shown to learn state and action representations that encompass sufficient information for control, thereby improving sample efficiency. For online RL, $\texttt{TACO}$ achieves 40% performance boost after one million environment interaction steps on average across nine challenging visual continuous control tasks from Deepmind Control Suite. In addition, we show that $\texttt{TACO}$ can also serve as a plug-and-play module adding to existing offline visual RL methods to establish the new state-of-the-art performance for offline visual RL across offline datasets with varying quality.

On the Importance of Exploration for Generalization in Reinforcement Learning
Yiding Jiang J Zico Kolter Roberta Raileanu



研究问题:本文旨在解决深度学习强化学习中,现有方法主要关注表示学习,忽视了特定于强化学习的探索性问题。
动机:作者假设智能体的探索策略在其适应新环境的能力中起着关键作用。
方法:通过在表格化情境马尔可夫决策过程(MDP)中的一系列实验,提出了一种名为EDE的方法,该方法通过Q值分布的集合来鼓励对具有高度认识不确定性的状态进行探索。
效果:提出的算法是第一个在高维观察的强化学习泛化基准Procgen和Crafter上实现强大性能的价值基础方法。

Existing approaches for improving generalization in deep reinforcement learning (RL) have mostly focused on representation learning, neglecting RL-specific aspects such as exploration. We hypothesize that the agent's exploration strategy plays a key role in its ability to generalize to new environments. Through a series of experiments in a tabular contextual MDP, we show that exploration is helpful not only for efficiently finding the optimal policy for the training environments but also for acquiring knowledge that helps decision making in unseen environments. Based on these observations, we propose EDE: Exploration via Distributional Ensemble, a method that encourages the exploration of states with high epistemic uncertainty through an ensemble of Q-value distributions. The proposed algorithm is the first value-based approach to achieve strong performance on both Procgen and Crafter, two benchmarks for generalization in RL with high-dimensional observations. The open-sourced implementation can be found at https://github.com/facebookresearch/ede.

Provably Efficient Offline Goal-Conditioned Reinforcement Learning with General Function Approximation and Single-Policy Concentrability
Hanlin Zhu Amy Zhang



研究问题:尽管离线条件强化学习(GCRL)在许多先前的工作中已经证明了其经验成功,但是当状态空间巨大且离线数据集仅覆盖我们要学习的策略时,高效离线GCRL算法的理论理解尚未建立。
动机:本文对一种现有的经验成功的离线GCRL算法进行了严格的理论分析,以解决上述问题。
方法:通过对目标函数的(半)强凸性属性进行修改,该算法只需要最少的假设(单策略集中性)和函数类(可实现性),就可以实现$\tilde{O}(text{poly}(1/\epsilon))$的样本复杂度(其中$\epsilon$是学习到的策略的期望次优性)。此外,该算法包含两个未交错的优化步骤,即$V$-学习和策略学习,并且由于不涉及最小最大优化问题,因此计算稳定。
效果:通过在各种真实环境中进行实验验证,我们发现修改后的算法优于先前的算法。据我们所知,这是第一个既具有一般函数近似和单策略集中性的高效算法,又不需要解决最小最大优化问题的算法,并且在经验上取得了成功。

Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills that aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we provide a rigorous theoretical analysis of an existing empirically successful offline GCRL algorithm. We prove that under slight modification, this algorithm enjoys an $\tilde{O}(\text{poly}(1/\epsilon))$ sample complexity (where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation thanks to the property of (semi-)strong convexity of the objective functions. We only require nearly minimal assumptions on the dataset (single-policy concentrability) and the function class (realizability). Moreover, this algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. We also empirically validate our theory by showing that the modified algorithm outperforms the previous algorithm in various real-world environments. To the best of our knowledge, this is the first algorithm that is both provably efficient with general function approximation and single-policy concentrability, and empirically successful without requiring solving minimax optimization problems.

Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning
James Queeney Mouhacine Benosman



研究问题:如何在不确定环境中进行安全决策。
动机:许多真实世界领域需要在不确定环境中进行安全决策。
方法:引入深度强化学习框架,考虑转移模型的分布,通过使用连贯的畸变风险度量来对模型不确定性采取风险规避的观点。
效果:在具有安全约束的连续控制任务实验中,证明了该框架在部署时能够在一系列受干扰的测试环境中产生稳健的性能和安全性。

Many real-world domains require safe decision making in uncertain environments. In this work, we introduce a deep reinforcement learning framework for approaching this important problem. We consider a distribution over transition models, and apply a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures. We provide robustness guarantees for this framework by showing it is equivalent to a specific class of distributionally robust safe reinforcement learning problems. Unlike existing approaches to robustness in deep reinforcement learning, however, our formulation does not involve minimax optimization. This leads to an efficient, model-free implementation of our approach that only requires standard data collection from a single training environment. In experiments on continuous control tasks with safety constraints, we demonstrate that our framework produces robust performance and safety at deployment time across a range of perturbed test environments.

Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning
Hanlin Zhu Paria Rashidinejad Jiantao Jiao



研究问题:本文提出了一种新的离线强化学习(RL)算法A-Crab,用于处理复杂环境中数据覆盖不足的问题。
动机:现有的方法在处理复杂环境和数据覆盖不足的问题上存在困难,因此需要一种更有效的算法。
方法:A-Crab算法结合了边缘重要性采样框架和演员-评论家范例,其中评论家返回对演员(策略)的评价,这些评价相对于离线数据是悲观的,并且具有小的平均(重要性加权)贝尔曼误差。
效果:实验结果表明,A-Crab算法在收敛到离线数据集中的最佳策略方面达到了最优的统计率$1/\sqrt{N}$,并且在广泛的特定超参数范围内优于数据收集行为策略。

We propose A-Crab (Actor-Critic Regularized by Average Bellman error), a new practical algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage. Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm, where the critic returns evaluations of the actor (policy) that are pessimistic relative to the offline data and have a small average (importance-weighted) Bellman error. Compared to existing methods, our algorithm simultaneously offers a number of advantages: (1) It achieves the optimal statistical rate of $1/\sqrt{N}$---where $N$ is the size of offline dataset---in converging to the best policy covered in the offline dataset, even when combined with general function approximators. (2) It relies on a weaker \textit{average} notion of policy coverage (compared to the $\ell_\infty$ single-policy concentrability) that exploits the structure of policy visitations. (3) It outperforms the data-collection behavior policy over a wide range of specific hyperparameters. We provide both theoretical analysis and experimental results to validate the effectiveness of our proposed algorithm. The code is available at https://github.com/zhuhl98/ACrab.

Creating Multi-Level Skill Hierarchies in Reinforcement Learning
Joshua Benjamin Evans Özgür Şimşek



研究问题:如何为自主代理设计有用的技能层次结构?
动机:通过图形化表示自主代理与其环境的交互过程,提出一种基于模块化最大值原则的技能层次结构。
方法:自动生成技能层次结构,包括技能本身(行为、调用时机和终止条件)以及它们之间的依赖关系。
效果:在各种环境中,该方法生成的技能层次结构直观且有效,显著提高了代理的学习性能。

What is a useful skill hierarchy for an autonomous agent? We propose an answer based on a graphical representation of how the interaction between an agent and its environment may unfold. Our approach uses modularity maximisation as a central organising principle to expose the structure of the interaction graph at multiple levels of abstraction. The result is a collection of skills that operate at varying time scales, organised into a hierarchy, where skills that operate over longer time scales are composed of skills that operate over shorter time scales. The entire skill hierarchy is generated automatically, with no human input, including the skills themselves (their behaviour, when they can be called, and when they terminate) as well as the dependency structure between them. In a wide range of environments, this approach generates skill hierarchies that are intuitively appealing and that considerably improve the learning performance of the agent.

Iterative Reachability Estimation for Safe Reinforcement Learning
Milan Ganai Zheng Gong Chenning Yu Sylvia Lee Herbert Sicun Gao



研究问题:本文旨在解决强化学习(RL)在实际应用中的安全问题,如处理环境的随机性、提供持续状态安全的严格保证以及避免牺牲性能的过度保守行为。
动机:为了在一般随机设置中进行安全约束的强化学习,提出了一个新的框架——可达性估计用于安全策略优化(RESPO)。
方法:在无违规策略的可行集内,我们优化奖励同时保持持续的安全。在此可行集之外,我们的优化通过尽可能保证进入可行集时的累积折扣违规最小化来产生最安全的行为。
效果:我们引入了一类使用我们新的可达性估计函数来优化我们在提出的框架和类似框架(如那些同时处理多个硬性和软性约束的框架)中的算法。理论上证明,我们的算法几乎肯定地收敛到我们安全优化框架的局部最优策略。我们在Safety Gym、PyBullet和MuJoCo的各种安全RL环境中评估了提出的方法,并与最先进的基线相比,显示出在提高奖励性能和安全性方面的优势。

Ensuring safety is important for the practical deployment of reinforcement learning (RL). Various challenges must be addressed, such as handling stochasticity in the environments, providing rigorous guarantees of persistent state-wise safety satisfaction, and avoiding overly conservative behaviors that sacrifice performance. We propose a new framework, Reachability Estimation for Safe Policy Optimization (RESPO), for safety-constrained RL in general stochastic settings. In the feasible set where there exist violation-free policies, we optimize for rewards while maintaining persistent safety. Outside this feasible set, our optimization produces the safest behavior by guaranteeing entrance into the feasible set whenever possible with the least cumulative discounted violations. We introduce a class of algorithms using our novel reachability estimation function to optimize in our proposed framework and in similar frameworks such as those concurrently handling multiple hard and soft constraints. We theoretically establish that our algorithms almost surely converge to locally optimal policies of our safe optimization framework. We evaluate the proposed methods on a diverse suite of safe RL environments from Safety Gym, PyBullet, and MuJoCo, and show the benefits in improving both reward performance and safety compared with state-of-the-art baselines.

Imitation Learning from Vague Feedback
Xin-Qiang Cai Yu-Jie Zhang Chao-Kai Chiang Masashi Sugiyama



研究问题:如何利用人类反馈进行模仿学习,特别是在无法提供明确配对比较的情况下。
动机:传统的模仿学习需要完美的专家数据,但在许多实际应用中,获取这种数据既昂贵又不可能。
方法:通过将演示池模型化为专家和非专家数据的混合体,当专家数据的比例α已知时,可以恢复专家策略分布。对于未知的α情况,提出了一种混合比例估计方法。然后将恢复的专家策略分布与生成对抗性模仿学习相结合,形成端到端算法。
效果:实验表明,我们的方法在各种任务上优于标准和偏好基模仿学习方法。

Imitation learning from human feedback studies how to train well-performed imitation agents with an annotator's relative comparison of two demonstrations (one demonstration is better/worse than the other), which is usually easier to collect than the perfect expert data required by traditional imitation learning. However, in many real-world applications, it is still expensive or even impossible to provide a clear pairwise comparison between two demonstrations with similar quality. This motivates us to study the problem of imitation learning with vague feedback, where the data annotator can only distinguish the paired demonstrations correctly when their quality differs significantly, i.e., one from the expert and another from the non-expert. By modeling the underlying demonstration pool as a mixture of expert and non-expert data, we show that the expert policy distribution can be recovered when the proportion $\alpha$ of expert data is known. We also propose a mixture proportion estimation method for the unknown $\alpha$ case. Then, we integrate the recovered expert policy distribution with generative adversarial imitation learning to form an end-to-end algorithm. Experiments show that our methods outperform standard and preference-based imitation learning methods on various tasks.

Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design
Matthew Thomas Jackson Minqi Jiang Jack Parker-Holder Risto Vuorio Chris Lu Gregory Farquhar Shimon Whiteson Jakob Nicolaus Foerster



研究问题:如何提高深度强化学习算法在未见过的环境中的泛化性能。
动机:尽管现有的元学习方法如Learned Policy Gradient (LPG)在初始阶段取得了令人印象深刻的结果,但在应用于未见过的环境时仍存在泛化差距。
方法:通过自动生成课程来最大化元学习优化器的遗憾,并提出了一种新的遗憾近似方法——算法遗憾(AR)。该方法被称为General RL Optimizers Obtained Via Environment Design (GROOVE)。
效果:实验表明,GROOVE在泛化性能上优于LPG,AR也被认为是环境设计中的关键组成部分。这种方法是迈向发现真正通用的RL算法的重要一步,能够解决广泛的真实世界环境问题。

The past decade has seen vast progress in deep reinforcement learning (RL) on the back of algorithms manually designed by human researchers. Recently, it has been shown that it is possible to meta-learn update rules, with the hope of discovering algorithms that can perform well on a wide range of RL tasks. Despite impressive initial results from algorithms such as Learned Policy Gradient (LPG), there remains a generalization gap when these algorithms are applied to unseen environments. In this work, we examine how characteristics of the meta-training distribution impact the generalization performance of these algorithms. Motivated by this analysis and building on ideas from Unsupervised Environment Design (UED), we propose a novel approach for automatically generating curricula to maximize the regret of a meta-learned optimizer, in addition to a novel approximation of regret, which we name algorithmic regret (AR). The result is our method, General RL Optimizers Obtained Via Environment Design (GROOVE). In a series of experiments, we show that GROOVE achieves superior generalization to LPG, and evaluate AR against baseline metrics from UED, identifying it as a critical component of environment design in this setting. We believe this approach is a step towards the discovery of truly general RL algorithms, capable of solving a wide range of real-world environments.

Adjustable Robust Reinforcement Learning for Online 3D Bin Packing
Yuxin Pan Yize Chen Fangzhen Lin



研究问题:设计有效的在线3D装箱问题(3D-BPP)策略,由于输入箱子序列的不可预测性和严格的物理限制,这是一个长期存在的挑战。
动机:尽管当前的深度强化学习(DRL)方法在优化平均性能上表现出色,但在可能出现最坏情况的真实环境中,它们往往无法应对。
方法:我们首先引入一种基于排列的攻击者来研究解决在线3D-BPP的DRL和启发式方法的实际鲁棒性。然后,我们提出了一个可调节的鲁棒强化学习(AR2L)框架,允许有效地调整鲁棒性权重,以实现在平均和最坏情况下的性能平衡。
效果:实验表明,AR2L具有通用性,因为它提高了策略的鲁棒性,同时保持了对名义情况的可接受性能水平。

Designing effective policies for the online 3D bin packing problem (3D-BPP) has been a long-standing challenge, primarily due to the unpredictable nature of incoming box sequences and stringent physical constraints. While current deep reinforcement learning (DRL) methods for online 3D-BPP have shown promising results in optimizing average performance over an underlying box sequence distribution, they often fail in real-world settings where some worst-case scenarios can materialize. Standard robust DRL algorithms tend to overly prioritize optimizing the worst-case performance at the expense of performance under normal problem instance distribution. To address these issues, we first introduce a permutation-based attacker to investigate the practical robustness of both DRL-based and heuristic methods proposed for solving online 3D-BPP. Then, we propose an adjustable robust reinforcement learning (AR2L) framework that allows efficient adjustment of robustness weights to achieve the desired balance of the policy's performance in average and worst-case environments. Specifically, we formulate the objective function as a weighted sum of expected and worst-case returns, and derive the lower performance bound by relating to the return under a mixture dynamics. To realize this lower bound, we adopt an iterative procedure that searches for the associated mixture dynamics and improves the corresponding policy. We integrate this procedure into two popular robust adversarial algorithms to develop the exact and approximate AR2L algorithms. Experiments demonstrate that AR2L is versatile in the sense that it improves policy robustness while maintaining an acceptable level of performance for the nominal case.

Inverse Dynamics Pretraining Learns Good Representations for Multitask Imitation
David Brandfonbrener Ofir Nachum Joan Bruna



研究问题:本文旨在评估在模仿学习中如何进行预训练,其中预训练和微调数据都是由专家与未知环境交互收集的轨迹。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In recent years, domains such as natural language processing and image recognition have popularized the paradigm of using large datasets to pretrain representations that can be effectively transferred to downstream tasks. In this work we evaluate how such a paradigm should be done in imitation learning, where both pretraining and finetuning data are trajectories collected by experts interacting with an unknown environment. Namely, we consider a setting where the pretraining corpus consists of multitask demonstrations and the task for each demonstration is set by an unobserved latent context variable. The goal is to use the pretraining corpus to learn a low dimensional representation of the high dimensional (e.g., visual) observation space which can be transferred to a novel context for finetuning on a limited dataset of demonstrations. Among a variety of possible pretraining objectives, we argue that inverse dynamics modeling -- i.e., predicting an action given the observations appearing before and after it in the demonstration -- is well-suited to this setting. We provide empirical evidence of this claim through evaluations on a variety of simulated visuomotor manipulation problems. While previous work has attempted various theoretical explanations regarding the benefit of inverse dynamics modeling, we find that these arguments are insufficient to explain the empirical advantages often observed in our settings, and so we derive a novel analysis using a simple but general environment model.

Beyond Average Return in Markov Decision Processes
Alexandre Marthe Aurélien Garivier Claire Vernade



研究问题:在马尔可夫决策过程中,哪些奖励函数可以被精确计算和优化?
动机:在有限时间、无折扣设置下,动态规划(DP)只能对某些类别的统计量进行有效操作。我们总结了这些类别的特征,并给出了新的规划问题答案。有趣的是,我们证明即使在更一般的分布强化学习(DistRL)框架中,也只能优化广义平均值。
方法:DistRL允许我们近似评估其他功能。我们提供了结果估计器的错误边界,并讨论了这种方法的潜力和局限性。
效果:这些结果通过检查回报的整体特征,特别是风险意识策略,推动了马尔可夫决策过程的理论发展。

What are the functionals of the reward that can be computed and optimized exactly in Markov Decision Processes? In the finite-horizon, undiscounted setting, Dynamic Programming (DP) can only handle these operations efficiently for certain classes of statistics. We summarize the characterization of these classes for policy evaluation, and give a new answer for the planning problem. Interestingly, we prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL). DistRL permits, however, to evaluate other functionals approximately. We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations. These results contribute to advancing the theory of Markov Decision Processes by examining overall characteristics of the return, and particularly risk-conscious strategies.

Higher-Order Uncoupled Dynamics Do Not Lead to Nash Equilibrium - Except When They Do
Sarah Asad Toonsi Jeff S Shamma



研究问题:多智能体学习框架中,一个智能体的策略如何响应其他智能体策略的演变。
动机:探索策略是否能够收敛到已知的解决方案概念,如纳什均衡。
方法:引入了更高阶的学习动态,包括可以捕获路径依赖等现象的辅助状态。这些动态是“基于支付”和“解耦”的,即每个智能体的动态依赖于其自身不断演变的收益,没有明确依赖于其他智能体的效用。
效果:对于任何具有孤立完全混合策略纳什均衡的具体游戏,都存在更高阶梯度博弈动态,可以引导(局部)达到该纳什均衡。然而,对于任何更高阶梯度博弈动态,都存在一个唯一的孤立完全混合策略纳什均衡的游戏,该动态不会导致纳什均衡。最后,我们发现在协调博弈中,向混合策略均衡的收敛是以动态内在不稳定为代价的。

The framework of multi-agent learning explores the dynamics of how an agent's strategies evolve in response to the evolving strategies of other agents. Of particular interest is whether or not agent strategies converge to well known solution concepts such as Nash Equilibrium (NE). In "higher order'' learning, agent dynamics include auxiliary states that can capture phenomena such as path dependencies. We introduce higher-order gradient play dynamics that resemble projected gradient ascent with auxiliary states. The dynamics are "payoff based'' and "uncoupled'' in that each agent's dynamics depend on its own evolving payoff and has no explicit dependence on the utilities of other agents. We first show that for any specific game with an isolated completely mixed-strategy NE, there exist higher-order gradient play dynamics that lead (locally) to that NE, both for the specific game and nearby games with perturbed utility functions. Conversely, we show that for any higher-order gradient play dynamics, there exists a game with a unique isolated completely mixed-strategy NE for which the dynamics do not lead to NE. Finally, we show that convergence to the mixed-strategy equilibrium in coordination games, comes at the expense of the dynamics being inherently internally unstable.

Optimistic Active Exploration of Dynamical Systems
Bhavya Sukhija Lenart Treven Cansu Sancaktar Sebastian Blaes Stelian Coros Andreas Krause



研究问题:如何探索未知的动态系统,使得估计的模型能够以零射击方式解决多个下游任务?
动机:现有的强化学习算法通常只优化一个特定任务的策略,对于未知动态系统的探索和多任务处理存在挑战。
方法:本文提出了一种名为OPAX的主动探索算法,该算法使用校准的概率模型来量化对未知动态系统的不确定性,并乐观地最大化未知动态系统与状态观察之间的信息增益。
效果:实验结果表明,OPAX不仅在理论上具有说服力,而且在新任务的零射击规划上也表现出色。

Reinforcement learning algorithms commonly seek to optimize policies for solving one particular task. How should we explore an unknown dynamical system such that the estimated model allows us to solve multiple downstream tasks in a zero-shot manner? In this paper, we address this challenge, by developing an algorithm -- OPAX -- for active exploration. OPAX uses well-calibrated probabilistic models to quantify the epistemic uncertainty about the unknown dynamics. It optimistically---w.r.t. to plausible dynamics---maximizes the information gain between the unknown dynamics and state observations. We show how the resulting optimization problem can be reduced to an optimal control problem that can be solved at each episode using standard approaches. We analyze our algorithm for general models, and, in the case of Gaussian process dynamics, we give a sample complexity bound and show that the epistemic uncertainty converges to zero. In our experiments, we compare OPAX with other heuristic active exploration approaches on several environments. Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.

Recurrent Hypernetworks are Surprisingly Strong in Meta-RL
Jacob Beck Risto Vuorio Zheng Xiong Shimon Whiteson



研究问题:深度强化学习(RL)在样本效率低下的情况下难以部署,元强化学习(Meta-RL)通过学习进行少次学习来解决此问题。
动机:尽管已经提出了许多专门的元强化学习方法,但最近的研究表明,端到端学习和现成的序列模型(如循环网络)相结合,是一个令人惊讶的强大基线。然而,由于缺乏支持证据,这些主张引起了争议。
方法:本文进行了实证研究,发现循环网络确实可以实现强大的性能,但使用超网络对其潜力的最大化至关重要。
效果:令人惊讶的是,当与超网络结合使用时,比现有专门方法简单得多的循环基线实际上实现了所有评估方法中最强的性能。

Deep reinforcement learning (RL) is notoriously impractical to deploy due to sample inefficiency. Meta-RL directly addresses this sample inefficiency by learning to perform few-shot learning when a distribution of related tasks is available for meta-training. While many specialized meta-RL methods have been proposed, recent work suggests that end-to-end learning in conjunction with an off-the-shelf sequential model, such as a recurrent network, is a surprisingly strong baseline. However, such claims have been controversial due to limited supporting evidence, particularly in the face of prior work establishing precisely the opposite. In this paper, we conduct an empirical investigation. While we likewise find that a recurrent network can achieve strong performance, we demonstrate that the use of hypernetworks is crucial to maximizing their potential. Surprisingly, when combined with hypernetworks, the recurrent baselines that are far simpler than existing specialized methods actually achieve the strongest performance of all methods evaluated.

Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning
Yihang Yao Zuxin Liu Zhepeng Cen Jiacheng Zhu Wenhao Yu Tingnan Zhang Ding Zhao



研究问题:如何训练出能适应不同安全约束要求,无需重新训练的多功能安全强化学习策略。
动机:现有的强化学习方法在保证安全性的同时,缺乏对多样化和自适应能力的考虑。
方法:提出条件约束策略优化(CCPO)框架,包括两个关键模块:通用价值估计(VVE)用于在未见过的条件下近似值函数,以及条件变分推断(CVI)用于在策略优化过程中编码任意约束阈值。
效果:实验证明,CCPO在安全性和任务性能上优于基线,同时保持了对不同约束阈值的零射击适应性,适用于现实世界的动态应用。

Safe reinforcement learning (RL) focuses on training reward-maximizing agents subject to pre-defined safety constraints. Yet, learning versatile safe policies that can adapt to varying safety constraint requirements during deployment without retraining remains a largely unexplored and challenging area. In this work, we formulate the versatile safe RL problem and consider two primary requirements: training efficiency and zero-shot adaptation capability. To address them, we introduce the Conditioned Constrained Policy Optimization (CCPO) framework, consisting of two key modules: (1) Versatile Value Estimation (VVE) for approximating value functions under unseen threshold conditions, and (2) Conditioned Variational Inference (CVI) for encoding arbitrary constraint thresholds during policy optimization. Our extensive experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance while preserving zero-shot adaptation capabilities to different constraint thresholds data-efficiently. This makes our approach suitable for real-world dynamic applications.

Probabilistic inverse optimal control for non-linear partially observable systems disentangles perceptual uncertainty and behavioral costs
Dominik Straub Matthias Schultheis Heinz Koeppl Constantin A. Rothkopf



研究问题:本文旨在解决部分可观察的随机非线性系统中逆最优控制的问题,特别是在动作信号未知的情况下。
动机:大部分现有的工作都局限于完全可观察或线性系统,或者需要知道动作信号。因此,本文提出了一种概率方法来处理部分可观察的随机非线性系统和未被观察到的动作信号。
方法:通过结合局部线性化技术和显式的噪声特性模型,我们推导出了模型参数的近似似然函数,该函数可以在单次前向传递中计算出来。
效果:我们在两个经典控制任务和两个人类行为任务的随机和部分可观察版本上进行了定量评估。结果显示,尽管在不确定性下的序列决策中,认识论行动和实用行动是交织在一起的,但我们的方法可以区分感知因素和行为成本。这种方法具有广泛的应用性,从模仿学习到感觉运动神经科学都有应用。

Inverse optimal control can be used to characterize behavior in sequential decision-making tasks. Most existing work, however, is limited to fully observable or linear systems, or requires the action signals to be known. Here, we introduce a probabilistic approach to inverse optimal control for partially observable stochastic non-linear systems with unobserved action signals, which unifies previous approaches to inverse optimal control with maximum causal entropy formulations. Using an explicit model of the noise characteristics of the sensory and motor systems of the agent in conjunction with local linearization techniques, we derive an approximate likelihood function for the model parameters, which can be computed within a single forward pass. We present quantitative evaluations on stochastic and partially observable versions of two classic control tasks and two human behavioral tasks. Importantly, we show that our method can disentangle perceptual factors and behavioral costs despite the fact that epistemic and pragmatic actions are intertwined in sequential decision-making under uncertainty, such as in active sensing and active learning. The proposed method has broad applicability, ranging from imitation learning to sensorimotor neuroscience.

Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach
Yudi Zhang Yali Du Biwei Huang Ziyan Wang Jun Wang Meng Fang Mykola Pechenizkiy



研究问题:强化学习中如何确定哪些状态-动作对负责延迟的未来奖励。
动机:当前大多数方法在构建奖励再分配时无法解释,我们提出从因果关系的角度明确地模型化状态和动作的贡献,以实现可解释的奖励再分配并保持策略不变性。
方法:提出了一种名为生成回报分解(GRD)的框架,用于处理延迟奖励场景中的政策优化。首先识别生成过程中的未观察到的马尔科夫奖励和因果关系,然后利用这些确定的因果生成模型形成紧凑表示,在代理的状态空间的最有利子空间上训练策略。
效果:理论证明未观察到的马尔科夫奖励函数以及底层的因果关系和因果模型是可识别的。实验结果优于现有方法,可视化进一步证明了我们的方法的可解释性。

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models. Experimental results show that our method outperforms state-of-the-art methods and the provided visualization further demonstrates the interpretability of our method. The project page is located at [https://reedzyd.github.io/GenerativeReturnDecomposition/](https://reedzyd.github.io/GenerativeReturnDecomposition/).

Multi-task Graph Neural Architecture Search with Task-aware Collaboration and Curriculum
Yijian Qin Xin Wang Ziwei Zhang Hong Chen Wenwu Zhu



研究问题:本文旨在解决多任务图神经网络架构搜索中的挑战,即如何同时发现不同任务的最优架构并学习任务间的协作关系。
动机:现有的多任务图神经网络架构搜索方法尚未被广泛研究,对捕获不同任务之间的复杂关系和影响提出了巨大挑战。
方法:我们提出了一种新的多任务图神经网络架构搜索方法,该方法具有任务感知的协作和课程设计(MTGC3)。它能够在统一框架中管理多个架构,并通过我们的软任务协作模块学习任务之间的迁移关系。我们还开发了任务特定的课程训练策略,通过根据任务难度重新权衡不同任务的影响来改进架构搜索过程。
效果:实验表明,我们的MTGC3模型在多任务场景中优于几个基线,显示出其发现有效架构和捕获多个任务协作关系的能力。

Graph neural architecture search (GraphNAS) has shown great potential for automatically designing graph neural architectures for graph related tasks. However, multi-task GraphNAS capable of handling multiple tasks simultaneously has been largely unexplored in literature, posing great challenges to capture the complex relations and influences among different tasks. To tackle this problem, we propose a novel multi-task graph neural architecture search with task-aware collaboration and curriculum (MTGC3), which is able to simultaneously discover optimal architectures for different tasks and learn the collaborative relationships among different tasks in a joint manner. Specifically, we design the layer-wise disentangled supernet capable of managing multiple architectures in a unified framework, which combines with our proposed soft task-collaborative module to learn the transferability relationships between tasks. We further develop the task-wise curriculum training strategy to improve the architecture search procedure via reweighing the influence of different tasks based on task difficulties. Extensive experiments show that our proposed MTGC3 model achieves state-of-the-art performance against several baselines in multi-task scenarios, demonstrating its ability to discover effective architectures and capture the collaborative relationships for multiple tasks.

Explore to Generalize in Zero-Shot RL
Ev Zisselman Itai Lavie Daniel Soudry Aviv Tamar



研究问题:本文旨在研究强化学习中的零样本泛化问题,即如何优化一个在一组训练任务上表现良好的策略,以在类似的未见过的任务上执行。
动机:先前的研究通过探索任务的不同不变性概念来减轻过拟合的问题,但在像ProcGen迷宫这样的问题上,有效的不变性解决方案并不存在,因此基于不变性的方法会失败。
方法:作者认为,学习一个能有效探索领域的策略比学习一个针对特定任务最大化奖励的策略更难记住,因此预期这种学习到的行为可以很好地泛化;并在几个对基于不变性的方法来说困难的领域进行了实证。作者的“探索以泛化”算法(ExpGen)建立在这个想法之上:训练一个额外的奖励优化的代理机群。在测试时,要么机群对一个动作达成一致,我们泛化得很好;要么我们采取探索性的动作,这些动作泛化得很好并推动我们到达状态空间的一个新部分,在那里机群可能会再次达成一致。
效果:该方法在迄今为止难以有效泛化的ProcGen挑战任务上取得了最先进的成果,在迷宫任务上的成功率为83%,在200级训练水平的劫案任务上的成功率为74%。ExpGen还可以与基于不变性的方法结合使用,以获得两者的最佳效果,从而在ProcGen上设置了新的最先进的结果。

We study zero-shot generalization in reinforcement learning - optimizing a policy on a set of training tasks to perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that effectively $\textit{explores}$ the domain is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our $\textit{Explore to Generalize}$ algorithm (ExpGen) builds on this insight: we train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which generalize well and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far eluded effective generalization, yielding a success rate of 83% on the Maze task and 74% on Heist with $200$ training levels. ExpGen can also be combined with an invariance based approach to gain the best of both worlds, setting new state-of-the-art results on ProcGen. Code available at [https://github.com/EvZissel/expgen](https://github.com/EvZissel/expgen).

Hierarchical Multi-Agent Skill Discovery
Mingyu Yang Yaodong Yang Zhenbo Lu Wengang Zhou Houqiang Li



研究问题:如何将无监督的技能学习有效地应用于多智能体强化学习(MARL)。
动机:目前的无监督技能学习在多智能体强化学习中应用受限,主要挑战在于如何学习和协调个体和团队的技能。
方法:提出分层多智能体技能发现(HMASD)算法,通过高层策略进行技能分配,低层策略学习发现有价值的团队和个人技能。
效果:在稀疏奖励的多智能体基准测试中,HMASD相比强MARL基线取得了显著的性能提升。

Skill discovery has shown significant progress in unsupervised reinforcement learning. This approach enables the discovery of a wide range of skills without any extrinsic reward, which can be effectively combined to tackle complex tasks. However, such unsupervised skill learning has not been well applied to multi-agent reinforcement learning (MARL) due to two primary challenges. One is how to learn skills not only for the individual agents but also for the entire team, and the other is how to coordinate the skills of different agents to accomplish multi-agent tasks. To address these challenges, we present Hierarchical Multi-Agent Skill Discovery (HMASD), a two-level hierarchical algorithm for discovering both team and individual skills in MARL. The high-level policy employs a transformer structure to realize sequential skill assignment, while the low-level policy learns to discover valuable team and individual skills. We evaluate HMASD on sparse reward multi-agent benchmarks, and the results show that HMASD achieves significant performance improvements compared to strong MARL baselines.

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage
Jose Blanchet Miao Lu Tong Zhang Han Zhong



研究问题:本文旨在研究分布健壮的离线强化学习,以从离线数据集中寻找最优的健壮策略。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:提出了一种名为Doubly Pessimistic Model-based Policy Optimization($\texttt{P}^2\texttt{MPO}$)的通用算法框架,结合了灵活的模型估计子程序和双重悲观策略优化步骤。
效果:实验结果表明,当对模型估计子程序做出某些精度假设时,$texttt{P}^2\texttt{MPO}$在具有健壮部分覆盖数据的离线数据集上是样本高效的,即离线数据集对由最优健壮策略和扰动模型围绕标称模型产生的分布具有良好的覆盖。

We study distributionally robust offline reinforcement learning (RL), which seeks to find an optimal robust policy purely from an offline dataset that can perform well in perturbed environments. We propose a generic algorithm framework Doubly Pessimistic Model-based Policy Optimization ($\texttt{P}^2\texttt{MPO}$) for robust offline RL, which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Here the double pessimism principle is crucial to overcome the distribution shift incurred by i) the mismatch between behavior policy and the family of target policies; and ii) the perturbation of the nominal model. Under certain accuracy assumptions on the model estimation subroutine, we show that $\texttt{P}^2\texttt{MPO}$ is provably sample-efficient with robust partial coverage data, which means that the offline dataset has good coverage of the distributions induced by the optimal robust policy and perturbed models around the nominal model. By tailoring specific model estimation subroutines for concrete examples including tabular Robust Markov Decision Process (RMDP), factored RMDP, and RMDP with kernel and neural function approximations, we show that $\texttt{P}^2\texttt{MPO}$ enjoys a $\tilde{\mathcal{O}}(n^{-1/2})$ convergence rate, where $n$ is the number of trajectories in the offline dataset. Notably, these models, except for the tabular case, are first identified and proven tractable by this paper. To the best of our knowledge, we first propose a general learning principle --- double pessimism --- for robust offline RL and show that it is provably efficient in the context of general function approximations.

Adaptive Online Replanning with Diffusion Models
Siyuan Zhou Yilun Du Shun Zhang Mengdi Xu Yikang Shen Wei Xiao Dit-Yan Yeung Chuang Gan



研究问题:如何有效地使用扩散模型进行重规划。
动机:直接执行计划而不进行重规划会导致行动错误累积和环境变化,而每个时间步都进行重规划会消耗大量计算资源并可能阻止任务成功执行。
方法:提出了一种基于扩散模型估计现有生成计划可能性的原则性重规划方法,以及一种确保新计划与原始轨迹具有相同目标状态的重规划现有轨迹的方法。
效果:这些提议的组合显著提高了扩散规划器的性能,在Maze2D上比过去的扩散规划方法提高了38%,并能处理随机和长期机器人控制任务。

Diffusion models have risen a promising approach to data-driven planning, and have demonstrated impressive robotic control, reinforcement learning, and video planning performance. Given an effective planner, an important question to consider is replanning -- when given plans should be regenerated due to both action execution error and external environment changes. Direct plan execution, without replanning, is problematic as errors from individual actions rapidly accumulate and environments are partially observable and stochastic. Simultaneously, replanning at each timestep incurs a substantial computational cost, and may prevent successful task execution, as different generated plans prevent consistent progress to any particular goal. In this paper, we explore how we may effectively replan with diffusion models. We propose a principled approach to determine when to replan, based on the diffusion model's estimated likelihood of existing generated plans. We further present an approach to replan existing trajectories to ensure that new plans follow the same goal state as the original trajectory, which may efficiently bootstrap off previously generated plans. We illustrate how a combination of our proposed additions significantly improves the performance of diffusion planners leading to 38\% gains over past diffusion planning approaches on Maze2D and further enables handling of stochastic and long-horizon robotic control tasks.

A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes
Han Zhong Tong Zhang



研究问题:尽管乐观的PPO算法在强化学习领域非常成功,但其理论理解仍然不足,特别是在解决线性马可夫决策过程(MDPs)方面。
动机:为了填补这一空白,我们提出了一种用于具有完全信息反馈的回合式对抗性线性MDPs的乐观PPO变体,并为其建立了一个状态-of-the-art的遗憾界限。
方法:我们设计了一种新的多批量更新机制,并在理论上使用了值和策略类的新覆盖数论点。
效果:与现有的基于策略的算法相比,无论是在随机线性MDPs还是在具有完全信息的对抗性线性MDPs中,我们的算法都实现了最先进的遗憾界限。

The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a $\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$ regret for it. Here $d$ is the ambient dimension of linear MDPs, $H$ is the length of each episode, and $K$ is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information. Additionally, our algorithm design features a novel multi-batched updating mechanism and the theoretical analysis utilizes a new covering number argument of value and policy classes, which might be of independent interest.

Multi-Agent First Order Constrained Optimization in Policy Space
Youpeng Zhao Yaodong Yang Zhenbo Lu Wengang Zhou Houqiang Li



研究问题:在多智能体强化学习(MARL)中,如何实现高性能并避免不安全行为是一个重要问题。
动机:对于实际应用来说,避免不安全行为的能力变得紧迫且必要,但在MARL中开发一种安全意识的方法仍然具有挑战性。
方法:本文提出了一种名为“多智能体一阶约束优化策略空间”(MAFOCOPS)的新方法,有效地解决了实现满意性能和执行安全约束的双重目标。该方法首先使用当前策略生成的数据,通过在非参数化策略空间中解决约束优化问题来找到最优更新策略。然后,将更新策略投影回参数化策略空间以实现可行的策略。值得注意的是,我们的方法本质上是一阶的,易于实施,并在最坏情况下的约束违反上表现出近似上限。
效果:实验结果表明,我们的方法在多个安全的MARL基准测试上实现了显著的性能,同时满足安全约束。

In the realm of multi-agent reinforcement learning (MARL), achieving high performance is crucial for a successful multi-agent system. Meanwhile, the ability to avoid unsafe actions is becoming an urgent and imperative problem to solve for real-life applications. Whereas, it is still challenging to develop a safety-aware method for multi-agent systems in MARL. In this work, we introduce a novel approach called Multi-Agent First Order Constrained Optimization in Policy Space (MAFOCOPS), which effectively addresses the dual objectives of attaining satisfactory performance and enforcing safety constraints. Using data generated from the current policy, MAFOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. Then, the update policy is projected back into the parametric policy space to achieve a feasible policy. Notably, our method is first-order in nature, ensuring the ease of implementation, and exhibits an approximate upper bound on the worst-case constraint violation. Empirical results show that our approach achieves remarkable performance while satisfying safe constraints on several safe MARL benchmarks.

Two Heads are Better Than One: A Simple Exploration Framework for Efficient Multi-Agent Reinforcement Learning
Jiahui Li Kun Kuang Baoxiang Wang Xingchen Li Fei Wu Jun Xiao Long Chen



研究问题:在强化学习中,特别是在稀疏奖励任务中,探索策略起着重要作用。在多智能体强化学习(MARL)中,由于状态空间大且智能体之间的交互复杂,设计合适的探索策略更具挑战性。
动机:目前主流的MARL探索方法要么致力于探索庞大而稀疏的未知状态,要么通过高计算成本来测量智能体之间的交互。我们发现不同的探索策略在不同的MARL场景中起着不同的作用,选择合适的策略通常比精心设计算法更有效。
方法:我们提出了一种结合基于好奇心和基于影响力的探索方法(COIN)。首先,COIN根据互信息理论衡量每个智能体对其他智能体的影响,并将其设计为应用于每个独立值函数的内在奖励。其次,COIN通过预测误差计算基于好奇心的内在奖励,并将其添加到外在奖励中。为了整合这两种内在奖励,COIN利用了一个新颖的框架,使它们相互补充,并在合作式MARL任务上产生足够有效的探索。
效果:我们在各种具有挑战性的基准测试上进行了广泛的实验,结果表明我们的方法在不同场景中都具有优越性。

Exploration strategy plays an important role in reinforcement learning, especially in sparse-reward tasks. In cooperative multi-agent reinforcement learning~(MARL), designing a suitable exploration strategy is much more challenging due to the large state space and the complex interaction among agents. Currently, mainstream exploration methods in MARL either contribute to exploring the unfamiliar states which are large and sparse, or measuring the interaction among agents with high computational costs. We found an interesting phenomenon that different kinds of exploration plays a different role in different MARL scenarios, and choosing a suitable one is often more effective than designing an exquisite algorithm. In this paper, we propose a exploration method that incorporate the \underline{C}uri\underline{O}sity-based and \underline{IN}fluence-based exploration~(COIN) which is simple but effective in various situations. First, COIN measures the influence of each agent on the other agents based on mutual information theory and designs it as intrinsic rewards which are applied to each individual value function. Moreover, COIN computes the curiosity-based intrinsic rewards via prediction errors which are added to the extrinsic reward. For integrating the two kinds of intrinsic rewards, COIN utilizes a novel framework in which they complement each other and lead to a sufficient and effective exploration on cooperative MARL tasks. We perform extensive experiments on different challenging benchmarks, and results across different scenarios show the superiority of our method.

Sample-Efficient and Safe Deep Reinforcement Learning via Reset Deep Ensemble Agents
Woojun Kim Yongjae Shin Jongeui Park Youngchul Sung



研究问题:深度强化学习(RL)通过与深度神经网络(DNN)结合作为函数近似器,在解决复杂任务上取得了显著的成功。然而,这种对DNN的依赖引入了一种新的挑战,即先验偏见,这些函数近似器倾向于优先考虑早期的经验,导致过拟合。
动机:为了减轻这种偏见,已经提出了一种重置方法,该方法涉及周期性地重置深度RL代理的一部分或全部,同时保留重播缓冲区。然而,这种方法的使用可能会导致执行重置后的性能崩溃,从安全RL和遗憾最小化的角度来看,这引起了关注。
方法:在这篇论文中,我们提出了一种基于重置的新型方法,利用深度集成学习来解决原生重置方法的限制并提高样本效率。
效果:通过各种实验,包括在安全RL领域的实验,验证了所提出方法的有效性。数值结果表明,它有潜力应用于需要高样本效率和安全考虑的现实应用。

Deep reinforcement learning (RL) has achieved remarkable success in solving complex tasks through its integration with deep neural networks (DNNs) as function approximators. However, the reliance on DNNs has introduced a new challenge called primacy bias, whereby these function approximators tend to prioritize early experiences, leading to overfitting. To alleviate this bias, a reset method has been proposed, which involves periodic resets of a portion or the entirety of a deep RL agent while preserving the replay buffer. However, the use of this method can result in performance collapses after executing the reset, raising concerns from the perspective of safe RL and regret minimization. In this paper, we propose a novel reset-based method that leverages deep ensemble learning to address the limitations of the vanilla reset method and enhance sample efficiency. The effectiveness of the proposed method is validated through various experiments including those in the domain of safe RL. Numerical results demonstrate its potential for real-world applications requiring high sample efficiency and safety considerations.

Distributional Pareto-Optimal Multi-Objective Reinforcement Learning
Xin-Qiang Cai Pushi Zhang Li Zhao Jiang Bian Masashi Sugiyama Ashley Juan Llorens



研究问题:现有的多目标强化学习算法无法考虑返回值的分布偏好,这对于自动驾驶等真实世界场景尤为重要。
动机:为了解决这个问题,我们将多目标强化学习中的帕累托最优性概念扩展到分布帕累托最优性,以捕捉返回值分布的最优性,而不仅仅是期望值。
方法:我们提出了一种名为分布帕累托最优多目标强化学习(DPMORL)的方法,该方法能够学习平衡多个目标同时考虑返回不确定性的分布帕累托最优策略。
效果:我们在几个基准问题上评估了我们的方法,并与现有的多目标强化学习方法相比,证明了它在发现分布帕累托最优策略和满足多样化分布偏好方面的有效性。

Multi-objective reinforcement learning (MORL) has been proposed to learn control policies over multiple competing objectives with each possible preference over returns. However, current MORL algorithms fail to account for distributional preferences over the multi-variate returns, which are particularly important in real-world scenarios such as autonomous driving. To address this issue, we extend the concept of Pareto-optimality in MORL into distributional Pareto-optimality, which captures the optimality of return distributions, rather than the expectations. Our proposed method, called Distributional Pareto-Optimal Multi-Objective Reinforcement Learning~(DPMORL), is capable of learning distributional Pareto-optimal policies that balance multiple objectives while considering the return uncertainty. We evaluated our method on several benchmark problems and demonstrated its effectiveness in discovering distributional Pareto-optimal policies and satisfying diverse distributional preferences compared to existing MORL methods.

Efficient Diffusion Policies For Offline Reinforcement Learning
Bingyi Kang Xiao Ma Chao Du Tianyu Pang Shuicheng YAN



研究问题:本文旨在解决现有离线强化学习(RL)中政策参数化的关键但常被忽视的问题,以及Diffusion-QL的两个主要限制:1)训练过程中整个马尔科夫链的前后向运行计算效率低下;2)与最大似然基于RL算法(如策略梯度方法)不兼容。
动机:Diffusion-QL通过使用扩散模型来表示政策,显著提高了离线RL的性能,但其依赖于一个需要数百步采样的参数化马尔科夫链,且在训练过程中需要运行整个马尔科夫链,这导致其计算效率低下,并且与最大似然基于RL算法不兼容。
方法:为了克服这些问题,我们提出了高效的扩散政策(EDP)。EDP在训练过程中通过近似从被破坏的动作构造动作,避免运行整个采样链。我们在D4RL基准上进行了广泛的实验。
效果:实验结果显示,EDP可以将扩散政策的训练时间从5天减少到5小时,并且在gym-locomotion任务上取得了显著的效果。此外,我们还证明EDP与各种离线RL算法(TD3、CRR和IQL)兼容,并在D4RL上以大幅度超过先前方法的成绩达到了新的最先进水平。

Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods.

Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents
Wonje Choi Woo Kyung Kim SeungHyun Kim Honguk Woo



研究问题:如何让实体强化学习代理快速适应未见过的环境视觉观察,实现零样本适应能力。
动机:在实体强化学习中,实现对未见过环境的快速适应是一个挑战性的问题。
方法:提出一种新颖的对比提示集成(ConPE)框架,利用预训练的视觉语言模型和一组视觉提示,使代理能够有效地学习和适应各种环境和物理变化。
效果:实验表明,ConPE在多个实体代理任务上优于其他最先进的算法,包括AI2THOR中的导航、Metaworld中的操作和CARLA中的自动驾驶,同时也提高了策略学习和适应的样本效率。

For embodied reinforcement learning (RL) agents interacting with the environment, it is desirable to have rapid policy adaptation to unseen visual observations, but achieving zero-shot adaptation capability is considered as a challenging problem in the RL context. To address the problem, we present a novel contrastive prompt ensemble (ConPE) framework which utilizes a pretrained vision-language model and a set of visual prompts, thus enables efficient policy learning and adaptation upon a wide range of environmental and physical changes encountered by embodied agents. Specifically, we devise a guided-attention-based ensemble approach with multiple visual prompts on the vision-language model to construct robust state representations. Each prompt is contrastively learned in terms of an individual domain factors that significantly affects the agent's egocentric perception and observation. For a given task, the attention-based ensemble and policy are jointly learned so that the resulting state representations not only generalize to various domains but are also optimized for learning the task. Through experiments, we show that ConPE outperforms other state-of-the-art algorithms for several embodied agent tasks including navigation in AI2THOR, manipulation in Metaworld, and autonomous driving in CARLA, while also improving the sample efficiency of policy learning and adaptation.

Taylor TD-learning
Michele Garibbo Maxime Robeyns Laurence Aitchison



研究问题:许多强化学习方法依赖于TD学习来学习一个评论家,但TD学习的更新可能是高方差的。
动机:本文提出了一种基于模型的强化学习框架——Taylor TD,用于在连续状态-动作设置中降低这种方差。
方法:Taylor TD使用TD更新的一阶泰勒级数展开。这种展开允许Taylor TD在行动选择和每次TD更新的初始状态和动作的状态分布中的一些随机性上进行解析积分。
效果:理论和实证证据表明,Taylor TD的更新确实比标准的TD更新具有更低的方差。此外,在合理的假设下,我们展示了Taylor TD与线性函数近似的标准TD学习具有相同的稳定学习保证。接下来,我们将Taylor TD与TD3算法结合,形成了TaTD3。我们在一系列标准基准任务上展示了TaTD3的表现与几种最先进的无模型和基于模型的基线算法相当,甚至更好。

Many reinforcement learning approaches rely on temporal-difference (TD) learning to learn a critic. However, TD-learning updates can be high variance. Here, we introduce a model-based RL framework, Taylor TD, which reduces this variance in continuous state-action settings. Taylor TD uses a first-order Taylor series expansion of TD updates. This expansion allows Taylor TD to analytically integrate over stochasticity in the action-choice, and some stochasticity in the state distribution for the initial state and action of each TD update. We include theoretical and empirical evidence that Taylor TD updates are indeed lower variance than standard TD updates. Additionally, we show Taylor TD has the same stable learning guarantees as standard TD-learning with linear function approximation under a reasonable assumption. Next, we combine Taylor TD with the TD3 algorithm, forming TaTD3. We show TaTD3 performs as well, if not better, than several state-of-the art model-free and model-based baseline algorithms on a set of standard benchmark tasks.

Necessary and Sufficient Conditions for Optimal Decision Trees using Dynamic Programming
Jacobus G.M. van der Linden Mathijs de Weerdt Emir Demirović



研究问题:如何优化决策树以提高准确性、缩小尺寸并提高人类可理解性。
动机:全局优化决策树在准确性、大小和人类可理解性方面具有潜力,但许多方法依赖于通用求解器,其可扩展性存在问题。
方法:我们探索了子树的优化可以独立进行时,动态规划方法能够更好地利用树结构的关系。我们详细探讨了这种关系,并提出了优化任何可分离目标和约束条件的框架。
效果:我们在五个应用领域进行了实验,展示了该框架的普遍适用性,同时以较大的优势超过了通用求解器的可扩展性。

Global optimization of decision trees has shown to be promising in terms of accuracy, size, and consequently human comprehensibility. However, many of the methods used rely on general-purpose solvers for which scalability remains an issue. Dynamic programming methods have been shown to scale much better because they exploit the tree structure by solving subtrees as independent subproblems. However, this only works when an objective can be optimized separately for subtrees. We explore this relationship in detail and show necessary and sufficient conditions for such separability and generalize previous dynamic programming approaches into a framework that can optimize any combination of separable objectives and constraints. Experiments on five application domains show the general applicability of this framework, while outperforming the scalability of general-purpose solvers by a large margin.

Efficient Subgame Refinement for Extensive-form Games
Zhenxing Ge Zheng Xu Tianyu Ding Wenbin Li Yang Gao



研究问题:本文旨在解决大型不完美信息游戏中的子游戏求解问题,由于许多真实世界的游戏性质复杂且规模庞大,直接应用现有的子游戏求解技术可能较为困难。
动机:为了克服这个问题,最近的子游戏求解方法允许在有限的知识顺序子游戏中进行子游戏求解,增加了它们在大游戏中的适用性;然而,由于信息集的大小过大,这仍然可能面临障碍。
方法:为此,我们提出了一个生成子游戏求解(GS2)框架,该框架利用生成函数来识别最早达到的节点的子集,从而减小子游戏的规模。我们的方法得到了理论分析的支持,并采用了基于多样性的生成函数来增强安全性。
效果:我们在中等规模的游戏以及具有挑战性的关丹大游戏中进行的实验表明,我们的方法比蓝图有了显著的改进。

Subgame solving is an essential technique in addressing large imperfect information games, with various approaches developed to enhance the performance of refined strategies in the abstraction of the target subgame. However, directly applying existing subgame solving techniques may be difficult, due to the intricate nature and substantial size of many real-world games. To overcome this issue, recent subgame solving methods allow for subgame solving on limited knowledge order subgames, increasing their applicability in large games; yet this may still face obstacles due to extensive information set sizes. To address this challenge, we propose a generative subgame solving (GS2) framework, which utilizes a generation function to identify a subset of the earliest-reached nodes, reducing the size of the subgame. Our method is supported by a theoretical analysis and employs a diversity-based generation function to enhance safety. Experiments conducted on medium-sized games as well as the challenging large game of GuanDan demonstrate a significant improvement over the blueprint.

Extracting Reward Functions from Diffusion Models
Felipe Pinto Coelho Nuti Tim Franzmeyer Joao F. Henriques



研究问题:如何提取两个扩散模型之间的相对奖励函数,用于优化决策过程。
动机:扩散模型在图像生成和序列决策任务中表现出色,但需要有效的奖励函数进行优化。
方法:通过比较低奖励和高奖励行为的扩散模型,定义并学习相对奖励函数。
效果:该方法在导航环境和图像生成任务中均取得了显著的改进,证明了其泛化能力。

Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn high-performing policies in sequential decision-making tasks. Decision-making diffusion models can be trained on lower-quality data, and then be steered with a reward function to generate near-optimal trajectories. We consider the problem of extracting a reward function by comparing a decision-making diffusion model that models low-reward behavior and one that models high-reward behavior; a setting related to inverse reinforcement learning. We first define the notion of a \emph{relative reward function of two diffusion models} and show conditions under which it exists and is unique. We then devise a practical learning algorithm for extracting it by aligning the gradients of a reward function -- parametrized by a neural network -- to the difference in outputs of both diffusion models. Our method finds correct reward functions in navigation environments, and we demonstrate that steering the base model with the learned reward functions results in significantly increased performance in standard locomotion benchmarks. Finally, we demonstrate that our approach generalizes beyond sequential decision-making by learning a reward-like function from two large-scale image generation diffusion models. The extracted reward function successfully assigns lower rewards to harmful images.

DIFFER:Decomposing Individual Reward for Fair Experience Replay in Multi-Agent Reinforcement Learning
Xunhan Hu Jian Zhao Wengang Zhou Ruili Feng Houqiang Li



研究问题:多智能体强化学习(MARL)中,如何有效地分解团队奖励为个体奖励,以实现公平的经验回放。
动机:现有的方法在分解团队奖励为个体奖励上存在困难,导致难以区分和利用重要的个体经验。
方法:提出DIFFER框架,通过强制网络梯度的不变性,建立一个偏微分方程来求解个体奖励函数,从而计算出每个经验片段在学习任务中的重要性,指导训练过程。
效果:在多个流行基准测试中验证了理论和方法的有效性,显著提高了学习效率和公平性。

Cooperative multi-agent reinforcement learning (MARL) is a challenging task, as agents must learn complex and diverse individual strategies from a shared team reward. However, existing methods struggle to distinguish and exploit important individual experiences, as they lack an effective way to decompose the team reward into individual rewards. To address this challenge, we propose DIFFER, a powerful theoretical framework for decomposing individual rewards to enable fair experience replay in MARL. By enforcing the invariance of network gradients, we establish a partial differential equation whose solution yields the underlying individual reward function. The individual TD-error can then be computed from the solved closed-form individual rewards, indicating the importance of each piece of experience in the learning task and guiding the training process. Our method elegantly achieves an equivalence to the original learning framework when individual experiences are homogeneous, while also adapting to achieve more muscular efficiency and fairness when diversity is observed. Our extensive experiments on popular benchmarks validate the effectiveness of our theory and method, demonstrating significant improvements in learning efficiency and fairness. Code is available in supplement material.

Efficient Potential-based Exploration in Reinforcement Learning using Inverse Dynamic Bisimulation Metric
YIMING WANG Ming Yang Renzhi Dong Binbin Sun Furui Liu Leong Hou U



研究问题:如何有效地将领域知识整合到强化学习中,提高探索效率并减少人为认知偏差。
动机:传统的基于潜力的奖励塑造方法完全依赖于手动设计奖励函数,这大大限制了探索效率并引入了人为的认知偏差。
方法:提出了一种基于状态差异潜力的深度强化学习端到端的潜在探索奖励方法,通过计算状态间的距离来测量相邻状态的新颖性,从而鼓励代理发现新的状态并提供更密集的奖励,而无需人工干预。
效果:在MuJoCo和Arcade Learning Environments上进行的广泛实验验证了该方法与其他竞争方法相比的优越性和可扩展性。

Reward shaping is an effective technique for integrating domain knowledge into reinforcement learning (RL). However, traditional approaches like potential-based reward shaping totally rely on manually designing shaping reward functions, which significantly restricts exploration efficiency and introduces human cognitive biases. While a number of RL methods have been proposed to boost exploration by designing an intrinsic reward signal as exploration bonus. Nevertheless, these methods heavily rely on the count-based episodic term in their exploration bonus which falls short in scalability. To address these limitations, we propose a general end-to-end potential-based exploration bonus for deep RL via potentials of state discrepancy, which motivates the agent to discover novel states and provides them with denser rewards without manual intervention. Specifically, we measure the novelty of adjacent states by calculating their distance using the bisimulation metric-based potential function, which enhances agent's exploration and ensures policy invariance. In addition, we offer a theoretical guarantee on our inverse dynamic bisimulation metric, bounding the value difference and ensuring that the agent explores states with higher TD error, thus significantly improving training efficiency. The proposed approach is named \textbf{LIBERTY} (exp\textbf{L}oration v\textbf{I}a \textbf{B}isimulation m\textbf{E}t\textbf{R}ic-based s\textbf{T}ate discrepanc\textbf{Y}) which is comprehensively evaluated on the MuJoCo and the Arcade Learning Environments. Extensive experiments have verified the superiority and scalability of our algorithm compared with other competitive methods.

Iteratively Learn Diverse Strategies with State Distance Information
Wei Fu Weihua Du Jingwei Li Sunli Chen Jingzhao Zhang Yi Wu



研究问题:在复杂的强化学习问题中,如何优化奖励并发现尽可能多的策略多样性。
动机:在许多实际应用场景中,策略的多样性至关重要。现有的方法无法准确捕捉策略间的行为差异。
方法:提出了一种结合状态空间距离信息的新多样性度量方法,并比较了两种常见的计算框架——基于种群的训练(PBT)和迭代学习(ITR)。
效果:实验结果表明,新算法SIPO在所有测试环境中都能产生具有策略多样性且易于人类理解的策略,这是现有基线无法发现的。

In complex reinforcement learning (RL) problems, policies with similar rewards may have substantially different behaviors. It remains a fundamental challenge to optimize rewards while also discovering as many *diverse* strategies as possible, which can be crucial in many practical applications. Our study examines two design choices for tackling this challenge, i.e., *diversity measure* and *computation framework*. First, we find that with existing diversity measures, visually indistinguishable policies can still yield high diversity scores. To accurately capture the behavioral difference, we propose to incorporate the state-space distance information into the diversity measure. In addition, we examine two common computation frameworks for this problem, i.e., population-based training (PBT) and iterative learning (ITR). We show that although PBT is the precise problem formulation, ITR can achieve comparable diversity scores with higher computation efficiency, leading to improved solution quality in practice. Based on our analysis, we further combine ITR with two tractable realizations of the state-distance-based diversity measures and develop a novel diversity-driven RL algorithm, *State-based Intrinsic-reward Policy Optimization* (SIPO), with provable convergence properties. We empirically examine SIPO across three domains from robot locomotion to multi-agent games. In all of our testing environments, SIPO consistently produces strategically diverse and human-interpretable policies that cannot be discovered by existing baselines.

RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization
Siqi Shen Chennan Ma Chao Li Weiquan Liu Yongquan Fu Songzhu Mei Xinwang Liu Cheng Wang



研究问题:多智能体系统中,由于环境的不确定性、代理的策略变化和部分可观察性,存在显著的风险。在多智能体强化学习(MARL)中,学习协调的、分散的且对风险敏感的策略是一项挑战。
动机:为了在风险敏感的MARL中形成协调需求,我们提出了风险敏感的个体-全局-最大(RIGM)原则,作为个体-全局-最大(IGM)和分布IGM(DIGM)原则的泛化。这个原则要求每个代理的风险敏感动作选择集合应该等同于中央政策的风险敏感动作选择。
方法:我们提出了RiskQ来解决这个问题,它通过将联合回报分布的分位数建模为每个代理回报分布效用的加权分位数混合来对联合回报分布进行建模。RiskQ满足VaR和扭曲风险度量的RIGM原则。
效果:实验表明,RiskQ可以获得良好的性能。RiskQ的源代码可以在https://github.com/xmu-rl-3dv/RiskQ上找到。

Multi-agent systems are characterized by environmental uncertainty, varying policies of agents, and partial observability, which result in significant risks. In the context of Multi-Agent Reinforcement Learning (MARL), learning coordinated and decentralized policies that are sensitive to risk is challenging. To formulate the coordination requirements in risk-sensitive MARL, we introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles. This principle requires that the collection of risk-sensitive action selections of each agent should be equivalent to the risk-sensitive action selection of the central policy. Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. Therefore, we propose RiskQ to address this limitation, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics. We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available in https://github.com/xmu-rl-3dv/RiskQ.

Sample-efficient Multi-objective Molecular Optimization with GFlowNets
Yiheng Zhu Jialu Wu Chaowen Hu Jiahuan Yan Chang-Yu Hsieh Tingjun Hou Jian Wu



研究问题:设计具有所需性质的新分子,这是一个在离散化学空间上的黑箱优化问题。
动机:在实际中,由于存在多个冲突的目标和昂贵的评估(如湿实验),候选者的多样性至关重要。现有的计算方法虽然取得了初步的成功,但在目标和搜索空间的多样性方面仍面临挑战。
方法:我们提出了一种多目标贝叶斯优化(MOBO)算法,利用基于超网络的GFlowNets(HN-GFN)作为获取函数优化器,目的是从近似帕累托前沿中采样一组多样化的候选分子图。
效果:实验结果表明,HN-GFN具有足够的能力来泛化不同的偏好。此外,在不同实际MOBO设置中的实验表明,我们的框架在候选质量和样本效率方面显著优于现有方法。

Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as a black-box optimization problem over the *discrete* chemical space. In practice, multiple conflicting objectives and costly evaluations (e.g., wet-lab experiments) make the *diversity* of candidates paramount. Computational methods have achieved initial success but still struggle with considering diversity in both objective and search space. To fill this gap, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. We further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. We empirically illustrate that HN-GFN has adequate capacity to generalize over preferences. Moreover, experiments in various real-world MOBO settings demonstrate that our framework predominantly outperforms existing methods in terms of candidate quality and sample efficiency. The code is available at https://github.com/violet-sto/HN-GFN.

Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL
Chen Sun Wannan Yang Thomas Jiralerspong Dane Malenfant Benjamin Alsbury-Nealy Yoshua Bengio Blake Aaron Richards



研究问题:如何更准确地识别和处理强化学习中成功的关键步骤。
动机:传统的强化学习方法在确定关键步骤方面存在困难,因为成功通常取决于时间上相距很远的多个关键步骤。
方法:提出了一种新的强化学习算法——对比回顾(ConSpec),该算法使用离线对比学习来找出这些关键步骤。通过新颖的对比损失函数学习任务中关键步骤的原型,并在当前状态匹配原型时提供内在奖励。
效果:实验证明,ConSpec能快速识别所有的关键步骤,并能在感官特征改变时进行分布外泛化,从而在各种强化学习任务中显著提高学习效果。

In real life, success is often contingent upon multiple critical steps that are distant in time from each other and from the final reward. These critical steps are challenging to identify with traditional reinforcement learning (RL) methods that rely on the Bellman equation for credit assignment. Here, we present a new RL algorithm that uses offline contrastive learning to hone in on these critical steps. This algorithm, which we call Contrastive Retrospection (ConSpec), can be added to any existing RL algorithm. ConSpec learns a set of prototypes for the critical steps in a task by a novel contrastive loss and delivers an intrinsic reward when the current state matches one of the prototypes. The prototypes in ConSpec provide two key benefits for credit assignment: (i) They enable rapid identification of all the critical steps. (ii) They do so in a readily interpretable manner, enabling out-of-distribution generalization when sensory features are altered. Distinct from other contemporary RL approaches to credit assignment, ConSpec takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon (and ignoring other states) than it is to prospectively predict reward at every taken step. ConSpec greatly improves learning in a diverse set of RL tasks. The code is available at the link: https://github.com/sunchipsster1/ConSpec

Multi-Modal Inverse Constrained Reinforcement Learning from a Mixture of Demonstrations
Guanren Qiao Guiliang Liu Pascal Poupart zhiqiang xu



研究问题:现有的逆约束强化学习算法通常假设演示数据是由单一类型的专家生成的,但实际上,演示往往包含来自尊重不同约束的不同专家代理收集的混合轨迹,这使得用统一的约束函数解释专家行为具有挑战性。
动机:为了解决这个问题,我们提出了一种多模态逆约束强化学习(MMICRL)算法,用于同时估计对应于不同类型的专家的多个约束。
方法:MMICRL构建了一个基于流的密度估计器,该估计器能够从演示中进行无监督的专家识别,从而推断出特定于代理的约束。遵循这些约束,MMICRL使用一种新的多模态约束策略优化目标来模仿专家策略,该目标最小化代理条件的策略熵并最大化无条件的策略熵。为了增强鲁棒性,我们将这个目标融入到对比学习框架中。这种方法使得模仿策略能够捕捉到专家代理之间的行为多样性。
效果:在离散和连续环境中的大量实验表明,MMICRL在约束恢复和控制性能方面优于其他基线方法。

Inverse Constraint Reinforcement Learning (ICRL) aims to recover the underlying constraints respected by expert agents in a data-driven manner. Existing ICRL algorithms typically assume that the demonstration data is generated by a single type of expert. However, in practice, demonstrations often comprise a mixture of trajectories collected from various expert agents respecting different constraints, making it challenging to explain expert behaviors with a unified constraint function. To tackle this issue, we propose a Multi-Modal Inverse Constrained Reinforcement Learning (MMICRL) algorithm for simultaneously estimating multiple constraints corresponding to different types of experts. MMICRL constructs a flow-based density estimator that enables unsupervised expert identification from demonstrations, so as to infer the agent-specific constraints. Following these constraints, MMICRL imitates expert policies with a novel multi-modal constrained policy optimization objective that minimizes the agent-conditioned policy entropy and maximizes the unconditioned one. To enhance robustness, we incorporate this objective into the contrastive learning framework. This approach enables imitation policies to capture the diversity of behaviors among expert agents. Extensive experiments in both discrete and continuous environments show that MMICRL outperforms other baselines in terms of constraint recovery and control performance.

Provably Safe Reinforcement Learning with Step-wise Violation Constraints
Nuoya Xiong Yihan Du Longbo Huang



研究问题:本文研究了一种新的具有逐步违规约束的安全强化学习问题,与现有工作不同之处在于我们关注更严格的逐步违规约束,并且不假设存在安全行动。
动机:我们的研究适用于需要在所有决策步骤中确保安全的严格安全关键应用,例如机器人控制和自动驾驶,这些应用可能并不总是拥有安全行动。
方法:我们提出了一种高效的算法SUCBVI,保证了逐步违规和遗憾的最优性能。我们还进一步研究了一种创新的无奖励安全探索问题,设计了SRF-UCRL算法来找到接近最优的安全策略。
效果:实验结果表明,我们的算法在安全性能上具有优越性,并验证了我们的理论结果的正确性。

We investigate a novel safe reinforcement learning problem with step-wise violation constraints. Our problem differs from existing works in that we focus on stricter step-wise violation constraints and do not assume the existence of safe actions, making our formulation more suitable for safety-critical applications that need to ensure safety in all decision steps but may not always possess safe actions, e.g., robot control and autonomous driving. We propose an efficient algorithm SUCBVI, which guarantees $\widetilde{\mathcal{O}}(\sqrt{ST})$ or gap-dependent $\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}} + S^2AH^2)$ step-wise violation and $\widetilde{\mathcal{O}}(\sqrt{H^3SAT})$ regret. Lower bounds are provided to validate the optimality in both violation and regret performance with respect to the number of states $S$ and the total number of steps $T$. Moreover, we further study an innovative safe reward-free exploration problem with step-wise violation constraints. For this problem, we design algorithm SRF-UCRL to find a near-optimal safe policy, which achieves nearly state-of-the-art sample complexity $\widetilde{\mathcal{O}}((\frac{S^2AH^2}{\varepsilon}+\frac{H^4SA}{\varepsilon^2})(\log(\frac{1}{\delta})+S))$, and guarantees $\widetilde{\mathcal{O}}(\sqrt{ST})$ violation during exploration. Experimental results demonstrate the superiority of our algorithms in safety performance and corroborate our theoretical results.

Unsupervised Behavior Extraction via Random Intent Priors
Hao Hu Yiqin Yang Jianing Ye Ziqing Mai Chongjie Zhang



研究问题:如何充分利用大量无奖励数据中的人类行为先验知识,提高离线强化学习算法的性能。
动机:尽管无奖励数据丰富,但现有的离线强化学习算法并未能充分利用这些数据。
方法:提出UBER方法,通过从给定的先验分布中采样不同的伪奖励,对不同代理进行赋值,从而提取出多样化的行为集合,并将其作为候选策略来帮助新任务的学习。
效果:实验证明,从随机神经网络生成的奖励足以提取出多样化且有用的行为,部分甚至接近专家级行为。在多个基准测试中,UBER展现出了优于现有基线的学习效果和样本效率,扩大了强化学习在现实世界无奖励数据丰富场景的应用范围。

Reward-free data is abundant and contains rich prior knowledge of human behaviors, but it is not well exploited by offline reinforcement learning (RL) algorithms. In this paper, we propose UBER, an unsupervised approach to extract useful behaviors from offline reward-free datasets via diversified rewards. UBER assigns different pseudo-rewards sampled from a given prior distribution to different agents to extract a diverse set of behaviors, and reuse them as candidate policies to facilitate the learning of new tasks. Perhaps surprisingly, we show that rewards generated from random neural networks are sufficient to extract diverse and useful behaviors, some even close to expert ones. We provide both empirical and theoretical evidences to justify the use of random priors for the reward function. Experiments on multiple benchmarks showcase UBER's ability to learn effective and diverse behavior sets that enhance sample efficiency for online RL, outperforming existing baselines. By reducing reliance on human supervision, UBER broadens the applicability of RL to real-world scenarios with abundant reward-free data.

Mutual Information Regularized Offline Reinforcement Learning
Xiao Ma Bingyi Kang Zhongwen Xu Min Lin Shuicheng YAN



研究问题:离线强化学习中,当查询到分布外的动作时会出现分布偏移,这导致策略改进方向受到外推误差的偏斜。
动机:大多数现有方法通过在策略改进或评估过程中对策略或值进行偏离行为策略的惩罚来解决这个问题。
方法:本文提出了一种新的MISA框架,通过直接约束策略改进的方向,从数据集中的状态和动作之间的互信息角度来解决离线强化学习的问题。MISA构建了由策略和Q值参数化的最大互信息下界。
效果:实验结果表明,优化这个下界等价于最大化离线数据集上一步改进策略的可能性。因此,我们约束策略改进方向位于数据流形内。由此产生的算法同时增强了策略评估和改进,通过添加互信息正则化。MISA是一个通用框架,将保守Q学习(CQL)和行为正则化方法(如TD3+BC)统一起来。我们引入了3种不同的MISA变体,并通过实验证明更紧的互信息下界可以获得更好的离线RL性能。此外,我们的大量实验表明,MISA在D4RL基准测试的各种任务上都显著优于广泛的基线,例如,在gym-locomotion任务上实现了742.9的总分数。

The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding mutual information regularizations. MISA is a general framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance. In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark, e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is attached and will be released upon publication.

An Efficient End-to-End Training Approach for Zero-Shot Human-AI Coordination
Xue Yan Jiaxian Guo Xingzhou Lou Jun Wang Haifeng Zhang Yali Du



研究问题:开发一种无需依赖人类数据,能与人类协作的智能代理。
动机:现有的两阶段基于种群的方法需要一组相互独立的策略来模拟人类的多样化行为,这种需求严重限制了其计算效率。
方法:提出E3T,一种端到端的训练方法,用于零样本人类-AI协调。E3T采用自我策略和随机策略的混合来构建伙伴策略,使其既有协调能力又具有多样性。这样,自我代理就可以在没有预训练种群的情况下,使用这种混合策略进行端到端训练,从而显著提高训练效率。此外,还提出了一个伙伴建模模块,通过历史信息预测伙伴的行动。有了预测的伙伴行动,自我策略就能调整自己的策略,并在与行为模式不同的人类合作时采取相应的行动。
效果:在Overcooked环境中的实证结果表明,我们的方法在保持与基于种群的基线相当或更好的性能的同时,显著提高了训练效率。

The goal of zero-shot human-AI coordination is to develop an agent that can collaborate with humans without relying on human data. Prevailing two-stage population-based methods require a diverse population of mutually distinct policies to simulate diverse human behaviors. The necessity of such populations severely limits their computational efficiency. To address this issue, we propose E3T, an **E**fficient **E**nd-to-**E**nd **T**raining approach for zero-shot human-AI coordination. E3T employs a mixture of ego policy and random policy to construct the partner policy, making it both coordination-skilled and diverse. In this way, the ego agent is end-to-end trained with this mixture policy without the need of a pre-trained population, thus significantly improving the training efficiency. In addition, a partner modeling module is proposed to predict the partner's action from historical information. With the predicted partner's action, the ego policy is able to adapt its policy and take actions accordingly when collaborating with humans of different behavior patterns. Empirical results on the Overcooked environment show that our method significantly improves the training efficiency while preserving comparable or superior performance than the population-based baselines. Demo videos are available at https://sites.google.com/view/e3t-overcooked.

Diversify \& Conquer: Outcome-directed Curriculum RL via Out-of-Distribution Disagreement
Daesol Cho Seungjae Lee H. Jin Kim



研究问题:强化学习中,代理在没有领域知识(如环境特性或外部奖励)的情况下,如何进行无信息搜索。
动机:为了解决这些挑战,本文提出了一种新的课程强化学习方法,称为D2C(Diversify for Disagreement & Conquer)。
方法:D2C通过使目标条件分类器多样化来识别访问过的状态和期望结果状态之间的相似性,并确保分类器对分布外的状态持不同意见,从而量化未探索的区域,并以简单直观的方式设计任意的目标条件内在奖励信号。然后,D2C采用二分匹配来定义一个课程学习目标,产生一系列调整良好的中间目标,使代理能够自动探索和征服未探索的区域。
效果:实验结果表明,D2C在定量和定性方面都优于先前的课程强化学习方法,即使期望结果示例是任意分布的。

Reinforcement learning (RL) often faces the challenges of uninformed search problems where the agent should explore without access to the domain knowledge such as characteristics of the environment or external rewards. To tackle these challenges, this work proposes a new approach for curriculum RL called $\textbf{D}$iversify for $\textbf{D}$isagreement \& $\textbf{C}$onquer ($\textbf{D2C}$). Unlike previous curriculum learning methods, D2C requires only a few examples of desired outcomes and works in any environment, regardless of its geometry or the distribution of the desired outcome examples. The proposed method performs diversification of the goal-conditional classifiers to identify similarities between visited and desired outcome states and ensures that the classifiers disagree on states from out-of-distribution, which enables quantifying the unexplored region and designing an arbitrary goal-conditioned intrinsic reward signal in a simple and intuitive way. The proposed method then employs bipartite matching to define a curriculum learning objective that produces a sequence of well-adjusted intermediate goals, which enable the agent to automatically explore and conquer the unexplored region. We present experimental results demonstrating that D2C outperforms prior curriculum RL methods in both quantitative and qualitative aspects, even with the arbitrarily distributed desired outcome examples.

Compositional Foundation Models for Hierarchical Planning
Anurag Ajay Seungwook Han Yilun Du Shuang Li Abhi Gupta Tommi S. Jaakkola Joshua B. Tenenbaum Leslie Pack Kaelbling Akash Srivastava Pulkit Agrawal



研究问题:如何通过跨空间和时间尺度的分层推理,在新颖环境中做出有效决策。
动机:为了解决长期目标的问题,需要规划抽象子目标序列,进行视觉推理,并通过视觉运动控制执行行动。
方法:提出组合式基础模型用于分层规划(HiP),该模型利用在语言、视觉和动作数据上分别训练的多个专家基础模型联合解决问题。
效果:通过大型语言模型构建基于环境的视频计划,然后通过逆动力学模型将生成的视频转化为动作,实现有效的分层推理。在三个不同的长期桌面操作任务中展示了该方法的有效性和适应性。

To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.

Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decomposition
Suyoung Lee Myungsik Cho Youngchul Sung



研究问题:如何使元强化学习(meta-RL)方法在面对参数变化的任务时也能进行有效泛化?
动机:现有的元强化学习方法在面对非参数化任务时,往往难以实现良好的泛化效果。
方法:提出一种新的元强化学习方法——子任务分解和虚拟训练(SDVT)。该方法将每个非参数化任务分解为一系列基本子任务,并根据其分解结果对任务进行参数化。通过高斯混合变分自编码器来元学习任务的分解过程,使代理能重用从常见子任务中获得的策略。此外,还提出了一种专为非参数化任务变异性设计的虚拟训练程序,该程序生成假设的子任务组合,从而增强对以前未见过子任务组合的泛化能力。
效果:在Meta-World ML-10和ML-45基准测试中,该方法显著提高了性能,超越了当前最先进的技术。

Meta-reinforcement learning (meta-RL) techniques have demonstrated remarkable success in generalizing deep reinforcement learning across a range of tasks. Nevertheless, these methods often struggle to generalize beyond tasks with parametric variations. To overcome this challenge, we propose Subtask Decomposition and Virtual Training (SDVT), a novel meta-RL approach that decomposes each non-parametric task into a collection of elementary subtasks and parameterizes the task based on its decomposition. We employ a Gaussian mixture VAE to meta-learn the decomposition process, enabling the agent to reuse policies acquired from common subtasks. Additionally, we propose a virtual training procedure, specifically designed for non-parametric task variability, which generates hypothetical subtask compositions, thereby enhancing generalization to previously unseen subtask compositions. Our method significantly improves performance on the Meta-World ML-10 and ML-45 benchmarks, surpassing current state-of-the-art techniques.

Recovering from Out-of-sample States via Inverse Dynamics in Offline Reinforcement Learning
Ke Jiang Jia-Yu Yao Xiaoyang Tan



研究问题:本文解决了离线强化学习测试中常见的状态分布偏移问题,即代理在未见过的状态上倾向于采取不可靠的行动。
动机:为了解决这个问题,我们提出了一种鼓励代理遵循所谓的状态恢复原则的方法,即在采取行动时,除了考虑长期回报外,还应立即考虑当前行动的后果,并优先选择能够恢复行为策略的状态分布的行动。
方法:为此,我们学习并使用了一个逆动力学模型来指导新策略的状态恢复行为。理论上,我们证明了该方法有助于将新策略的过渡状态分布与离线数据集中的未见过的状态对齐,而无需显式预测通常在高维和复杂环境中难以预测的过渡状态分布。
效果:通过在通用离线RL基准测试中展示最先进的性能,我们证明了所提出方法的有效性和可行性。

In this paper we deal with the state distributional shift problem commonly encountered in offline reinforcement learning during test, where the agent tends to take unreliable actions at out-of-sample (unseen) states. Our idea is to encourage the agent to follow the so called state recovery principle when taking actions, i.e., besides long-term return, the immediate consequences of the current action should also be taken into account and those capable of recovering the state distribution of the behavior policy are preferred. For this purpose, an inverse dynamics model is learned and employed to guide the state recovery behavior of the new policy. Theoretically, we show that the proposed method helps aligning the transited state distribution of the new policy with the offline dataset at out-of-sample states, without the need of explicitly predicting the transited state distribution, which is usually difficult in high-dimensional and complicated environments. The effectiveness and feasibility of the proposed method is demonstrated with the state-of-the-art performance on the general offline RL benchmarks.

Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization
Xiangsen Wang Haoran Xu Yinan Zheng Xianyuan Zhan



研究问题:如何有效地从离线数据集学习多智能体强化学习策略,同时解决状态-动作空间大和多智能体行为耦合的复杂性问题。
动机:尽管单智能体的离线强化学习已取得一定成功,但多智能体的情况仍然具有挑战性。现有的方法大多仅在个体级别应用离线数据相关正则化,没有充分考虑全局的多智能体系统。
方法:提出了一种新的离线多智能体强化学习算法OMIGA,通过隐式的全局到局部值正则化进行学习。OMIGA将全局级别的值正则化转化为等价的隐式局部值正则化,并同时支持样本内学习,从而优雅地连接了多智能体的价值分解和策略学习与离线正则化。
效果:在离线多智能体MuJoCo和星际争霸II微管理任务上进行的全面实验表明,OMIGA在所有任务中的表现都优于最先进的离线多智能体强化学习方法。

Offline reinforcement learning (RL) has received considerable attention in recent years due to its attractive capability of learning policies from offline datasets without environmental interactions. Despite some success in the single-agent setting, offline multi-agent RL (MARL) remains to be a challenge. The large joint state-action space and the coupled multi-agent behaviors pose extra complexities for offline policy optimization. Most existing offline MARL studies simply apply offline data-related regularizations on individual agents, without fully considering the multi-agent system at the global level. In this work, we present OMIGA, a new offline multi-agent RL algorithm with implicit global-to-local value regularization. OMIGA provides a principled framework to convert global-level value regularization into equivalent implicit local value regularizations and simultaneously enables in-sample learning, thus elegantly bridging multi-agent value decomposition and policy learning with offline regularizations. Based on comprehensive experiments on the offline multi-agent MuJoCo and StarCraft II micro-management tasks, we show that OMIGA achieves superior performance over the state-of-the-art offline MARL methods in almost all tasks.

Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration
Dongyoung Kim Jinwoo Shin Pieter Abbeel Younggyo Seo



研究问题:在有任务奖励的监督学习环境中,如何平衡探索和利用,避免探索偏向低价值状态区域的问题。
动机:现有的最大化访问状态分布熵的方法在有任务奖励的环境中效果不佳,因为这种方法会倾向于探索低价值状态区域。
方法:提出一种新的探索技术,即最大化值条件状态熵,该方法分别估计每个状态的条件熵,然后取其平均值最大化。通过仅考虑具有相似值估计的访问过的状态来计算内在奖励,防止低价值状态的分布影响高价值状态周围的探索。
效果:实验证明,这种方法可以显著加速各种强化学习算法在MiniGrid、DeepMind控制套件和Meta-World基准测试中的各种任务。

A promising technique for exploration is to maximize the entropy of visited state distribution, i.e., state entropy, by encouraging uniform coverage of visited state space. While it has been effective for an unsupervised setup, it tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states to exploit the task reward. Such a preference can cause an imbalance between the distributions of high-value states and low-value states, which biases exploration towards low-value state regions as a result of the state entropy increasing when the distribution becomes more uniform. This issue is exacerbated when high-value states are narrowly distributed within the state space, making it difficult for the agent to complete the tasks. In this paper, we present a novel exploration technique that maximizes the value-conditional state entropy, which separately estimates the state entropies that are conditioned on the value estimates of each state, then maximizes their average. By only considering the visited states with similar value estimates for computing the intrinsic bonus, our method prevents the distribution of low-value states from affecting exploration around high-value states, and vice versa. We demonstrate that the proposed alternative to the state entropy baseline significantly accelerates various reinforcement learning algorithms across a variety of tasks within MiniGrid, DeepMind Control Suite, and Meta-World benchmarks. Source code is available at https://sites.google.com/view/rl-vcse.

Breadcrumbs to the Goal: Supervised Goal Selection from Human-in-the-Loop Feedback
Marcel Torne Villasevil Max Balsells I Pamies Zihan Wang Samedh Desai Tao Chen Pulkit Agrawal Abhishek Gupta



研究问题:如何有效地进行强化学习中的探索和奖励指定,特别是在需要人类指导的情况下。
动机:现有的强化学习方法在解决具有探索元素的序列决策任务时,需要精心设计的奖励函数或依赖无差别的新奇性探索奖励。人类监督者可以提供有效的循环指导来引导探索过程,但现有的方法需要持续同步的高质量人类反馈,这既昂贵又难以实现。
方法:我们提出了一种名为“人类引导的探索”(HUGE)的技术,能够利用低质量的、非专家用户的反馈(即反馈不频繁、异步且有噪声)来指导强化学习中的探索,而无需精心指定的奖励。关键思想是将定向探索和策略学习的挑战分开——人类反馈用于指导探索,而自我监督的策略学习用于从收集的数据中独立地学习无偏行为。
效果:我们的研究表明,这种方法可以利用有噪声的、异步的人类反馈来学习没有手工设计的奖励或探索奖励的任务。我们在模拟中展示了HUGE能够学习各种具有挑战性的多阶段机器人导航和操作任务,使用的是来自非专家用户的众包反馈。此外,这种范式可以直接扩展到真实世界的机器人上。

Exploration and reward specification are fundamental and intertwined challenges for reinforcement learning. Solving sequential decision making tasks with a non-trivial element of exploration requires either specifying carefully designed reward functions or relying on indiscriminate, novelty seeking exploration bonuses. Human supervisors can provide effective guidance in the loop to direct the exploration process, but prior methods to leverage this guidance require constant synchronous high-quality human feedback, which is expensive and impractical to obtain. In this work, we propose a technique - Human Guided Exploration (HUGE), that is able to leverage low-quality feedback from non-expert users, which is infrequent, asynchronous and noisy, to guide exploration for reinforcement learning, without requiring careful reward specification. The key idea is to separate the challenges of directed exploration and policy learning - human feedback is used to direct exploration, while self-supervised policy learning is used to independently learn unbiased behaviors from the collected data. We show that this procedure can leverage noisy, asynchronous human feedback to learn tasks with no hand-crafted reward design or exploration bonuses. We show that HUGE is able to learn a variety of challenging multi-stage robotic navigation and manipulation tasks in simulation using crowdsourced feedback from non-expert users. Moreover, this paradigm can be scaled to learning directly on real-world robots.

Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets
Anirudhan Badrinath Yannis Flet-Berliac Allen Nie Emma Brunskill



研究问题:尽管离线强化学习通过监督学习(RvS)取得了进展,决策转换器(DT)架构在各种领域也取得了成功,但在一些具有挑战性的基准测试中,DTs的表现不佳。
动机:这种性能不佳的根本原因在于DT无法无缝连接子优化轨迹的片段。
方法:我们提出了一种新颖的方法,通过整合中间目标来增强RvS方法。我们引入了Waypoint Transformer(WT),该架构建立在DT框架之上,并依赖于自动生成的航点。
效果:实验结果表明,与现有的RvS方法相比,WT在最终回报上有了显著提高,其性能与或超过了现有的基于时序差分学习的最先进的方法。此外,在最具挑战性和最复杂的环境和数据配置下,包括AntMaze Large Play/Diverse和Kitchen Mixed/Partial,性能和稳定性的改进最为显著。

Despite the recent advancements in offline reinforcement learning via supervised learning (RvS) and the success of the decision transformer (DT) architecture in various domains, DTs have fallen short in several challenging benchmarks. The root cause of this underperformance lies in their inability to seamlessly connect segments of suboptimal trajectories. To overcome this limitation, we present a novel approach to enhance RvS methods by integrating intermediate targets. We introduce the Waypoint Transformer (WT), using an architecture that builds upon the DT framework and conditioned on automatically-generated waypoints. The results show a significant increase in the final return compared to existing RvS methods, with performance on par or greater than existing state-of-the-art temporal difference learning-based methods. Additionally, the performance and stability improvements are largest in the most challenging environments and data configurations, including AntMaze Large Play/Diverse and Kitchen Mixed/Partial.

Mixed-Initiative Multiagent Apprenticeship Learning for Human Training of Robot Teams
Esmaeil Seraj Jerry Yuyang Xiong Mariah L Schrum Matthew Gombolay



研究问题:将最近在多机器人环境中学习示范(LfD)框架的进展进行扩展,面临着由于部分可观察性导致的环境非平稳性等关键挑战,这对现有方法的适用性造成了损害。
动机:尽管已有研究表明,实现机器人团队中各代理之间的通信可以缓解这些问题,但在现有的多代理学习示范(MA-LfD)框架下创建代理间通信需要人类专家为环境和通信行为都提供示范,这需要在已知的消息空间上制定有效的通信策略。
方法:我们提出了混合倡议多代理学徒学习(MixTURE)。MixTURE使机器人团队能够从人类专家生成的数据中学习完成协作任务的首选策略,同时学习新兴的代理间通信以增强团队协调。MixTURE成功的关键成分是自动学习通信策略,通过最大化互信息的自我纠正模型来合理化底层专家示范,而无需人工生成的数据或辅助奖励函数。
效果:MixTURE在复杂异构领域的各种人类专家生成的数据上都优于相关的基线方法。MixTURE是第一个能直接从真实人类数据中学习多机器人协作策略的MA-LfD框架,减少了约44%的人力工作量,提高了约46%的可用性评分。

Extending recent advances in Learning from Demonstration (LfD) frameworks to multi-robot settings poses critical challenges such as environment non-stationarity due to partial observability which is detrimental to the applicability of existing methods. Although prior work has shown that enabling communication among agents of a robot team can alleviate such issues, creating inter-agent communication under existing Multi-Agent LfD (MA-LfD) frameworks requires the human expert to provide demonstrations for both environment actions and communication actions, which necessitates an efficient communication strategy on a known message spaces. To address this problem, we propose Mixed-Initiative Multi-Agent Apprenticeship Learning (MixTURE). MixTURE enables robot teams to learn from a human expert-generated data a preferred policy to accomplish a collaborative task, while simultaneously learning emergent inter-agent communication to enhance team coordination. The key ingredient to MixTURE's success is automatically learning a communication policy, enhanced by a mutual-information maximizing reverse model that rationalizes the underlying expert demonstrations without the need for human generated data or an auxiliary reward function. MixTURE outperforms a variety of relevant baselines on diverse data generated by human experts in complex heterogeneous domains. MixTURE is the first MA-LfD framework to enable learning multi-robot collaborative policies directly from real human data, resulting in ~44% less human workload, and ~46% higher usability score.

Gradient Informed Proximal Policy Optimization
Sanghyun Son Laura Yu Zheng Ryan Sullivan Yi-Ling Qiao Ming Lin



研究问题:如何将可微环境中的分析梯度与PPO算法相结合,以提升强化学习的效果。
动机:当前许多强化学习算法在处理复杂任务时,往往需要大量的试错和调整,而通过引入分析梯度,可以更有效地指导策略优化过程。
方法:提出了一种新的政策学习方法,该方法将来自可微环境的分析梯度与PPO算法结合,通过引入α-policy作为局部优势策略,并自适应地调整α值来管理分析梯度的影响。同时,还提出了评估分析梯度的方差和偏差的度量标准,并在检测到高方差或偏差时减少对分析梯度的依赖。
效果:在函数优化、物理模拟和交通控制等不同场景下,该方法均优于基线算法。

We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. To incorporate analytical gradients into the PPO framework, we introduce the concept of an α-policy that stands as a locally superior policy. By adaptively modifying the α value, we can effectively manage the influence of analytical policy gradients during learning. To this end, we suggest metrics for assessing the variance and bias of analytical gradients, reducing dependence on these gradients when high variance or bias is detected. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments. Our code can be found online: https://github.com/SonSang/gippo.

Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior
Adam Block Ali Jadbabaie Daniel Pfrommer Max Simchowitz Russ Tedrake



研究问题:本文旨在提出一个理论框架,用于使用生成模型对复杂专家演示的行为克隆进行研究。
动机:现有的行为克隆方法在处理复杂的专家演示时,往往无法生成与专家轨迹匹配的轨迹。
方法:本文提出了一个理论框架,通过调用低级别的控制器来稳定模仿专家演示,并使用强大的生成模型作为模仿学习器。同时,结合数据增强和一种新的算法技巧,即在执行时添加增强噪声,来确保生成的模仿者轨迹与示范者的分布接近。
效果:实验结果表明,该算法能够有效地生成与专家轨迹匹配的轨迹,并在优化传输成本上取得了良好的效果。

We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a powerful enough generative model as our imitation learner, pure supervised behavior cloning can generate trajectories matching the per-time step distribution of essentially arbitrary expert trajectories in an optimal transport cost. Our analysis relies on a stochastic continuity property of the learned policy we call "total variation continuity" (TVC). We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations, and discussing implications for future research directions for better behavior cloning with generative modeling.

Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making
Ting Li Chengchun Shi Jianing Wang Fan Zhou Hongtu Zhu



研究问题:本文旨在研究如何通过在线实验获取最大信息,以准确估计治疗效应。
动机:在现代科技公司中,A/B测试对于评估新开发产品相对于标准基线的有效性至关重要。
方法:我们提出了三种优化分配策略,这些策略在动态环境中设计,其中治疗是随时间顺序分配的。这些策略旨在最小化治疗效应估计器的方差,当数据遵循非马尔可夫决策过程或(时变的)马尔可夫决策过程时。
效果:我们在各种环境中进行了大量的实验,以证明所提出的方法的有效性。在理论上,我们证明了提出的治疗分配设计的最优性,并为由此产生的治疗效应估计器建立了均方误差的上界。

A/B testing is critical for modern technological companies to evaluate the effectiveness of newly developed products against standard baselines. This paper studies optimal designs that aim to maximize the amount of information obtained from online experiments to estimate treatment effects accurately. We propose three optimal allocation strategies in a dynamic setting where treatments are sequentially assigned over time. These strategies are designed to minimize the variance of the treatment effect estimator when data follow a non Markov decision process or a (time-varying) Markov decision process. We further develop estimation procedures based on existing off-policy evaluation (OPE) methods and conduct extensive experiments in various environments to demonstrate the effectiveness of the proposed methodologies. In theory, we prove the optimality of the proposed treatment allocation design and establish upper bounds for the mean squared errors of the resulting treatment effect estimators.

Thinker: Learning to Plan and Act
Stephen Chung Ivan Anokhin David Krueger



研究问题:如何让强化学习代理自主地与已学习的世界模型进行交互和利用?
动机:现有的强化学习算法需要手动设计规划算法,且难以解释。
方法:提出Thinker算法,将环境包装在世界模型中,并引入新的与世界模型交互的动作。这些模型交互动作使代理能够通过向世界模型提出替代计划来进行规划,然后选择在环境中执行的最终动作。
效果:在Sokoban游戏和Atari 2600基准测试中,Thinker算法取得了最先进的性能和竞争性的结果。可视化显示,使用Thinker算法训练的代理已经学会了有效地使用世界模型进行规划以选择更好的行动。这是第一个表明强化学习代理可以在复杂环境中学习使用已学习的世界模型进行规划的工作。

We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments.

Reinforcement Learning with Simple Sequence Priors
Tankred Saanum Noemi Elteto Peter Dayan Marcel Binz Eric Schulz



研究问题:在强化学习中,通常以逐个动作为基础来量化简单性,但这种时间尺度忽略了序列策略中经常出现的重复等时间规律。
动机:因此,我们提出了一种强化学习算法,该算法可以解决具有可压缩动作序列的任务。
方法:我们探索了两种可能的简单动作序列来源:可以通过自回归模型学习的动作序列和可以使用现成的数据压缩算法进行压缩的动作序列。通过将这些偏好提炼为序列先验,我们得到了一种新的信息理论目标,该目标激励代理在学习策略时既要最大化奖励,又要符合这些先验。
效果:实验结果表明,所得到的强化学习算法能够更快地学习,并在DeepMind控制套件的一系列连续控制任务中实现了比最先进的无模型方法更高的回报。这些先验还产生了一个强大的信息规范代理,该代理对噪声观察具有鲁棒性,并能执行开环控制。

In reinforcement learning (RL), simplicity is typically quantified on an action-by-action basis -- but this timescale ignores temporal regularities, like repetitions, often present in sequential strategies. We therefore propose an RL algorithm that learns to solve tasks with sequences of actions that are compressible. We explore two possible sources of simple action sequences: Sequences that can be learned by autoregressive models, and sequences that are compressible with off-the-shelf data compression algorithms. Distilling these preferences into sequence priors, we derive a novel information-theoretic objective that incentivizes agents to learn policies that maximize rewards while conforming to these priors. We show that the resulting RL algorithm leads to faster learning, and attains higher returns than state-of-the-art model-free approaches in a series of continuous control tasks from the DeepMind Control Suite. These priors also produce a powerful information-regularized agent that is robust to noisy observations and can perform open-loop control.

Learning Multi-agent Behaviors from Distributed and Streaming Demonstrations
Shicheng Liu Minghui Zhu



研究问题:本文旨在解决通过估计多个互动专家的奖励函数和约束来推断他们行为的问题。
动机:在分布式示范轨迹被一组学习者顺序揭示的情况下,如何准确推断多代理的行为。
方法:将问题形式化为分布式在线双层优化问题,外层问题是估计奖励函数,内层问题是学习约束和相应的策略。提出了一种新的“从分布式和流式演示中进行多代理行为推断”(MA-BIRDS)算法,允许学习者通过间歇性通信在单循环中解决外层和内层问题。
效果:形式化保证分布式学习者在奖励函数、约束和策略上达成共识,平均局部遗憾(在N次在线迭代中)以$O(1/N^{1-eta_1}+1/N^{1-\eta_2}+1/N)$的速度下降,累积约束违反以$O(N^{eta_2}+1)$的亚线性速度增加,其中$\eta_1,\eta_2\in (1/2,1)$。

This paper considers the problem of inferring the behaviors of multiple interacting experts by estimating their reward functions and constraints where the distributed demonstrated trajectories are sequentially revealed to a group of learners. We formulate the problem as a distributed online bi-level optimization problem where the outer-level problem is to estimate the reward functions and the inner-level problem is to learn the constraints and corresponding policies. We propose a novel ``multi-agent behavior inference from distributed and streaming demonstrations" (MA-BIRDS) algorithm that allows the learners to solve the outer-level and inner-level problems in a single loop through intermittent communications. We formally guarantee that the distributed learners achieve consensus on reward functions, constraints, and policies, the average local regret (over $N$ online iterations) decreases at the rate of $O(1/N^{1-\eta_1}+1/N^{1-\eta_2}+1/N)$, and the cumulative constraint violation increases sub-linearly at the rate of $O(N^{\eta_2}+1)$ where $\eta_1,\eta_2\in (1/2,1)$.

Mutual-Information Regularized Multi-Agent Policy Iteration
Jiangxing Wang Deheng Ye Zongqing Lu



研究问题:大多数合作多智能体强化学习算法只关注单一的团队构成,无法应对动态团队构成的更现实场景。
动机:为了解决这一问题,我们提出了使用互信息作为增强奖励的方法,防止个体策略过度依赖团队相关信息,并鼓励代理学习在不同团队构成中稳健的策略。
方法:我们首先提出了一种固定边际分布的多智能体策略迭代算法,并证明了其收敛性和最优性。然后,我们采用Blahut–Arimoto算法和假想的团队构成分布进行优化,以近似边际分布作为实际应用。
效果:实验结果表明,我们的方法在复杂合作任务中对动态团队构成表现出强大的零样本泛化能力。

Despite the success of cooperative multi-agent reinforcement learning algorithms, most of them focus on a single team composition, which prevents them from being used in more realistic scenarios where dynamic team composition is possible. While some studies attempt to solve this problem via multi-task learning in a fixed set of team compositions, there is still a risk of overfitting to the training set, which may lead to catastrophic performance when facing dramatically varying team compositions during execution. To address this problem, we propose to use mutual information (MI) as an augmented reward to prevent individual policies from relying too much on team-related information and encourage agents to learn policies that are robust in different team compositions. Optimizing this MI-augmented objective in an off-policy manner can be intractable due to the existence of dynamic marginal distribution. To alleviate this problem, we first propose a multi-agent policy iteration algorithm with a fixed marginal distribution and prove its convergence and optimality. Then, we propose to employ the Blahut–Arimoto algorithm and an imaginary team composition distribution for optimization with approximate marginal distribution as the practical implementation. Empirically, our method demonstrates strong zero-shot generalization to dynamic team compositions in complex cooperative tasks.

Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets
Zhang-Wei Hong Aviral Kumar Sathwik Karnik Abhishek Bhandwaldar Akash Srivastava Joni Pajarinen Romain Laroche Abhishek Gupta Pulkit Agrawal



研究问题:本文旨在解决离线强化学习中存在的分布不匹配问题,即学习到的策略与数据集的状态-动作分布之间的差异。
动机:离线强化学习算法在无需与环境交互的情况下进行决策制定,但存在状态-动作分布的分布不匹配问题,这严重影响了其性能。现有的解决方案是约束策略以与数据集中的状态-动作对对齐,但在主要由低效策略收集的轨迹和少量高效策略收集的轨迹组成的数据集上,这种方法效果不佳。
方法:本文提出了一种优化重要性采样权重的方法,使数据采样类似于从接近最优策略生成的数据分布中采样,从而约束策略只模仿数据集中的优秀部分,而不是所有数据。
效果:在72个不同类型的不平衡数据集上,该方法比现有的最佳离线RL算法的性能提高了多达五倍。

Offline reinforcement learning (RL) enables learning a decision-making policy without interaction with the environment. This makes it particularly beneficial in situations where such interactions are costly. However, a known challenge for offline RL algorithms is the distributional mismatch between the state-action distributions of the learned policy and the dataset, which can significantly impact performance. State-of-the-art algorithms address it by constraining the policy to align with the state-action pairs in the dataset. However, this strategy struggles on datasets that predominantly consist of trajectories collected by low-performing policies and only a few trajectories from high-performing ones. Indeed, the constraint to align with the data leads the policy to imitate low-performing behaviors predominating the dataset. Our key insight to address this issue is to constrain the policy to the policy that collected the good parts of the dataset rather than all data. To this end, we optimize the importance sampling weights to emulate sampling data from a data distribution generated by a nearly optimal policy. Our method exhibits considerable performance gains (up to five times better) over the existing approaches in state-of-the-art offline RL algorithms over 72 imbalanced datasets with varying types of imbalance.

Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations
Minshuo Chen Yu Bai H. Vincent Poor Mengdi Wang



研究问题:本文旨在探讨在现实世界的强化学习系统中,由于延迟或丢失通道导致的观察能力受损对决策的影响。
动机:在实际的控制系统中,由于网络延迟或信道丢失,智能体无法获取系统的最新状态,这对实时决策造成了困扰。
方法:本文提出了一种理论框架,用于研究在有延迟和丢失观测的情况下进行高效强化学习的方法。
效果:实验结果表明,尽管观察能力的受损给策略制定和规划带来了挑战,但学习仍然可以保持高效,且其遗憾界限与原系统的州-动作大小最优相关。同时,我们还比较了完全可观察情况下的最佳策略性能。

In real-world reinforcement learning (RL) systems, various forms of {\it impaired observability} can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish near-optimal regret bounds, of the form $\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK})$, for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability.

Multi-Step Generalized Policy Improvement by Leveraging Approximate Models
Lucas Nunes Alegre Ana L. C. Bazzan Ann Nowe Bruno Castro da Silva



研究问题:本文旨在通过利用环境的近似模型,提出一种在强化学习中进行零样本转移的有原则的方法。
动机:尽管基于广义策略改进(GPI)和后继特征(SFs)的方法在计算上效率高,但它们是无模型的:它们分析一个解决特定任务的策略库,并确定代理应该采取的行动。当代理除了策略库外,还可以访问环境近似模型时,我们调查了更一般的情况。
方法:我们引入了h-GPI,这是一种多步扩展的GPI,可以在标准的无模型GPI和完全基于模型的规划之间进行插值,作为参数h的函数,调整代理推理的时间。
效果:实验证明,随着h的增加,h-GPI的性能优于GPI,并且h-GPI的性能受代理策略库中的次优策略的影响越来越小。最后,我们引入了新的界限来描述h-GPI可以获得的收益,这是代理策略库和可能学到的模型中的近似误差的函数。这些界限严格地推广了文献中已知的界限。我们在具有挑战性的表格和连续状态问题上评估了h-GPI,并在各种近似误差水平下,它始终优于GPI和最先进的竞争方法。

We introduce a principled method for performing zero-shot transfer in reinforcement learning (RL) by exploiting approximate models of the environment. Zero-shot transfer in RL has been investigated by leveraging methods rooted in generalized policy improvement (GPI) and successor features (SFs). Although computationally efficient, these methods are model-free: they analyze a library of policies---each solving a particular task---and identify which action the agent should take. We investigate the more general setting where, in addition to a library of policies, the agent has access to an approximate environment model. Even though model-based RL algorithms can identify near-optimal policies, they are typically computationally intensive. We introduce $h$-GPI, a multi-step extension of GPI that interpolates between these extremes---standard model-free GPI and fully model-based planning---as a function of a parameter, $h$, regulating the amount of time the agent has to reason. We prove that $h$-GPI's performance lower bound is strictly better than GPI's, and show that $h$-GPI generally outperforms GPI as $h$ increases. Furthermore, we prove that as $h$ increases, $h$-GPI's performance becomes arbitrarily less susceptible to sub-optimality in the agent's policy library. Finally, we introduce novel bounds characterizing the gains achievable by $h$-GPI as a function of approximation errors in both the agent's policy library and its (possibly learned) model. These bounds strictly generalize those known in the literature. We evaluate $h$-GPI on challenging tabular and continuous-state problems under value function approximation and show that it consistently outperforms GPI and state-of-the-art competing methods under various levels of approximation errors.

Finite-Time Analysis of Single-Timescale Actor-Critic
Xuyang Chen Lin Zhao



研究问题:单时间尺度的actor-critic方法在连续状态空间上的有限时间收敛性尚未得到充分理解。
动机:现有的分析方法主要限于i.i.d采样或表格设置,对于在线单时间尺度actor-critic算法在连续状态空间上的应用,其有限时间收敛性尚未得到证明。
方法:我们提出了一种新的框架,通过评估和控制actor与critic之间的误差传播,证明了在线单时间尺度actor-critic方法可以在标准假设下找到具有ε近似稳定性的点,样本复杂度为O(ε^-2),并且在i.i.d采样下可以进一步改进到O(ε^-2)。
效果:我们的新框架为分析其他单时间尺度强化学习算法提供了有希望的方法。

Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in the most practical single-timescale form. Existing works on analyzing single-timescale actor-critic have been limited to i.i.d. sampling or tabular setting for simplicity. We investigate the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic assumes linear function approximation and updates with a single Markovian sample per actor step. Previous analysis has been unable to establish the convergence for such a challenging scenario. We demonstrate that the online single-timescale actor-critic method provably finds an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. Our novel framework systematically evaluates and controls the error propagation between the actor and critic. It offers a promising approach for analyzing other single-timescale reinforcement learning algorithms as well.

AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation
Daiki E. Matsunaga Jongmin Lee Jaeseok Yoon Stefanos Leonardos Pieter Abbeel Kee-Eung Kim



研究问题:离线强化学习中,由于学到的策略偏离数据收集策略而产生的分布偏移是一个主要挑战。
动机:在多智能体强化学习(MARL)环境中,这个问题更为严重,因为联合行动空间会随着智能体数量的增加呈指数级增长。
方法:我们提出了一种新的离线MARL算法AlberDICE,该算法通过交替进行个体代理的集中训练和最优响应计算来避免选择不在分布内(OOD)的联合行动。
效果:实验结果表明,AlberDICE在标准MARL基准测试中显著优于基线算法。

One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.

STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning
Weipu Zhang Gang Wang Jian Sun Yetian Yuan Gao Huang



研究问题:如何通过引入随机噪声来提高基于模型的强化学习算法在复杂未知环境中的表现。
动机:现有的基于模型的强化学习算法在真实环境中的性能受限于世界模型的准确性,而构建一个完全准确的复杂未知环境模型几乎是不可能的。
方法:提出了一种结合了Transformer的强大序列建模和生成能力以及变分自编码器的随机性的高效世界模型架构——Stochastic Transformer-based wORld Model(STORM)。
效果:在Atari 100k基准测试中,STORM实现了人类表现均值的126.7%,并在不使用前瞻搜索技术的情况下创造了新的最优秀记录。此外,在单个NVIDIA GeForce RTX 3090显卡上训练一个具有1.85小时实时交互经验的代理仅需要4.3小时,展示了比先前方法更高的效率。

Recently, model-based reinforcement learning algorithms have demonstrated remarkable efficacy in visual input environments. These approaches begin by constructing a parameterized simulation world model of the real environment through self-supervised learning. By leveraging the imagination of the world model, the agent's policy is enhanced without the constraints of sampling from the real environment. The performance of these algorithms heavily relies on the sequence modeling and generation capabilities of the world model. However, constructing a perfectly accurate model of a complex unknown environment is nearly impossible. Discrepancies between the model and reality may cause the agent to pursue virtual goals, resulting in subpar performance in the real environment. Introducing random noise into model-based reinforcement learning has been proven beneficial. In this work, we introduce Stochastic Transformer-based wORld Model (STORM), an efficient world model architecture that combines the strong sequence modeling and generation capabilities of Transformers with the stochastic nature of variational autoencoders. STORM achieves a mean human performance of $126.7\%$ on the Atari $100$k benchmark, setting a new record among state-of-the-art methods that do not employ lookahead search techniques. Moreover, training an agent with $1.85$ hours of real-time interaction experience on a single NVIDIA GeForce RTX 3090 graphics card requires only $4.3$ hours, showcasing improved efficiency compared to previous methodologies.

Conservative Offline Policy Adaptation in Multi-Agent Games
Chengjie Wu Pingzhong Tang Jun Yang Yujing Hu Tangjie Lv Changjie Fan Chongjie Zhang



研究问题:本文旨在研究多智能体游戏中的离线策略适应,以利用目标代理的行为数据来利用其弱点或实现有效合作。
动机:现有的多智能体游戏策略适应研究通常依赖于与目标代理的在线交互进行训练,这在现实世界的场景中可能既昂贵又不实用。受最近离线强化学习进展的启发,本文研究了离线策略适应。
方法:我们提出了一种新的学习目标——保守离线适应,该目标优化了任何数据集一致代理模型的最坏情况性能。我们还提出了一种名为“约束自我博弈”(CSP)的高效算法,该算法将数据集信息纳入正则化策略学习中。
效果:实验结果表明,CSP在各种环境中都优于非保守基线,包括迷宫、捕食者-猎物、MuJoCo和谷歌足球等环境。

Prior research on policy adaptation in multi-agent games has often relied on online interaction with the target agent in training, which can be expensive and impractical in real-world scenarios. Inspired by recent progress in offline reinforcement learn- ing, this paper studies offline policy adaptation, which aims to utilize the target agent’s behavior data to exploit its weakness or enable effective cooperation. We investigate its distinct challenges of distributional shift and risk-free deviation, and propose a novel learning objective, conservative offline adaptation, that optimizes the worst-case performance against any dataset consistent proxy models. We pro- pose an efficient algorithm called Constrained Self-Play (CSP) that incorporates dataset information into regularized policy learning. We prove that CSP learns a near-optimal risk-free offline adaptation policy upon convergence. Empirical results demonstrate that CSP outperforms non-conservative baselines in various environments, including Maze, predator-prey, MuJoCo, and Google Football.

CQM: Curriculum Reinforcement Learning with a Quantized World Model
Seungjae Lee Daesol Cho Jonghae Park H. Jin Kim



研究问题:现有的强化学习课程方法在高维空间中生成课程目标时面临挑战,通常依赖于手动指定的目标空间。
动机:为了缓解这个限制并提高课程的可扩展性,我们提出了一种新的课程方法,该方法自动定义包含课程过程关键信息的语义目标空间,并在其中建议课程目标。
方法:我们的方法通过向量量化变分自编码器(VQ-VAE)对连续观测值进行离散化,并通过图恢复离散观测值之间的时间关系。同时,我们的方法建议在自动组成的目标空间中向最终目标收敛的不确定性和时间距离感知的课程目标。
效果:实验结果表明,我们提出的方法允许在只有原始目标示例的未了解环境中进行有效探索。此外,即使在各种目标达成任务中,即使使用以自我为中心的视觉输入,我们的方法也优于最先进的课程强化学习方法,无论是在数据效率还是性能上。

Recent curriculum Reinforcement Learning (RL) has shown notable progress in solving complex tasks by proposing sequences of surrogate tasks. However, the previous approaches often face challenges when they generate curriculum goals in a high-dimensional space. Thus, they usually rely on manually specified goal spaces. To alleviate this limitation and improve the scalability of the curriculum, we propose a novel curriculum method that automatically defines the semantic goal space which contains vital information for the curriculum process, and suggests curriculum goals over it. To define the semantic goal space, our method discretizes continuous observations via vector quantized-variational autoencoders (VQ-VAE) and restores the temporal relations between the discretized observations by a graph. Concurrently, ours suggests uncertainty and temporal distance-aware curriculum goals that converges to the final goals over the automatically composed goal space. We demonstrate that the proposed method allows efficient explorations in an uninformed environment with raw goal examples only. Also, ours outperforms the state-of-the-art curriculum RL methods on data efficiency and performance, in various goal-reaching tasks even with ego-centric visual inputs.

Macro Placement by Wire-Mask-Guided Black-Box Optimization
Yunqi Shi Ke Xue Lei Song Chao Qian



研究问题:大规模集成技术(VLSI)的发展对芯片布局设计中的电子设计自动化(EDA)技术提出了新的挑战。
动机:在芯片布局设计过程中,宏放置是一个重要子问题,旨在最小化半周长线长度(HPWL)并避免重叠。
方法:本文提出了一种新的黑盒优化(BBO)框架(称为WireMask-BBO),用于宏放置,通过使用线掩模引导的贪婪过程进行目标评估。
效果:配备不同的BBO算法,WireMask-BBO在实践中显著优于以前的方法,即通过使用更少的时间实现明显更短的HPWL。此外,它可以将现有的布局视为初始解决方案进行微调,这可以在HPWL上提高50%的改进。WireMask-BBO有可能显着提高芯片布局设计的质量和效率,使其对EDA的研究和实践者具有吸引力,并将促进BBO的应用。

The development of very large-scale integration (VLSI) technology has posed new challenges for electronic design automation (EDA) techniques in chip floorplanning. During this process, macro placement is an important subproblem, which tries to determine the positions of all macros with the aim of minimizing half-perimeter wirelength (HPWL) and avoiding overlapping. Previous methods include packing-based, analytical and reinforcement learning methods. In this paper, we propose a new black-box optimization (BBO) framework (called WireMask-BBO) for macro placement, by using a wire-mask-guided greedy procedure for objective evaluation. Equipped with different BBO algorithms, WireMask-BBO empirically achieves significant improvements over previous methods, i.e., achieves significantly shorter HPWL by using much less time. Furthermore, it can fine-tune existing placements by treating them as initial solutions, which can bring up to 50% improvement in HPWL. WireMask-BBO has the potential to significantly improve the quality and efficiency of chip floorplanning, which makes it appealing to researchers and practitioners in EDA and will also promote the application of BBO. Our code is available at https://github.com/lamda-bbo/WireMask-BBO.

Reward Imputation with Sketching for Contextual Batched Bandits
Xiao Zhang Ninglu Shao Zihua Si Jun Xu Wenhan Wang Hanjing Su Ji-Rong Wen



研究问题:本文旨在解决在部分信息反馈的环境下,如何更有效地利用未执行动作的奖励信息。
动机:现有的部分信息反馈方法往往忽视了未执行动作的奖励,导致反馈信息的利用率不高。
方法:本文提出了一种名为“带估算奖励的策略更新”(SPUIR)的方法,通过使用概略法来估算未被观察的奖励,从而近似完整的信息反馈。
效果:实验结果表明,SPUIR在合成、公共基准和真实世界数据集上都优于最先进的基线方法。

Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode, but the rewards of the non-executed actions are unobserved, resulting in partial-information feedback. Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information. In this paper, we propose an efficient approach called Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching, which approximates the full-information feedbacks. We formulate reward imputation as an imputation regularized ridge regression problem that captures the feedback mechanisms of both executed and non-executed actions. To reduce time complexity, we solve the regression problem using randomized sketching. We prove that our approach achieves an instantaneous regret with controllable bias and smaller variance than approaches without reward imputation. Furthermore, our approach enjoys a sublinear regret bound against the optimal policy. We also present two extensions, a rate-scheduled version and a version for nonlinear rewards, making our approach more practical. Experimental results show that SPUIR outperforms state-of-the-art baselines on synthetic, public benchmark, and real-world datasets.

Off-Policy Evaluation for Human Feedback
Qitong Gao Ge Gao Juncheng Dong Vahid Tarokh Min Chi Miroslav Pajic



研究问题:如何准确评估强化学习中的人机反馈信号。
动机:现有的离线策略评估方法无法准确估计稀疏且受多重因素影响的人机反馈信号,导致其难以推广到准确的离线策略评估。
方法:提出一种针对人机反馈的离线策略评估(OPEHF)框架,通过开发一种立即人类奖励(IHR)重建方法,利用在潜在空间中提炼的环境知识进行正则化,以捕捉状态转换和发出人机反馈信号的底层动态。
效果:在两个真实世界实验(自适应体内神经刺激和智能辅导)和一个模拟环境(视觉问答)中测试了该方法,结果表明,与直接应用现有离线策略评估方法相比,该方法能显著提高对人机反馈信号的准确估计。

Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL), by estimating performance and/or rank of target (evaluation) policies using offline trajectories only. It can improve the safety and efficiency of data collection and policy testing procedures in situations where online deployments are expensive, such as healthcare. However, existing OPE methods fall short in estimating human feedback (HF) signals, as HF may be conditioned over multiple underlying factors and are only sparsely available; as opposed to the agent-defined environmental rewards (used in policy optimization), which are usually determined over parametric functions or distributions. Consequently, the nature of HF signals makes extrapolating accurate OPE estimations to be challenging. To resolve this, we introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals. Specifically, we develop an immediate human reward (IHR) reconstruction approach, regularized by environmental knowledge distilled in a latent space that captures the underlying dynamics of state transitions as well as issuing HF signals. Our approach has been tested over *two real-world experiments*, adaptive *in-vivo* neurostimulation and intelligent tutoring, and a simulation environment (visual Q&A). Results show that our approach significantly improves the performance toward estimating HF signals accurately, compared to directly applying (variants of) existing OPE methods.

Goal-conditioned Offline Planning from Curious Exploration
Marco Bagatella Georg Martius



研究问题:如何从无监督探索技术中提取目标条件行为,而无需任何额外的环境交互。
动机:在困难的离线环境中,传统的目标条件强化学习方法在提取值函数和策略方面表现不佳。
方法:通过分析最优目标条件值函数的几何形状,将此问题与学习值中的特定类别估计错误联系起来。为了减少其发生,我们提出了一种基于模型的计划方法,该方法在已学习的价值景观上进行规划,并结合了基于图的值聚合方案。
效果:这种组合可以纠正局部和全局错误,并在各种模拟环境中显著提高了零射击目标性能。

Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goal-conditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments.

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
Mitsuhiko Nakamoto Yuexiang Zhai Anikait Singh Max Sobol Mark Yi Ma Chelsea Finn Aviral Kumar Sergey Levine



研究问题:现有的离线强化学习方法在在线微调阶段表现不佳。
动机:设计一种有效的方法,从离线数据中学习初始策略,并实现快速在线微调。
方法:提出校准Q学习(Cal-QL)方法,通过学习保守的价值函数初始化,同时保证其校准性,即学习到的Q值处于合理的尺度范围内。
效果:Cal-QL在9/11在线微调基准任务上优于现有方法。

A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online fine-tuning with limited interaction. However, existing offline RL methods tend to behave poorly during fine-tuning. In this paper, we devise an approach for learning an effective initialization from offline data that also enables fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL), accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Q-values are at a reasonable scale. We refer to this property as calibration, and define it formally as providing a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy, which may simply be the behavior policy. We show that offline RL algorithms that learn such calibrated value functions lead to effective online fine-tuning, enabling us to take the benefits of offline initializations in online fine-tuning. In practice, Cal-QL can be implemented on top of the conservative Q learning (CQL) for offline RL within a one-line code change. Empirically, Cal-QL outperforms state-of-the-art methods on 9/11 fine-tuning benchmark tasks that we study in this paper. Code and video are available at https://nakamotoo.github.io/Cal-QL

Anytime-Competitive Reinforcement Learning with Policy Prior
Jianyi Yang Pengfei Li Tongxin Li Adam Wierman Shaolei Ren



研究问题:本文旨在解决任意时间竞争马尔可夫决策过程(A-CMDP)的问题。
动机:现有的约束马尔可夫决策过程(CMDPs)虽然在优化预期奖励和随机动态的预期成本方面取得了一定的成果,但在特定回合中的实际成本仍然可能过高。因此,A-CMDP的目标是在保证每轮任何回合的成本都受到上界限制的同时优化预期奖励。
方法:我们提出了一种名为“任意时间竞争强化学习”(ACRL)的新算法,该算法可以保证任意时间的成本约束。通过分析遗憾,我们发现该策略会逐渐接近在任意时间竞争约束下可实现的最佳奖励。
效果:在碳智能计算的应用实验中,ACRL在奖励性能和成本约束保证方面均表现出良好的效果。

This paper studies the problem of Anytime-Competitive Markov Decision Process (A-CMDP). Existing works on Constrained Markov Decision Processes (CMDPs) aim to optimize the expected reward while constraining the expected cost over random dynamics, but the cost in a specific episode can still be unsatisfactorily high. In contrast, the goal of A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior. We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints. The regret analysis shows the policy asymptotically matches the optimal reward achievable under the anytime competitive constraints. Experiments on the application of carbon-intelligent computing verify the reward performance and cost constraint guarantee of ACRL.

Budgeting Counterfactual for Offline RL
Yao Liu Pratik Chaudhari Rasool Fakoor



研究问题:离线强化学习中,由于数据有限,可能的行动序列中的反事实推理困境是主要挑战。
动机:如果选择不同的行动路径会怎样?这种状况常常导致外推错误,这些错误会随着问题的复杂性呈指数级累积。因此,必须认识到并非所有的决策步骤对最终结果都同等重要,需要对策略产生的反事实决策数量进行预算以控制外推。
方法:我们提出了一种在训练过程中明确限制分布外动作数量的方法,与现有的在策略或值函数上使用正则化的方法不同。具体来说,我们的方法利用动态规划决定在哪里进行外推,哪里不进行外推,并对与行为策略不同的决策设定上限。
效果:理论上,我们通过$Q$更新规则的固定点解的约束最优性证明了我们的方法。在实证上,我们在广泛使用的D4RL基准测试任务上的表现优于最先进的离线RL方法。

The main challenge of offline reinforcement learning, where data is limited, arises from a sequence of counterfactual reasoning dilemmas within the realm of potential actions: What if we were to choose a different course of action? These circumstances frequently give rise to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Hence, it becomes crucial to acknowledge that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy make in order to control the extrapolation. Contrary to existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. Theoretically, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.

Provably (More) Sample-Efficient Offline RL with Options
Xiaoyan Hu Ho-fung Leung



研究问题:本文旨在解决在线强化学习中,探索环境存在风险的问题,如自动驾驶和医疗。
动机:虽然选项框架在增强学习中的长期规划问题上取得了实证成功,但在在线环境中进行探索存在风险的情况下,这些结果不再适用。
方法:本文提出了一种名为PEssimistic Value Iteration for Learning with Options(PEVIO)的算法,并建立了两种流行的数据收集过程的信息理论下界,一种是收集状态-选项转换,另一种是收集状态-动作转换。
效果:实验结果表明,与仅使用动作的离线强化学习相比,使用选项不仅可以更快地收敛到最优值,而且在精心设计选项或离线数据有限的情况下,可以获得更好的性能。

The options framework yields empirical success in long-horizon planning problems of reinforcement learning (RL). Recent works show that options help improve the sample efficiency in online RL. However, these results are no longer applicable to scenarios where exploring the environment online is risky, e.g., automated driving and healthcare. In this paper, we provide the first analysis of the sample complexity for offline RL with options, where the agent learns from a dataset without further interaction with the environment. We derive a novel information-theoretic lower bound, which generalizes the one for offline learning with actions. We propose the PEssimistic Value Iteration for Learning with Options (PEVIO) algorithm and establish near-optimal suboptimality bounds for two popular data-collection procedures, where the first one collects state-option transitions and the second one collects state-action transitions. We show that compared to offline RL with actions, using options not only enjoys a faster finite-time convergence rate (to the optimal value) but also attains a better performance when either the options are carefully designed or the offline data is limited. Based on these results, we analyze the pros and cons of the data-collection procedures.

Belief Projection-Based Reinforcement Learning for Environments with Delayed Feedback
Jangwon Kim Hangyeol Kim Jiwook Kang Jongchan Baek Soohee Han



研究问题:本文旨在解决传统方法在处理具有延迟反馈的环境时,由于状态空间爆炸导致的问题。
动机:传统的处理方法会使用从上次观察到的状态和自上次观察到的状态以来执行的动作构建的增强状态,虽然这种方法可以构造出正确的延迟环境的马尔可夫决策过程,但当延迟时间步数增加时,状态空间会爆炸,导致收敛速度变慢。
方法:本文提出了一种名为基于信念投影的Q学习的算法(BPQL),该算法通过评估输入状态大小等于原始状态空间大小而不是增强状态大小的评论家的价值来解决这个问题。
效果:实验结果表明,BPQL在连续控制任务上显著优于其他算法,无论是在渐近性能还是样本效率方面都表现出色。同时,BPQL还能解决传统方法无法处理的长延迟环境问题。

We present a novel actor-critic algorithm for an environment with delayed feedback, which addresses the state-space explosion problem of conventional approaches. Conventional approaches use an augmented state constructed from the last observed state and actions executed since visiting the last observed state. Using the augmented state space, the correct Markov decision process for delayed environments can be constructed; however, this causes the state space to explode as the number of delayed timesteps increases, leading to slow convergence. Our proposed algorithm, called Belief-Projection-Based Q-learning (BPQL), addresses the state-space explosion problem by evaluating the values of the critic for which the input state size is equal to the original state-space size rather than that of the augmented one. We compare BPQL to traditional approaches in continuous control tasks and demonstrate that it significantly outperforms other algorithms in terms of asymptotic performance and sample efficiency. We also show that BPQL solves long-delayed environments, which conventional approaches are unable to do.

Maximum State Entropy Exploration using Predecessor and Successor Representations
Arnav Kumar Jain Lucas Lehnert Irina Rish Glen Berseth



研究问题:如何使探索算法更有效地学习探索策略?
动机:目前的探索算法往往只关注当前状态或随机开放回路的探索,缺乏对过去经验的有效利用。
方法:提出一种名为$\eta\psi$-Learning的方法,通过考虑过去的历险经验来制定下一步的探索策略。
效果:实验证明,该方法能有效地进行环境探索,并在有限的样本下最大化状态覆盖。

Animals have a developed ability to explore that aids them in important tasks such as locating food, exploring for shelter, and finding misplaced items. These exploration skills necessarily track where they have been so that they can plan for finding items with relative efficiency. Contemporary exploration algorithms often learn a less efficient exploration strategy because they either condition only on the current state or simply rely on making random open-loop exploratory moves. In this work, we propose $\eta\psi$-Learning, a method to learn efficient exploratory policies by conditioning on past episodic experience to make the next exploratory move. Specifically, $\eta\psi$-Learning learns an exploration policy that maximizes the entropy of the state visitation distribution of a single trajectory. Furthermore, we demonstrate how variants of the predecessor representation and successor representations can be combined to predict the state visitation entropy. Our experiments demonstrate the efficacy of $\eta\psi$-Learning to strategically explore the environment and maximize the state coverage with limited samples.

A Reduction-based Framework for Sequential Decision Making with Delayed Feedback
Yunchang Yang Han Zhong Tianhao Wu Bin Liu Liwei Wang Simon Shaolei Du



研究问题:本研究关注随机延迟反馈在单代理和多代理序列决策中的作用,包括
动机:目前的探索算法往往只关注当前状态或随机开放回路的探索,缺乏对过去经验的有效利用。
方法:提出一种名为$\eta\psi$-Learning的方法,通过考虑过去的历险经验来制定下一步的探索策略。
效果:实验证明,该方法能有效地进行环境探索,并在有限的样本下最大化状态覆盖。

We study stochastic delayed feedback in general single-agent and multi-agent sequential decision making, which includes bandits, single-agent Markov decision processes (MDPs), and Markov games (MGs). We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that can handle stochastic delays in sequential decision making. By plugging different multi-batched algorithms into our framework, we provide several examples demonstrating that our framework not only matches or improves existing results for bandits, tabular MDPs, and tabular MGs, but also provides the first line of studies on delays in sequential decision making with function approximation. In summary, we provide a complete set of sharp results for single-agent and multi-agent sequential decision making with delayed feedback.

Keep Various Trajectories: Promoting Exploration of Ensemble Policies in Continuous Control
Chao Li Chen GONG Qiang He Xinwen Hou



研究问题:当前,将深度强化学习(DRL)与集成方法结合在解决复杂的序列决策问题上已被证明非常有效。然而,对于现有集成RL方法的实证成功,目前的研究还十分有限。
动机:我们新的分析发现,现有的集成DRL算法的样本效率可能受到子策略多样性不足的限制。
方法:受这些发现启发,我们引入了一种新的集成RL算法,称为“轨迹-觉醒-探索”(TEEN)。TEEN的主要目标是在提高样本多样性的同时,最大化期望回报。
效果:通过大量实验,我们发现TEEN不仅比单独使用子策略提高了集成策略的样本多样性,而且比现有的集成RL算法表现更好。在测试的典型环境中,TEEN的平均性能比基线集成DRL算法高出41%。

The combination of deep reinforcement learning (DRL) with ensemble methods has been proved to be highly effective in addressing complex sequential decision-making problems. This success can be primarily attributed to the utilization of multiple models, which enhances both the robustness of the policy and the accuracy of value function estimation. However, there has been limited analysis of the empirical success of current ensemble RL methods thus far. Our new analysis reveals that the sample efficiency of previous ensemble DRL algorithms may be limited by sub-policies that are not as diverse as they could be. Motivated by these findings, our study introduces a new ensemble RL algorithm, termed \textbf{T}rajectories-awar\textbf{E} \textbf{E}nsemble exploratio\textbf{N} (TEEN). The primary goal of TEEN is to maximize the expected return while promoting more diverse trajectories. Through extensive experiments, we demonstrate that TEEN not only enhances the sample diversity of the ensemble policy compared to using sub-policies alone but also improves the performance over ensemble RL algorithms. On average, TEEN outperforms the baseline ensemble DRL algorithms by 41\% in performance on the tested representative environments.

GraphMP: Graph Neural Network-based Motion Planning with Efficient Graph Search
Xiao Zang Miao Yin Jinqi Xiao Saman Zonouz Bo Yuan



研究问题:如何利用图神经网络在机器人系统中进行高质量的无碰撞路径规划。
动机:尽管基于学习的路径规划器,特别是图神经网络驱动的,已经显示出良好的规划性能,但其固有机制并不适合图搜索过程,阻碍了其进一步的性能提升。
方法:本文提出了一种名为GraphMP的神经运动规划器,用于低维和高维的规划任务。通过定制模型架构和训练机制设计,GraphMP可以同时执行高效的图模式提取和图搜索处理,从而实现强大的规划性能。
效果:在从2D迷宫到14D双KUKA机械臂的各种环境中进行的实验表明,我们提出的GraphMP在路径质量和规划速度上比最先进的学习和经典规划器有显著的改进,同时保持了竞争的成功率。

Motion planning, which aims to find a high-quality collision-free path in the configuration space, is a fundamental task in robotic systems. Recently, learning-based motion planners, especially the graph neural network-powered, have shown promising planning performance. However, though the state-of-the-art GNN planner can efficiently extract and learn graph information, its inherent mechanism is not well suited for graph search process, hindering its further performance improvement. To address this challenge and fully unleash the potential of GNN in motion planning, this paper proposes GraphMP, a neural motion planner for both low and high-dimensional planning tasks. With the customized model architecture and training mechanism design, GraphMP can simultaneously perform efficient graph pattern extraction and graph search processing, leading to strong planning performance. Experiments on a variety of environments, ranging from 2D Maze to 14D dual KUKA robotic arm, show that our proposed GraphMP achieves significant improvement on path quality and planning speed over the state-of-the-art learning-based and classical planners; while preserving the competitive success rate.

Fractal Landscapes in Policy Optimization
Tao Wang Sylvia Lee Herbert Sicun Gao



研究问题:本文旨在解决深度强化学习中,政策梯度方法在连续领域训练失败的问题。
动机:尽管政策梯度在深度强化学习中取得了许多成功,但在实践中,即使在已知解决方案的标准控制问题上,也常常观察到政策梯度训练的失败。
方法:作者提出了一个理解政策梯度方法内在局限性的框架,即在某些类别的马尔可夫决策过程(MDP)中,政策空间的优化景观可能非常不平滑或具有分形结构,以至于根本找不到可以估计的梯度。作者借鉴混沌理论和非光滑分析的技术,分析了政策优化目标的最大李雅普诺夫指数和霍尔德指数。此外,作者还开发了一种实用的方法,可以从样本中估计目标函数的局部平滑性,以确定训练过程是否遇到了分形景观。
效果:实验表明,一些政策优化的失败案例可以通过这种分形景观来解释。

Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. Despite much success, it is often observed in practice that RL training with policy gradient can fail for many reasons, even on standard control problems with known solutions. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs, such that there does not exist gradient to be estimated in the first place. We draw on techniques from chaos theory and non-smooth analysis, and analyze the maximal Lyapunov exponents and H\"older exponents of the policy optimization objectives. Moreover, we develop a practical method that can estimate the local smoothness of objective function from samples to identify when the training process has encountered fractal landscapes. We show experiments to illustrate how some failure cases of policy optimization can be explained by such fractal landscapes.

Multi-Agent Meta-Reinforcement Learning: Sharper Convergence Rates with Task Similarity
Weichao Mao Haoran Qiu Chen Wang Hubertus Franke Zbigniew Kalbarczyk Ravi Iyer Tamer Basar



研究问题:本文旨在研究多智能体强化学习(MARL)中元学习在解决多个任务集合上的优势。
动机:现有的MARL主要关注独立解决单个任务,而实际上环境经常在变化,留下许多相关任务需要解决。
方法:通过建立一系列理论结果,研究了元学习在各种基本的MARL设置中的应用,包括学习两人零和马尔科夫博弈和马尔科夫潜在博弈的纳什均衡,以及学习一般和的马尔科夫博弈的粗糙相关均衡。
效果:实验结果表明,与分别学习每个任务相比,元学习在各种游戏理论解决方案上实现了更明显的收敛。同时,开发了多个MARL算法,并提供了初始化相关的收敛保证。这些算法将乐观策略镜像下降与阶段值更新相结合,其改进的收敛保证几乎恢复了最佳已知结果,即使初始状态未知。

Multi-agent reinforcement learning (MARL) has primarily focused on solving a single task in isolation, while in practice the environment is often evolving, leaving many related tasks to be solved. In this paper, we investigate the benefits of meta-learning in solving multiple MARL tasks collectively. We establish the first line of theoretical results for meta-learning in a wide range of fundamental MARL settings, including learning Nash equilibria in two-player zero-sum Markov games and Markov potential games, as well as learning coarse correlated equilibria in general-sum Markov games. Under natural notions of task similarity, we show that meta-learning achieves provable sharper convergence to various game-theoretical solution concepts than learning each task separately. As an important intermediate step, we develop multiple MARL algorithms with initialization-dependent convergence guarantees. Such algorithms integrate optimistic policy mirror descents with stage-based value updates, and their refined convergence guarantees (nearly) recover the best known results even when a good initialization is unknown. To our best knowledge, such results are also new and might be of independent interest. We further provide numerical simulations to corroborate our theoretical findings.

On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes
Jia Lin Hau Erick Delage Mohammad Ghavamzadeh Marek Petrik



研究问题:优化马尔可夫决策过程中的静态风险规避目标很困难,因为它们不接受强化学习(RL)算法中常见的标准动态规划方程。
动机:条件风险价值(CVaR)和熵风险价值(EVaR)的风险水平离散化后,通过增加状态空间的动态规划分解在RL社区中越来越受欢迎。然而,我们发现这些流行的分解方法本质上是次优的,无论离散化级别如何。
方法:我们展示了一种针对风险值(VaR)的分解方法,并证明了这种风险度量与CVaR和EVaR的不同。
效果:我们的研究结果具有重要意义,因为风险规避算法用于高风险环境,因此其正确性更为重要。

Optimizing static risk-averse objectives in Markov decision processes is difficult because they do not admit standard dynamic programming equations common in Reinforcement Learning (RL) algorithms. Dynamic programming decompositions that augment the state space with discrete risk levels have recently gained popularity in the RL community. Prior work has shown that these decompositions are optimal when the risk level is discretized sufficiently. However, we show that these popular decompositions for Conditional-Value-at-Risk (CVaR) and Entropic-Value-at-Risk (EVaR) are inherently suboptimal regardless of the discretization level. In particular, we show that a saddle point property assumed to hold in prior literature may be violated. However, a decomposition does hold for Value-at-Risk and our proof demonstrates how this risk measure differs from CVaR and EVaR. Our findings are significant because risk-averse algorithms are used in high-stake environments, making their correctness much more critical.

Goal-Conditioned Predictive Coding for Offline Reinforcement Learning
Zilai Zeng Ce Zhang Shijie Wang Chen Sun



研究问题:序列模型是否具有将轨迹压缩为有用表示以增强政策学习的能力。
动机:尽管强大的序列模型如GPT或BERT常用于编码轨迹,但序列建模在轨迹数据上的效果尚不明确。
方法:采用两阶段框架,首先利用序列模型对轨迹进行编码,然后使用编码后的表现作为输入来学习目标条件策略。
效果:实验结果表明,序列建模可以在挑战性决策任务上产生显著影响。此外,GCPC学习到的目标条件潜在表示能够对未来的轨迹进行编码,从而在所有三个基准测试中实现竞争性能。

Recent work has demonstrated the effectiveness of formulating decision making as supervised learning on offline-collected trajectories. Powerful sequence models, such as GPT or BERT, are often employed to encode the trajectories. However, the benefits of performing sequence modeling on trajectory data remain unclear. In this work, we investigate whether sequence modeling has the ability to condense trajectories into useful representations that enhance policy learning. We adopt a two-stage framework that first leverages sequence models to encode trajectory-level representations, and then learns a goal-conditioned policy employing the encoded representations as its input. This formulation allows us to consider many existing supervised offline RL methods as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predictive Coding (GCPC), a sequence modeling objective that yields powerful trajectory representations and leads to performant policies. Through extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, we observe that sequence modeling can have a significant impact on challenging decision making tasks. Furthermore, we demonstrate that GCPC learns a goal-conditioned latent representation encoding the future trajectory, which enables competitive performance on all three benchmarks.

For SALE: State-Action Representation Learning for Deep Reinforcement Learning
Scott Fujimoto Wei-Di Chang Edward J. Smith Shixiang Shane Gu Doina Precup David Meger



研究问题:本文旨在解决强化学习中对于低层次状态环境的表示学习问题,如物理控制问题。
动机:在图像任务中,表示学习已被证明是一种有效的工具,但在低层次状态环境中,如物理控制问题,这种学习方法往往被忽视。
方法:本文提出了一种新的方法SALE,用于学习能模型状态和动作之间微妙交互的嵌入,从而有效地从低层次状态中进行表示学习。
效果:通过将SALE和一种适用于强化学习的检查点整合到TD3中,形成了TD7算法,该算法在OpenAI gym基准测试任务上的表现大大超过了现有的连续控制算法。在300k和5M时间步的情况下,TD7的平均性能分别比TD3提高了276.7%和50.7%,并且在在线和离线设置中都能工作。

In reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.

Inverse Reinforcement Learning with the Average Reward Criterion
Feiyang Wu Jingyang Ke Anqi Wu



研究问题:本文研究了平均奖励准则下的逆强化学习(IRL)问题,目标是在研究问题:本文研究了平均奖励准则下的逆强化学习(IRL)问题,目标是在只有来自经验丰富的代理的状态和动作样本的情况下恢复未知的策略和奖励函数。
动机:现有的IRL方法假设专家在一个已知折扣因子的折扣环境中进行训练,而本文通过提出一个平均奖励框架和有效的学习算法来减轻这一假设。
方法:本文开发了一种新的随机一阶方法来解决平均奖励设置下的IRL问题,这需要解决一个平均奖励马尔可夫决策过程(AMDP)作为子问题。为了解决子问题,我们开发了一种在一般状态和动作空间中的随机策略镜像下降(SPMD)方法,该方法需要$mathcal{O}(1/\varepsilon)$步的梯度计算。配备了SPMD,我们提出了逆策略镜像下降(IPMD)方法来解决平均奖励准则下的IRL问题,其复杂度为$\mathcal{O}(1/\varepsilon^2)$。
效果:通过使用MuJoCo基准测试和其他控制任务的数值实验,我们证实了我们的分析,并发现上述复杂性结果在平均奖励准则下的IRL中是新的。

We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion. The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent. Previous IRL methods assume that the expert is trained in a discounted environment, and the discount factor is known. This work alleviates this assumption by proposing an average-reward framework with efficient learning algorithms. We develop novel stochastic first-order methods to solve the IRL problem under the average-reward setting, which requires solving an Average-reward Markov Decision Process (AMDP) as a subproblem. To solve the subproblem, we develop a Stochastic Policy Mirror Descent (SPMD) method under general state and action spaces that needs $\mathcal{O}(1/\varepsilon)$ steps of gradient computation. Equipped with SPMD, we propose the Inverse Policy Mirror Descent (IPMD) method for solving the IRL problem with a $\mathcal{O}(1/\varepsilon^2)$ complexity. To the best of our knowledge, the aforementioned complexity results are new in IRL with the average reward criterion. Finally, we corroborate our analysis with numerical experiments using the MuJoCo benchmark and additional control tasks.

The Best of Both Worlds in Network Population Games: Reaching Consensus and Convergence to Equilibrium
Shuyue Hu Harold Soh Georgios Piliouras



研究问题:本文旨在同时解决多智能体系统中的共识和均衡两大挑战。
动机:尽管每个挑战都吸引了大量关注,但同时处理这两个挑战的研究相对较少。
方法:在多个交互子种群共存的多智能体系统中,考察共识和均衡概念之间的联系。
效果:研究表明,平滑虚构游戏可以在各种多智能体设置中实现共识和向均衡的收敛,且共识形成过程在多智能体学习中的均衡选择问题上起着关键作用。

Reaching consensus and convergence to equilibrium are two major challenges of multi-agent systems. Although each has attracted significant attention, relatively few studies address both challenges at the same time. This paper examines the connection between the notions of consensus and equilibrium in a multi-agent system where multiple interacting sub-populations coexist. We argue that consensus can be seen as an intricate component of intra-population stability, whereas equilibrium can be seen as encoding inter-population stability. We show that smooth fictitious play, a well-known learning model in game theory, can achieve both consensus and convergence to equilibrium in diverse multi-agent settings. Moreover, we show that the consensus formation process plays a crucial role in the seminal thorny problem of equilibrium selection in multi-agent learning.

Revisiting the Minimalist Approach to Offline Reinforcement Learning
Denis Tarasov Vladislav Kurenkov Alexander Nikulin Sergey Kolesnikov



研究问题:近年来,离线强化学习取得了显著进展,但其设计选择对算法效果的影响尚未得到深入研究。
动机:本研究旨在填补这一空白,通过回顾分析最近的离线RL工作,提出ReBRAC,一种基于TD3+BC方法的最小化算法,集成了这些设计元素。
方法:我们在D4RL和V-D4RL基准测试集上,使用51个具有本体感和视觉状态空间的数据集评估ReBRAC,展示了其在无集成方法中的离线和离线在线设置中的最佳性能。
效果:为了进一步说明这些设计选择的有效性,我们进行了大规模的消融研究和数千次实验的超参数敏感性分析。

Recent years have witnessed significant advancements in offline reinforcement learning (RL), resulting in the development of numerous algorithms with varying degrees of complexity. While these algorithms have led to noteworthy improvements, many incorporate seemingly minor design choices that impact their effectiveness beyond core algorithmic advances. However, the effect of these design choices on established baselines remains understudied. In this work, we aim to bridge this gap by conducting a retrospective analysis of recent works in offline RL and propose ReBRAC, a minimalistic algorithm that integrates such design elements built on top of the TD3+BC method. We evaluate ReBRAC on 51 datasets with both proprioceptive and visual state spaces using D4RL and V-D4RL benchmarks, demonstrating its state-of-the-art performance among ensemble-free methods in both offline and offline-to-online settings. To further illustrate the efficacy of these design choices, we perform a large-scale ablation study and hyperparameter sensitivity analysis on the scale of thousands of experiments.

Adversarial Model for Offline Reinforcement Learning
Mohak Bhardwaj Tengyang Xie Byron Boots Nan Jiang Ching-An Cheng



研究问题:如何设计一种离线强化学习(RL)框架,以优化参考策略并提高其性能,无论数据覆盖范围如何。
动机:现有的离线RL方法在面对非完全数据覆盖时,往往无法有效地学习和改进参考策略。
方法:提出了一种新的基于模型的离线强化学习框架ARMOR,通过对抗性训练马尔可夫决策过程模型来优化策略,以实现对任意参考策略的最坏情况性能优化。
效果:理论证明ARMOR可以在数据覆盖范围内与最佳策略竞争,同时对超参数选择具有鲁棒性。实验表明,ARMOR可以有效地提升参考策略的性能,并与最新的离线无模型和基于模型的RL算法相媲美。

We propose a novel model-based offline Reinforcement Learning (RL) framework, called Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary reference policy regardless of data coverage. ARMOR is designed to optimize policies for the worst-case performance relative to the reference policy through adversarially training a Markov decision process model. In theory, we prove that ARMOR, with a well-tuned hyperparameter, can compete with the best policy within data coverage when the reference policy is supported by the data. At the same time, ARMOR is robust to hyperparameter choices: the policy learned by ARMOR, with any admissible hyperparameter, would never degrade the performance of the reference policy, even when the reference policy is not covered by the dataset. To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.

Supported Value Regularization for Offline Reinforcement Learning
Yixiu Mao Hongchang Zhang Chen Chen Yi Xu Xiangyang Ji



研究问题:离线强化学习中,由于分布外(OOD)动作的存在,会导致外推误差和价值估计过高的问题。
动机:为了解决这个问题,现有的价值正则化方法试图通过降低OOD动作的价值来惩罚学习到的价值函数。然而,这些方法未能在ID和OOD动作之间做出适当的区分,也不能保证策略的最优收敛结果。
方法:我们提出了支持值正则化(SVR)方法,对所有OOD动作的Q值进行惩罚,同时对ID动作保持标准的贝尔曼更新。具体来说,我们利用重要性采样的偏差来计算整个OOD区域的Q值之和,作为策略评估的惩罚。这种设计自动区分了ID和OOD动作的正则化,无需手动区分它们。
效果:在表格MDP中,我们证明了SVR的策略评估算子是一个压缩映射,其固定点输出为ID动作的无偏Q值和OOD动作的低估Q值。此外,使用SVR的策略迭代保证了严格的策略改进,直到收敛到数据集中的最佳支持约束策略。在实验上,我们在一个表格迷宫环境中验证了SVR的理论性质,并在D4RL基准测试的一系列连续控制任务中展示了其最先进的性能。

Offline reinforcement learning suffers from the extrapolation error and value overestimation caused by out-of-distribution (OOD) actions. To mitigate this issue, value regularization approaches aim to penalize the learned value functions to assign lower values to OOD actions. However, existing value regularization methods lack a proper distinction between the regularization effects on in-distribution (ID) and OOD actions, and fail to guarantee optimal convergence results of the policy. To this end, we propose Supported Value Regularization (SVR), which penalizes the Q-values for all OOD actions while maintaining standard Bellman updates for ID ones. Specifically, we utilize the bias of importance sampling to compute the summation of Q-values over the entire OOD region, which serves as the penalty for policy evaluation. This design automatically separates the regularization for ID and OOD actions without manually distinguishing between them. In tabular MDP, we show that the policy evaluation operator of SVR is a contraction, whose fixed point outputs unbiased Q-values for ID actions and underestimated Q-values for OOD actions. Furthermore, the policy iteration with SVR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Empirically, we validate the theoretical properties of SVR in a tabular maze environment and demonstrate its state-of-the-art performance on a range of continuous control tasks in the D4RL benchmark.

PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks
Ian Char Jeff Schneider



研究问题:深度强化学习在训练系统控制时,由于系统状态的不可见性以及训练和测试环境的差异,如何平衡历史观察信息的提取和环境的鲁棒性。
动机:借鉴PID控制器的成功,提出仅通过求和和差分就能累积信息的方法,用于解决深度强化学习中的环境适应性问题。
方法:提出了两种基于PID特征的历史编码器架构,一种直接使用PID特征,另一种则可以应用于任意控制任务。
效果:与先前的方法相比,这两种编码器产生的策略通常更具鲁棒性,并在一系列跟踪任务上实现了更好的性能。此外,这些策略在运动控制任务上也比先前的最佳方法平均提高了1.7倍的性能。

Deep reinforcement learning (RL) has shown immense potential for learning to control systems through data alone. However, one challenge deep RL faces is that the full state of the system is often not observable. When this is the case, the policy needs to leverage the history of observations to infer the current state. At the same time, differences between the training and testing environments makes it critical for the policy not to overfit to the sequence of observations it sees at training time. As such, there is an important balancing act between having the history encoder be flexible enough to extract relevant information, yet be robust to changes in the environment. To strike this balance, we look to the PID controller for inspiration. We assert the PID controller's success shows that only summing and differencing are needed to accumulate information over time for many control tasks. Following this principle, we propose two architectures for encoding history: one that directly uses PID features and another that extends these core ideas and can be used in arbitrary control tasks. When compared with prior approaches, our encoders produce policies that are often more robust and achieve better performance on a variety of tracking tasks. Going beyond tracking tasks, our policies achieve 1.7x better performance on average over previous state-of-the-art methods on a suite of locomotion control tasks.

FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation
Xinyu Sun Peihao Chen Jugang Fan Jian Chen Thomas H. Li Mingkui Tan



研究问题:如何让自主系统(如家用机器人)通过图片导航到指定目标位置。
动机:现有的方法在理解并推理目标位置的图片时,可能会错过目标图片中的详细信息,且难以关注观察图片中与目标相关区域。
方法:设计了一种名为Fine-grained Goal Prompting的方法,利用目标图片中精细且高分辨率的特征图作为提示进行条件嵌入,以保留目标图片中的详细信息并引导观察编码器关注与目标相关的区域。
效果:在图像目标导航基准测试中,该方法在3个基准数据集上(即Gibson、MP3D和HM3D)取得了显著的性能提升,特别是在Gibson上,仅使用1/50的模型大小就超过了最先进的成功率8%。

Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems like household robots. The agent is required to well understand and reason the location of the navigation goal from a picture shot in the goal position. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence of navigation actions. However, these methods suffer from two major limitations. 1) They may miss detailed information in the goal image, and thus fail to reason the goal location. 2) More critically, it is hard to focus on the goal-relevant regions in the observation image, because they attempt to understand observation without goal conditioning. In this paper, we aim to overcome these limitations by designing a Fine-grained Goal Prompting (\sexyname) method for image-goal navigation. In particular, we leverage fine-grained and high-resolution feature maps in the goal image as prompts to perform conditioned embedding, which preserves detailed information in the goal image and guides the observation encoder to pay attention to goal-relevant regions. Compared with existing methods on the image-goal navigation benchmark, our method brings significant performance improvement on 3 benchmark datasets (\textit{i.e.,} Gibson, MP3D, and HM3D). Especially on Gibson, we surpass the state-of-the-art success rate by 8\% with only 1/50 model size.

BCDiff: Bidirectional Consistent Diffusion for Instantaneous Trajectory Prediction
Rongqing Li Changsheng Li Dongchun Ren Guangyi Chen Ye Yuan Guoren Wang



研究问题:行人轨迹预测的目标是通过利用历史观察来估计行人的未来路径,这对于确保自动驾驶车辆和导航机器人的安全至关重要。
动机:在许多真实世界的情况下,模型缺乏足够的观察时间,例如当行人突然从盲点出现时,会导致预测不准确甚至安全风险。因此,有必要基于瞬时观察进行轨迹预测,这在以前的研究中很少被研究。
方法:本文提出了一种适用于瞬时轨迹预测的双向一致扩散框架,命名为BCDiff。其核心是设计一个相互指导机制,开发两个耦合的扩散模型,可以双向并一致地逐步生成未观察到的历史轨迹和未来轨迹,以利用它们之间的互补信息。
效果:实验表明,与相关方法相比,我们提出的BCDiff显著提高了瞬时轨迹预测在ETH/UCY和斯坦福无人机数据集上的准确率。

The objective of pedestrian trajectory prediction is to estimate the future paths of pedestrians by leveraging historical observations, which plays a vital role in ensuring the safety of self-driving vehicles and navigation robots. Previous works usually rely on a sufficient amount of observation time to accurately predict future trajectories. However, there are many real-world situations where the model lacks sufficient time to observe, such as when pedestrians abruptly emerge from blind spots, resulting in inaccurate predictions and even safety risks. Therefore, it is necessary to perform trajectory prediction based on instantaneous observations, which has rarely been studied before. In this paper, we propose a Bi-directional Consistent Diffusion framework tailored for instantaneous trajectory prediction, named BCDiff. At its heart, we develop two coupled diffusion models by designing a mutual guidance mechanism which can bidirectionally and consistently generate unobserved historical trajectories and future trajectories step-by-step, to utilize the complementary information between them. Specifically, at each step, the predicted unobserved historical trajectories and limited observed trajectories guide one diffusion model to generate future trajectories, while the predicted future trajectories and observed trajectories guide the other diffusion model to predict unobserved historical trajectories. Given the presence of relatively high noise in the generated trajectories during the initial steps, we introduce a gating mechanism to learn the weights between the predicted trajectories and the limited observed trajectories for automatically balancing their contributions. By means of this iterative and mutually guided generation process, both the future and unobserved historical trajectories undergo continuous refinement, ultimately leading to accurate predictions. Essentially, BCDiff is an encoder-free framework that can be compatible with existing trajectory prediction models in principle. Experiments show that our proposed BCDiff significantly improves the accuracy of instantaneous trajectory prediction on the ETH/UCY and Stanford Drone datasets, compared to related approaches.

Learning from Visual Observation via Offline Pretrained State-to-Go Transformer
Bohan Zhou Ke Li Jiechuan Jiang Zongqing Lu



研究问题:如何仅通过视觉观察数据恢复策略,这是一个有前景但具有挑战性的问题。
动机:现有的从视觉观察学习(LfVO)方法要么只采用效率低下的在线学习方案,要么需要额外的特定任务信息,如目标状态,使它们不适合开放性任务。
方法:我们提出了一个两阶段框架来从视觉观察中学习。在第一阶段,我们引入并离线预训练状态到目标(STG)转换器来预测和区分演示的潜在转换。随后,在第二阶段,STG转换器为下游强化学习任务提供内在奖励,其中代理仅从内在奖励中学习。
效果:我们在Atari和Minecraft上的实验结果表明,我们提出的方法优于基线,并且在一些任务中甚至达到了与从环境奖励中学习的策略相当的性能。这些结果揭示了利用视频数据解决困难的视觉强化学习任务的潜力,而不是依赖于包含状态、动作和奖励的完整离线数据集。

Learning from visual observation (LfVO), aiming at recovering policies from only visual observation data, is promising yet a challenging problem. Existing LfVO approaches either only adopt inefficient online learning schemes or require additional task-specific information like goal states, making them not suited for open-ended tasks. To address these issues, we propose a two-stage framework for learning from visual observation. In the first stage, we introduce and pretrain State-to-Go (STG) Transformer offline to predict and differentiate latent transitions of demonstrations. Subsequently, in the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks where an agent learns merely from intrinsic rewards. Empirical results on Atari and Minecraft show that our proposed method outperforms baselines and in some tasks even achieves performance comparable to the policy learned from environmental rewards. These results shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards. The project’s website and code can be found at https://sites.google.com/view/stgtransformer.

Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents
Zihao Wang Shaofei Cai Guanzhou Chen Anji Liu Xiaojian Ma Yitao Liang



研究问题:本文研究了在Minecraft中进行规划的问题,这是一个开放性、大众化但具有挑战性的环境,用于开发多任务实体代理。
动机:我们发现给这样的代理赋予规划能力有两个主要的挑战:1)像Minecraft这样的开放世界中的规划需要精确且多步骤的推理,因为任务具有长期性;2)由于普通的规划器在复杂的计划中排列并行子目标时没有考虑当前代理的可实现性,因此生成的计划可能效率低下。
方法:我们提出了“描述、解释、计划和选择”(DEPS),一种基于大型语言模型(LLMs)的交互式规划方法。我们的方法通过在长期的规划过程中从反馈中获得更好的错误修正,同时通过目标选择器带来接近感,这是一个可学习的模块,根据预计的完成步骤对并行子目标进行排序并相应地改进原始计划。
效果:我们的实验标志着第一个能够稳健地完成70多个Minecraft任务的零射多任务代理的里程碑,总体性能提高了近一倍。进一步的测试揭示了我们的方法在普遍采用的非开放性领域(如ALFWorld和桌面操作)中的普遍有效性。消融研究和探索性研究详细说明了我们的设计如何优于其对应物,并提供了我们在“获取钻石”这一重大挑战中的有希望的进展。

In this paper, we study the problem of planning in Minecraft, a popular, democratized yet challenging open-ended environment for developing multi-task embodied agents. We've found two primary challenges of empowering such agents with planning: 1) planning in an open-ended world like Minecraft requires precise and multi-step reasoning due to the long-term nature of the tasks, and 2) as vanilla planners do not consider the achievability of the current agent when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient. To this end, we propose ``$\underline{D}$escribe, $\underline{E}$xplain, $\underline{P}$lan and $\underline{S}$elect'' ($\textbf{DEPS}$), an interactive planning approach based on Large Language Models (LLMs). Our approach helps with better error correction from the feedback during the long-haul planning, while also bringing the sense of proximity via goal $\textbf{Selector}$, a learnable module that ranks parallel sub-goals based on the estimated steps of completion and improves the original plan accordingly. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances. Further testing reveals our method's general effectiveness in popularly adopted non-open-ended domains as well (i.e., ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the $\texttt{ObtainDiamond}$ grand challenge with our approach.

Interpretable and Explainable Logical Policies via Neurally Guided Symbolic Abstraction
Quentin Delfosse Hikaru Shindo Devendra Singh Dhami Kristian Kersting



研究问题:如何使强化学习中的语言模型既能编码和学习策略,又具有可解释性。
动机:虽然神经网络在强化学习中占据主导地位,但其黑箱特性使得理解代理的行为变得困难,特别是在图像级别上。因此,神经符号强化学习旨在创建一开始就可解释的策略。
方法:引入了神经引导的可微分逻辑策略(NUDGE)。NUDGE利用训练好的基于神经网络的代理来指导候选加权逻辑规则的搜索,然后使用可微分逻辑来训练逻辑代理。
效果:实验评估表明,NUDGE代理可以产生可解释且可解释的策略,同时优于纯神经网络,并显示出对不同初始状态和问题规模的环境的良好的灵活性。

The limited priors required by neural networks make them the dominating choice to encode and learn policies using reinforcement learning (RL). However, they are also black-boxes, making it hard to understand the agent's behavior, especially when working on the image level. Therefore, neuro-symbolic RL aims at creating policies that are interpretable in the first place. Unfortunately, interpretability is not explainability. To achieve both, we introduce Neurally gUided Differentiable loGic policiEs (NUDGE). NUDGE exploits trained neural network-based agents to guide the search of candidate-weighted logic rules, then uses differentiable logic to train the logic agents. Our experimental evaluation demonstrates that NUDGE agents can induce interpretable and explainable policies while outperforming purely neural ones and showing good flexibility to environments of different initial states and problem sizes.

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Guohao Li Hasan Abed Al Kader Hammoud Hani Itani Dmitrii Khizbullin Bernard Ghanem



研究问题:如何实现聊天机器人的自主协作。
动机:目前聊天机器人的成功依赖于人类指导,但这种方法既困难又耗时。
方法:提出一种名为角色扮演的新型通信代理框架,通过使用引入提示来引导聊天机器人完成任务,同时保持与人类意图的一致性。
效果:展示了角色扮演如何用于生成对话数据以研究代理的行为和能力,为调查对话语言模型提供了有价值的资源。

The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their “cognitive” processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing . Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond: https://github.com/camel-ai/camel.

Train Hard, Fight Easy: Robust Meta Reinforcement Learning
Ido Greenberg Shie Mannor Gal Chechik Eli Meirom



研究问题:在现实世界的应用中,强化学习面临的主要挑战是环境、任务或客户端之间的差异。
动机:元强化学习(MRL)通过学习适应新任务的元策略来解决这一问题。然而,标准的MRL方法通常在高风险或困难的任务上表现不佳,这限制了系统的可靠性。
方法:我们定义了一个具有可控鲁棒性的稳健MRL目标。我们证明了在我们的提出的MRL框架中,梯度偏差会消失。我们还提出了一种新的Robust Meta RL算法(RoML),通过在整个训练过程中识别和过采样更难的任务来解决数据效率低下的问题。
效果:实验证明,RoML在多个导航和连续控制基准上实现了稳健的回报。

A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability since test tasks are not known in advance. In this work, we define a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL is known to lead to both **biased gradients** and **data inefficiency**. We prove that the gradient bias disappears in our proposed MRL framework. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML achieves robust returns on multiple navigation and continuous control benchmarks.

Wasserstein Gradient Flows for Optimizing Gaussian Mixture Policies
Hanna Ziesche Leonel Rozo



研究问题:机器人在面对未见过的任务条件或新的任务需求时,如何调整其运动策略以适应特定的目标。
动机:大多数常用的运动策略具有特定的结构,这些结构在优化算法中常常被忽视。我们提出将策略优化视为最优传输问题,以利用概率策略的结构。
方法:我们将基于高斯混合模型(GMMs)的机器人运动策略优化问题形式化为Wasserstein梯度流在GMMs空间上的问题。通过在GMMs之间使用L^2-Wasserstein距离来约束策略更新,从而增强策略优化过程的稳定性。此外,我们还利用Bures-Wasserstein流形的几何结构,通过黎曼优化来优化GMM策略的高斯分布。
效果:我们在常见的机器人设置中评估了我们的方法,包括到达运动、避碰行为和多目标任务。实验结果表明,我们的方法在任务成功率和低方差解决方案方面优于常见的策略优化基线。

Robots often rely on a repertoire of previously-learned motion policies for performing tasks of diverse complexities. When facing unseen task conditions or when new task requirements arise, robots must adapt their motion policies accordingly. In this context, policy optimization is the \emph{de facto} paradigm to adapt robot policies as a function of task-specific objectives. Most commonly-used motion policies carry particular structures that are often overlooked in policy optimization algorithms. We instead propose to leverage the structure of probabilistic policies by casting the policy optimization as an optimal transport problem. Specifically, we focus on robot motion policies that build on Gaussian mixture models (GMMs) and formulate the policy optimization as a Wassertein gradient flow over the GMMs space. This naturally allows us to constrain the policy updates via the $L^2$-Wasserstein distance between GMMs to enhance the stability of the policy optimization process. Furthermore, we leverage the geometry of the Bures-Wasserstein manifold to optimize the Gaussian distributions of the GMM policy via Riemannian optimization. We evaluate our approach on common robotic settings: Reaching motions, collision-avoidance behaviors, and multi-goal tasks. Our results show that our method outperforms common policy optimization baselines in terms of task success rate and low-variance solutions.

Task-aware world model learning with meta weighting via bi-level optimization
Huining Yuan Hongkun Dou Xingyu Jiang Yue Deng



研究问题:如何将世界模型与环境对齐,以适应代理的特定任务,是模型基础强化学习中的关键。
动机:虽然等价模型可能在任务意识上优于最大似然模型,但它们牺牲了大量的语义信息并面临实施问题。为了结合这两种模型的优点,我们提出了一种带有双层优化的任务感知环境建模管道(TEMPO)。
方法:TEMPO是一个双层模型学习框架,通过引入一个元权重网络来对每个训练样本进行加权,从而在最大似然模型之上增加了一个额外的优化级别。上层的元权重器通过最小化提出的任务感知模型损失来学习生成新的样本权重。下层的模型则关注重要样本,同时保持状态表示中的丰富语义信息。
效果:我们在DeepMind控制套件和Atari视频游戏中的各种连续和离散控制任务上评估了TEMPO。实验结果表明,TEMPO在渐进性能、训练稳定性和收敛速度方面都取得了最先进的成果。

Aligning the world model with the environment for the agent’s specific task is crucial in model-based reinforcement learning. While value-equivalent models may achieve better task awareness than maximum-likelihood models, they sacrifice a large amount of semantic information and face implementation issues. To combine the benefits of both types of models, we propose Task-aware Environment Modeling Pipeline with bi-level Optimization (TEMPO), a bi-level model learning framework that introduces an additional level of optimization on top of a maximum-likelihood model by incorporating a meta weighter network that weights each training sample. The meta weighter in the upper level learns to generate novel sample weights by minimizing a proposed task-aware model loss. The model in the lower level focuses on important samples while maintaining rich semantic information in state representations. We evaluate TEMPO on a variety of continuous and discrete control tasks from the DeepMind Control Suite and Atari video games. Our results demonstrate that TEMPO achieves state-of-the-art performance regarding asymptotic performance, training stability, and convergence speed.

Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms
Akifumi Wachi Wataru Hashimoto Xun Shen Kazumune Hashimoto



研究问题:本文旨在解决强化学习中安全探索的问题,提出了一种通用的安全探索(GSE)问题的统一表述。
动机:在许多现实世界的场景中,安全探索对于强化学习的实际使用至关重要。
方法:本文提出了一种名为MASE的元算法来解决GSE问题,该算法结合了无约束的强化学习算法和不确定性量化器,以确保当前回合的安全性,同时对实际安全违规之前的不安全探索进行适当的惩罚,以阻止其在后续回合中发生。
效果:实验结果表明,我们提出的算法在不违反任何安全约束的情况下,比最先进的算法在网格世界和Safety Gym基准测试上取得了更好的性能。

Safe exploration is essential for the practical use of reinforcement learning (RL) in many real-world scenarios. In this paper, we present a generalized safe exploration (GSE) problem as a unified formulation of common safe exploration problems. We then propose a solution of the GSE problem in the form of a meta-algorithm for safe exploration, MASE, which combines an unconstrained RL algorithm with an uncertainty quantifier to guarantee safety in the current episode while properly penalizing unsafe explorations before actual safety violation to discourage them in future episodes. The advantage of MASE is that we can optimize a policy while guaranteeing with a high probability that no safety constraint will be violated under proper assumptions. Specifically, we present two variants of MASE with different constructions of the uncertainty quantifier: one based on generalized linear models with theoretical guarantees of safety and near-optimality, and another that combines a Gaussian process to ensure safety with a deep RL algorithm to maximize the reward. Finally, we demonstrate that our proposed algorithm achieves better performance than state-of-the-art algorithms on grid-world and Safety Gym benchmarks without violating any safety constraints, even during training.

Learning non-Markovian Decision-Making from State-only Sequences
Aoyang Qin Feng Gao Qing Li Song-Chun Zhu Sirui Xie



研究问题:传统的模仿学习需要访问示范者的动作,但在自然设置中这些运动信号通常是不可观察的。此外,在这些设置中的序列决策行为可能偏离标准马尔可夫决策过程(MDP)的假设。
动机:为了解决这些挑战,我们探索了状态仅序列的深度生成模型与非马尔可夫决策过程(nMDP),其中策略是状态转换生成器潜在空间中的能量基先验。
方法:我们开发了最大似然估计来实现基于模型的模仿,这涉及到从先验进行短程蒙特卡洛采样和后验的重要性采样。学习到的模型实现了“决策即推理”:无模型的策略执行等同于先验采样,基于模型的规划是从策略初始化的后验采样。
效果:我们在一个具有非马尔可夫约束的典型路径规划任务中演示了所提出方法的有效性,并表明在MuJoCo套件的挑战性领域中,学习到的模型表现出强大的性能。

Conventional imitation learning assumes access to the actions of demonstrators, but these motor signals are often non-observable in naturalistic settings. Additionally, sequential decision-making behaviors in these settings can deviate from the assumptions of a standard Markov Decision Process (MDP). To address these challenges, we explore deep generative modeling of state-only sequences with non-Markov Decision Process (nMDP), where the policy is an energy-based prior in the latent space of the state transition generator. We develop maximum likelihood estimation to achieve model-based imitation, which involves short-run MCMC sampling from the prior and importance sampling for the posterior. The learned model enables $\textit{decision-making as inference}$: model-free policy execution is equivalent to prior sampling, model-based planning is posterior sampling initialized from the policy. We demonstrate the efficacy of the proposed method in a prototypical path planning task with non-Markovian constraints and show that the learned model exhibits strong performances in challenging domains from the MuJoCo suite.

Video Prediction Models as Rewards for Reinforcement Learning
Alejandro Escontrela Ademi Adeniji Wilson Yan Ajay Jain Xue Bin Peng Ken Goldberg Youngwoon Lee Danijar Hafner Pieter Abbeel



研究问题:如何为强化学习中的行为指定奖励信号,以使代理学习复杂的行为。
动机:从互联网上广泛可用的未标记视频中提取行为偏好是一种有前景的方法。
方法:我们提出了Video Prediction Rewards(VIPER)算法,该算法利用预训练的视频预测模型作为强化学习中无动作奖励信号。
效果:实验结果表明,VIPER能够在广泛的DMC、Atari和RLBench任务中实现专家级控制,无需程序化的任务奖励。此外,视频预测模型的泛化使我们能够为没有专家数据的分布外环境推导奖励,实现针对桌面操作的跨实体泛化。我们认为这项工作是从一个未标记的视频中进行可扩展的奖励规范的起点,将受益于生成建模的快速发展。

Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://ViperRL.com

Policy Finetuning in Reinforcement Learning via Design of Experiments using Offline Data
Ruiqi Zhang Andrea Zanette



研究问题:如何在强化学习中利用已有的经验数据集和额外的在线数据来提高策略质量。
动机:虽然已有的经验数据集可以用于改进策略,但收集额外的在线数据也是必要的。然而,频繁切换探索策略会增加工程成本。
方法:本文提出了一种算法,该算法使用离线数据集设计一个单一的非反应性探索策略,并保证其性能。
效果:通过理论分析和实验测量,证明了该算法的有效性,即原始数据集的局部覆盖率和额外收集的数据量越大,最终策略的质量越高。

In some applications of reinforcement learning, a dataset of pre-collected experience is already available but it is also possible to acquire some additional online data to help improve the quality of the policy. However, it may be preferable to gather additional data with a single, non-reactive exploration policy and avoid the engineering costs associated with switching policies. In this paper we propose an algorithm with provable guarantees that can leverage an offline dataset to design a single non-reactive policy for exploration. We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected.

Learning Dynamic Attribute-factored World Models for Efficient Multi-object Reinforcement Learning
Fan Feng Sara Magliacane



研究问题:在许多强化学习任务中,代理必须学会与不同类型的许多对象进行交互,并推广到未见过的对象组合和数量。
动机:现有的方法没有充分利用对象属性的分解优势,本文提出了动态属性因子化强化学习(DAFT-RL)框架来解决这个问题。
方法:我们利用对象中心表示学习从视觉输入中提取对象,为每个对象的类别学习一个类模板图,描述该类对象的动力学和奖励如何根据其属性进行分解。我们还学习了一个交互模式图,描述了不同类别的对象如何在属性级别相互交互。通过这些图和一个动态交互图,我们可以学习一个策略,然后通过估计交互和潜在参数直接应用于新环境。
效果:我们在三个基准数据集上评估了DAFT-RL,结果显示我们的框架在推广到具有不同属性和潜在参数的未见过的对象以及在先前学习的子任务的组合方面优于最先进的方法。

In many reinforcement learning tasks, the agent has to learn to interact with many objects of different types and generalize to unseen combinations and numbers of objects. Often a task is a composition of previously learned tasks (e.g. block stacking). These are examples of compositional generalization, in which we compose object-centric representations to solve complex tasks. Recent works have shown the benefits of object-factored representations and hierarchical abstractions for improving sample efficiency in these settings. On the other hand, these methods do not fully exploit the benefits of factorization in terms of object attributes. In this paper, we address this opportunity and introduce the Dynamic Attribute FacTored RL (DAFT-RL) framework. In DAFT-RL, we leverage object-centric representation learning to extract objects from visual inputs. We learn to classify them into classes and infer their latent parameters. For each class of object, we learn a class template graph that describes how the dynamics and reward of an object of this class factorize according to its attributes. We also learn an interaction pattern graph that describes how objects of different classes interact with each other at the attribute level. Through these graphs and a dynamic interaction graph that models the interactions between objects, we can learn a policy that can then be directly applied in a new environment by estimating the interactions and latent parameters. We evaluate DAFT-RL in three benchmark datasets and show our framework outperforms the state-of-the-art in generalizing across unseen objects with varying attributes and latent parameters, as well as in the composition of previously learned tasks.

Automatic Grouping for Efficient Cooperative Multi-Agent Reinforcement Learning
Yifan Zang Jinmin He Kai Li Haobo Fu QIANG FU Junliang Xing Jian Cheng



研究问题:如何有效地进行团队协作以提高团队效率。
动机:现有的方法试图直接学习联合行动值和个体效用之间的复杂关系,而本文提出的方法将分组作为桥梁,模型化小部分代理之间的关系,鼓励他们之间的合作,从而提高整个团队的学习效率。
方法:提出了一种新的群体导向的多智能体强化学习方法(GoMARL),该方法通过自动分组来提高学习效率,无需领域知识。具体来说,我们将联合行动值分解为群体值的组合,引导代理以精细的方式改进其策略。
效果:在星际争霸II微管理任务和谷歌研究足球场景中进行的实验验证了该方法的有效性。广泛的组件研究表明了分组如何工作并提高性能。

Grouping is ubiquitous in natural systems and is essential for promoting efficiency in team coordination. This paper proposes a novel formulation of Group-oriented Multi-Agent Reinforcement Learning (GoMARL), which learns automatic grouping without domain knowledge for efficient cooperation. In contrast to existing approaches that attempt to directly learn the complex relationship between the joint action-values and individual utilities, we empower grouping as a bridge to model the connection between small sets of agents and encourage cooperation among them, thereby improving the learning efficiency of the whole team. In particular, we factorize the joint action-values as a combination of group-wise values, which guide agents to improve their policies in a fine-grained fashion. We present an automatic grouping mechanism to generate dynamic groups and group action-values. We further introduce a hierarchical control for policy learning that drives the agents in the same group to specialize in similar policies and possess diverse strategies for various groups. Experiments on the StarCraft II micromanagement tasks and Google Research Football scenarios verify our method's effectiveness. Extensive component studies show how grouping works and enhances performance.

Large Language Models can Implement Policy Iteration
Ethan Brooks Logan A Walls Richard Lewis Satinder Singh



研究问题:如何利用大型语言模型实现策略迭代。
动机:现有的基于基础模型的强化学习方法主要依赖于专家演示或任务特定的预训练,或者使用梯度方法进行微调或适配器层的训练,但这些方法都有其缺点。
方法:本文提出了一种利用大型语言模型进行策略迭代的方法,通过与强化学习环境的试错交互来更新提示的内容,从而无需专家演示或梯度就能学习执行RL任务。
效果:实验结果表明,这种方法可以在没有领域先验知识的语言模型(如Codex)上实现策略迭代,且无需专家演示或梯度。

In this work, we demonstrate a method for implementing policy iteration using a large language model. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the “few-shot” quality that makes in-context learning attractive to begin with. Our method demonstrates that a large language model can be used to implement policy iteration using the machinery of in-context learning, enabling it to learn to perform RL tasks without expert demonstrations or gradients. Our approach iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our method using Codex (M. Chen et al. 2021b), a language model with no prior knowledge of the domains on which we evaluate it.

Inverse Preference Learning: Preference-based RL without a Reward Function
Joey Hejna Dorsa Sadigh



研究问题:设计奖励函数困难,且往往难以与人类意图对齐。
动机:偏好强化学习算法通过从人类反馈中学习奖励函数来解决这些问题。
方法:开发了一种新颖的参数高效算法——逆偏好学习(IPL),专门用于从离线偏好数据中学习。
效果:在一系列连续控制和机器人基准测试中,IPL实现了与更复杂的利用基于变压器和非马尔可夫奖励函数的方法相竞争的性能,同时具有更少的算法超参数和学习的神经网络参数。

Reward functions are difficult to design and often hard to align with human intent. Preference-based Reinforcement Learning (RL) algorithms address these problems by learning reward functions from human feedback. However, the majority of preference-based RL methods na\"ively combine supervised reward models with off-the-shelf RL algorithms. Contemporary approaches have sought to improve performance and query complexity by using larger and more complex reward architectures such as transformers. Instead of using highly complex architectures, we develop a new and parameter-efficient algorithm, Inverse Preference Learning (IPL), specifically designed for learning from offline preference data. Our key insight is that for a fixed policy, the $Q$-function encodes all information about the reward function, effectively making them interchangeable. Using this insight, we completely eliminate the need for a learned reward function. Our resulting algorithm is simpler and more parameter-efficient. Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters. Our code is publicly released.

Latent exploration for Reinforcement Learning
Alberto Silvio Chiappa Alessandro Marin Vargas Ann Huang Alexander Mathis



研究问题:在强化学习中,如何有效地探索和交互环境以学习策略,特别是在高维感官输入到运动输出的映射上。
动机:现有的方法(如SAC、PPO等)通过在执行器上施加独立的高斯噪声进行环境探索,这种方法对于多执行器系统来说可能是次优的。
方法:提出一种新方法Lattice,将时间相关的噪声注入到策略网络的潜伏状态中,可以无缝地与在线和离线算法集成。
效果:在PyBullet移动任务中,Lattice-SAC实现了最先进的结果,并在Humanoid环境中比未结构化的探索获得了18%更高的奖励。在MyoSuite的肌肉骨骼控制环境中,Lattice-PPO在大多数达到和物体操作任务中获得了更高的奖励,同时找到了更节能的策略,能耗降低了20-60%。总的来说,我们证明了在时间和执行器空间中使用结构化的动作噪声对复杂的运动控制任务是有效的。

In Reinforcement Learning, agents learn policies by exploring and interacting with the environment. Due to the curse of dimensionality, learning policies that map high-dimensional sensory input to motor output is particularly challenging. During training, state of the art methods (SAC, PPO, etc.) explore the environment by perturbing the actuation with independent Gaussian noise. While this unstructured exploration has proven successful in numerous tasks, it can be suboptimal for overactuated systems. When multiple actuators, such as motors or muscles, drive behavior, uncorrelated perturbations risk diminishing each other's effect, or modifying the behavior in a task-irrelevant way. While solutions to introduce time correlation across action perturbations exist, introducing correlation across actuators has been largely ignored. Here, we propose LATent TIme-Correlated Exploration (Lattice), a method to inject temporally-correlated noise into the latent state of the policy network, which can be seamlessly integrated with on- and off-policy algorithms. We demonstrate that the noisy actions generated by perturbing the network's activations can be modeled as a multivariate Gaussian distribution with a full covariance matrix. In the PyBullet locomotion tasks, Lattice-SAC achieves state of the art results, and reaches 18\% higher reward than unstructured exploration in the Humanoid environment. In the musculoskeletal control environments of MyoSuite, Lattice-PPO achieves higher reward in most reaching and object manipulation tasks, while also finding more energy-efficient policies with reductions of 20-60\%. Overall, we demonstrate the effectiveness of structured action noise in time and actuator space for complex motor control tasks. The code is available at: https://github.com/amathislab/lattice.

Learning Score-based Grasping Primitive for Human-assisting Dexterous Grasping
Tianhao Wu Mingdong Wu Jiyao Zhang Yunchong Gan Hao Dong



研究问题:如何训练一个策略来控制机械手的手指,以帮助用户抓取物体。
动机:在人类手无法或不适合的情况下,使用拟人机器人手进行辅助的重要性日益凸显。
方法:提出了一种名为“人机协作灵巧抓取”的新任务,通过学习合成成功抓取示例集的梯度,训练了一个名为“抓取梯度场”(GraspGF)的手-物体条件抓取原语和一个基于轨迹历史的条件剩余策略。
效果:实验结果表明,该方法优于基线,显示出对用户意图的理解和实际应用的实用性。

The use of anthropomorphic robotic hands for assisting individuals in situations where human hands may be unavailable or unsuitable has gained significant importance. In this paper, we propose a novel task called human-assisting dexterous grasping that aims to train a policy for controlling a robotic hand's fingers to assist users in grasping objects. Unlike conventional dexterous grasping, this task presents a more complex challenge as the policy needs to adapt to diverse user intentions, in addition to the object's geometry. We address this challenge by proposing an approach consisting of two sub-modules: a hand-object-conditional grasping primitive called Grasping Gradient Field (GraspGF), and a history-conditional residual policy. GraspGF learns 'how' to grasp by estimating the gradient of a synthesised success grasping example set, while the residual policy determines 'when' and at what speed the grasping action should be executed based on the trajectory history. Experimental results demonstrate the superiority of our proposed method compared to baselines, highlighting the user-awareness and practicality in real-world applications. The codes and demonstrations can be viewed at https://sites.google.com/view/graspgf.

Generalized Weighted Path Consistency for Mastering Atari Games
Dengwei Zhao Shikui Tu Lei Xu



研究问题:如何提高强化学习中神经指导搜索的效率和性能。
动机:目前的神经指导搜索方法需要消耗大量的计算资源才能达到显著的性能,且缺乏理论支持。
方法:提出了一种名为GW-PCZero的新方法,该方法将路径一致性(PC)应用于MCTS,并引入了权重机制以减少由于探索不确定性引起的f值估计的方差。
效果:在Atari 100k基准测试中,GW-PCZero在26个游戏中实现了198%的平均人类性能,高于最先进的EfficientZero的194%,而消耗的资源仅为EfficientZero的25%。

Reinforcement learning with the help of neural-guided search consumes huge computational resources to achieve remarkable performance. Path consistency (PC), i.e., $f$ values on one optimal path should be identical, was previously imposed on MCTS by PCZero to improve the learning efficiency of AlphaZero. Not only PCZero still lacks a theoretical support but also considers merely board games. In this paper, PCZero is generalized into GW-PCZero for real applications with non-zero immediate reward. A weighting mechanism is introduced to reduce the variance caused by scouting's uncertainty on the $f$ value estimation. For the first time, it is theoretically proved that neural-guided MCTS is guaranteed to find the optimal solution under the constraint of PC. Experiments are conducted on the Atari $100$k benchmark with $26$ games and GW-PCZero achieves $198\%$ mean human performance, higher than the state-of-the-art EfficientZero's $194\\%$, while consuming only $25\\%$ of the computational resources consumed by EfficientZero.

Reduced Policy Optimization for Continuous Control with Hard Constraints
Shutong Ding Jingya Wang Yali Du Ye Shi



研究问题:如何将约束强化学习(RL)有效地应用于连续控制任务,特别是在存在一般硬约束的情况下。
动机:尽管最新的约束强化学习算法为强化学习提供了一定的安全保障,但在具有一般硬约束的连续控制任务中部署这些算法仍然具有挑战性。
方法:受经典约束优化技术广义减少梯度(GRG)算法的启发,提出了一种结合RL和GRG的策略优化(RPO)算法来处理一般的硬约束。RPO根据GRG方法将动作分为基本动作和非基本动作,并通过策略网络输出基本动作。然后,RPO通过使用获得的基本动作解决基于等式约束的方程来计算非基本动作。接下来,RPO通过隐式地对非基本动作相对于基本动作进行微分来更新策略网络。此外,还引入了一种基于减少梯度的动作投影过程,并应用了修改的拉格朗日松弛技术以确保满足不等式约束。
效果:RPO是首次尝试将GRG引入RL以有效处理等式和不等式硬约束的方法。在三个新的基准测试中,RPO在累积奖励和约束违反方面均优于以前的约束强化学习算法。

Recent advances in constrained reinforcement learning (RL) have endowed reinforcement learning with certain safety guarantees. However, deploying existing constrained RL algorithms in continuous control tasks with general hard constraints remains challenging, particularly in those situations with non-convex hard constraints. Inspired by the generalized reduced gradient (GRG) algorithm, a classical constrained optimization technique, we propose a reduced policy optimization (RPO) algorithm that combines RL with GRG to address general hard constraints. RPO partitions actions into basic actions and nonbasic actions following the GRG method and outputs the basic actions via a policy network. Subsequently, RPO calculates the nonbasic actions by solving equations based on equality constraints using the obtained basic actions. The policy network is then updated by implicitly differentiating nonbasic actions with respect to basic actions. Additionally, we introduce an action projection procedure based on the reduced gradient and apply a modified Lagrangian relaxation technique to ensure inequality constraints are satisfied. To the best of our knowledge, RPO is the first attempt that introduces GRG to RL as a way of efficiently handling both equality and inequality hard constraints. It is worth noting that there is currently a lack of RL environments with complex hard constraints, which motivates us to develop three new benchmarks: two robotics manipulation tasks and a smart grid operation control task. With these benchmarks, RPO achieves better performance than previous constrained RL algorithms in terms of both cumulative reward and constraint violation. We believe RPO, along with the new benchmarks, will open up new opportunities for applying RL to real-world problems with complex constraints.

State Regularized Policy Optimization on Data with Dynamics Shift
Zhenghai Xue Qingpeng Cai Shuchang Liu Dong Zheng Peng Jiang Kun Gai Bo An



研究问题:在许多真实世界的场景中,强化学习算法需要在动态变化的数据上进行训练,即在不同的环境动态下。目前大多数的方法通过训练环境参数的编码器来解决这个问题,将具有不同动态的数据根据其环境参数进行分离,然后训练相应的策略。
动机:然而,这些方法可能因为数据的“特定”使用而效率低下,为一种动态训练的策略无法从在其他不同动态环境中收集的数据中受益。本文发现,在许多结构相似但动态不同的环境中,最优策略的稳定状态分布是相似的。
方法:我们利用这种特性,从具有动态变化的数据中学习稳定状态分布以实现数据的有效重用。这种分布被用于在新环境中训练的策略进行正则化,从而产生了SRPO(状态正则化策略优化)算法。
效果:实验结果表明,SRPO可以使几种基于上下文的算法更加高效地利用数据,并显著提高其总体性能。

In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.

Encoding Human Behavior in Information Design through Deep Learning
Guanghui Yu Wei Tang Saumik Narayanan Chien-Ju Ho



研究问题:本文旨在通过深度学习来研究行为信息设计。
动机:在信息设计中,发送者通过策略性地揭示信息来试图说服接收者采取某些行动。我们解决了接收者可能表现出与标准的贝叶斯理性假设不同的行为模式的情况。
方法:我们提出了HAIDNet,这是一个基于神经网络的信息设计优化框架,可以适应人类行为的多种表示形式。通过大量的模拟,我们发现HAIDNet不仅可以恢复接近最优的信息策略,与已知的解析解相比,还可以扩展到为计算挑战性的场景(例如,当有多个接收者)或没有已知解决方案的场景(例如,当接收者的行为不遵循贝叶斯理性假设)设计信息策略。我们还进行了现实世界的人类受试者实验,并证明我们的框架可以从数据中捕捉人类行为,并为现实世界的人类接收者带来更有效的信息策略。
效果:实验结果表明,HAIDNet不仅可以恢复接近最优的信息策略,与已知的解析解相比,还可以扩展到为计算挑战性的场景或没有已知解决方案的场景设计信息策略。在现实世界的人类受试者实验中,我们的框架可以从数据中捕捉人类行为,并为现实世界的人类接收者带来更有效的信息策略。

We initiate the study of $\textit{behavioral information design}$ through deep learning. In information design, a $\textit{sender}$ aims to persuade a $\textit{receiver}$ to take certain actions by strategically revealing information. We address scenarios in which the receiver might exhibit different behavior patterns other than the standard Bayesian rational assumption. We propose HAIDNet, a neural-network-based optimization framework for information design that can adapt to multiple representations of human behavior. Through extensive simulation, we show that HAIDNet can not only recover information policies that are near-optimal compared with known analytical solutions, but also can extend to designing information policies for settings that are computationally challenging (e.g., when there are multiple receivers) or for settings where there are no known solutions in general (e.g., when the receiver behavior does not follow the Bayesian rational assumption). We also conduct real-world human-subject experiments and demonstrate that our framework can capture human behavior from data and lead to more effective information policy for real-world human receivers.

Dual Self-Awareness Value Decomposition Framework without Individual Global Max for Cooperative MARL
Zhiwei Xu Bin Zhang Dapeng Li Guangchong Zhou Zeren Zhang Guoliang Fan



研究问题:现有的合作多智能体强化学习中的价值分解方法大多遵循个体全局最大(IGM)原则,限制了其问题解决能力。
动机:为了解决这个问题,我们提出了一个双自我意识价值分解框架,完全抛弃了IGM前提。
方法:每个智能体由一个动作选择的自我策略和一个解决信用分配问题的他我值函数组成。通过使用显式搜索过程,值函数分解可以忽略IGM假设。在此基础上,我们还提出了一种新的反自我探索机制,以避免算法陷入局部最优。
效果:作为第一个完全无IGM的价值分解方法,我们的框架在各种合作任务中取得了理想的性能。

Value decomposition methods have gained popularity in the field of cooperative multi-agent reinforcement learning. However, almost all existing methods follow the principle of Individual Global Max (IGM) or its variants, which limits their problem-solving capabilities. To address this, we propose a dual self-awareness value decomposition framework, inspired by the notion of dual self-awareness in psychology, that entirely rejects the IGM premise. Each agent consists of an ego policy for action selection and an alter ego value function to solve the credit assignment problem. The value function factorization can ignore the IGM assumption by utilizing an explicit search procedure. On the basis of the above, we also suggest a novel anti-ego exploration mechanism to avoid the algorithm becoming stuck in a local optimum. As the first fully IGM-free value decomposition method, our proposed framework achieves desirable performance in various cooperative tasks.

Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor
Julien Grand-Clément Marek Petrik



研究问题:本文旨在解决马尔可夫决策过程(MDPs)中的平均和黑威尔最优性问题,以及其对应的折扣因子。
动机:当前对MDPs的优化目标包括折扣、平均和黑威尔最优性,但现有计算平均最优策略的方法存在局限,且忽视了最优策略在折扣因子下的病态行为。
方法:本文提出了黑威尔折扣因子,并证明当折扣因子大于该值时,所有折扣最优策略都同时满足平均和黑威尔最优性。同时,我们还推导出了黑威尔折扣因子的一个上界,并基于此给出了从平均和黑威尔最优性到折扣最优性的首次归约算法。
效果:我们的研究为MDPs的分析引入了新的数学工具,并首次实现了计算鲁棒黑威尔最优策略的算法。

We introduce the Blackwell discount factor for Markov Decision Processes (MDPs). Classical objectives for MDPs include discounted, average, and Blackwell optimality. Many existing approaches to computing average-optimal policies solve for discount-optimal policies with a discount factor close to $1$, but they only work under strong or hard-to-verify assumptions on the MDP structure such as unichain or ergodicity. We are the first to highlight the shortcomings of the classical definition of Blackwell optimality, which does not lead to simple algorithms for computing Blackwell-optimal policies and overlooks the pathological behaviors of optimal policies as regards the discount factors. To resolve this issue, in this paper, we show that when the discount factor is larger than the Blackwell discount factor $\gamma_{\sf bw}$, all discount-optimal policies become Blackwell- and average-optimal, and we derive a general upper bound on $\gamma_{\sf bw}$. Our upper bound on $\gamma_{\sf bw}$, parametrized by the bit-size of the rewards and transition probabilities of the MDP instance, provides the first reduction from average and Blackwell optimality to discounted optimality, without any assumptions, along with new polynomial-time algorithms. Our work brings new ideas from polynomials and algebraic numbers to the analysis of MDPs. Our results also apply to robust MDPs, enabling the first algorithms to compute robust Blackwell-optimal policies.

Finding Safe Zones of Markov Decision Processes Policies
Lee Cohen Yishay Mansour Michal Moshkovitz



研究问题:如何定义和寻找马尔可夫决策过程中的“安全区”,并找到最优解。
动机:为了提高策略的稳定性,需要将大部分轨迹限制在一个子集内,即“安全区”。
方法:通过联合训练大规模文本语料库和知识图谱来训练ERNIE模型,以捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Given a policy of a Markov Decision Process, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of a SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general, the problem is computationally hard. For this reason, we concentrate on finding approximate SafeZones. Our main result is a bi-criteria approximation learning algorithm with a factor of almost $2$ approximation for both the escape probability and \newprob size, using a polynomial size sample complexity.

A Unified Algorithm Framework for Unsupervised Discovery of Skills based on Determinantal Point Process
Jiayu Chen Vaneet Aggarwal Tian Lan



研究问题:如何在无监督的外部奖励下,在选项框架中学习丰富的技能是强化学习研究的前沿。
动机:现有的工作主要分为两类:通过互信息损失最大化选项多样性(忽略覆盖范围)的变分选项发现和通过增加状态空间的连通性(忽略多样性)来提高选项覆盖范围的拉普拉斯方法。
方法:我们展示了在无监督选项发现中,多样性和覆盖范围确实可以在相同的数学框架下统一。具体来说,我们通过新颖的使用确定点过程(DPP)来明确量化学习的选项的多样性和覆盖范围,并优化这些目标以发现具有优越多样性和覆盖范围的选项。
效果:我们的算法ODPP在Mujoco和Atari创建的挑战任务上进行了广泛的评估。结果表明,我们的算法在多样性驱动和覆盖范围驱动的类别中都优于最先进的基线。

Learning rich skills under the option framework without supervision of external rewards is at the frontier of reinforcement learning research. Existing works mainly fall into two distinctive categories: variational option discovery that maximizes the diversity of the options through a mutual information loss (while ignoring coverage) and Laplacian-based methods that focus on improving the coverage of options by increasing connectivity of the state space (while ignoring diversity). In this paper, we show that diversity and coverage in unsupervised option discovery can indeed be unified under the same mathematical framework. To be specific, we explicitly quantify the diversity and coverage of the learned options through a novel use of Determinantal Point Process (DPP) and optimize these objectives to discover options with both superior diversity and coverage. Our proposed algorithm, ODPP, has undergone extensive evaluation on challenging tasks created with Mujoco and Atari. The results demonstrate that our algorithm outperforms state-of-the-art baselines in both diversity- and coverage-driven categories.

Strategic Apple Tasting
Keegan Harris Chara Podimata Steven Wu



研究问题:在高风险领域中,算法决策通常涉及将决策分配给具有策略性修改其输入到算法的激励的代理。
动机:除了处理激励外,在许多感兴趣的领域(如贷款和招聘),决策者只观察到关于他们在分配正面决策给代理人的回合中的政策的反馈;这种反馈通常被称为苹果品尝(或单方面的)反馈。
方法:我们将此设置形式化为一种具有苹果品尝反馈的在线学习问题,其中委托人对一系列T个代理人做出决定,每个代理人都由可能被策略性修改的上下文表示。我们的目标是实现次线性策略遗憾,即比较委托人的表现与事后最佳固定策略的表现。
效果:我们的主要结果是在学习算法中,当代理人序列是随机选择时,会产生$\tilde{\mathcal{O}}(\sqrt{T})$的策略遗憾。我们还给出了一个能够处理对手选择的代理人的算法,尽管这需要付出$\tilde{mathcal{O}}(T^{(d+1)/(d+2)})$的策略遗憾的代价(其中d是上下文的维度)。我们的算法可以很容易地适应委托人收到贝塔特反馈的情况——这种情况通过考虑有激励的代理人来推广线性上下文贝塔特问题,并通过允许部分反馈来推广策略分类问题。

Algorithmic decision-making in high-stakes domains often involves assigning decisions to agents with incentives to strategically modify their input to the algorithm. In addition to dealing with incentives, in many domains of interest (e.g. lending and hiring) the decision-maker only observes feedback regarding their policy for rounds in which they assign a positive decision to the agent; this type of feedback is often referred to as apple tasting (or one-sided) feedback. We formalize this setting as an online learning problem with apple-tasting feedback where a principal makes decisions about a sequence of $T$ agents, each of which is represented by a context that may be strategically modified. Our goal is to achieve sublinear strategic regret, which compares the performance of the principal to that of the best fixed policy in hindsight, if the agents were truthful when revealing their contexts. Our main result is a learning algorithm which incurs $\tilde{\mathcal{O}}(\sqrt{T})$ strategic regret when the sequence of agents is chosen stochastically. We also give an algorithm capable of handling adversarially-chosen agents, albeit at the cost of $\tilde{\mathcal{O}}(T^{(d+1)/(d+2)})$ strategic regret (where $d$ is the dimension of the context). Our algorithms can be easily adapted to the setting where the principal receives bandit feedback---this setting generalizes both the linear contextual bandit problem (by considering agents with incentives) and the strategic classification problem (by allowing for partial feedback).

Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning
Hongyu Zang Xin Li Leiji Zhang Yang Liu Baigui Sun Riashat Islam Remi Tachet des Combes Romain Laroche



研究问题:尽管基于同构的方法在强化学习任务中表现出了强大的状态表示能力,但在离线强化学习任务中的有效性却未达到预期。
动机:本研究旨在理解为什么同构方法在在线环境中成功,但在离线任务中却失败。
方法:通过分析发现,数据集中的缺失转换对同构原则尤其有害,导致无效的估计。同时,我们还揭示了奖励缩放对于限制同构测量的规模和它们引发的值误差的关键作用。基于这些发现,我们提出了将期望算子应用于离线强化学习的表示学习,以防止对不完整数据的过拟合。同时,通过引入适当的奖励缩放策略,我们避免了表示空间中的特征塌陷风险。
效果:我们在两个最先进的基于同构的方法MICo和SimSR上实施了这些建议,并在D4RL和Visual D4RL两个基准测试套件上展示了性能提升。

While bisimulation-based approaches hold promise for learning robust state representations for Reinforcement Learning (RL) tasks, their efficacy in offline RL tasks has not been up to par. In some instances, their performance has even significantly underperformed alternative methods. We aim to understand why bisimulation methods succeed in online settings, but falter in offline tasks. Our analysis reveals that missing transitions in the dataset are particularly harmful to the bisimulation principle, leading to ineffective estimation. We also shed light on the critical role of reward scaling in bounding the scale of bisimulation measurements and of the value error they induce. Based on these findings, we propose to apply the expectile operator for representation learning to our offline RL setting, which helps to prevent overfitting to incomplete data. Meanwhile, by introducing an appropriate reward scaling strategy, we avoid the risk of feature collapse in representation space. We implement these recommendations on two state-of-the-art bisimulation-based algorithms, MICo and SimSR, and demonstrate performance gains on two benchmark suites: D4RL and Visual D4RL. Codes are provided at \url{https://github.com/zanghyu/Offline_Bisimulation}.

Selectively Sharing Experiences Improves Multi-Agent Reinforcement Learning
Matthias Gerstgrasser Tom Danino Sarah Keren



研究问题:如何有效地在多智能体强化学习中进行经验分享以提高学习效果。
动机:现有的多智能体强化学习方法大多需要中心化的培训,且经验分享的效果并不理想。
方法:提出了一种新的多智能体强化学习方法——选择性多智能体优先经验转发,允许智能体之间只分享少量相关的经验,而无需大量的通信渠道。
效果:实验证明,该方法优于无经验分享的分散式训练和最先进的多智能体强化学习方法,而且只分享少量高度相关经验的效果更好,性能提升稳健,适用于各种超参数和DQN变种。

We present a novel multi-agent RL approach, Selective Multi-Agent Prioritized Experience Relay, in which agents share with other agents a limited number of transitions they observe during training. The intuition behind this is that even a small number of relevant experiences from other agents could help each agent learn. Unlike many other multi-agent RL algorithms, this approach allows for largely decentralized training, requiring only a limited communication channel between agents. We show that our approach outperforms baseline no-sharing decentralized training and state-of-the art multi-agent RL algorithms. Further, sharing only a small number of highly relevant experiences outperforms sharing all experiences between agents, and the performance uplift from selective experience sharing is robust across a range of hyperparameters and DQN variants.

EDGI: Equivariant Diffusion for Planning with Embodied Agents
Johann Brehmer Joey Bose Pim De Haan Taco Cohen



研究问题:现有的规划和基于模型的强化学习算法往往忽视了丰富的几何结构,导致样本效率低下和泛化能力差。
动机:为了解决这一问题,我们提出了等变扩散生成交互(EDGI)算法,该算法对SE(3)空间对称群、ℤ离散时间平移群和Sₙ物体排列群具有等变性。
方法:EDGI遵循了Janner等人(2022)提出的扩散器框架,将学习和规划世界模型视为条件生成建模问题,并在离线轨迹数据集上训练扩散模型。我们引入了一种新的支持多种表示的SE(3)×ℤ×Sₙ等变扩散模型,并将其集成到规划循环中,通过条件和分类器指导在特定任务中按需柔和地打破对称性。
效果:在物体操作和导航任务上,EDGI比非等变模型具有更高的样本效率和更好的对称群泛化能力。

Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries. Most algorithms for planning and model-based reinforcement learning (MBRL) do not take this rich geometric structure into account, leading to sample inefficiency and poor generalization. We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an algorithm for MBRL and planning that is equivariant with respect to the product of the spatial symmetry group SE(3), the discrete-time translation group ℤ, and the object permutation group Sₙ. EDGI follows the Diffuser framework by Janner et al. (2022) in treating both learning a world model and planning in it as a conditional generative modeling problem, training a diffusion model on an offline trajectory dataset. We introduce a new SE(3) × ℤ × Sₙ-equivariant diffusion model that supports multiple representations. We integrate this model in a planning loop, where conditioning and classifier guidance let us softly break the symmetry for specific tasks as needed. On object manipulation and navigation tasks, EDGI is substantially more sample efficient and generalizes better across the symmetry group than non-equivariant models.

A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories
Kai Yan Alex Schwing Yu-Xiong Wang



研究问题:本文旨在解决在只有特定任务专家状态和与任务无关的非专家状态-动作对可用的情况下,如何通过观察进行离线模仿以解决马尔可夫决策过程(MDPs)的问题。
动机:在现实世界中,任意交互都是昂贵的,专家行动无法获取,因此离线模仿非常有用。尽管现有的“分布校正估计”(DICE)方法最小化了专家和学习者策略之间的状态占用分歧并检索到加权行为克隆的策略,但当从不完整的轨迹中学习时,由于双重域中的非鲁棒优化,其结果不稳定。
方法:为了解决这个问题,本文提出了一种基于观察的轨迹感知模仿学习(TAILO)。TAILO使用未来轨迹的折扣总和作为加权行为克隆的权重。总和的项由一个鉴别器输出进行缩放,该鉴别器的目标是识别专家状态。
效果:实验结果表明,TAILO在多个测试平台上表现良好,特别是在有不完整轨迹或专家行为的片段的任务无关数据上,这在先前的工作中是一个常见的假设。

Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art ‘DIstribution Correction Estimation’ (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.

Offline Imitation Learning with Variational Counterfactual Reasoning
Zexu Sun Bowei He Jinxin Liu Xu Chen Chen Ma Shuai Zhang



研究问题:在许多现实场景中,如机器人操作,离线数据集是从无奖励的次优行为中收集的。由于专家数据的稀缺,代理通常容易陷入简单地记忆糟糕的轨迹,并且对环境的变化非常敏感,缺乏泛化到新环境的能力。
动机:为了自动生成高质量的专家数据并提高代理的泛化能力,我们提出了一个名为OILCA的框架,通过反事实推理进行数据增强。
方法:我们利用可识别的变分自编码器为专家数据生成反事实样本。我们从理论上分析了生成的专家数据的影响和泛化的改进。
效果:实验结果表明,我们的方法在深度思维控制套件基准测试中的分布内性能和因果关系世界基准测试中的分布外泛化方面都大大优于各种基线。

In offline imitation learning (IL), an agent aims to learn an optimal expert behavior policy without additional online environment interactions. However, in many real-world scenarios, such as robotics manipulation, the offline dataset is collected from suboptimal behaviors without rewards. Due to the scarce expert data, the agents usually suffer from simply memorizing poor trajectories and are vulnerable to the variations in the environments, lacking the capability of generalizing to new environments.To automatically generate high-quality expert data and improve the generalization ability of the agent, we propose a framework named \underline{O}ffline \underline{I}mitation \underline{L}earning with \underline{C}ounterfactual data \underline{A}ugmentation (OILCA) by doing counterfactual inference. In particular, we leverage identifiable variational autoencoder to generate \textit{counterfactual} samples for expert data augmentation. We theoretically analyze the influence of the generated expert data and the improvement of generalization. Moreover, we conduct extensive experiments to demonstrate that our approach significantly outperforms various baselines on both \textsc{DeepMind Control Suite} benchmark for in-distribution performance and \textsc{CausalWorld} benchmark for out-of-distribution generalization.

Efficient Symbolic Policy Learning with Differentiable Symbolic Expression
Jiaming Guo Rui Zhang Shaohui Peng Qi Yi Xing Hu Ruizhi Chen Zidong Du Xishan Zhang Ling Li Qi Guo Yunji Chen



研究问题:如何有效地从零开始学习符号化策略,并使其适用于未见过的任务。
动机:深度强化学习的策略复杂性使得理解和部署变得困难,而现有的符号化策略方法通常涉及复杂的训练过程和预训练的神经网络策略,效率低下且限制了符号化策略的应用。
方法:提出了一种名为“高效符号化策略学习”(ESPL)的梯度基础学习方法,该方法以端到端的方式从头开始学习符号化策略。引入了一个符号网络作为搜索空间,并使用路径选择器来找到紧凑的符号化策略。通过这样做,我们将策略表示为可微分的符号表达式,并以离线方式进行训练,进一步提高了效率。此外,与以前的符号化策略不同,由于其复杂性,我们将其扩展到元RL中,以生成未见过的任务的符号化策略。
效果:实验表明,我们的方法生成的符号化策略性能更高,并且在单任务RL中大大提高了数据效率。在元RL中,与神经网络策略相比,提出的符号化策略实现了更高的性能和效率,显示出具有可解释性的潜力。

Deep reinforcement learning (DRL) has led to a wide range of advances in sequential decision-making tasks. However, the complexity of neural network policies makes it difficult to understand and deploy with limited computational resources. Currently, employing compact symbolic expressions as symbolic policies is a promising strategy to obtain simple and interpretable policies. Previous symbolic policy methods usually involve complex training processes and pre-trained neural network policies, which are inefficient and limit the application of symbolic policies. In this paper, we propose an efficient gradient-based learning method named Efficient Symbolic Policy Learning (ESPL) that learns the symbolic policy from scratch in an end-to-end way. We introduce a symbolic network as the search space and employ a path selector to find the compact symbolic policy. By doing so we represent the policy with a differentiable symbolic expression and train it in an off-policy manner which further improves the efficiency. In addition, in contrast with previous symbolic policies which only work in single-task RL because of complexity, we expand ESPL on meta-RL to generate symbolic policies for unseen tasks. Experimentally, we show that our approach generates symbolic policies with higher performance and greatly improves data efficiency for single-task RL. In meta-RL, we demonstrate that compared with neural network policies the proposed symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.

Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning
Jianzhun Shao Yun Qu Chen Chen Hongchang Zhang Xiangyang Ji



研究问题:本文旨在解决离线多智能体强化学习中的挑战,包括分布偏移和高维度问题,以及由此产生的行动超出分布(OOD)和价值过高估计现象。
动机:由于在离线设置中常见的分布偏移问题和在多智能体设置中常见的高维度问题之间的耦合效应,使得行动超出分布(OOD)和价值过高估计现象变得过于严重。
方法:提出一种名为CounterFactual Conservative Q-Learning (CFCQL)的新颖多智能体离线RL算法进行保守的价值估计。该方法不是将所有的智能体视为一个高维的单一实体,而是以反事实的方式分别对每个智能体进行保守正则化计算,然后线性组合以实现整体的保守价值估计。
效果:实验证明,CFCQL在大多数数据集上的表现优于现有方法,甚至在一些数据集上有显著的优势。

Offline multi-agent reinforcement learning is challenging due to the coupling effect of both distribution shift issue common in offline setting and the high dimension issue common in multi-agent setting, making the action out-of-distribution (OOD) and value overestimation phenomenon excessively severe. To mitigate this problem, we propose a novel multi-agent offline RL algorithm, named CounterFactual Conservative Q-Learning (CFCQL) to conduct conservative value estimation. Rather than regarding all the agents as a high dimensional single one and directly applying single agent conservative methods to it, CFCQL calculates conservative regularization for each agent separately in a counterfactual way and then linearly combines them to realize an overall conservative value estimation. We prove that it still enjoys the underestimation property and the performance guarantee as those single agent conservative methods do, but the induced regularization and safe policy improvement bound are independent of the agent number, which is therefore theoretically superior to the direct treatment referred to above, especially when the agent number is large. We further conduct experiments on four environments including both discrete and continuous action settings on both existing and our man-made datasets, demonstrating that CFCQL outperforms existing methods on most datasets and even with a remarkable margin on some of them.

SPRING: Studying Papers and Reasoning to play Games
Yue Wu So Yeon Min Shrimai Prabhumoye Yonatan Bisk Ruslan Salakhutdinov Amos Azaria Tom Mitchell Yuanzhi Li



研究问题:开放世界生存游戏对AI算法提出了重大挑战,因为需要处理多任务、深度探索和目标优先级排序等问题。尽管强化学习在解决这类问题上很受欢迎,但其高样本复杂度限制了其在复杂开放世界游戏中的有效性。
动机:我们提出了一种新的方法SPRING,通过阅读Crafter的原始学术文章并使用学到的知识,通过大型语言模型(LLM)进行推理和玩游戏。
方法:我们的SPRING框架使用有向无环图(DAG),将与游戏相关的问题作为节点,依赖关系作为边。通过遍历DAG并计算每个节点的LLM响应,以拓扑顺序确定环境中应采取的最佳行动,LLM对最终节点的回答直接转化为环境行动。
效果:实验表明,当提示具有一致的思维链时,LLM在完成复杂的高级轨迹方面具有巨大潜力。定量上,SPRING与GPT-4配合在所有最先进的RL基线上表现优秀,且无需任何训练。

Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read Crafter's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM). Prompted with the LaTeX source as game context and a description of the agent's current observation, our SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges. We identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM's answer to final node directly translating to environment actions. In our experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, we show the potential of Crafter as a test bed for LLMs. Code at github.com/holmeswww/SPRING

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
Yue Wu Yewen Fan Paul Pu Liang Amos Azaria Yuanzhi Li Tom Mitchell



研究问题:增强学习(RL)在高样本复杂性问题上的挑战。
动机:人类不仅通过交互或示范,而且通过阅读非结构化文本文档(如说明书)来学习执行任务。这些数据可以告知代理有价值的特征和策略或特定任务的环境动态和奖励结构。
方法:提出Read and Reward框架,该框架从Atari游戏开发者发布的手册中提取相关信息,并提供给标准的A2C RL代理作为辅助奖励,以帮助其学习特定任务的策略。
效果:实验表明,当受到我们的设计的辅助时,各种RL算法在性能和训练速度上都取得了显著的改进。

High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design. Code at github.com/Holmeswww/RnR

Information Design in Multi-Agent Reinforcement Learning
Yue Lin Wenhao Li Hongyuan Zha Baoxiang Wang



研究问题:如何通过信息设计影响强化学习(RL)环境中的其他代理,使其行为更有利于主代理。
动机:在现实任务中,其他代理有自己的目标并会对主代理的行为进行适应性反应。为了在这些环境中生存,主代理需要影响其他代理,使他们的行动变得更有帮助且更少有害。
方法:本研究提出了一种信号博弈模型,并开发了信号梯度和扩展服从约束的概念来解决这个问题。
效果:该算法在各种混合动机任务上效率高,为计算经济学提供了进一步的见解。

Reinforcement learning (RL) is inspired by the way human infants and animals learn from the environment. The setting is somewhat idealized because, in actual tasks, other agents in the environment have their own goals and behave adaptively to the ego agent. To thrive in those environments, the agent needs to influence other agents so their actions become more helpful and less harmful. Research in computational economics distills two ways to influence others directly: by providing tangible goods (mechanism design) and by providing information (information design). This work investigates information design problems for a group of RL agents. The main challenges are two-fold. One is the information provided will immediately affect the transition of the agent trajectories, which introduces additional non-stationarity. The other is the information can be ignored, so the sender must provide information that the receiver is willing to respect. We formulate the Markov signaling game, and develop the notions of signaling gradient and the extended obedience constraints that address these challenges. Our algorithm is efficient on various mixed-motive tasks and provides further insights into computational economics. Our code is publicly available at https://github.com/YueLin301/InformationDesignMARL.

Social Motion Prediction with Cognitive Hierarchies
Wentao Zhu Jason Qin Yuke Lou Hang Ye Xiaoxuan Ma Hai Ci Yizhou Wang



研究问题:本研究旨在解决社会运动预测问题,即如何预测和规划他人和自己的动作。
动机:人类具有出色的预测他人行动并相应规划自己动作的能力,本研究试图通过解决社会运动预测问题来复制这种能力。
方法:引入新的基准、新的形式化方法和认知启发的框架。提出了一个名为Wusi的3D多人运动数据集,该数据集在团队运动的背景下,具有激烈和策略性的人与人互动和多样的姿态分布。通过从多智能体强化学习的角度重新形式化问题,结合行为克隆和生成对抗模仿学习来提高学习效率和泛化能力。同时,考虑到人类社交行为规划过程的认知方面,开发了一个认知层次框架来预测策略性的人类社交互动。
效果:通过全面的实验验证了所提出的数据集和方法的有效性。

Humans exhibit a remarkable capacity for anticipating the actions of others and planning their own actions accordingly. In this study, we strive to replicate this ability by addressing the social motion prediction problem. We introduce a new benchmark, a novel formulation, and a cognition-inspired framework. We present Wusi, a 3D multi-person motion dataset under the context of team sports, which features intense and strategic human interactions and diverse pose distributions. By reformulating the problem from a multi-agent reinforcement learning perspective, we incorporate behavioral cloning and generative adversarial imitation learning to boost learning efficiency and generalization. Furthermore, we take into account the cognitive aspects of the human social action planning process and develop a cognitive hierarchy framework to predict strategic human social interactions. We conduct comprehensive experiments to validate the effectiveness of our proposed dataset and approach.

NetHack is Hard to Hack
Ulyana Piterbarg Lerrel Pinto Rob Fergus



研究问题:本文旨在解决神经网络在长时任务和开放环境中表现不佳的问题,特别是在具有多模态观察的NetHack游戏中。
动机:尽管神经网络在许多控制问题上取得了显著的成果,但在长期任务和开放环境中,特别是像NetHack这样的游戏,其性能却不如符号代理。
方法:通过对获胜的符号代理进行深入分析并扩展其代码库,生成了最大的可用演示数据集。利用这个数据集,我们研究了(i)动作层次结构的优势;(ii)神经网络架构的增强;(iii)强化学习与模仿学习的集成。
效果:我们的调查产生了一个最先进的神经网络代理,在离线设置中比之前的完全神经网络策略提高了127%,在在线设置中提高了25%。然而,我们也发现仅仅扩大规模并不足以缩小与最好的符号模型甚至顶级人类玩家之间的性能差距。

Neural policy learning methods have achieved remarkable results in various control problems, ranging from Atari games to simulated locomotion. However, these methods struggle in long-horizon tasks, especially in open-ended environments with multi-modal observations, such as the popular dungeon-crawler game, NetHack. Intriguingly, the NeurIPS 2021 NetHack Challenge revealed that symbolic agents outperformed neural approaches by over four times in median game score. In this paper, we delve into the reasons behind this performance gap and present an extensive study on neural policy learning for NetHack. To conduct this study, we analyze the winning symbolic agent, extending its codebase to track internal strategy selection in order to generate one of the largest available demonstration datasets. Utilizing this dataset, we examine (i) the advantages of an action hierarchy; (ii) enhancements in neural architecture; and (iii) the integration of reinforcement learning with imitation learning. Our investigations produce a state-of-the-art neural agent that surpasses previous fully neural policies by 127% in offline settings and 25% in online settings on median game score. However, we also demonstrate that mere scaling is insufficient to bridge the performance gap with the best symbolic models or even the top human players.

Reinforcement Learning with Fast and Forgetful Memory
Steven Morad Ryan Kortvelesy Stephan Liwicki Amanda Prorok



研究问题:如何提高强化学习中的记忆效率和训练速度?
动机:现实世界的任务大多具有部分可观察性,需要使用记忆。然而,现有的记忆模型大多来自有监督学习,与强化学习的训练和效率特性不同。
方法:提出快速遗忘记忆(Fast and Forgetful Memory)模型,这是一种专为强化学习设计的记忆模型。该模型通过计算心理学启发的强结构先验来约束模型搜索空间,可以作为循环神经网络在循环强化学习算法中的替代品。
效果:实验表明,快速遗忘记忆模型在各种循环基准测试和算法上的表现优于循环神经网络,且无需更改任何超参数。此外,其训练速度比循环神经网络快两个数量级,因为其时间和空间复杂度分别为对数和线性。

Nearly all real world tasks are inherently partially observable, necessitating the use of memory in Reinforcement Learning (RL). Most model-free approaches summarize the trajectory into a latent Markov state using memory models borrowed from Supervised Learning (SL), even though RL tends to exhibit different training and efficiency characteristics. Addressing this discrepancy, we introduce Fast and Forgetful Memory, an algorithm-agnostic memory model designed specifically for RL. Our approach constrains the model search space via strong structural priors inspired by computational psychology. It is a drop-in replacement for recurrent neural networks (RNNs) in recurrent RL algorithms, achieving greater reward than RNNs across various recurrent benchmarks and algorithms _without changing any hyperparameters_. Moreover, Fast and Forgetful Memory exhibits training speeds two orders of magnitude faster than RNNs, attributed to its logarithmic time and linear space complexity. Our implementation is available at https://github.com/proroklab/ffm.

Active Vision Reinforcement Learning under Limited Visual Observability
Jinghuan Shang Michael S Ryoo



研究问题:本研究探讨了主动视觉强化学习(ActiveVision-RL),即一个具研究问题:本研究探讨了主动视觉强化学习(ActiveVision-RL),即一个具身代理在部分可观察环境中同时学习任务的动作策略和视觉观察的控制策略。
动机:由于动作策略和视觉观察策略的相互影响,主动视觉强化学习面临着协调两个策略的挑战。
方法:我们提出了SUGARL,一种传感器运动理解引导的主动强化学习框架,该框架分别对动作策略和视觉观察策略进行建模,但使用内在传感器运动奖励联合学习它们。这种可学习的奖励由传感器运动奖励模块分配,激励视觉观察策略选择最优的观察以推断其自身的动作,这受到人类传感器运动阶段的启发。
效果:通过一系列实验,我们发现该方法在各种可观察性条件下都有效,并且可以适应现有的强化学习算法。观察到通过我们的方法学习的视觉观察策略表现出有效的主动视觉策略。

In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) together with eye movements (sensory policy). ActiveVision-RL poses challenges on coordinating two policies given their mutual influence. We propose SUGARL, Sensorimotor Understanding Guided Active Reinforcement Learning, a framework that models motor and sensory policies separately, but jointly learns them using with an intrinsic sensorimotor reward. This learnable reward is assigned by sensorimotor reward module, incentivizes the sensory policy to select observations that are optimal to infer its own motor action, inspired by the sensorimotor stage of humans. Through a series of experiments, we show the effectiveness of our method across a range of observability conditions and its adaptability to existed RL algorithms. The sensory policies learned through our method are observed to exhibit effective active vision strategies.

Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback
Minyoung Hwang Gunmin Lee Hogun Kee Chan Woo Kim Kyungjae Lee Songhwai Oh



研究问题:现有的强化学习人类反馈(RLHF)模型效率低下,因为它们从每个人类反馈中只产生单一的偏好数据。
动机:为了解决这个问题,我们提出了一种新的RLHF框架SeqRank,它使用顺序偏好排名来提高反馈效率。
方法:我们的方法通过迭代地从先前选择的轨迹集K和未选择的轨迹集U-K中选择一个防御者和一个挑战者,以序列方式采样轨迹。我们还提出了两种不同的防御者采样策略的轨迹比较方法:(1) 顺序成对比较,选择最近的轨迹;(2) 根成对比较,选择K中最优选的轨迹。我们构建了一个数据结构,并通过偏好对轨迹进行排序以增加额外的查询。
效果:我们的方法比基线提高了至少39.2%的平均反馈效率,并在反馈效率和数据依赖性之间取得了平衡。在行走任务中,根成对比较将平均奖励提高了29.0%,在操作任务中将成功率提高了25.0%。

Reinforcement learning from human feedback (RLHF) alleviates the problem of designing a task-specific reward function in reinforcement learning by learning it from human preference. However, existing RLHF models are considered inefficient as they produce only a single preference data from each human feedback. To tackle this problem, we propose a novel RLHF framework called SeqRank, that uses sequential preference ranking to enhance the feedback efficiency. Our method samples trajectories in a sequential manner by iteratively selecting a defender from the set of previously chosen trajectories $\mathcal{K}$ and a challenger from the set of unchosen trajectories $\mathcal{U}\setminus\mathcal{K}$, where $\mathcal{U}$ is the replay buffer. We propose two trajectory comparison methods with different defender sampling strategies: (1) sequential pairwise comparison that selects the most recent trajectory and (2) root pairwise comparison that selects the most preferred trajectory from $\mathcal{K}$. We construct a data structure and rank trajectories by preference to augment additional queries. The proposed method results in at least 39.2% higher average feedback efficiency than the baseline and also achieves a balance between feedback efficiency and data dependency. We examine the convergence of the empirical risk and the generalization bound of the reward model with Rademacher complexity. While both trajectory comparison methods outperform conventional pairwise comparison, root pairwise comparison improves the average reward in locomotion tasks and the average success rate in manipulation tasks by 29.0% and 25.0%, respectively. The source code and the videos are provided in the supplementary material.

Elastic Decision Transformer
Yueh-Hua Wu Xiaolong Wang Masashi Hamaya



研究问题:现有的决策转换器(DT)在从一系列次优轨迹中生成最优或接近最优轨迹的过程中,存在轨迹拼接的问题。
动机:为了解决DT在轨迹拼接问题上的困难,本文提出了弹性决策转换器(EDT)。
方法:EDT通过调整DT中保持的历史长度来促进测试时的动作推理中的轨迹拼接。此外,当之前的轨迹最优时,EDT会保留较长的历史以优化轨迹,而当其次优时,则会保留较短的历史,使其能够与更优的轨迹进行“拼接”。
效果:实验表明,EDT能够弥合基于DT和基于Q学习的方法之间的性能差距。特别是在D4RL行走基准测试和Atari游戏上的多任务环境中,EDT优于基于Q学习的方法。

This paper introduces Elastic Decision Transformer (EDT), a significant advancement over the existing Decision Transformer (DT) and its variants. Although DT purports to generate an optimal trajectory, empirical evidence suggests it struggles with trajectory stitching, a process involving the generation of an optimal or near-optimal trajectory from the best parts of a set of sub-optimal trajectories. The proposed EDT differentiates itself by facilitating trajectory stitching during action inference at test time, achieved by adjusting the history length maintained in DT. Further, the EDT optimizes the trajectory by retaining a longer history when the previous trajectory is optimal and a shorter one when it is sub-optimal, enabling it to "stitch" with a more optimal trajectory. Extensive experimentation demonstrates EDT's ability to bridge the performance gap between DT-based and Q Learning-based approaches. In particular, the EDT outperforms Q Learning-based methods in a multi-task regime on the D4RL locomotion benchmark and Atari games.

Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examples
Hao Sun Alihan Hüyük Daniel Jarrett Mihaela van der Schaar



研究问题:如何利用离线数据在决策系统中学习控制器,同时保证医疗等责任敏感领域的决策可解释性。
动机:离线数据学习控制器是一个重要的研究领域,但在如医疗等责任敏感领域,决策的可解释性尚未得到充分解决。
方法:本文提出了可解释的离线控制器(AOC),将离线数据集作为决策语料库,并根据定制的示例子集进行可解释的控制。AOC在低数据场景下有效运作,可以扩展到严格的离线模仿设置,并显示出保护和适应性的特点。
效果:我们在模拟和现实世界的医疗场景中评估了AOC的性能,强调了其在保持可解释性的同时处理离线控制任务的高绩效能力。

Learning controllers with offline data in decision-making systems is an essential area of research due to its potential to reduce the risk of applications in real-world systems. However, in responsibility-sensitive settings such as healthcare, decision accountability is of paramount importance, yet has not been adequately addressed by the literature. This paper introduces the Accountable Offline Controller (AOC) that employs the offline dataset as the Decision Corpus and performs accountable control based on a tailored selection of examples, referred to as the Corpus Subset. AOC operates effectively in low-data scenarios, can be extended to the strictly offline imitation setting, and displays qualities of both conservation and adaptability. We assess AOC's performance in both simulated and real-world healthcare scenarios, emphasizing its capability to manage offline control tasks with high levels of performance while maintaining accountability.

topic-7

Topic words :  algorithm,  problem,  algorithms,  learning,  optimal,  gradient,  bounds,  optimization

Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures
Hamish Flynn David Reeb Melih Kandemir Jan Peters



研究问题:本文旨在解决随机线性带状问题,并提出具有最坏情况遗憾保证的改进算法。
动机:在面对未知奖励函数时,广泛使用的“乐观原则”将随机带状问题转化为置信序列的构建。由此产生的带状算法的性能取决于置信序列的大小,较小的置信集会产生更好的实证性能和更强的遗憾保证。
方法:我们使用一种新的适应性鞅混合的尾部界来构建适合随机带状的置信序列。这些置信序列允许通过凸规划进行有效的动作选择。我们证明了基于我们的置信序列的线性带状算法确保实现竞争性的最坏情况遗憾。
效果:我们的置信序列比竞争对手更紧,无论是在实证上还是在理论上。最后,我们在几个超参数调优任务中展示了我们更紧的置信序列的改善性能。

We present improved algorithms with worst-case regret guarantees for the stochastic linear bandit problem. The widely used "optimism in the face of uncertainty" principle reduces a stochastic bandit problem to the construction of a confidence sequence for the unknown reward function. The performance of the resulting bandit algorithm depends on the size of the confidence sequence, with smaller confidence sets yielding better empirical performance and stronger regret guarantees. In this work, we use a novel tail bound for adaptive martingale mixtures to construct confidence sequences which are suitable for stochastic bandits. These confidence sequences allow for efficient action selection via convex programming. We prove that a linear bandit algorithm based on our confidence sequences is guaranteed to achieve competitive worst-case regret. We show that our confidence sequences are tighter than competitors, both empirically and theoretically. Finally, we demonstrate that our tighter confidence sequences give improved performance in several hyperparameter tuning tasks.

Ordering-based Conditions for Global Convergence of Policy Gradient Methods
Jincheng Mei Bo Dai Alekh Agarwal Mohammad Ghavamzadeh Csaba Szepesvari Dale Schuurmans



研究问题:本文探讨了在有限臂赌博机中,线性函数近似下的政策梯度(PG
动机:在面对未知奖励函数时,广泛使用的“乐观原则”将随机带状问题转化为置信序列的构建。由此产生的带状算法的性能取决于置信序列的大小,较小的置信集会产生更好的实证性能和更强的遗憾保证。
方法:我们使用一种新的适应性鞅混合的尾部界来构建适合随机带状的置信序列。这些置信序列允许通过凸规划进行有效的动作选择。我们证明了基于我们的置信序列的线性带状算法确保实现竞争性的最坏情况遗憾。
效果:我们的置信序列比竞争对手更紧,无论是在实证上还是在理论上。最后,我们在几个超参数调优任务中展示了我们更紧的置信序列的改善性能。

We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). \textbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. \textbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. \textcolor{blue}{Second}, motivated by these observations, we establish new general results: \textbf{(i)} NPG with linear function approximation achieves global convergence \emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. \textbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.

Tester-Learners for Halfspaces: Universal Algorithms
Aravind Gollakota Adam Klivans Konstantinos Stavropoulos Arsen Vasilyan



研究问题:开发一种通用的二分空间测试器-学习器,能够广泛地应用于结构化分布。
动机:现有的测试器-学习器大多针对特定的目标分布进行优化,缺乏泛化性。
方法:提出一种全新的测试器-学习器,该模型能够在完全多项式时间内运行,并具有以下保证:对于测试器接受的任何标记分布,学习器都能达到误差O(opt) + ε;并且只要边际分布满足泊松不等式,测试器就会接受。
效果:在已知标签噪声为马萨尔特的情况下,该测试器-学习器实现了误差O(opt) + ε,同时无条件接受所有对数凹分布(无需假设KLS)。通过使用平方和(SOS)程序检查未知分布的超压缩性,并利用泊松分布在SOS框架中具有可证明的超压缩性这一事实,我们的测试得以实现。

We give the first tester-learner for halfspaces that succeeds universally over a wide class of structured distributions. Our universal tester-learner runs in fully polynomial time and has the following guarantee: the learner achieves error $O(\mathrm{opt}) + \epsilon$ on any labeled distribution that the tester accepts, and moreover, the tester accepts whenever the marginal is any distribution that satisfies a Poincare inequality. In contrast to prior work on testable learning, our tester is not tailored to any single target distribution but rather succeeds for an entire target class of distributions. The class of Poincare distributions includes all strongly log-concave distributions, and, assuming the Kannan--Lovasz--Simonovits (KLS) conjecture, includes all log-concave distributions. In the special case where the label noise is known to be Massart, our tester-learner achieves error $\mathrm{opt} + \epsilon$ while accepting all log-concave distributions unconditionally (without assuming KLS). Our tests rely on checking hypercontractivity of the unknown distribution using a sum-of-squares (SOS) program, and crucially make use of the fact that Poincare distributions are certifiably hypercontractive in the SOS framework.

Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Method
Constantine Caramanis Dimitris Fotakis Alkis Kalavasis Vasilis Kontonis Christos Tzamos



研究问题:本文旨在探讨深度神经网络和强化学习方法在解决组合优化问题上的有效性。
动机:深度神经网络和强化学习方法已在处理复杂组合优化问题上显示出巨大潜力,但对其有效性的理论分析尚不充分。
方法:本文提出了一种新的理论框架来分析这些方法的有效性,并探讨了是否存在具有以下特性的生成模型:(i)具有足够的表达能力以生成近似最优解;(ii)参数数量和输入大小均为多项式级别,即具有可追踪性;(iii)其优化景观没有次优稳定点。
效果:本文的主要贡献是对此问题给出了肯定的答案。这一结果适用于包括最大最小割、最大k-约束满足问题、最大权二分图匹配和旅行商问题在内的广泛组合优化问题。此外,作为分析的副产品,本文还介绍了一种全新的梯度下降正则化过程,并提供了理论和实验证据,证明它有助于解决梯度消失问题并跳出不良稳定点。

Deep Neural Networks and Reinforcement Learning methods have empirically shown great promise in tackling challenging combinatorial problems. In those methods a deep neural network is used as a solution generator which is then trained by gradient-based methods (e.g., policy gradient) to successively obtain better solution distributions. In this work we introduce a novel theoretical framework for analyzing the effectiveness of such methods. We ask whether there exist generative models that (i) are expressive enough to generate approximately optimal solutions; (ii) have a tractable, i.e, polynomial in the size of the input, number of parameters; (iii) their optimization landscape is benign in the sense that it does not contain sub-optimal stationary points. Our main contribution is a positive answer to this question. Our result holds for a broad class of combinatorial problems including Max- and Min-Cut, Max-$k$-CSP, Maximum-Weight-Bipartite-Matching, and the Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla gradient descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.

User-Level Differential Privacy With Few Examples Per User
Badih Ghazi Pritish Kamath Ravi Kumar Pasin Manurangsi Raghu Meka Chiyuan Zhang



研究问题:本文探讨了在用户级别差异隐私(DP)中,当每个用户只有少量示例的“例子稀缺”情况。
动机:以前的工作主要关注在“例子丰富”的情况下实现用户级别的差异隐私,而本研究则针对每个用户只有少量示例的情况。
方法:对于近似差异隐私,提出了一种通用的转换方法,将任何项目级别的差异隐私算法转换为用户级别的差异隐私算法。对于纯差异隐私,展示了如何将指数机制适应到用户级别设置。
效果:实验结果表明,这两种方法不仅恢复了特定问题的已知界限,还为诸如私有PAC学习、假设选择和分布学习等任务提供了新的界限。其中一些任务的界限是最优的。

Previous work on user-level differential privacy (DP) [Ghazi et al. NeurIPS 2021, Bun et al. STOC 2023] obtained generic algorithms that work for various learning tasks. However, their focus was on the *example-rich* regime, where the users have so many examples that each user could themselves solve the problem. In this work we consider the *example-scarce* regime, where each user has only a few examples, and obtain the following results: * For approximate-DP, we give a generic transformation of any item-level DP algorithm to a user-level DP algorithm. Roughly speaking, the latter gives a (multiplicative) savings of $O_{\varepsilon,\delta}(\sqrt{m})$ in terms of the number of users required for achieving the same utility, where $m$ is the number of examples per user. This algorithm, while recovering most known bounds for specific problems, also gives new bounds, e.g., for PAC learning. * For pure-DP, we present a simple technique for adapting the exponential mechanism [McSherry & Talwar, FOCS 2007] to the user-level setting. This gives new bounds for a variety of tasks, such as private PAC learning, hypothesis selection, and distribution learning. For some of these problems, we show that our bounds are near-optimal.

Optimal Learners for Realizable Regression: PAC Learning and Online Learning
Idan Attias Steve Hanneke Alkis Kalavasis Amin Karbasi Grigoris Velegkas



研究问题:本研究旨在描述在PAC学习和在线学习环境中可实现回归的统计复杂性。
动机:先前的研究已经建立了有限性对于PAC可学习性和标度Natarajan维数的必要性,但对于更完整的特性描述进展甚微。
方法:首先引入了一个最小最大实例优化器用于可实现回归,并提出了一个新的维度,该维度定性和定量地描述了哪些类别的实值预测器是可学习的。然后,我们识别了一个与图维度相关的组合维度,该维度描述了在可实现设置中的ERM可学习性。最后,我们基于一个与DS维度相关的组合维度建立了可学习性的必需条件,并推测这在这个上下文中也可能是充分的。此外,在在线学习的背景下,我们提供了一个描述最小最大实例最优累积损失的维度,并设计了一个可实现回归的最优在线学习器,从而解决了Daskalakis和Golowich在STOC '22上提出的一个开放问题。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this work, we aim to characterize the statistical complexity of realizable regression both in the PAC learning setting and the online learning setting. Previous work had established the sufficiency of finiteness of the fat shattering dimension for PAC learnability and the necessity of finiteness of the scaled Natarajan dimension, but little progress had been made towards a more complete characterization since the work of Simon 1997 (SICOMP '97). To this end, we first introduce a minimax instance optimal learner for realizable regression and propose a novel dimension that both qualitatively and quantitatively characterizes which classes of real-valued predictors are learnable. We then identify a combinatorial dimension related to the graph dimension that characterizes ERM learnability in the realizable setting. Finally, we establish a necessary condition for learnability based on a combinatorial dimension related to the DS dimension, and conjecture that it may also be sufficient in this context. Additionally, in the context of online learning we provide a dimension that characterizes the minimax instance optimal cumulative loss up to a constant factor and design an optimal online learner for realizable regression, thus resolving an open question raised by Daskalakis and Golowich in STOC '22.

A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
Alicia Curth Alan Jeffares Mihaela van der Schaar



研究问题:本文旨在挑战传统的统计学习理论,即模型复杂度和预测误差之间的U型关系,特别是在参数数量超过样本数量时出现的双下降现象。
动机:尽管深度学习中的过参数化神经网络取得了成功,但最近的研究表明,传统的统计理论可能不完整,需要引入新的理论来解释参数数量超过样本数量时的双下降现象。
方法:本文对经典的统计机器学习方法进行了深入研究,包括线性回归、树形结构和增强学习等。作者们提出了一个有效的参数数量度量,用于衡量这些方法在未见过的例子上使用的参数数量。
效果:通过这种度量,作者们发现,当转换到潜在的多个不同的复杂度轴时,双下降现象就会出现,并且其位置并不固有地与插值阈值p=n相关联。此外,作者们还从非参数统计的角度解释了这种现象,认为这是一种平滑技术。

Conventional statistical wisdom established a well-understood relationship between model complexity and prediction error, typically presented as a _U-shaped curve_ reflecting a transition between under- and overfitting regimes. However, motivated by the success of overparametrized neural networks, recent influential work has suggested this theory to be generally incomplete, introducing an additional regime that exhibits a second descent in test error as the parameter count $p$ grows past sample size $n$ -- a phenomenon dubbed _double descent_. While most attention has naturally been given to the deep-learning setting, double descent was shown to emerge more generally across non-neural models: known cases include _linear regression, trees, and boosting_. In this work, we take a closer look at the evidence surrounding these more classical statistical machine learning methods and challenge the claim that observed cases of double descent truly extend the limits of a traditional U-shaped complexity-generalization curve therein. We show that once careful consideration is given to _what is being plotted_ on the x-axes of their double descent plots, it becomes apparent that there are implicitly multiple, distinct complexity axes along which the parameter count grows. We demonstrate that the second descent appears exactly (and _only_) when and where the transition between these underlying axes occurs, and that its location is thus _not_ inherently tied to the interpolation threshold $p=n$. We then gain further insight by adopting a classical nonparametric statistics perspective. We interpret the investigated methods as _smoothers_ and propose a generalized measure for the _effective_ number of parameters they use _on unseen examples_, using which we find that their apparent double descent curves do indeed fold back into more traditional convex shapes -- providing a resolution to the ostensible tension between double descent and traditional statistical intuition.

Online RL in Linearly $q^\pi$-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore
Gellért Weisz András György Csaba Szepesvari



研究问题:本文研究了在具有线性$q^\pi$-可实现性的情境中,如何进行在线强化学习。
动机:现有的在线强化学习方法主要针对线性马尔可夫决策过程(MDPs),但当动作值不能被线性函数完全表示时,这些方法可能无法有效工作。因此,作者提出了一种新的学习算法来解决这个问题。
方法:作者提出一种新颖的学习算法,该算法可以同时确定需要跳过的状态,并在隐藏的线性MDP上运行另一个学习算法。这种方法可以在多项式样本复杂度下返回$\epsilon$-最优策略。
效果:实验结果表明,该方法在处理线性$q^pi$-可实现性问题时表现出良好的性能,并且其样本复杂度会随着错误估计的增大而逐渐降低。

We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^\pi$-realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.

Nearly Tight Bounds For Differentially Private Multiway Cut
Mina Dalirrooyfard Slobodan Mitrovic Yuriy Nevmyvaka



研究问题:寻找图中最小s-t割是算法工具的基本问题,广泛应用于图像分割、社区发现、强化学习和数据聚类。
动机:在这个问题中,给定两个节点作为终端,目标是从图中删除最小的边数,使得这两个终端断开连接。我们研究了差分隐私对最小s-t割问题的复杂性,并展示了几乎紧密的上下界,其中我们在运行时间效率方面实现了隐私保护。
方法:我们开发了一个具有差分隐私的多路k-cut算法,其中给定k个节点作为我们希望断开的终端。
效果:作为k的函数,我们获得的隐私保证比将先进的合成定理应用于已知的多路k-cut算法的效率提高了指数级。最后,我们评估了我们的差分隐私最小s-t割算法的近似度,并表明其输出质量与非私有算法几乎匹配。

Finding min $s$-$t$ cuts in graphs is a basic algorithmic tool, with applications in image segmentation, community detection, reinforcement learning, and data clustering. In this problem, we are given two nodes as terminals and the goal is to remove the smallest number of edges from the graph so that these two terminals are disconnected. We study the complexity of differential privacy for the min $s$-$t$ cut problem and show nearly tight lower and upper bounds where we achieve privacy at no cost for running time efficiency. We also develop a differentially private algorithm for the multiway $k$-cut problem, in which we are given $k$ nodes as terminals that we would like to disconnect. As a function of $k$, we obtain privacy guarantees that are exponentially more efficient than applying the advanced composition theorem to known algorithms for multiway $k$-cut. Finally, we empirically evaluate the approximation of our differentially private min $s$-$t$ cut algorithm and show that it almost matches the quality of the output of non-private ones.

Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models
Alex Damian Eshaan Nichani Rong Ge Jason D. Lee



研究问题:学习一个关于各向同性高斯分布在d维空间中的单一索引模型σ(w*·x)。
动机:先前的研究表明,学习w*的样本复杂度受链接函数σ的信息指数k*控制,这是σ的第一个非零埃尔米特系数的指数。
方法:通过在线随机梯度下降法在平滑损失上进行学习,证明了需要大约n>d^(k*/2)个样本来学习w*。
效果:我们的研究缩小了上界和下界之间的差距,并指出在线随机梯度下降法在平滑损失上的学习可以以n>d^(k*/2)个样本来学习w*。同时,我们还探讨了张量PCA的统计分析以及小批量随机梯度下降对经验损失的隐式正则化效应。

We focus on the task of learning a single index model $\sigma(w^\star \cdot x)$ with respect to the isotropic Gaussian distribution in $d$ dimensions. Prior work has shown that the sample complexity of learning $w^\star$ is governed by the information exponent $k^\star$ of the link function $\sigma$, which is defined as the index of the first nonzero Hermite coefficient of $\sigma$. Ben Arous et al. (2021) showed that $n \gtrsim d^{k^\star-1}$ samples suffice for learning $w^\star$ and that this is tight for online SGD. However, the CSQ lower bound for gradient based methods only shows that $n \gtrsim d^{k^\star/2}$ samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns $w^\star$ with $n \gtrsim d^{k^\star/2}$ samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.

Private Everlasting Prediction
Moni Naor Kobbi Nissim Uri Stemmer Chao Yan



研究问题:本文旨在探索预测作为学习的替代方案,并研究如何保护训练集和查询的隐私。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。同时,我们引入了私有永久预测的概念,以保护训练集和查询的隐私。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。此外,我们的私有永久预测模型可以用于所有具有有限VC维概念类别的预测,包括无限域上的阈值函数,而无需增加样本复杂度。

A private learner is trained on a sample of labeled points and generatesa hypothesis that can be used for predicting the labels of newly sampled points while protecting the privacy of the training set [Kasiviswannathan et al., FOCS 2008]. Research uncovered that private learners may need to exhibit significantly higher sample complexity than non-private learners as is the case with, e.g., learning of one-dimensional threshold functions [Bun et al., FOCS 2015, Alon et al., STOC 2019]. We explore prediction as an alternative to learning. Instead of putting forward a hypothesis, a predictor answers a stream of classification queries. Earlier work has considered a private prediction model with just a single classification query [Dwork and Feldman, COLT 2018]. We observe that when answering a stream of queries, a predictor must modify the hypothesis it uses over time, and, furthermore, that it must use the queries for this modification, hence introducing potential privacy risks with respect to the queries themselves. We introduce private everlasting prediction taking into account the privacy of both the training set and the (adaptively chosen) queries made to the predictor. We then present a generic construction of private everlasting predictors in the PAC model. The sample complexity of the initial training sample in our construction is quadratic (up to polylog factors) in the VC dimension of the concept class. Our construction allows prediction for all concept classes with finite VC dimension, and in particular threshold functions with constant size initial training sample, even when considered over infinite domains, whereas it is known that the sample complexity of privately learning threshold functions must grow as a function of the domain size and hence is impossible for infinite domains.

A Single-Loop Accelerated Extra-Gradient Difference Algorithm with Improved Complexity Bounds for Constrained Minimax Optimization
Yuanyuan Liu Fanhua Shang Weixin An Junhao Liu Hongying Liu Zhouchen Lin



研究问题:本文旨在提出一种新的用于解决约束非凸-非凹(NC-NC)最小最大问题的外梯度差分加速算法。
动机:目前的算法在处理约束的NC-NC问题上,收敛速度慢且需要额外的结构假设。
方法:设计了一个新的外梯度差分步骤以获得重要的拟单调性质,提高了收敛速度,同时引入了动量加速到我们的对偶加速更新步骤中。
效果:实验证明,该算法在找到函数f的ε稳定点时,其复杂度为O(ε^-2),优于现有的O(ε^-4)的最优复杂度界。此外,对于非凸-凹(NC-C)和凸-非凹(C-NC)的特殊情况,该算法也能获得相同的O(ε^-2)复杂度,而现有的最佳复杂度界分别为O(ε^-2.5)和O(ε^-4)。

In this paper, we propose a novel extra-gradient difference acceleration algorithm for solving constrained nonconvex-nonconcave (NC-NC) minimax problems. In particular, we design a new extra-gradient difference step to obtain an important quasi-cocoercivity property, which plays a key role to significantly improve the convergence rate in the constrained NC-NC setting without additional structural assumption. Then momentum acceleration is also introduced into our dual accelerating update step. Moreover, we prove that, to find an $\epsilon$-stationary point of the function $f$, our algorithm attains the complexity $\mathcal{O}(\epsilon^{-2})$ in the constrained NC-NC setting, while the best-known complexity bound is $\widetilde{\mathcal{O}}(\epsilon^{-4})$, where $\widetilde{\mathcal{O}}(\cdot)$ hides logarithmic factors compared to $\mathcal{O}(\cdot)$. As the special cases of the constrained NC-NC setting, our algorithm can also obtain the same complexity $\mathcal{O}(\epsilon^{-2})$ for both the nonconvex-concave (NC-C) and convex-nonconcave (C-NC) cases, while the best-known complexity bounds are $\widetilde{\mathcal{O}}(\epsilon^{-2.5})$ for the NC-C case and $\widetilde{\mathcal{O}}(\epsilon^{-4})$ for the C-NC case. For fair comparison with existing algorithms, we also analyze the complexity bound to find $\epsilon$-stationary point of the primal function $\phi$ for the constrained NC-C problem, which shows that our algorithm can improve the complexity bound from $\widetilde{\mathcal{O}}(\epsilon^{-3})$ to $\mathcal{O}(\epsilon^{-2})$. To the best of our knowledge, this is the first time that the proposed algorithm improves the best-known complexity bounds from $\mathcal{O}(\epsilon^{-4})$ and $\widetilde{\mathcal{O}}(\epsilon^{-3})$ to $\mathcal{O}(\epsilon^{-2})$ in both the NC-NC and NC-C settings.

Random Cuts are Optimal for Explainable k-Medians
Konstantin Makarychev Liren Shan



研究问题:如何优化解释性$k$-medians在$\ell_1$中的竞争比率。
动机:解释性$k$-medians问题由Dasgupta等人于2020年提出,已有的随机化算法竞争比率为$O(\log k \log\log k)$,作者希望通过更深入的分析找到最优的竞争比率。
方法:采用RandomCoordinateCut算法对解释性$k$-medians进行优化。
效果:通过严谨的分析,证明该算法的竞争比率上限为$2ln k+2$,与Dasgupta等人给出的$\Omega(\log k)$下界相匹配。

We show that the RandomCoordinateCut algorithm gives the optimal competitive ratio for explainable $k$-medians in $\ell_1$. The problem of explainable $k$-medians was introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian in 2020. Several groups of authors independently proposed a simple polynomial-time randomized algorithm for the problem and showed that this algorithm is $O(\log k \log\log k)$ competitive. We provide a tight analysis of the algorithm and prove that its competitive ratio is upper bounded by $2\ln k+2$. This bound matches the $\Omega(\log k)$ lower bound by Dasgupta et al (2020).

Information Maximization Perspective of Orthogonal Matching Pursuit with Applications to Explainable AI
Aditya Chattopadhyay Ryan Pilgrim Rene Vidal



研究问题:本文探讨了信息寻求(IP)算法在预测输出时如何通过顺序和贪婪地查询输入来预测输出,以及其计算密集性问题。
动机:由于IP需要估计高维空间中的互信息,因此计算量较大。本文探索了正交匹配追踪(OMP)作为替代IP的贪婪选择查询的方法。
方法:我们建立了IP和OMP之间的基本联系,证明了使用字典原子随机投影作为查询的IP“几乎”可以简化为OMP,区别在于IP按归一化相关增益的顺序选择原子。我们称之为IP-OMP,并通过模拟表明,对于随机高斯字典,IP-OMP与OMP的稀疏码恢复率没有明显差异。
效果:受此联系启发,我们探索了IP-OMP用于生成可解释预测的效用。具体来说,我们提出了一种简单的可解释AI算法,该算法将图像编码为具有可解释概念的文本嵌入的语义有意义的字典原子的稀疏组合。最终预测使用这种稀疏组合的权重进行,这些权重作为解释。从实证上看,我们提出的算法不仅与现有的可解释性方法竞争,而且计算成本更低。

Information Pursuit (IP) is a classical active testing algorithm for predicting an output by sequentially and greedily querying the input in order of information gain. However, IP is computationally intensive since it involves estimating mutual information in high-dimensional spaces. This paper explores Orthogonal Matching Pursuit (OMP) as an alternative to IP for greedily selecting the queries. OMP is a classical signal processing algorithm for sequentially encoding a signal in terms of dictionary atoms chosen in order of correlation gain. In each iteration, OMP selects the atom that is most correlated with the signal residual (the signal minus its reconstruction thus far). Our first contribution is to establish a fundamental connection between IP and OMP, where we prove that IP with random projections of dictionary atoms as queries ``almost'' reduces to OMP, with the difference being that IP selects atoms in order of normalized correlation gain. We call this version IP-OMP and present simulations indicating that this difference does not have any appreciable effect on the sparse code recovery rate of IP-OMP compared to that of OMP for random Gaussian dictionaries. Inspired by this connection, our second contribution is to explore the utility of IP-OMP for generating explainable predictions, an area in which IP has recently gained traction. More specifically, we propose a simple explainable AI algorithm which encodes an image as a sparse combination of semantically meaningful dictionary atoms that are defined as text embeddings of interpretable concepts. The final prediction is made using the weights of this sparse combination, which serve as an explanation. Empirically, our proposed algorithm is not only competitive with existing explainability methods but also computationally less expensive.

On the Variance, Admissibility, and Stability of Empirical Risk Minimization
Gil Kur Eli Putterman Alexander Rakhlin



研究问题:本文探讨了经验风险最小化(ERM)在最小最大次优率方面的性能,并证明了其偏差是导致次优的主要原因。
动机:为了理解经验风险最小化(ERM)的偏差如何影响其在最小最大次优率上的表现,作者进行了详细的理论分析。
方法:通过概率方法,作者在固定设计和随机设计两种情况下,对ERM的偏差和方差误差项进行了分析,并扩展了Chatterjee的可容许性定理到随机设计环境。
效果:研究发现,ERM的偏差是导致其在最小最大次优率上表现不佳的主要原因。此外,作者还发现,尽管某些函数在$L_2$距离上接近ERM,但它们仍然远离经验损失的几乎最小值点。

It is well known that Empirical Risk Minimization (ERM) may attain minimax suboptimal rates in terms of the mean squared error (Birgé and Massart, 1993). In this paper, we prove that, under relatively mild assumptions, the suboptimality of ERM must be due to its bias. Namely, the variance error term of ERM (in terms of the bias and variance decomposition) enjoys the minimax rate. In the fixed design setting, we provide an elementary proof of this result using the probabilistic method. Then, we extend our proof to the random design setting for various models. In addition, we provide a simple proof of Chatterjee’s admissibility theorem (Chatterjee, 2014, Theorem 1.4), which states that in the fixed design setting, ERM cannot be ruled out as an optimal method, and then we extend this result to the random design setting. We also show that our estimates imply stability of ERM, complementing the main result of Caponnetto and Rakhlin (2006) for non-Donsker classes. Finally, we highlight the somewhat irregular nature of the loss landscape of ERM in the non-Donsker regime, by showing that functions can be close to ERM, in terms of $L_2$ distance, while still being far from almost-minimizers of the empirical loss.

Precise asymptotic generalization for multiclass classification with overparameterized linear models
David Xing Wu Anant Sahai



研究问题:本研究针对高维参数线性模型在多元分类问题上的渐近泛化性进行探讨,特别是在具有增长的数据点、特征和类别数量的高斯共变量双层模型下。
动机:现有的预训练语言模型往往忽视了知识图谱中丰富的结构化知识,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱联合训练ERNIE模型,充分利用词汇、句法和知识信息,以更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the asymptotic generalization of an overparameterized linear model for multiclass classification under the Gaussian covariates bi-level model introduced in Subramanian et al. (NeurIPS'22), where the number of data points, features, and classes all grow together. We fully resolve the conjecture posed in Subramanian et al. '22, matching the predicted regimes for which the model does and does not generalize. Furthermore, our new lower bounds are akin to an information-theoretic strong converse: they establish that the misclassification rate goes to 0 or 1 asymptotically. One surprising consequence of our tight results is that the min-norm interpolating classifier can be asymptotically suboptimal relative to noninterpolating classifiers in the regime where the min-norm interpolating regressor is known to be optimal. The key to our tight analysis is a new variant of the Hanson-Wright inequality which is broadly useful for multiclass problems with sparse labels. As an application, we show that the same type of analysis can be used to analyze the related multi-label classification problem under the same bi-level ensemble.

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL
Qinghua Liu Gellért Weisz András György Chi Jin Csaba Szepesvari



研究问题:本文旨在解决强化学习中策略优化算法的理论理解有限,以及在在线RL中样本复杂度过高的问题。
动机:尽管策略优化算法在强化学习中起到了重要作用,但对其理论理解仍然有限,且在在线RL中的样本复杂度过高,特别是在需要探索的情况下。
方法:本文提出了一种简单高效的策略优化框架——乐观NPG,用于在线RL。乐观NPG可以看作是经典自然策略梯度算法与乐观策略评估子程序的结合,以鼓励探索。
效果:对于$d$-维线性MDPs,乐观NPG具有计算效率,并能在 $\tilde{\mathcal{O}}(d^2/epsilon^3)$ 个样本内学习到 $\epsilon$-最优策略,这是第一个其样本复杂度具有最优的维度依赖性 $\tilde{\Theta}(d^2)$ 的算法。对于广义函数近似(包括线性MDPs),乐观NPG也是第一个实现学习近优策略的多项式样本复杂度的策略优化算法。

While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited---they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework---Optimistic NPG for online RL. Optimistic NPG can be viewed as simply combining of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $\epsilon$-optimal policy within $\tilde{\mathcal{O}}(d^2/\epsilon^3)$ samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence $\tilde{\Theta}(d^2)$. It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of $d$. For general function approximation that subsumes linear MDPs, Optimistic NPG, to our best knowledge, is also the first policy optimization algorithm that achieves the polynomial sample complexity for learning near-optimal policies.

Distribution-Free Statistical Dispersion Control for Societal Applications
Zhun Deng Thomas P Zollo Jake Snell Toniann Pitassi Richard Zemel



研究问题:如何对机器学习模型的性能进行有限样本的统计保证,并控制损失分布的分散程度。
动机:对于高风险应用,理解并控制算法决策对不同群体产生的不平等影响至关重要。
方法:提出一个简单而灵活的框架,可以处理更丰富的统计函数类,通过实验在有毒评论检测、医学影像和电影推荐等任务中验证了这些方法。
效果:该研究为理解并控制算法决策的不平等影响提供了新的视角和方法,并在多个实际任务中取得了良好的效果。

Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning. Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range. However, for many high-stakes applications it is crucial to understand and control the \textit{dispersion} of a loss distribution, or the extent to which different members of a population experience unequal effects of algorithmic decisions. We initiate the study of distribution-free control of statistical dispersion measures with societal implications and propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work. Our methods are verified through experiments in toxic comment detection, medical imaging, and film recommendation.

Convex and Non-convex Optimization Under Generalized Smoothness
Haochuan Li Jian Qian Yi Tian Alexander Rakhlin Ali Jadbabaie



研究问题:本文旨在进一步推广非均匀平滑性条件,并开发一种简单但强大的分析技术,以获得更强的凸和非凸优化问题结果。
动机:传统的凸和非凸优化方法分析通常需要梯度的Lipschitz连续性,这限制了对二次函数边界的功能的分析。最近的一些工作放宽了这个要求,通过梯度裁剪和噪声有界假设,证明了在非凸设置中的收敛性。
方法:本文进一步推广了这种非均匀平滑性条件,并开发了一种简单的、强大的分析技术,该技术可以约束轨迹上的梯度,从而为凸和非凸优化问题带来更强的结果。
效果:特别是,我们在这种一般平滑性条件下获得了(随机)梯度下降和Nesterov加速梯度方法在凸和/或非凸设置下的古典收敛率。新的分析方法不需要梯度裁剪,并且在随机设置中允许带有有限方差的重尾噪声。

Classical analysis of convex and non-convex optimization methods often requires the Lipschitz continuity of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.

Convergence of Adam Under Relaxed Assumptions
Haochuan Li Alexander Rakhlin Ali Jadbabaie



研究问题:本文旨在为一类广泛的优化目标提供自适应矩估计(Adam)算法的严格收敛性证明。
动机:尽管Adam算法在训练深度神经网络中非常流行和高效,但其理论性质尚未完全理解,现有的收敛性证明需要不切实际的强大假设,如全局有界的梯度,才能显示出向稳定点的收敛。
方法:本文显示,在更现实的条件下,Adam可以证明以$\mathcal{O}(\epsilon^{-4})$的梯度复杂度收敛到$\epsilon$-稳定点。我们分析的关键是一种新的Adam优化轨迹上的梯度有界性的证明,根据这一广义平滑性假设,局部平滑性(即Hessian范数存在时)被一个次二次函数的梯度范数所约束。此外,我们还提出了一种方差减少版本的Adam,其加速的梯度复杂度为$\mathcal{O}(\epsilon^{-3})$。
效果:实验结果表明,这种改进的Adam算法在各种优化任务上都表现出了优越的性能。

In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with $\mathcal{O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of $\mathcal{O}(\epsilon^{-3})$.

Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach
Yu-Hu Yan Peng Zhao Zhi-Hua Zhou



研究问题:本文提出了一种在线凸优化方法,该方法具有两个不同级别的适应性。
动机:当前的方法对未知的函数类型和曲率是未知的,同时,它可以利用环境的未知性并达到与问题相关的保证。
方法:我们的方法基于一个多层在线集成框架,包括一个精心设计的乐观度来统一不同的函数类型和级联校正以提高算法稳定性。
效果:实验结果表明,我们的方法在强凸、指数凹和凸损失函数上分别获得了$\mathcal{O}(\log V_T)$, $\mathcal{O}(d \log V_T)$和$\hat{\mathcal{O}}(\sqrt{V_T})$的遗憾界,其中$d$是维度,$V_T$表示与问题相关的梯度变化,$hat{\mathcal{O}}(\cdot)$表示省略$\log V_T$因子。这种方法不仅保证了最坏情况的保证,而且在分析中直接暗示了小损失界。此外,当我们将其应用于对抗性/随机凸优化和博弈论问题时,我们的结果增强了现有的通用保证。

In this paper, we propose an online convex optimization approach with two different levels of adaptivity. On a higher level, our approach is agnostic to the unknown types and curvatures of the online functions, while at a lower level, it can exploit the unknown niceness of the environments and attain problem-dependent guarantees. Specifically, we obtain $\mathcal{O}(\log V_T)$, $\mathcal{O}(d \log V_T)$ and $\hat{\mathcal{O}}(\sqrt{V_T})$ regret bounds for strongly convex, exp-concave and convex loss functions, respectively, where $d$ is the dimension, $V_T$ denotes problem-dependent gradient variations and the $\hat{\mathcal{O}}(\cdot)$-notation omits $\log V_T$ factors. Our result not only safeguards the worst-case guarantees but also directly implies the small-loss bounds in analysis. Moreover, when applied to adversarial/stochastic convex optimization and game theory problems, our result enhances the existing universal guarantees. Our approach is based on a multi-layer online ensemble framework incorporating novel ingredients, including a carefully designed optimism for unifying diverse function types and cascaded corrections for algorithmic stability. Notably, despite its multi-layer structure, our algorithm necessitates only one gradient query per round, making it favorable when the gradient evaluation is time-consuming. This is facilitated by a novel regret decomposition equipped with carefully designed surrogate losses.

Online List Labeling with Predictions
Samuel McCauley Benjamin Moseley Aidin Niaparast Shikha Singh



研究问题:如何将预测结果整合到具有强大理论保证的数据结构中。
动机:尽管已有研究表明学习预测可以用来提高算法的运行时间,但如何将这些预测有效地整合到数据结构中仍待进一步研究。
方法:本文通过在线列表标签问题展示了预测可以被利用。设计了一个新的列表标签数据结构,并对其性能进行了两种模型的界限设定。在最坏情况的学习增强模型中,根据预测的错误给出了保证。
效果:该数据结构提供了强大的保证:对于任何预测错误都是最优的,即使预测完全错误,也能保证已知的最坏情况界限。此外,还考虑了随机误差模型,并根据期望和方差对性能进行了界限设定。最后,通过实证研究证明了理论结果,特别是在实际使用案例中,预测通常由过去到达的元素构建时,该数据结构表现出强大的性能。

A growing line of work shows how learned predictions can be used to break through worst-case barriers to improve the running time of an algorithm. However, incorporating predictions into data structures with strong theoretical guarantees remains underdeveloped. This paper takes a step in this direction by showing that predictions can be leveraged in the fundamental online list labeling problem. In the problem, $n$ items arrive over time and must be stored in sorted order in an array of size $\Theta(n)$. The array slot of an element is its label and the goal is to maintain sorted order while minimizing the total number of elements moved (i.e., relabeled). We design a new list labeling data structure and bound its performance in two models. In the worst-case learning-augmented model, we give guarantees in terms of the error in the predictions. Our data structure provides strong guarantees: it is optimal for any prediction error and guarantees the best-known worst-case bound even when the predictions are entirely erroneous. We also consider a stochastic error model and bound the performance in terms of the expectation and variance of the error. Finally, the theoretical results are demonstrated empirically. In particular, we show that our data structure has strong performance on real temporal data sets where predictions are constructed from elements that arrived in the past, as is typically done in a practical use case.

Convergence of Alternating Gradient Descent for Matrix Factorization
Rachel Ward Tamara G. Kolda



研究问题:如何通过交替梯度下降法优化非对称矩阵分解目标。
动机:对于任意的非对称矩阵,找到一种快速有效的方法进行低秩分解。
方法:采用固定步长的交替梯度下降法应用于非对称矩阵分解目标,并从随机初始值开始,证明了在一定迭代次数后可以达到误差范围内的最优分解。
效果:实验证明,该方法不仅在理论上有效,而且在实践中显著提高了梯度下降的收敛速度。

We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $A \in \mathbb{R}^{m \times n}$, $T = C ( \frac{\sigma_1(A)}{\sigma_r(A)} )^2 \log(1/\epsilon)$ iterations of alternating gradient descent suffice to reach an $\epsilon$-optimal factorization $\| A - X_{T} Y_{T}' \|^2 \leq \epsilon \| A \|^2$ with high probability starting from an atypical random initialization. The factors have rank $d \geq r$ so that $X_{T}\in \mathbb{R}^{m \times d}$ and $Y_{T} \in\mathbb{R}^{n \times d}$, and mild overparameterization suffices for the constant $C$ in the iteration complexity $T$ to be an absolute constant. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly improves the convergence rate of gradient descent in practice. Our proof is conceptually simple: a uniform Polyak-Lojasiewicz (PL) inequality and uniform Lipschitz smoothness constant are guaranteed for a sufficient number of iterations, starting from our random initialization. Our proof method should be useful for extending and simplifying convergence analyses for a broader class of nonconvex low-rank factorization problems.

Improved Frequency Estimation Algorithms with and without Predictions
Anders Aamand Justin Y. Chen Huy Nguyen Sandeep Silwal Ali Vakilian



研究问题:如何准确估计数据流中元素的出现频率。
动机:现有的估算方法(如CountMin和CountSketch)存在误差,Hsu等人(2019)提出使用机器学习来适应特定的数据分布。
方法:我们提出了一种新的算法,在某些参数设置下,无需任何预测就能在理论上超越Hsu等人的学习型算法。加入重击预测器后,我们的算法错误率更低,超越了现有技术。
效果:实验证明,我们的算法在所有实验中的表现都优于先前的方法。

Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically bound the error of the estimated frequencies for any possible input. The work of Hsu et al.~(2019) introduced the idea of using machine learning to tailor sketching algorithms to the specific data distribution they are being run on. In particular, their learning-augmented frequency estimation algorithm uses a learned heavy-hitter oracle which predicts which elements will appear many times in the stream. We give a novel algorithm, which in some parameter regimes, already theoretically outperforms the learning based algorithm of Hsu et al. *without* the use of any predictions. Augmenting our algorithm with heavy-hitter predictions further reduces the error and improves upon the state of the art. Empirically, our algorithms achieve superior performance in all experiments compared to prior approaches.

The Equivalence of Dynamic and Strategic Stability under Regularized Learning in Games
Victor Boone Panayotis Mertikopoulos



研究问题:本文探讨了有限N人游戏中的正则化无后悔学习长期行为。
动机:尽管已知无后悔学习的经验频率会收敛到游戏的粗关联均衡集,但对玩家实际策略如何随时间演变的理解却非常有限。
方法:我们采取更通用的方法,通过关注最具挑战性的集合理性属性之一——封闭性(即任何单方面偏离都会给背离者带来成本),来描述玩家日常游戏轨迹的集合理性。
效果:我们发现战略稳定性和动态稳定性之间存在显著的等价关系,并估计了向此类集合的收敛速率。我们还发现基于熵正则化的方法(如指数权重算法)以几何速率收敛,而基于投影的方法即使在有奖励反馈的情况下也能在有限迭代次数内收敛。

In this paper, we examine the long-run behavior of regularized, no-regret learning in finite N-player games. A well-known result in the field states that the empirical frequencies of play under no-regret learning converge to the game’s set of coarse correlated equilibria; however, our understanding of how the players' _actual strategies_ evolve over time is much more limited – and, in many cases, non-existent. This issue is exacerbated further by a series of recent results showing that _only_ strict Nash equilibria are stable and attracting under regularized learning, thus making the relation between learning and _pointwise_ solution concepts particularly elusive. In lieu of this, we take a more general approach and instead seek to characterize the _setwise_ rationality properties of the players' day-to-day trajectory of play. To do so, we focus on one of the most stringent criteria of setwise strategic stability, namely that any unilateral deviation from the set in question incurs a cost to the deviator – a property known as _closedness under better replies_ (club). In so doing, we obtain a remarkable equivalence between strategic and dynamic stability: _a product of pure strategies is closed under better replies if and only if its span is stable and attracting under regularized learning._ In addition, we estimate the rate of convergence to such sets, and we show that methods based on entropic regularization (like the exponential weights algorithm) converge at a geometric rate, while projection-based methods converge within a finite number of iterations, even with bandit, payoff-based feedback.

Private estimation algorithms for stochastic block models and mixture models
Hongjie Chen Vincent Cohen-Addad Tommaso d'Orsi Alessandro Epasto Jacob Imola David Steurer Stefan Tiegel



研究问题:设计高效的私有估计算法,在高维环境中,其统计保证几乎与已知的最佳非私有算法相匹配。
动机:提高数据处理的隐私保护,同时保持高效和准确的计算结果。
方法:通过引入新的技术和方法,设计出高效的私有估计算法。
效果:对于随机块模型恢复和学习混合的高斯球体两个问题,提出的算法都取得了较好的效果,其中前者是首个实现弱恢复和精确恢复的高效$(\epsilon, delta)$-差分隐私算法,后者则在最小间隔至少为$ O(k^{1/t}\sqrt{t})$时,能恢复$k$-混合的中心,且在所有选择的$t$下,该算法所需的样本复杂度为$n\geq k^{O(1)}d^{O(t)}$,时间复杂度为$(nd)^{O(t)}$。

We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(\epsilon, \delta)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(\epsilon, \delta)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required either an additional additive $\Omega(\sqrt{\log n})$ term in the minimum separation or an explicit upper bound on the Euclidean norm of the centers.

Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima
Dongkuk Si Chulhee Yun



研究问题:设计高效的私有估计算法,在高维环境中,其统计保证几乎与已知的最佳非私有算法相匹配。
动机:提高数据处理的隐私保护,同时保持高效和准确的计算结果。
方法:通过引入新的技术和方法,设计出高效的私有估计算法。
效果:对于随机块模型恢复和学习混合的高斯球体两个问题,提出的算法都取得了较好的效果,其中前者是首个实现弱恢复和精确恢复的高效$(\epsilon, delta)$-差分隐私算法,后者则在最小间隔至少为$ O(k^{1/t}\sqrt{t})$时,能恢复$k$-混合的中心,且在所有选择的$t$下,该算法所需的样本复杂度为$n\geq k^{O(1)}d^{O(t)}$,时间复杂度为$(nd)^{O(t)}$。

Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $\rho$ and/or no gradient normalization in $y_t$, which is detached from practice. To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant $\rho$ and gradient normalization in $y_t$) and explore their convergence properties on smooth functions with (non)convexity assumptions. Perhaps surprisingly, in many scenarios, we find out that SAM has limited capability to converge to global minima or stationary points. For smooth strongly convex functions, we show that while deterministic SAM enjoys tight global convergence rates of $\tilde \Theta(\frac{1}{T^2})$, the convergence bound of stochastic SAM suffers an inevitable additive term $\mathcal O(\rho^2)$, indicating convergence only up to neighborhoods of optima. In fact, such $\mathcal O(\rho^2)$ factors arise for stochastic SAM in all the settings we consider, and also for deterministic SAM in nonconvex cases; importantly, we prove by examples that such terms are unavoidable. Our results highlight vastly different characteristics of SAM with vs. without decaying perturbation size or gradient normalization, and suggest that the intuitions gained from one version may not apply to the other.

Mean-field Langevin dynamics: Time-space discretization, stochastic gradient, and variance reduction
Taiji Suzuki Denny Wu Atsushi Nitanda



研究问题:本文旨在解决均值场朗之万动力学(MFLD)的全局最小值问题,并考虑有限的粒子近似、时间离散化和随机梯度误差。
动机:尽管之前的分析都假设了无限粒子或连续时间极限,但无法处理随机梯度更新的问题。因此,作者提出了一个通用框架来证明MFLD的一致传播混沌现象。
方法:通过建立一个统一的框架,作者证明了MFLD在有限粒子近似、时间离散化和随机梯度误差下的一致传播混沌现象。
效果:作者的框架具有广泛的应用性,为多种学习问题如均值场神经网络和MMD最小化以及不同的梯度估计器(包括SGD和SVRG)建立了定量的收敛速率保证。此外,当专门针对标准的朗之万动力学时,作者还实现了在SGD和SVRG设置下的改进收敛速率。

The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift, and it naturally arises from the optimization of two-layer neural networks via (noisy) gradient descent. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. However, all prior analyses assumed the infinite-particle or continuous-time limit, and cannot handle stochastic gradient updates. We provide a general framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and stochastic gradient. To demonstrate the wide applicability of our framework, we establish quantitative convergence rate guarantees to the regularized global optimal solution for $(i)$ a wide range of learning problems such as mean-field neural network and MMD minimization, and $(ii)$ different gradient estimators including SGD and SVRG. Despite the generality of our results, we achieve an improved convergence rate in both the SGD and SVRG settings when specialized to the standard Langevin dynamics.

Generalization in the Face of Adaptivity: A Bayesian Perspective
Moshe Shenfeld Katrina Ligett



研究问题:通过自适应选择的查询反复使用数据样本可能导致过拟合,其中对样本上的查询的经验评估显著偏离其相对于底层数据分布的均值。
动机:简单的噪声添加算法可以防止这个问题,基于差分隐私的分析显示这些算法可以处理渐近最优数量的查询。然而,差分隐私的最坏情况性质需要将这种噪声扩展到查询的范围,即使对于高度集中的查询,或者引入更复杂的算法。
方法:在本文中,我们证明了直接的噪声添加算法已经提供了依赖于方差的保证,这也扩展到了无边界的查询。这种改进源于一种新的特性描述,它阐明了自适应数据分析的核心问题。
效果:我们展示了适应性的危害来自于新查询和基于贝叶斯因子的数据样本响应中编码的信息量度量之间的协方差。然后,我们利用这种特性描述来引入一个新的依赖于数据的稳定性概念,它可以约束这个协方差。

Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting, wherein the empirical evaluation of queries on the sample significantly deviates from their mean with respect to the underlying data distribution. It turns out that simple noise addition algorithms suffice to prevent this issue, and differential privacy-based analysis of these algorithms shows that they can handle an asymptotically optimal number of queries. However, differential privacy's worst-case nature entails scaling such noise to the range of the queries even for highly-concentrated queries, or introducing more complex algorithms. In this paper, we prove that straightforward noise-addition algorithms already provide variance-dependent guarantees that also extend to unbounded queries. This improvement stems from a novel characterization that illuminates the core problem of adaptive data analysis. We show that the harm of adaptivity results from the covariance between the new query and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries. We then leverage this characterization to introduce a new data-dependent stability notion that can bound this covariance.

List and Certificate Complexities in Replicable Learning
Peter Dixon A. Pavan Jason Vander Woude N V Vinodchandran



研究问题:本文旨在研究可复制学习算法,即在多次运行中以高概率输出相同标准假设的算法。
动机:目前的强可复制性概念通常无法实现,因此我们考虑了列表可复制性和证书可复制性这两种可行的可复制性概念。
方法:我们设计了一种具有最优列表复杂度的学习算法来估计$d$个硬币的偏差,同时最小化样本复杂度。我们还使用几何分区产生的舍入方案和Sperner/KKM引理来建立我们的上界结果。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We investigate replicable learning algorithms. Informally a learning algorithm is replicable if the algorithm outputs the same canonical hypothesis over multiple runs with high probability, even when different runs observe a different set of samples from the unknown data distribution. In general, such a strong notion of replicability is not achievable. Thus we consider two feasible notions of replicability called {\em list replicability} and {\em certificate replicability}. Intuitively, these notions capture the degree of (non) replicability. The goal is to design learning algorithms with optimal list and certificate complexities while minimizing the sample complexity. Our contributions are the following. 1. We first study the learning task of estimating the biases of $d$ coins, up to an additive error of $\varepsilon$, by observing samples. For this task, we design a $(d+1)$-list replicable algorithm. To complement this result, we establish that the list complexity is optimal, i.e there are no learning algorithms with a list size smaller than $d+1$ for this task. We also design learning algorithms with certificate complexity $\tilde{O}(\log d)$. The sample complexity of both these algorithms is $\tilde{O}(\frac{d^2}{\varepsilon^2})$ where $\varepsilon$ is the approximation error parameter (for a constant error probability). 2. In the PAC model, we show that any hypothesis class that is learnable with $d$-nonadaptive statistical queries can be learned via a $(d+1)$-list replicable algorithm and also via a $\tilde{O}(\log d)$-certificate replicable algorithm. The sample complexity of both these algorithms is $\tilde{O}(\frac{d^2}{\nu^2})$ where $\nu$ is the approximation error of the statistical query. We also show that for the concept class \dtep, the list complexity is exactly $d+1$ with respect to the uniform distribution. To establish our upper bound results we use rounding schemes induced by geometric partitions with certain properties. We use Sperner/KKM Lemma to establish the lower bound results.

Online (Multinomial) Logistic Bandit: Improved Regret and Constant Computation Cost
Yu-Jie Zhang Masashi Sugiyama



研究问题:本文探讨了逻辑Bandit问题,这是一种使用逻辑模型描述行动反馈的广义线性Bandit模型的变体。
动机:尽管大多数现有研究都集中在二元逻辑Bandit问题上,但多值情况(考虑两种以上的可能反馈值)在复杂决策问题(如强化学习)中具有更高的实际相关性和适应性。
方法:我们提供了一个算法,该算法在统计和计算效率方面都具有优势,适用于逻辑Bandit问题。在二元情况下,我们的方法是将每轮计算成本从$\mathcal{O}(log T)$降低到$\mathcal{O}(1)$,同时保持了最优的最小最大保证。在多元情况下,对于$K+1$个可能的反馈值,我们的算法实现了$\tilde{\mathcal{O}}(K\sqrt{T})$的遗憾界限,每轮计算成本为$\mathcal{O}(1)$。
效果:这一结果不仅改进了已知的最佳算法的$\tilde{mathcal{O}}(K\sqrt{\kappa T})$界限(其中大常数$\kappa$随着参数域的直径呈指数增长),而且还降低了前一种方法所需的$\mathcal{O}(T)$计算复杂度。

This paper investigates the logistic bandit problem, a variant of the generalized linear bandit model that utilizes a logistic model to depict the feedback from an action. While most existing research focuses on the binary logistic bandit problem, the multinomial case, which considers more than two possible feedback values, offers increased practical relevance and adaptability for use in complex decision-making problems such as reinforcement learning. In this paper, we provide an algorithm that enjoys both statistical and computational efficiency for the logistic bandit problem. In the binary case, our method improves the state-of-the-art binary logistic bandit method by reducing the per-round computation cost from $\mathcal{O}(\log T)$ to $\mathcal{O}(1)$ with respect to the time horizon $T$, while still preserving the minimax optimal guarantee up to logarithmic factors. In the multinomial case, with $K+1$ potential feedback values, our algorithm achieves an $\tilde{\mathcal{O}}(K\sqrt{T})$ regret bound with $\mathcal{O}(1)$ computational cost per round. The result not only improves the $\tilde{\mathcal{O}}(K\sqrt{\kappa T})$ bound for the best-known tractable algorithm—where the large constant $\kappa$ increases exponentially with the diameter of the parameter domain—but also reduces the $\mathcal{O}(T)$ computational complexity demanded by the previous method.

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise
Ta Duy Nguyen Thien Hang Nguyen Alina Ene Huy Nguyen



研究问题:本研究探讨了在噪声分布具有重尾(即有界p阶矩,1 动机:现有的方法主要依赖于集中不等式和归纳论证与并集边界来约束所有迭代的迭代,这种方法会导致失败概率增加一个因数T,其中T是迭代次数。
方法:我们提出了一种新的分析方法,基于对选定的上鞅序列生成函数的界限进行约束。我们对剪切梯度的大量算法改进了依赖T的收敛保证,包括用于凸目标的随机(加速)镜像下降和用于非凸目标的随机梯度下降。
效果:我们的高概率界限实现了最优的收敛速度,并与当前已知的最佳期望界限相匹配。我们的方法自然允许算法在时间范围未知时使用时变的步长和剪切参数,这在使用现有技术时似乎困难甚至不可能。此外,我们还表明,在剪切随机镜像下降的情况下,设置步长和剪切参数时不需要几个问题常数,包括初始距离优化。

In this work, we study the convergence in high probability of clipped gradient methods when the noise distribution has heavy tails, i.e., with bounded $p$th moments, for some $1

Error Bounds for Learning with Vector-Valued Random Features
Samuel Lanthaler Nicholas H. Nelsen



研究问题:本文对使用向量值随机特征(RF)的学习进行了全面的错误分析。
动机:尽管现有的研究已经对此进行了一些探讨,但往往依赖于随机矩阵理论的集中结果或其向随机算子的推广,而本文则直接分析了基础风险泛函,避免了明确表述RF岭回归解公式的需要。
方法:在完全一般无限维输入-输出设置中,为RF岭回归发展了理论,同时改进了现有的有限维分析。
效果:本文的主要成果包括在模型误设定下向量值RF估计器的强一致性和在良好设定下的最小最大最优收敛速度。达到这些速度所需的参数复杂度(随机特征的数量)和样本复杂度(标记数据的数量)与蒙特卡洛直觉相符合,且无对数因子。

This paper provides a comprehensive error analysis of learning with vector-valued random features (RF). The theory is developed for RF ridge regression in a fully general infinite-dimensional input-output setting, but nonetheless applies to and improves existing finite-dimensional analyses. In contrast to comparable work in the literature, the approach proposed here relies on a direct analysis of the underlying risk functional and completely avoids the explicit RF ridge regression solution formula in terms of random matrices. This removes the need for concentration results in random matrix theory or their generalizations to random operators. The main results established in this paper include strong consistency of vector-valued RF estimators under model misspecification and minimax optimal convergence rates in the well-specified setting. The parameter complexity (number of random features) and sample complexity (number of labeled data) required to achieve such rates are comparable with Monte Carlo intuition and free from logarithmic factors.

PAC Learning Linear Thresholds from Label Proportions
Anand Paresh Brahmbhatt Rishi Saket Aravindan Raghuveer



研究问题:本文旨在探索标签比例学习(LLP)的计算可学习性,即在给定随机标签比例的包的情况下,如何有效地学习线性阈值函数(LTF)。
动机:尽管大多数关于LLP的研究都集中在训练模型上,但LLP的计算可学习性直到最近才被探索。Saket (2021, 2022)的工作表明,从标签比例中正确学习LTF是难以处理的。然而,他们并没有排除对于自然分布的有效算法。
方法:本文提出了一种有效的学习LTF的方法,当给定一些标签比例的随机包时,其中特征向量根据其标签从高斯分布N(µ, Σ)中独立采样。我们的工作表明,使用从有放回和无放回采样的包中的特征向量差异的协方差形成的特定矩阵,经过变换后,其主要分量必然在LTF的法线方向上。
效果:通过应用次高斯浓度界限来估计均值和协方差矩阵,并结合包设置中的新颖泛化误差界限,我们展示了可以识别低误差假设LTF。对于N(0, I)分布的一些特殊情况,我们提供了一种基于简单均值估计的算法。实验评估显示,我们的方法比Saket (2021, 2022)的方法和随机LTF更有效。

Learning from label proportions (LLP) is a generalization of supervised learning in which the training data is available as sets or bags of feature-vectors (instances) along with the average instance-label of each bag. The goal is to train a good instance classifier. While most previous works on LLP have focused on training models on such training data, computational learnability of LLP was only recently explored by Saket (2021, 2022) who showed worst case intractability of properly learning linear threshold functions (LTFs) from label proportions. However, their work did not rule out efficient algorithms for this problem for natural distributions. In this work we show that it is indeed possible to efficiently learn LTFs using LTFs when given access to random bags of some label proportion in which feature-vectors are, conditioned on their labels, independently sampled from a Gaussian distribution $N(µ, Σ)$. Our work shows that a certain matrix – formed using covariances of the differences of feature-vectors sampled from the bags with and without replacement – necessarily has its principal component, after a transformation, in the direction of the normal vector of the LTF. Our algorithm estimates the means and covariance matrices using subgaussian concentration bounds which we show can be applied to efficiently sample bags for approximating the normal direction. Using this in conjunction with novel generalization error bounds in the bag setting, we show that a low error hypothesis LTF can be identified. For some special cases of the $N(0, I)$ distribution we provide a simpler mean estimation based algorithm. We include an experimental evaluation of our learning algorithms along with a comparison with those of Saket (2021, 2022) and random LTFs, demonstrating the effectiveness of our techniques.

CLIP-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments
Jessica Dai Paula Gradu Christopher Harshaw



研究问题:本研究旨在解决适应性序列设计在因果推断中的应用问题,特别是在癌症疗法的临床开发和党派偏见调查中。
动机:尽管适应性序列设计在因果推断中越来越受欢迎,因为它们可能比非适应性设计提供更高的精度,但在简单设置(如两种治疗)下,适应性设计能提高多少精度的问题尚未得到充分理解。
方法:本研究在设计基础的潜在结果框架中研究了适应性尼曼分配问题,其中实验者试图构建一个几乎与最优(但不切实际)的非适应性尼曼设计一样高效的自适应设计。受在线优化的启发,我们提出了尼曼比率和尼曼遗憾作为这个问题的两种等效的适应性设计性能度量。
效果:我们提出了Clip-OGD,一种自适应设计,它实现了O(T)的预期尼曼遗憾,从而在大样本中恢复了最优的尼曼方差。最后,我们构建了一个保守的方差估计器,有助于发展出渐近有效的置信区间。为了补充我们的理论结果,我们使用一项微观经济实验的数据进行了模拟。

From clinical development of cancer therapies to investigations into partisan bias, adaptive sequential designs have become increasingly popular method for causal inference, as they offer the possibility of improved precision over their non-adaptive counterparts. However, even in simple settings (e.g. two treatments) the extent to which adaptive designs can improve precision is not sufficiently well understood. In this work, we study the problem of Adaptive Neyman Allocation in a design-based potential outcomes framework, where the experimenter seeks to construct an adaptive design which is nearly as efficient as the optimal (but infeasible) non-adaptive Neyman design, which has access to all potential outcomes. Motivated by connections to online optimization, we propose Neyman Ratio and Neyman Regret as two (equivalent) performance measures of adaptive designs for this problem. We present Clip-OGD, an adaptive design which achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ expected Neyman regret and thereby recovers the optimal Neyman variance in large samples. Finally, we construct a conservative variance estimator which facilitates the development of asymptotically valid confidence intervals. To complement our theoretical results, we conduct simulations using data from a microeconomic experiment.

Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption
Yige Hong Qiaomin Xie Yudong Chen Weina Wang



研究问题:本文研究了无限时域的Restless Bandit问题,考虑离散时间和连续时间两种情况。
动机:设计计算效率高的策略,随着手臂数量N的增长,减小最优性差距。
方法:提出了一个通用的基于模拟的框架——Follow-the-Virtual-Advice,将任何单臂策略转化为原始N臂问题的策略。通过在每只手臂上模拟单臂策略,并仔细引导真实状态向模拟状态移动。
效果:在离散时间设置中,我们的结果在一个简单的同步假设下成立,该假设覆盖了一些违反UGAP的问题实例。更值得注意的是,在连续时间设置中,我们不需要任何额外的假设,除了标准的单链条件。在这两种情况下,我们的工作都是第一个不需要UGAP的渐近最优性结果。

We study the infinite-horizon Restless Bandit problem with the average reward criterion, under both discrete-time and continuous-time settings. A fundamental goal is to design computationally efficient policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on asymptotic optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework, Follow-the-Virtual-Advice, that converts any single-armed policy into a policy for the original $N$-armed problem. This is done by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that violate UGAP. More notably, in the continuous-time setting, we do not require \emph{any} additional assumptions beyond the standard unichain condition. In both settings, our work is the first asymptotic optimality result that does not require UGAP.

Parallel Submodular Function Minimization
Deeparnab Chakrabarty Andrei Graur Haotian Jiang Aaron Sidford



研究问题:本文研究了子模函数最小化(SFM)的并行复杂性。
动机:尽管有一系列关于改进SFM并行下界的研究,但在我们的工作之前,已知的并行SFM算法要么源于更一般的串行SFM方法,要么源于高度并行的凸$\ell_2$-Lipschitz函数最小化方法。
方法:我们提供了两种新的方法,用于获取在$n$个元素子集上定义的、整数值在$-M$和$M$之间的子模函数的两个新的查询与深度权衡。第一种方法的深度为2,查询复杂度为$n^{O(M)}$;第二种方法的深度为$\widetilde{O}(n^{1/3} M^{2/3})$,查询复杂度为$O(\mathrm{poly}(n, M))$。
效果:为了获得我们的第二个结果,我们首次提供了一种高度并行的算法,用于在超立方体上最小化$\ell_\infty$-Lipschitz函数,该算法获得了接近最优的深度以获得恒定的精度。

We consider the parallel complexity of submodular function minimization (SFM). We provide a pair of methods which obtain two new query versus depth trade-offs a submodular function defined on subsets of $n$ elements that has integer values between $-M$ and $M$. The first method has depth $2$ and query complexity $n^{O(M)}$ and the second method has depth $\widetilde{O}(n^{1/3} M^{2/3})$ and query complexity $O(\mathrm{poly}(n, M))$. Despite a line of work on improved parallel lower bounds for SFM, prior to our work the only known algorithms for parallel SFM either followed from more general methods for sequential SFM or highly-parallel minimization of convex $\ell_2$-Lipschitz functions. Interestingly, to obtain our second result we provide the first highly-parallel algorithm for minimizing $\ell_\infty$-Lipschitz function over the hypercube which obtains near-optimal depth for obtaining constant accuracy.

When Does Optimizing a Proper Loss Yield Calibration?
Jarosław Błasiok Parikshit Gopalan Lunjia Hu Preetum Nakkiran



研究问题:在限制的预测器族中优化适当的损失函数,什么情况下会产生校准模型?它给出了什么样的校准保证?
动机:尽管优化适当的损失函数通常被认为可以得到具有良好校准性质的预测器,但典型的机器学习模型是在一个不太可能包含真实值的限制的预测器族中进行训练。
方法:我们用局部最优性条件取代全局最优性,规定预测器的(适当)损失不能通过后处理其预测结果的某一类Lipschitz函数来大大减少。
效果:我们发现任何满足局部最优性的预测器都满足平滑校准,这可能解释了为什么经过良好的训练深度神经网络(DNNs)可以从适当的损失最小化中产生校准模型。最后,我们还发现局部最优性和校准误差之间的联系是双向的:几乎校准的预测器也几乎是局部最优的。

Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the ground truth. Under what circumstances does optimizing proper loss over a restricted family yield calibrated models? What precise calibration guarantees does it give? In this work, we provide a rigorous answer to these questions. We replace the global optimality with a local optimality condition stipulating that the (proper) loss of the predictor cannot be reduced much by post-processing its predictions with a certain family of Lipschitz functions. We show that any predictor with this local optimality satisfies smooth calibration as defined in [Kakade and Foster, 2008, Błasiok et al., 2023]. Local optimality is plausibly satisfied by well-trained DNNs, which suggests an explanation for why they are calibrated from proper loss minimization alone. Finally, we show that the connection between local optimality and calibration error goes both ways: nearly calibrated predictors are also nearly locally optimal.

QuACK: Accelerating Gradient-Based Quantum Optimization with Koopman Operator Learning
Di Luo Jiayu Shen Rumen Dangovski Marin Soljacic



研究问题:量子优化中,随着参数数量的增加,梯度计算的复杂度呈线性增长,这阻碍了其发展。
动机:为了解决这个问题,本文将Koopman算子理论和自然梯度方法引入到量子优化中,以大幅度加速基于梯度的量子优化。
方法:提出了一种新的框架——量子电路交替受控Koopman学习(QuACK),利用交替算法在量子计算机上高效地预测梯度动态。
效果:实验证明,QuACK在量子化学、量子凝聚态、量子机器学习和噪声环境中的各种应用中,都能显著加速基于梯度的优化。在过参数化区域,速度提高了200倍以上;在平滑区域,速度提高了10倍;在非平滑区域,速度提高了3倍。

Quantum optimization, a key application of quantum computing, has traditionally been stymied by the linearly increasing complexity of gradient calculations with an increasing number of parameters. This work bridges the gap between Koopman operator theory, which has found utility in applications because it allows for a linear representation of nonlinear dynamical systems, and natural gradient methods in quantum optimization, leading to a significant acceleration of gradient-based quantum optimization. We present Quantum-circuit Alternating Controlled Koopman learning (QuACK), a novel framework that leverages an alternating algorithm for efficient prediction of gradient dynamics on quantum computers. We demonstrate QuACK's remarkable ability to accelerate gradient-based optimization across a range of applications in quantum optimization and machine learning. In fact, our empirical studies, spanning quantum chemistry, quantum condensed matter, quantum machine learning, and noisy environments, have shown accelerations of more than 200x speedup in the overparameterized regime, 10x speedup in the smooth regime, and 3x speedup in the non-smooth regime. With QuACK, we offer a robust advancement that harnesses the advantage of gradient-based quantum optimization for practical benefits.

OKRidge: Scalable Optimal k-Sparse Ridge Regression
Jiachang Liu Sam Rosen Chudi Zhong Cynthia Rudin



研究问题:识别非线性动力系统的稀疏控制方程。
动机:解决稀疏岭回归问题以达到可证明的最优性,确定哪些项驱动基本动态。
方法:提出一种快速算法OKRidge用于稀疏岭回归,使用一种新的下界计算,包括首先进行鞍点公式化,然后(i)解线性系统或(ii)使用基于ADMM的方法,其中可以通过解另一个线性系统和等距回归问题来有效评估邻近算子。还提出了一种利用束搜索的方法来预热求解器。
效果:实验表明,我们的方法达到可证明的最优性,运行时间比现有的由商业求解器Gurobi解决的MIP公式快几个数量级。

We consider an important problem in scientific discovery, namely identifying sparse governing equations for nonlinear dynamical systems. This involves solving sparse ridge regression problems to provable optimality in order to determine which terms drive the underlying dynamics. We propose a fast algorithm, OKRidge, for sparse ridge regression, using a novel lower bound calculation involving, first, a saddle point formulation, and from there, either solving (i) a linear system or (ii) using an ADMM-based approach, where the proximal operators can be efficiently evaluated by solving another linear system and an isotonic regression problem. We also propose a method to warm-start our solver, which leverages a beam search. Experimentally, our methods attain provable optimality with run times that are orders of magnitude faster than those of the existing MIP formulations solved by the commercial solver Gurobi.

Stochastic Multi-armed Bandits: Optimal Trade-off among Optimality, Consistency, and Tail Risk
David Simchi-Levi Zeyu Zheng Feng Zhu



研究问题:本文研究了随机多臂赌博机问题,并全面描述了政策设计中三个期望
动机:解决稀疏岭回归问题以达到可证明的最优性,确定哪些项驱动基本动态。
方法:提出一种快速算法OKRidge用于稀疏岭回归,使用一种新的下界计算,包括首先进行鞍点公式化,然后(i)解线性系统或(ii)使用基于ADMM的方法,其中可以通过解另一个线性系统和等距回归问题来有效评估邻近算子。还提出了一种利用束搜索的方法来预热求解器。
效果:实验表明,我们的方法达到可证明的最优性,运行时间比现有的由商业求解器Gurobi解决的MIP公式快几个数量级。

We consider the stochastic multi-armed bandit problem and fully characterize the interplays among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to achieve the optimal regret tail risk for any regret threshold. Concretely, for any given $\alpha\in[1/2, 1)$ and $\beta\in[0, 1)$, our policy achieves a worst-case expected regret of $\tilde O(T^\alpha)$ and instance-dependent expected regret of $\tilde O(T^\beta)$, while enjoys a probability of incurring an $\Omega(T^\delta)$ regret that decays exponentially with a polynomial $T$ term. Such decaying rate is proved to be best achievable. We also generalize our analysis to the stochastic multi-armed bandit problem with non-stationary baseline rewards, where in each time period $t$, the decision maker pulls one of $K$ arms and collects a reward which is the sum of three terms: the mean of the pulled arm, an independent noise, and a non-stationary baseline reward as a function of $t$. Our results reveal insights on the trade-off between expected regret and tail risk for both worst-case and instance-dependent scenario, indicating that more sub-optimality and inconsistency leaves space for more light-tailed risk of incurring a large regret.

The Behavior and Convergence of Local Bayesian Optimization
Kaiwen Wu Kyurae Kim Roman Garnett Jacob R. Gardner



研究问题:本文旨在研究贝叶斯优化中局部优化策略的行为和收敛性,以解决高维问题。
动机:尽管局部优化策略在高维问题上具有优秀的实证性能,但其行为和收敛性尚未得到具体了解。
方法:首先研究了局部方法的行为,发现其单个局部解的统计特性优于全局方法的预期恢复。然后,对Müller等人(2021)最近提出的贝叶斯局部优化算法进行了首次严格分析,并推导出了有噪声和无噪声环境下的收敛速度。
效果:实验结果表明,局部优化策略在高维问题上具有优越的性能,且收敛速度快于传统全局策略。

A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by Müller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.

Differentially Private Approximate Near Neighbor Counting in High Dimensions
Alexandr Andoni Piotr Indyk Sepideh Mahabadi Shyam Narayanan



研究问题:如何在差分隐私下进行范围计数,即计算落入给定查询球内的数据点数量。
动机:目前的范围计数算法存在两种类型的问题,一类算法的误差是数据点数量的固定多项式,另一类算法允许对数误差,但误差在维度上呈指数增长。
方法:本文提出了一种有效的算法,该算法在这两种类型之间找到了一个平衡点。算法的误差是一个与数据集大小成任意小幂的加性误差,以及一个小的(1+o(1))乘性误差。关键在于,添加的噪声量与维度无关。
效果:该算法引入了局部敏感哈希的一个变体,并以一种新的方式使用它。

Range counting (e.g., counting the number of data points falling into a given query ball) under differential privacy has been studied extensively. However, the current algorithms for this problem are subject to the following dichotomy. One class of algorithms suffers from an additive error that is a fixed polynomial in the number of points. Another class of algorithms allows for polylogarithmic additive error, but the error grows exponentially in the dimension. To achieve the latter, the problem is relaxed to allow a “fuzzy” definition of the range boundary, e.g., a count of the points in a ball of radius $r$ might also include points in a ball of radius $cr$ for some $c>1$. In this paper we present an efficient algorithm that offers a sweet spot between these two classes. The algorithm has an additive error that is an arbitrary small power of the data set size, depending on how fuzzy the range boundary is, as well as a small ($1+o(1)$) multiplicative error. Crucially, the amount of noise added has no dependence on the dimension. Our algorithm introduces a variant of Locality-Sensitive Hashing, utilizing it in a novel manner.

Distributionally Robust Linear Quadratic Control
Bahar Taskesen Dan Andrei Iancu Çağıl Koçyiğit Daniel Kuhn



研究问题:本文旨在解决带有噪声分布不确定性的离散时间有限时域LQG控制问题。
动机:在许多领域中,如工程、计算机科学、经济学和神经科学等,LQG控制是一种基本的控制范式。然而,当噪声分布未知且属于以标称(高斯)分布为中心的Wasserstein模糊集时,如何进行最优控制是一个挑战。
方法:本文提出了一种数值解决方案,该方法使用Frank-Wolfe算法识别Wasserstein模糊集中的最不利分布,并在这些分布下使用卡尔曼滤波器估计来计算控制器的最优策略。
效果:实验结果表明,尽管增加了复杂性,但本文提出的方法仍然能够有效地找到最优控制策略。

Linear-Quadratic-Gaussian (LQG) control is a fundamental control paradigm that is studied in various fields such as engineering, computer science, economics, and neuroscience. It involves controlling a system with linear dynamics and imperfect observations, subject to additive noise, with the goal of minimizing a quadratic cost function for the state and control variables. In this work, we consider a generalization of the discrete-time, finite-horizon LQG problem, where the noise distributions are unknown and belong to Wasserstein ambiguity sets centered at nominal (Gaussian) distributions. The objective is to minimize a worst-case cost across all distributions in the ambiguity set, including non-Gaussian distributions. Despite the added complexity, we prove that a control policy that is linear in the observations is optimal for this problem, as in the classic LQG problem. We propose a numerical solution method that efficiently characterizes this optimal control policy. Our method uses the Frank-Wolfe algorithm to identify the least-favorable distributions within the Wasserstein ambiguity sets and computes the controller's optimal policy using Kalman filter estimation under these distributions.

Online Control for Meta-optimization
Xinyi Chen Elad Hazan



研究问题:选择最优超参数(如学习率和动量)是一个重要的非凸挑战。
动机:传统的迭代技术如超梯度下降在获取全局最优性保证方面不足,因此我们考虑了更一般的元优化任务——在线学习最佳优化算法。
方法:我们引入了一种基于控制理论的新方法,将元优化公式化为一个最优控制问题,这与现有使用稳定性方法研究优化的文献不同。
效果:我们的方法利用最近提出的非随机控制框架中的凸松弛技术来克服非凸性的挑战,并获得了与最佳离线解决方案相比的遗憾保证。这保证了在元优化中,我们可以学习到一种收敛性能与事后从一类方法中选出的最佳优化方法相当的方法。

Choosing the optimal hyperparameters, including learning rate and momentum, for specific optimization instances is a significant yet non-convex challenge. This makes conventional iterative techniques such as hypergradient descent \cite{baydin2017online} insufficient in obtaining global optimality guarantees. We consider the more general task of meta-optimization -- online learning of the best optimization algorithm given problem instances, and introduce a novel approach based on control theory. We show how meta-optimization can be formulated as an optimal control problem, departing from existing literature that use stability-based methods to study optimization. Our approach leverages convex relaxation techniques in the recently-proposed nonstochastic control framework to overcome the challenge of nonconvexity, and obtains regret guarantees vs. the best offline solution. This guarantees that in meta-optimization, we can learn a method that attains convergence comparable to that of the best optimization method in hindsight from a class of methods.

The Pick-to-Learn Algorithm: Empowering Compression for Tight Generalization Bounds and Improved Post-training Performance
Dario Paccagnan Marco Campi Simone Garatti



研究问题:如何通过压缩理论为学习算法建立新的框架,以获得具有广泛应用的紧致泛化界限。
动机:泛化界限对于理论研究和应用都具有重要价值,可以揭示学习过程的基础机制,并验证学习模型对未见过输入的表现。
方法:将任何给定的学习算法嵌入到适当构造的元算法(称为“选择学习”,P2L)中,以注入理想的压缩属性。
效果:在MNIST分类数据集和合成回归问题上应用P2L,不仅获得了与现有技术相比具有竞争力的泛化界限(测试集和PAC-Bayes界限),而且还学习了具有更好后训练性能的模型。

Generalization bounds are valuable both for theory and applications. On the one hand, they shed light on the mechanisms that underpin the learning processes; on the other, they certify how well a learned model performs against unseen inputs. In this work we build upon a recent breakthrough in compression theory to develop a new framework yielding tight generalization bounds of wide practical applicability. The core idea is to embed any given learning algorithm into a suitably-constructed meta-algorithm (here called Pick-to-Learn, P2L) in order to instill desirable compression properties. When applied to the MNIST classification dataset and to a synthetic regression problem, P2L not only attains generalization bounds that compare favorably with the state of the art (test-set and PAC-Bayes bounds), but it also learns models with better post-training performance.

Unexpected Improvements to Expected Improvement for Bayesian Optimization
Sebastian Ament Sam Daulton David Eriksson Maximilian Balandat Eytan Bakshy



研究问题:现有的期望改进(EI)等优化函数在贝叶斯优化中广泛应用,但其性能往往被新的方法超越。特别是在观察数量、搜索空间维度或约束条件增加时,其数值优化难度加大,导致性能不稳定且通常为次优。
动机:针对这一问题,本文提出了一种新的优化函数族——LogEI。该函数族的成员与经典优化函数具有相同的最优解,但数值优化的难度大大降低。
方法:通过对经典分析EI、期望超体积改进(EHVI)及其受约束、有噪声和并行变体的分析,揭示了数值病态的存在,并提出了相应的改进方案。
效果:实验结果表明,LogEI族的优化函数在经典优化函数的优化性能上有显著提升,且出人意料地与最新的最先进优化函数的性能相媲美,突显了数值优化在文献中被低估的作用。

Expected Improvement (EI) is arguably the most popular acquisition function in Bayesian optimization and has found countless successful applications, but its performance is often exceeded by that of more recent methods. Notably, EI and its variants, including for the parallel and multi-objective settings, are challenging to optimize because their acquisition values vanish numerically in many regions. This difficulty generally increases as the number of observations, dimensionality of the search space, or the number of constraints grow, resulting in performance that is inconsistent across the literature and most often sub-optimal. Herein, we propose LogEI, a new family of acquisition functions whose members either have identical or approximately equal optima as their canonical counterparts, but are substantially easier to optimize numerically. We demonstrate that numerical pathologies manifest themselves in “classic” analytic EI, Expected Hypervolume Improvement (EHVI), as well as their constrained, noisy, and parallel variants, and propose corresponding reformulations that remedy these pathologies. Our empirical results show that members of the LogEI family of acquisition functions substantially improve on the optimization performance of their canonical counterparts and surprisingly, are on par with or exceed the performance of recent state-of-the-art acquisition functions, highlighting the understated role of numerical optimization in the literature.

An Optimal and Scalable Matrix Mechanism for Noisy Marginals under Convex Loss Functions
Yingtai Xiao Guanlin He Danfeng Zhang Daniel Kifer



研究问题:如何有效地保护数据隐私,同时进行下游任务如概率表分析、贝叶斯网络构建和合成数据生成?
动机:现有的矩阵机制在处理线性查询(如边际)时,只能提供一种预定义的目标函数,且在大规模设置中运行速度慢,内存消耗大。
方法:提出了ResidualPlanner,一种用于高斯噪声边际的最优且可扩展的矩阵机制。它可以优化许多可以表示为边际方差凸函数的损失函数,并在大规模设置中以秒为单位优化边际的准确性。
效果:实验结果表明,ResidualPlanner即使在具有100个属性的数据集上也能在几分钟内运行,并且其计算每个边际的方差/协方差值的效率远高于先前的方法。

Noisy marginals are a common form of confidentiality-protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner, a matrix mechanism for marginals with Gaussian noise that is both optimal and scalable. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets).

Optimal Guarantees for Algorithmic Reproducibility and Gradient Complexity in Convex Optimization
Liang Zhang Junchi YANG Amin Karbasi Niao He



研究问题:本文旨在解决机器学习算法在训练过程中的输出偏差问题,特别是在存在误差的优化设置中。
动机:先前的研究认为,一阶方法需要以收敛速度(梯度复杂度)为代价来提高算法的可复制性。本文对此进行了挑战,并证明了在各种误差易发的优化设置下,平滑凸最小化和平滑凸-凹最小最大问题都可以实现最优的可复制性和接近最优的收敛保证。
方法:本文提出了基于正则化的算法,通过使用不精确的初始化、不精确的梯度和随机梯度等不同的优化器,实现了在最小化和最小最大优化问题上的最佳可复制性和接近最优的梯度复杂度。
效果:实验结果表明,这些算法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,这些结果也有助于深化我们对凸优化中的可复制性-收敛性权衡的理解。

Algorithmic reproducibility measures the deviation in outputs of machine learning algorithms upon minor changes in the training process. Previous work suggests that first-order methods would need to trade-off convergence rate (gradient complexity) for better reproducibility. In this work, we challenge this perception and demonstrate that both optimal reproducibility and near-optimal convergence guarantees can be achieved for smooth convex minimization and smooth convex-concave minimax problems under various error-prone oracle settings. Particularly, given the inexact initialization oracle, our regularization-based algorithms achieve the best of both worlds -- optimal reproducibility and near-optimal gradient complexity -- for minimization and minimax optimization. With the inexact gradient oracle, the near-optimal guarantees also hold for minimax optimization. Additionally, with the stochastic gradient oracle, we show that stochastic gradient descent ascent is optimal in terms of both reproducibility and gradient complexity. We believe our results contribute to an enhanced understanding of the reproducibility-convergence trade-off in the context of convex optimization.

Bayesian Extensive-Rank Matrix Factorization with Rotational Invariant Priors
Farzad Pourkamali Nicolas Macris



研究问题:本文探讨了在矩阵分解中,两个隐藏矩阵因子的秩与其维度线性增长,并且其乘积受到加性噪声干扰的情况下的统计模型。
动机:尽管有各种方法,但这类问题的统计和算法限制仍然难以捉摸。
方法:我们研究了一个贝叶斯设置,假设(a)其中一个矩阵因子是对称的,(b)两个因子以及加性噪声都有旋转不变先验,(c)先验知识已知给统计员。我们推导出了旋转不变估计器的解析公式来重构两个矩阵因子,并推测这些在维数大极限下是最优的,因为它们最小化了平均均方误差。
效果:我们的推导依赖于随机矩阵理论变换、球面积分公式和统计力学的复制方法的组合。数值检查证实了最优性猜想,当面临由定义而确定最优但涉及真实情况的Oracle估计器时。

We consider a statistical model for matrix factorization in a regime where the rank of the two hidden matrix factors grows linearly with their dimension and their product is corrupted by additive noise. Despite various approaches, statistical and algorithmic limits of such problems have remained elusive. We study a Bayesian setting with the assumptions that (a) one of the matrix factors is symmetric, (b) both factors as well as the additive noise have rotational invariant priors, (c) the priors are known to the statistician. We derive analytical formulas for Rotation Invariant Estimators to reconstruct the two matrix factors, and conjecture that these are optimal in the large-dimension limit, in the sense that they minimize the average mean-square-error. We provide numerical checks which confirm the optimality conjecture when confronted to Oracle Estimators which are optimal by definition, but involve the ground-truth. Our derivation relies on a combination of tools, namely random matrix theory transforms, spherical integral formulas, and the replica method from statistical mechanics.

Faster Margin Maximization Rates for Generic Optimization Methods
Guanghui Wang Zihao Hu Vidya Muthukumar Jacob Abernethy



研究问题:优化方法在最小化训练目标时,会存在对某些解的偏好,即“隐含偏置”,这对理解优化算法的泛化能力至关重要。
动机:最近的研究发现,基于梯度下降的方法在可分二分类问题上表现出对$\ell_2$-最大间隔分类器的隐含偏置。然而,通用优化方法如镜像下降和最速下降显示出向由其他几何定义的最大间隔分类器收敛的趋势。
方法:本文提出了一系列先进的镜像下降和最速下降算法的隐含偏置率。主要技术是将通用优化算法转化为解决正则化双线性游戏的在线学习动态,为分析各种优化方法的隐含偏置提供了统一的框架。
效果:通过在这个游戏框架内利用在线学习算法的后悔界,得出了加速的隐含偏置率。

First-order optimization methods tend to inherently favor certain solutions over others when minimizing a given training objective with multiple local optima. This phenomenon, known as \emph{implicit bias}, plays a critical role in understanding the generalization capabilities of optimization algorithms. Recent research has revealed that gradient-descent-based methods exhibit an implicit bias for the $\ell_2$-maximal margin classifier in the context of separable binary classification. In contrast, generic optimization methods, such as mirror descent and steepest descent, have been shown to converge to maximal margin classifiers defined by alternative geometries. However, while gradient-descent-based algorithms demonstrate fast implicit bias rates, the implicit bias rates of generic optimization methods have been relatively slow. To address this limitation, in this paper, we present a series of state-of-the-art implicit bias rates for mirror descent and steepest descent algorithms. Our primary technique involves transforming a generic optimization algorithm into an online learning dynamic that solves a regularized bilinear game, providing a unified framework for analyzing the implicit bias of various optimization methods. The accelerated rates are derived leveraging the regret bounds of online learning algorithms within this game framework.

Private Distribution Learning with Public Data: The View from Sample Compression
Shai Ben-David Alex Bie Clement Louis Canonne Gautam Kamath Vikrant Singhal



研究问题:本研究探讨了在有公共数据可用的情况下进行私有分布学习的问题。
动机:在现实世界中,我们经常需要在保护隐私的同时对未知的分布进行学习。
方法:研究者提出了一种称为“公共-私有学习”的方法,该方法结合了公共和私有样本来估计未知分布,同时满足仅对私有样本的隐私约束。
效果:实验结果表明,这种方法可以有效地恢复先前关于高维空间中高斯混合的研究结果,并得出了一些新的结果,包括任意高维空间中高斯混合的样本复杂度上界、对异构和分布偏移具有鲁棒性的学习器以及关于公共-私有学习能力的闭包性质。

We study the problem of private distribution learning with access to public data. In this setup, which we refer to as *public-private learning*, the learner is given public and private samples drawn from an unknown distribution $p$ belonging to a class $\mathcal Q$, with the goal of outputting an estimate of $p$ while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class $\mathcal Q$ is connected to the existence of a sample compression scheme for $\mathcal Q$, as well as to an intermediate notion we refer to as \emph{list learning}. Leveraging this connection: (1) approximately recovers previous results on Gaussians over $\mathbb R^d$; and (2) leads to new ones, including sample complexity upper bounds for arbitrary $k$-mixtures of Gaussians over $\mathbb R^d$, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in $\mathbb R^d$, at least $d$ public samples are necessary for private learnability, which is close to the known upper bound of $d+1$ public samples.

On the Minimax Regret for Online Learning with Feedback Graphs
Khaled Eldowa Emmanuel Esposito Tommaso Cesari Nicolò Cesa-Bianchi



研究问题:本文旨在改进在线学习中具有强可观察无向反馈图的后悔上界和下界。
动机:当前已知的最好上界是O(√αTlnK),其中K是行动的数量,α是图的独立数,T是时间范围。当α=1(专家情况)时,已知必须存在√lnK因子。另一方面,当α=K(赌徒情况)时,已知最小最大速率为Θ(√KT),并且对于任何α都存在Ω(√αT)的下界。
方法:我们使用q-Tsallis熵的FTRL来证明这个结果,其中q是一个精心选择的值,在[1/2,1)之间变化。
效果:我们的改进上界O(√αT(1+ln(K/α)))适用于任何α,并与赌徒和专家的下界相匹配,同时插值中间情况。此外,我们还展示了如何扩展我们的技术到时间变化的图,而无需预先了解它们的独立数量。

In this work, we improve on the upper and lower bounds for the regret of online learning with strongly observable undirected feedback graphs. The best known upper bound for this problem is $\mathcal{O}\bigl(\sqrt{\alpha T\ln K}\bigr)$, where $K$ is the number of actions, $\alpha$ is the independence number of the graph, and $T$ is the time horizon. The $\sqrt{\ln K}$ factor is known to be necessary when $\alpha = 1$ (the experts case). On the other hand, when $\alpha = K$ (the bandits case), the minimax rate is known to be $\Theta\bigl(\sqrt{KT}\bigr)$, and a lower bound $\Omega\bigl(\sqrt{\alpha T}\bigr)$ is known to hold for any $\alpha$. Our improved upper bound $\mathcal{O}\bigl(\sqrt{\alpha T(1+\ln(K/\alpha))}\bigr)$ holds for any $\alpha$ and matches the lower bounds for bandits and experts, while interpolating intermediate cases. To prove this result, we use FTRL with $q$-Tsallis entropy for a carefully chosen value of $q \in [1/2, 1)$ that varies with $\alpha$. The analysis of this algorithm requires a new bound on the variance term in the regret. We also show how to extend our techniques to time-varying graphs, without requiring prior knowledge of their independence numbers. Our upper bound is complemented by an improved $\Omega\bigl(\sqrt{\alpha T(\ln K)/(\ln\alpha)}\bigr)$ lower bound for all $\alpha > 1$, whose analysis relies on a novel reduction to multitask learning. This shows that a logarithmic factor is necessary as soon as $\alpha < K$.

Alternation makes the adversary weaker in two-player games
Volkan Cevher Ashok Cutkosky Ali Kavis Georgios Piliouras Stratis Skoulakis Luca Viano



研究问题:本文研究了交替在线线性优化(OLO)的变体,即交替OLO。
动机:受到两人游戏中交替博弈的启发,我们研究了交替在线线性优化的变体。
方法:在交替OLO中,学习者在每一轮选择向量$x^t$,然后对手选择成本向量$c^t \in [-1,1]^n$。学习者体验到的成本是$(c^t + c^{t-1})^\top x^t$,而不是标准OLO中的$(c^t)^top x^t$。我们建立了在这种小变化下,$\Omega(sqrt{T})$的遗憾下界不再有效的结论。
效果:我们提出了两种在线学习算法,分别对$n$-维单纯形和半径为$\rho>0$的球体有$\mathcal{O}((\log n)^{4/3} T^{1/3})$的遗憾和$\mathcal{O}(\rho \log T)$的遗憾。我们的研究结果表明,在交替博弈中,无论对手的策略如何,一个代理总能保证$\mathcal{\tilde{O}}((\log n)^{4/3} T^{1/3})$的遗憾,而当代理只承认两个行动时,遗憾界限可以改善到$\mathcal{O}(\log T)$。

Motivated by alternating game-play in two-player games, we study an altenating variant of the \textit{Online Linear Optimization} (OLO). In alternating OLO, a \textit{learner} at each round $t \in [n]$ selects a vector $x^t$ and then an \textit{adversary} selects a cost-vector $c^t \in [-1,1]^n$. The learner then experiences cost $(c^t + c^{t-1})^\top x^t$ instead of $(c^t)^\top x^t$ as in standard OLO. We establish that under this small twist, the $\Omega(\sqrt{T})$ lower bound on the regret is no longer valid. More precisely, we present two online learning algorithms for alternating OLO that respectively admit $\mathcal{O}((\log n)^{4/3} T^{1/3})$ regret for the $n$-dimensional simplex and $\mathcal{O}(\rho \log T)$ regret for the ball of radius $\rho>0$. Our results imply that in alternating game-play, an agent can always guarantee $\mathcal{\tilde{O}}((\log n)^{4/3} T^{1/3})$ regardless the strategies of the other agent while the regret bound improves to $\mathcal{O}(\log T)$ in case the agent admits only two actions.

Accelerated Quasi-Newton Proximal Extragradient: Faster Rate for Smooth Convex Optimization
Ruichen Jiang Aryan Mokhtari



研究问题:本文旨在提出一种加速的拟牛顿近端梯度下降法,用于解决无约束平滑凸优化问题。
动机:现有的方法在收敛速度上存在限制,我们希望通过改进算法来提高求解效率。
方法:采用最近提出的蒙特罗-斯瓦特加速框架的变体,并从在线学习的角度更新海森矩阵的近似值,以实现更快的收敛速度。
效果:实验结果表明,该方法在各种情况下都能达到比现有方法更快的收敛速度,并在凸设置中首次证明了拟牛顿型方法优于NAG的效果。

In this paper, we propose an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective, we prove that our method can achieve a convergence rate of $\mathcal{O}\bigl(\min\\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\\}\bigr)$, where $d$ is the problem dimension and $k$ is the number of iterations. In particular, in the regime where $k = \mathcal{O}(d)$, our method matches the _optimal rate_ of $\mathcal{O}(\frac{1}{k^2})$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $k = \Omega(d \log d)$, it outperforms NAG and converges at a _faster rate_ of $\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices.

Follow-ups Also Matter: Improving Contextual Bandits via Post-serving Contexts
Chaoqi Wang Ziyu Ye Zhe Feng Ashwinkumar Badanidiyuru Haifeng Xu



研究问题:本文旨在解决标准上下文 bandit 问题,即在算法选择行动前观察到所有相关上下文的问题。
动机:对于像Youtube、Instagram、Tiktok这样的内容推荐平台来说,用户点击内容后会获得更多关于用户奖励的额外特征(如用户停留时间、观看速度等)。为了提高这些应用中的在线学习效率,我们研究了具有后期服务的新颖上下文 bandit 问题,并设计了新的算法poLinUCB。
方法:我们的核心是对著名的椭圆形潜在引理(EPL)进行强化和通用化,以适应数据中的噪声。这种强化是解决我们问题的必要条件,尽管我们认为它也可能具有普遍意义。
效果:我们在合成和真实世界的数据集上进行了广泛的实证测试,证明了利用后期服务上下文以及我们的算法优于最先进的方法所带来的显著效益。

Standard contextual bandit problem assumes that all the relevant contexts are observed before the algorithm chooses an arm. This modeling paradigm, while useful, often falls short when dealing with problems in which additional valuable contexts can be observed after arm selection. For example, content recommendation platforms like Youtube, Instagram, Tiktok receive much additional features about a user's reward after the user clicks a content (e.g., how long the user stayed, what is the user's watch speed, etc.). To improve online learning efficiency in these applications, we study a novel contextual bandit problem with post-serving contexts and design a new algorithm, poLinUCB, that achieves tight regret under standard assumptions. Core to our technical proof is a robustified and generalized version of the well-known Elliptical Potential Lemma (EPL), which can accommodate noise in data. Such robustification is necessary for tackling our problem, though we believe it could also be of general interest. Extensive empirical tests on both synthetic and real-world datasets demonstrate the significant benefit of utilitzing post-serving contexts as well as the superior performance of our algorithm over the state-of-the-art approaches.

Feature Adaptation for Sparse Linear Regression
Jonathan Kelner Frederic Koehler Raghu Meka Dhruv Rohatgi



研究问题:本文研究了在高维统计中稀疏线性回归的核心问题,即如何在相关随机设计设置下,从多元高斯分布$N(0,\Sigma)$中抽取协变量,寻找具有小额外风险的估计器。
动机:在实际问题中,真实信号往往是稀疏的,而传统的算法如Lasso需要大量的样本才能实现稀疏恢复,这在计算上是低效的。因此,如何设计一个能够容忍少量近似依赖关系的高效算法,成为了一个重要的研究问题。
方法:本文提出了一个多项式时间算法,该算法可以根据$\Sigma$自动调整Lasso以容忍少量的近似依赖关系。特别的是,当稀疏度为常数和$\Sigma$有少量“异常”特征值时,该算法可以实现接近最优的样本复杂度。
效果:通过特征适应框架,本文还首次在常数稀疏度$t$和任意协方差$\Sigma$的情况下,实现了比暴力搜索更好的多项式因子改进。

Sparse linear regression is a central problem in high-dimensional statistics. We study the correlated random design setting, where the covariates are drawn from a multivariate Gaussian $N(0,\Sigma)$, and we seek an estimator with small excess risk. If the true signal is $t$-sparse, information-theoretically, it is possible to achieve strong recovery guarantees with only $O(t\log n)$ samples. However, computationally efficient algorithms have sample complexity linear in (some variant of) the *condition number* of $\Sigma$. Classical algorithms such as the Lasso can require significantly more samples than necessary even if there is only a single sparse approximate dependency among the covariates. We provide a polynomial-time algorithm that, given $\Sigma$, automatically adapts the Lasso to tolerate a small number of approximate dependencies. In particular, we achieve near-optimal sample complexity for constant sparsity and if $\Sigma$ has few ``outlier'' eigenvalues. Our algorithm fits into a broader framework of *feature adaptation* for sparse linear regression with ill-conditioned covariates. With this framework, we additionally provide the first polynomial-factor improvement over brute-force search for constant sparsity $t$ and arbitrary covariance $\Sigma$.

Private (Stochastic) Non-Convex Optimization Revisited: Second-Order Stationary Points and Excess Risks
Daogao Liu Arun Ganesh Sewoong Oh Abhradeep Guha Thakurta



研究问题:本文重新考虑了在差分隐私约束下的非凸优化挑战。
动机:基于先前的方差减少算法SpiderBoost,我们提出了一个新颖的框架,该框架使用两种类型的梯度查询方法:一种估计单个点的梯度,另一种计算两点之间的梯度差,成本更低。
方法:我们的框架可以确保梯度估计的连续性,并提高识别二阶平稳点的速度。此外,我们还尝试通过指数机制在没有任何假设的情况下定位非凸目标的全局最小值。
效果:初步结果显示,正则化指数机制可以有效地模拟以前的实证和总体风险界限,无需对具有多项式运行时间的算法进行平滑性假设。此外,排除运行时间因素后,指数机制显示出有希望的总体风险界限性能,并且我们提供了一个几乎匹配的下界。

We reconsider the challenge of non-convex optimization under differential privacy constraint. Building upon the previous variance-reduced algorithm SpiderBoost, we propose a novel framework that employs two types of gradient oracles: one that estimates the gradient at a single point and a more cost-effective option that calculates the gradient difference between two points. Our framework can ensure continuous accuracy of gradient estimations and subsequently enhances the rates of identifying second-order stationary points. Additionally, we consider a more challenging task by attempting to locate the global minima of a non-convex objective via the exponential mechanism without almost any assumptions. Our preliminary results suggest that the regularized exponential mechanism can effectively emulate previous empirical and population risk bounds, negating the need for smoothness assumptions for algorithms with polynomial running time. Furthermore, with running time factors excluded, the exponential mechanism demonstrates promising population risk bound performance, and we provide a nearly matching lower bound.

On the Learnability of Multilabel Ranking
Vinod Raman UNIQUE SUBEDI Ambuj Tewari



研究问题:多标签排序在机器学习中是一个核心任务,但在具有相关性评分反馈的多标签排序设置中的可学习性的基本问题尚未得到解答。
动机:本研究旨在对大量排名损失函数的多标签排序问题的可学习性进行描述和分类。
方法:通过对比实验,我们给出了两个基于可学习性的等价类排名损失,这两个等价类捕获了实践中使用的大多数损失。
效果:我们的分类结果可以为理解和设计新的排名损失提供指导。

Multilabel ranking is a central task in machine learning. However, the most fundamental question of learnability in a multilabel ranking setting with relevance-score feedback remains unanswered. In this work, we characterize the learnability of multilabel ranking problems in both batch and online settings for a large family of ranking losses. Along the way, we give two equivalence classes of ranking losses based on learnability that capture most losses used in practice.

Regret Matching+: (In)Stability and Fast Convergence in Games
Gabriele Farina Julien Grand-Clément Christian Kroer Chung-Wei Lee Haipeng Luo



研究问题:大规模游戏中的Regret Matching+及其变体的成功实践背后的理论理解仍然是一个谜。
动机:最近的快速收敛游戏进展仅限于满足稳定性的无后悔算法,如在线镜像下降。
方法:本文首先给出了反例,显示RM+和其预测版本可能是不稳定的,这可能导致其他玩家遭受巨大的遗憾。然后我们提供了两个修复方案:重启和砍掉RM+工作的正半面。
效果:我们的实验表明,通过具有预测的RM+,这些修复方案足以在标准形式游戏中实现O(T^{1/4})的个人遗憾和O(1)的社会遗憾。我们还将这些稳定技术应用于RM+的非耦合学习设置中的先知更新,并证明了类似于最近对先知在线镜像下降的工作的理想结果。

Regret Matching$^+$ (RM$^+$) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its predictive version can be unstable, which might cause other players to suffer large regret. We then provide two fixes: restarting and chopping off the positive orthant that RM$^+$ works in. We show that these fixes are sufficient to get $O(T^{1/4})$ individual regret and $O(1)$ social regret in normal-form games via RM$^+$ with predictions. We also apply our stabilizing techniques to clairvoyant updates in the uncoupled learning setting for RM$^+$ and prove desirable results akin to recent works for Clairvoyant online mirror descent. Our experiments show the advantages of our algorithms over vanilla RM$^+$-based algorithms in matrix and extensive-form games.

Characterizing the Optimal $0-1$ Loss for Multi-class Classification with a Test-time Attacker
Sihui Dai Wenxin Ding Arjun Nitin Bhagoji Daniel Cullina Haitao Zheng Ben Y. Zhao Prateek Mittal



研究问题:如何确定在给定威胁模型和固定数据分布下,最佳分类器对对抗性示例的鲁棒性,并将其与最先进的训练方法进行比较。
动机:为了安全部署分类器,找到能够抵御对抗性示例的鲁棒分类器至关重要。
方法:本文提出了一种寻找多分类器在任何离散数据集上对抗测试时攻击者的鲁棒损失信息理论下限的方法。通过从数据和对抗约束构建冲突超图来寻找最优的0-1损失。
效果:首次分析了基准数据集上多分类设置中分类器的鲁棒性与最优鲁棒性之间的差距。

Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model for a fixed data distribution and comparing it to that achieved by state-of-the-art training methods is thus an important diagnostic tool. In this paper, we find achievable information-theoretic lower bounds on robust loss in the presence of a test-time attacker for *multi-class classifiers on any discrete dataset*. We provide a general framework for finding the optimal $0-1$ loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints. The prohibitive cost of this formulation in practice leads us to formulate other variants of the attacker-classifier game that more efficiently determine the range of the optimal loss. Our valuation shows, for the first time, an analysis of the gap to optimal robustness for classifiers in the multi-class setting on benchmark datasets.

Neural Injective Functions for Multisets, Measures and Graphs via a Finite Witness Theorem
Tal Amir Steven J. Gortler Ilai Avni Ravina Ravina Nadav Dym



研究问题:本文旨在解决理论和实践中关于有向多重集函数的不匹配问题,即研究问题:本文旨在解决理论和实践中关于有向多重集函数的不匹配问题,即理论中通常依赖多项式矩的可证有向多重集函数与实践中依赖神经矩的多重集函数之间的差距。
动机:尽管有向多重集函数在机器学习的多重集和图的理论研究中起着关键作用,但理论中考虑的可证有向多重集函数(通常依赖于多项式矩)与实践中使用的多重集函数(依赖于神经矩)之间存在差距。
方法:本文通过证明神经网络的矩确实定义了有向多重集函数,只要使用分析非多项式激活函数,从而弥合了这个差距。我们的理论所需的矩的数量基本上是最优的,最多可以乘以两个乘数。为了证明这个结果,我们陈述并证明了一个有限的见证定理,这是独立的兴趣点。
效果:作为我们主要定理的推论,我们得到了新的关于多重集和度量函数的近似结果,以及新的关于图神经网络的分离结果。我们还提供了两个负面结果:(1)分段线性神经网络的矩不能是有向多重集函数;(2)即使基于矩的多重集函数是注入的,它们也永远不可能是双射Lipschitz的。

Injective multiset functions have a key role in the theoretical study of machine learning on multisets and graphs. Yet, there remains a gap between the provably injective multiset functions considered in theory, which typically rely on polynomial moments, and the multiset functions used in practice, which rely on $\textit{neural moments}$ — whose injectivity on multisets has not been studied to date. In this paper, we bridge this gap by showing that moments of neural networks do define injective multiset functions, provided that an analytic non-polynomial activation is used. The number of moments required by our theory is optimal essentially up to a multiplicative factor of two. To prove this result, we state and prove a $\textit{finite witness theorem}$, which is of independent interest. As a corollary to our main theorem, we derive new approximation results for functions on multisets and measures, and new separation results for graph neural networks. We also provide two negative results: (1) moments of piecewise-linear neural networks cannot be injective multiset functions; and (2) even when moment-based multiset functions are injective, they can never be bi-Lipschitz.

Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
Krunoslav Lehman Pavasovic Alain Durmus Umut Simsekli



研究问题:本研究旨在解决实际运用中随机梯度下降(SGD)可能出现的重尾行为及其与总体性能的相关性。
动机:尽管理论研究发现在线单次通过SGD可能会出现重尾行为,但这种重尾行为在实际应用中的出现机制,尤其是在有限的训练数据下,尚未得到充分理解。
方法:本研究采用离线多次通过SGD进行研究,并证明其稳态分布会出现“近似”幂律尾,且近似误差由训练数据的实证分布向真实底层数据分布的Wasserstein度量收敛速度控制。
效果:随着数据点数量的增加,离线SGD的行为将越来越“接近”幂律分布。通过对合成数据和神经网络的实验,验证了这一理论。

A recent line of empirical studies has demonstrated that SGD might exhibit a heavy-tailed behavior in practical settings, and the heaviness of the tails might correlate with the overall performance. In this paper, we investigate the emergence of such heavy tails. Previous works on this problem only considered, up to our knowledge, online (also called single-pass) SGD, in which the emergence of heavy tails in theoretical findings is contingent upon access to an infinite amount of data. Hence, the underlying mechanism generating the reported heavy-tailed behavior in practical settings, where the amount of training data is finite, is still not well-understood. Our contribution aims to fill this gap. In particular, we show that the stationary distribution of offline (also called multi-pass) SGD exhibits ‘approximate’ power-law tails and the approximation error is controlled by how fast the empirical distribution of the training data converges to the true underlying data distribution in the Wasserstein metric. Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly ‘power-law-like’. To achieve this result, we first prove nonasymptotic Wasserstein convergence bounds for offline SGD to online SGD as the number of data points increases, which can be interesting on their own. Finally, we illustrate our theory on various experiments conducted on synthetic data and neural networks.

Non-Asymptotic Analysis of a UCB-based Top Two Algorithm
Marc Jourdan Rémy Degenne



研究问题:本文旨在解决固定置信度下最优臂识别问题,为Top Two算法提供非渐近的理论保证。
动机:尽管Top Two采样规则在最近几年得到了越来越多的关注,但其在固定置信度下最优臂识别问题上的非渐近理论保证尚未得到解决。
方法:本文提出了一种基于UCB算法的Top Two算法,该算法满足了用于最小化遗憾的领导者算法的所有充分属性。
效果:实验结果表明,所提出的基于UCB的Top Two算法不仅具有非渐近的保证,而且具有竞争性的实证性能。

A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a *leader* and a *challenger*. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.

Is Learning in Games Good for the Learners?
William Brown Jon Schneider Kiran Vodrahalli



研究问题:本文探讨了两个代理在重复游戏中的奖励和遗憾之间的权衡问题。
动机:为了解决这个问题,我们引入了一种广义均衡的概念,允许非对称的遗憾约束,并产生了每个代理和每对遗憾约束的可行值的多面体,我们证明了任何这样的均衡都是可以通过一对保持其遗憾保证的算法对抗任意对手来实现的。
方法:我们以一个核心例子来突出显示,其中一个代理是无交换的,另一个代理的遗憾是无约束的。我们展示了这捕获了一个扩展的斯塔克尔伯格均衡,具有匹配的最佳值,并且存在一个广泛的游戏类别,玩家可以通过偏离一个无交换遗憾算法来显著提高他们的效用(事实上,几乎所有没有纯纳什均衡的游戏都属于这种形式)。此外,我们还利用广义均衡来考虑对手的算法选择方面的权衡。
效果:我们给出了对抗一些无遗憾学习者的最大奖励的紧致特性描述,然而我们也展示了一类游戏,其中这个值被限制在一个与常见的“基于均值”的无遗憾算法类所能达到的值之外。最后,我们考虑了在游戏最初未知的情况下,通过与一个无遗憾代理进行重复游戏来学习奖励最优策略的问题。同样,我们展示了取决于对手的学习算法的权衡:对于任何可以通过查询学习的游戏,斯塔克尔伯格策略都可以在指数时间内与任何无遗憾代理一起学习(并且与任何无适应性遗憾代理一起在多项式时间内学习),并且存在一些游戏可以与任何无交换遗憾代理在多项式时间内学习,但需要与基于均值的无遗憾代理进行指数时间的学习。

We consider a number of questions related to tradeoffs between reward and regret in repeated gameplay between two agents. To facilitate this, we introduce a notion of generalized equilibrium which allows for asymmetric regret constraints, and yields polytopes of feasible values for each agent and pair of regret constraints, where we show that any such equilibrium is reachable by a pair of algorithms which maintain their regret guarantees against arbitrary opponents. As a central example, we highlight the case one agent is no-swap and the other's regret is unconstrained. We show that this captures an extension of Stackelberg equilibria with a matching optimal value, and that there exists a wide class of games where a player can significantly increase their utility by deviating from a no-swap-regret algorithm against a no-swap learner (in fact, almost any game without pure Nash equilibria is of this form). Additionally, we make use of generalized equilibria to consider tradeoffs in terms of the opponent's algorithm choice. We give a tight characterization for the maximal reward obtainable against some no-regret learner, yet we also show a class of games in which this is bounded away from the value obtainable against the class of common "mean-based" no-regret algorithms. Finally, we consider the question of learning reward-optimal strategies via repeated play with a no-regret agent when the game is initially unknown. Again we show tradeoffs depending on the opponent's learning algorithm: the Stackelberg strategy is learnable in exponential time with any no-regret agent (and in polynomial time with any no-adaptive-regret agent) for any game where it is learnable via queries, and there are games where it is learnable in polynomial time against any no-swap-regret agent but requires exponential time against a mean-based no-regret agent.

Smoothed Analysis of Sequential Probability Assignment
Alankrita Bhatt Nika Haghtalab Abhishek Shetty



研究问题:本文旨在对具有上下文的序列概率分配问题的平滑分析进行研究。
动机:为了理解信息理论最优的最小最大速率,以及涉及最大似然估计器(MLE)的算法简化框架。
方法:通过将平滑对手的序列概率分配的最小最大速率降低到转导学习(一种特定类型的学习)的最小最大速率,建立了一个通用的从平滑对手到转导学习的速率降低框架。
效果:对于参数类和具有有限VC维数的类,我们的方法得到了最优(对数)快速率。在算法方面,我们开发了一种有效利用MLE oracle的算法,对于一般函数类,该算法能够产生次线性遗憾。

We initiate the study of smoothed analysis for the sequential probability assignment problem with contexts. We study information-theoretically optimal minmax rates as well as a framework for algorithmic reduction involving the maximum likelihood estimator oracle. Our approach establishes a general-purpose reduction from minimax rates for sequential probability assignment for smoothed adversaries to minimax rates for transductive learning. This leads to optimal (logarithmic) fast rates for parametric classes and classes with finite VC dimension. On the algorithmic front, we develop an algorithm that efficiently taps into the MLE oracle, for general classes of functions. We show that under general conditions this algorithmic approach yields sublinear regret.

A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm
Ilias Diakonikolas Daniel Kane Jasper C.H. Lee Ankit Pensia Thanasis Pittas



研究问题:本文研究了可列表解码的高斯协方差估计问题。
动机:在给定的数据集T中,存在一个未知的小于1/2的比例的点是从未知的高斯分布中抽取的样本,目标是输出一个包含O(1/α)个假设的列表,其中至少有一个与Σ在相对Frobenius范数上接近。
方法:本文提出了一种基于谱技术的算法,该算法只需要多项式(d,1/α)的时间和样本就可以完成任务,并保证了相对Frobenius范数误差为多项式(1/α)。
效果:作为推论,我们得到了一种有效的高斯混合模型的部分聚类的谱算法,这是最近关于鲁棒地学习任意GMMs的工作[BakDJKKV22]的关键部分。结合[BakDJKKV22]的其他部分,我们的方法首次实现了无需求和平方的自由算法来鲁棒地学习GMMs,解决了由Vempala和Kothari提出的开放问题。

We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb{R}^d$ such that an unknown $\alpha<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(\mu, \Sigma)$, the goal is to output a list of $O(1/\alpha)$ hypotheses at least one of which is close to $\Sigma$ in relative Frobenius norm. Our main result is a $\mathrm{poly}(d,1/\alpha)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\mathrm{poly}(1/\alpha)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) --- a key ingredient in the recent work of [BakDJKKV22] on robustly learning arbitrary GMMs. Combined with the other components of [BakDJKKV22], our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs, resolving an open problem proposed by Vempala and Kothari. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings.

Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability
Jingfeng Wu Vladimir Braverman Jason D. Lee



研究问题:本文研究了在机器学习优化中,梯度下降(GD)在稳定性边缘(EoS)操作时的稳定性和收敛性问题。
动机:目前观察到,在机器学习优化中,梯度下降(GD)的操作往往位于稳定性边缘(EoS),其步长被设定为大,导致由GD迭代产生的非单调损失。
方法:通过理论分析和数值模拟,研究了在EoS状态下,固定步长的梯度下降在逻辑回归问题上的收敛性和隐含偏差。
效果:研究发现,尽管存在局部振荡,但逻辑损失可以通过任何固定步长的GD在长时间尺度上最小化。此外,当投影到最大间隔方向(硬间隔SVM方向)时,GD迭代趋向于无穷大;当投影到最大间隔方向的正交补集时,GD迭代会收敛到一个固定的向量,该向量最小化了一个强凸势。这些理论研究结果与数值模拟相符,并补充了现有的关于GD在逻辑回归问题上的收敛性和隐含偏差的理论,这些理论仅适用于步长足够小的情况。

Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime. Despite the presence of local oscillations, we prove that the logistic loss can be minimized by GD with any constant stepsize over a long time scale. Furthermore, we prove that with any constant stepsize, the GD iterates tend to infinity when projected to a max-margin direction (the hard-margin SVM direction) and converge to a fixed vector that minimizes a strongly convex potential when projected to the orthogonal complement of the max-margin direction. In contrast, we also show that in the EoS regime, GD iterates may diverge catastrophically under the exponential loss, highlighting the superiority of the logistic loss. These theoretical findings are in line with numerical simulations and complement existing theories on the convergence and implicit bias of GD for logistic regression, which are only applicable when the stepsizes are sufficiently small.

Sample Complexity of Forecast Aggregation
Tao Lin Yiling Chen



研究问题:本文研究了利用贝叶斯预测聚合模型,在未知二进制事件发生后,专家观察到私有信号并报告其对事件的信念,然后由委托人汇总这些报告形成单一预测的问题。
动机:尽管专家和事件的输出遵循一个联合分布,但委托人无法获知这个分布。然而,委托人可以访问来自该分布的独立同分布样本,每个样本都是专家的报告(而非信号)和事件实现的元组。委托人的目标是使用这些样本找到一个ε近似最优的聚合器,其中最优性是根据聚合预测与事件实现之间的期望平方距离来测量的。
方法:我们展示了对于任意离散分布,这个问题的样本复杂度至少为 Ω(m^n-2 / ε),其中 m 是每个专家的信号空间的大小。这个样本复杂度以指数方式增长在专家数量 n 上。但是,如果专家们的信号在给定事件实现的条件下是独立的,那么样本复杂度会显著降低,变为 O(1 / ε^2),并且不依赖于 n。
效果:我们的结果可以推广到非二进制事件。证明结果的过程使用了从分布学习问题进行归约的方法,揭示了预测聚合几乎与分布学习一样困难的事实。

We consider a Bayesian forecast aggregation model where $n$ experts, after observing private signals about an unknown binary event, report their posterior beliefs about the event to a principal, who then aggregates the reports into a single prediction for the event. The signals of the experts and the outcome of the event follow a joint distribution that is unknown to the principal, but the principal has access to i.i.d. "samples" from the distribution, where each sample is a tuple of the experts' reports (not signals) and the realization of the event. Using these samples, the principal aims to find an $\varepsilon$-approximately optimal aggregator, where optimality is measured in terms of the expected squared distance between the aggregated prediction and the realization of the event. We show that the sample complexity of this problem is at least $\tilde \Omega(m^{n-2} / \varepsilon)$ for arbitrary discrete distributions, where $m$ is the size of each expert's signal space. This sample complexity grows exponentially in the number of experts $n$. But, if the experts' signals are independent conditioned on the realization of the event, then the sample complexity is significantly reduced, to $\tilde O(1 / \varepsilon^2)$, which does not depend on $n$. Our results can be generalized to non-binary events. The proof of our results uses a reduction from the distribution learning problem and reveals the fact that forecast aggregation is almost as difficult as distribution learning.

Saddle-to-Saddle Dynamics in Diagonal Linear Networks
Scott Pesme Nicolas Flammarion



研究问题:本文研究了在消失初始化极限下,2层对角线性网络在回归设置中的梯度流轨迹。
动机:了解和揭示消失初始化下的网络训练过程和学习动态。
方法:通过使用类似于LARS算法的递归算法,明确地描述了访问过的鞍点以及跳跃时间。从零向量开始,逐步激活坐标,直到恢复最小$\ell_1$-范数解决方案,揭示了一种增量学习方式。
效果:实验结果支持了我们的研究结论,且该分析对数据的要求极低,适用于欠参数化和过参数化的情况,也适用于活动坐标数量无单调性等复杂情况。

In this paper we fully describe the trajectory of gradient flow over $2$-layer diagonal linear networks for the regression setting in the limit of vanishing initialisation. We show that the limiting flow successively jumps from a saddle of the training loss to another until reaching the minimum $\ell_1$-norm solution. We explicitly characterise the visited saddles as well as the jump times through a recursive algorithm reminiscent of the LARS algorithm used for computing the Lasso path. Starting from the zero vector, coordinates are successively activated until the minimum $\ell_1$-norm solution is recovered, revealing an incremental learning. Our proof leverages a convenient arc-length time-reparametrisation which enables to keep track of the transitions between the jumps. Our analysis requires negligible assumptions on the data, applies to both under and overparametrised settings and covers complex cases where there is no monotonicity of the number of active coordinates. We provide numerical experiments to support our findings.

Constant Approximation for Individual Preference Stable Clustering
Anders Aamand Justin Y. Chen Allen Liu Sandeep Silwal Pattara Sukprasert Ali Vakilian Fred Zhang



研究问题:如何利用稳定性和公平性约束进行自然聚类,并解决确定一个数据集是否存在$1$-IP稳定聚类的问题。
动机:目前的聚类方法无法保证稳定性和公平性,且确定$1$-IP稳定聚类的存在性是NP-Hard的。
方法:提出了个体偏好(IP)稳定性的概念,并证明了对于一般度量,总是存在$O(1)$-IP稳定的聚类。同时,还介绍了超越平均距离的IP稳定性的泛化,并在考虑最大和最小距离的情况下给出了高效的近似最优算法。
效果:解决了确定$1$-IP稳定聚类的存在性问题,并提供了高效的聚类算法。

Individual preference (IP) stability, introduced by Ahmadi et al. (ICML 2022), is a natural clustering objective inspired by stability and fairness constraints. A clustering is $\alpha$-IP stable if the average distance of every data point to its own cluster is at most $\alpha$ times the average distance to any other cluster. Unfortunately, determining if a dataset admits a $1$-IP stable clustering is NP-Hard. Moreover, before this work, it was unknown if an $o(n)$-IP stable clustering always exists, as the prior state of the art only guaranteed an $O(n)$-IP stable clustering. We close this gap in understanding and show that an $O(1)$-IP stable clustering always exists for general metrics, and we give an efficient algorithm which outputs such a clustering. We also introduce generalizations of IP stability beyond average distance and give efficient near optimal algorithms in the cases where we consider the maximum and minimum distances within and between clusters.

Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh Yingcong Li Xuechen Zhang Samet Oymak



研究问题:本文旨在探索注意力机制背后的理论原理,特别是其非凸优化动态。
动机:尽管注意力机制在大型语言模型的成功中起到了核心作用,但其理论基础尚未得到充分理解。
方法:本研究对softmax-attention模型进行了深入探讨,证明了梯度下降法可以收敛到局部最优的标记选择机制。
效果:实验结果验证了理论研究的正确性,并提供了对注意力机制更深入的理解。

Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(X)=\langle Xv, \texttt{softmax}(XWp)\rangle$, where $X$ is the token sequence and $(v,W,p)$ are trainable parameters. We prove that running gradient descent on $p$, or equivalently $W$, converges in direction to a max-margin solution that separates *locally-optimal* tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism. Remarkably, our results are applicable to general data and precisely characterize *optimality* of tokens in terms of the value embeddings $Xv$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $v$ and $p$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $v$ separates the input features based on their labels. Interestingly, the SVM formulation of $p$ is influenced by the support vector geometry of $v$. Finally, we verify our theoretical findings via numerical experiments and provide insights.

Tight Risk Bounds for Gradient Descent on Separable Data
Matan Schliserman Tomer Koren



研究问题:本文研究了在可分离线性分类中应用无正则化梯度方法的泛化特性。
动机:自Soudry等人(2018)的开创性工作以来,这个领域已经受到了广泛关注。
方法:在这个设置中,我们为任何光滑的损失函数建立了紧密的上界和下界(总体)风险边界,以数据边缘率表示。
效果:我们的风险上界极大地改进了Shamir(2021)和Schliserman和Koren(2022)的现有风险边界,这些边界要么适用于特定的损失函数,要么强加了不必要的技术假设,并且几乎适用于任何凸性和光滑的损失函数。我们的风险下界是该领域的首个,并确立了我们的风险上界的紧致性,适用于任何给定的数据边缘率和所有参数范围。证明这些结果所使用的技巧也明显比之前的工作简单,并且可以很容易地扩展到其他梯度方法;我们通过提供类似随机梯度下降的结果来说明这一点。

We study the generalization properties of unregularized gradient methods applied to separable linear classification---a setting that has received considerable attention since the pioneering work of Soudry et al. (2018). We establish tight upper and lower (population) risk bounds for gradient descent in this setting, for any smooth loss function, expressed in terms of its tail decay rate. Our bounds take the form $\Theta(r_{\ell,T}^2 / \gamma^2 T + r_{\ell,T}^2 / \gamma^2 n)$, where $T$ is the number of gradient steps, $n$ is size of the training set, $\gamma$ is the data margin, and $r_{\ell,T}$ is a complexity term that depends on the tail decay rate of the loss function (and on $T$). Our upper bound greatly improves the existing risk bounds due to Shamir (2021) and Schliserman and Koren (2022), that either applied to specific loss functions or imposed extraneous technical assumptions, and applies to virtually any convex and smooth loss function. Our risk lower bound is the first in this context and establish the tightness of our general upper bound for any given tail decay rate and in all parameter regimes. The proof technique used to show these results is also markedly simpler compared to previous work, and is straightforward to extend to other gradient methods; we illustrate this by providing analogous results for Stochastic Gradient Descent.

Demystifying Softmax Gating Function in Gaussian Mixture of Experts
Huy Nguyen TrungTin Nguyen Nhat Ho



研究问题:解决软最大值门控高斯专家混合模型的参数估计问题。
动机:由于软最大值门控函数在高斯密度中与专家函数的内在交互以及条件密度的复杂依赖性,使得该问题的参数估计长期存在理论挑战。
方法:通过提出新的参数之间的Voronoi损失函数并建立求解这些模型的最大似然估计器的收敛速度,解决了这个问题。
效果:当真实专家数量未知且过度指定时,研究发现了最大似然估计的收敛速度与一组多项式方程的可解性问题之间的联系。

Understanding the parameter estimation of softmax gating Gaussian mixture of experts has remained a long-standing open problem in the literature. It is mainly due to three fundamental theoretical challenges associated with the softmax gating function: (i) the identifiability only up to the translation of parameters; (ii) the intrinsic interaction via partial differential equations between the softmax gating and the expert functions in the Gaussian density; (iii) the complex dependence between the numerator and denominator of the conditional density of softmax gating Gaussian mixture of experts. We resolve these challenges by proposing novel Voronoi loss functions among parameters and establishing the convergence rates of maximum likelihood estimator (MLE) for solving parameter estimation in these models. When the true number of experts is unknown and over-specified, our findings show a connection between the convergence rate of the MLE and a solvability problem of a system of polynomial equations.

Adaptive Data Analysis in a Balanced Adversarial Model
Kobbi Nissim Uri Stemmer Eliad Tsfadia



研究问题:在自适应数据分析中,如何对未知分布进行准确的估计。
动机:现有的研究结果依赖于一个明显优于机制的敌对模型,这引发了关于所得难度结果的适用性的问题。
方法:我们考虑了更受限制的对手,称为“平衡”,并使用有效的“平衡”对手重新审视以前的下界,基于标准的公钥密码学假设。
效果:我们证明了这些更强的难度假设是不可避免的,因为任何具有已知攻击结构的计算受限的“平衡”对手都意味着存在公钥密码学。

In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $\cal{D}$, and is required to provide accurate estimations to a sequence of adaptively chosen statistical queries with respect to $\cal{D}$. Hardt and Ullman (FOCS 2014) and Steinke and Ullman (COLT 2015) showed that in general, it is computationally hard to answer more than $\Theta(n^2)$ adaptive queries, assuming the existence of one-way functions. However, these negative results strongly rely on an adversarial model that significantly advantages the adversarial analyst over the mechanism, as the analyst, who chooses the adaptive queries, also chooses the underlying distribution $\cal{D}$. This imbalance raises questions with respect to the applicability of the obtained hardness results -- an analyst who has complete knowledge of the underlying distribution $\cal{D}$ would have little need, if at all, to issue statistical queries to a mechanism which only holds a finite number of samples from $\cal{D}$. We consider more restricted adversaries, called \emph{balanced}, where each such adversary consists of two separated algorithms: The \emph{sampler} who is the entity that chooses the distribution and provides the samples to the mechanism, and the \emph{analyst} who chooses the adaptive queries, but has no prior knowledge of the underlying distribution (and hence has no a priori advantage with respect to the mechanism). We improve the quality of previous lower bounds by revisiting them using an efficient \emph{balanced} adversary, under standard public-key cryptography assumptions. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded \emph{balanced} adversary that has the structure of all known attacks, implies the existence of public-key cryptography.

Decentralized Randomly Distributed Multi-agent Multi-armed Bandit with Heterogeneous Rewards
Mengfan Xu Diego Klabjan



研究问题:本文研究了一个去中心的多代理多臂赌博机问题,其中多个客户端通过环境
动机:现有的研究结果依赖于一个明显优于机制的敌对模型,这引发了关于所得难度结果的适用性的问题。
方法:我们考虑了更受限制的对手,称为“平衡”,并使用有效的“平衡”对手重新审视以前的下界,基于标准的公钥密码学假设。
效果:我们证明了这些更强的难度假设是不可避免的,因为任何具有已知攻击结构的计算受限的“平衡”对手都意味着存在公钥密码学。

We study a decentralized multi-agent multi-armed bandit problem in which multiple clients are connected by time dependent random graphs provided by an environment. The reward distributions of each arm vary across clients and rewards are generated independently over time by an environment based on distributions that include both sub-exponential and sub-gaussian distributions. Each client pulls an arm and communicates with neighbors based on the graph provided by the environment. The goal is to minimize the overall regret of the entire system through collaborations. To this end, we introduce a novel algorithmic framework, which first provides robust simulation methods for generating random graphs using rapidly mixing markov chains or the random graph model, and then combines an averaging-based consensus approach with a newly proposed weighting technique and the upper confidence bound to deliver a UCB-type solution. Our algorithms account for the randomness in the graphs, removing the conventional doubly stochasticity assumption, and only require the knowledge of the number of clients at initialization. We derive optimal instance-dependent regret upper bounds of order $\log{T}$ in both sub-gaussian and sub-exponential environments, and a nearly optimal instance-free regret upper bound of order $\sqrt{T}\log T$ up to a $\log T$ factor. Importantly, our regret bounds hold with high probability and capture graph randomness, whereas prior works consider expected regret under assumptions and require more stringent reward distributions.

Regularization properties of adversarially-trained linear regression
Antonio H. Ribeiro Dave Zachariah Francis Bach Thomas B. Schön



研究问题:最先进的机器学习模型可能对恶意构造的极小输入扰动非常脆弱,对抗性训练是防御的有效方法。
动机:线性模型等简单模型存在易受攻击的问题,我们的研究重点在此。
方法:我们将对抗性训练在线性回归中的解决方案与其他正则化方法进行了比较分析。
效果:我们发现(A)只要最大干扰半径小于阈值,对抗性训练就会在过参数化区域产生最小范数插值解。(B)在适当的选择对抗半径和零均值对称分布协变量的情况下,对抗性训练可以等同于参数收缩方法(岭回归和Lasso)。(C)对于$\ell_\infty$-对抗性训练-如同平方根Lasso-最优边界的对抗半径的选择并不依赖于附加噪声方差。我们的理论研究结果通过数值示例得到了证实。

State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is an effective approach to defend against it. Formulated as a min-max problem, it searches for the best solution when the training data were corrupted by the worst-case attacks. Linear models are among the simple models where vulnerabilities can be observed and are the focus of our study. In this case, adversarial training leads to a convex optimization problem which can be formulated as the minimization of a finite sum. We provide a comparative analysis between the solution of adversarial training in linear regression and other regularization methods. Our main findings are that: (A) Adversarial training yields the minimum-norm interpolating solution in the overparameterized regime (more parameters than data), as long as the maximum disturbance radius is smaller than a threshold. And, conversely, the minimum-norm interpolator is the solution to adversarial training with a given radius. (B) Adversarial training can be equivalent to parameter shrinking methods (ridge regression and Lasso). This happens in the underparametrized region, for an appropriate choice of adversarial radius and zero-mean symmetrically distributed covariates. (C) For $\ell_\infty$-adversarial training---as in square-root Lasso---the choice of adversarial radius for optimal bounds does not depend on the additive noise variance. We confirm our theoretical findings with numerical examples.

Robust Distributed Learning: Tight Error Bounds and Breakdown Point under Data Heterogeneity
Youssef Allouah Rachid Guerraoui Nirupam Gupta Rafael Pinot Geovani Rizk



研究问题:现有的分布式学习算法理论在面对数据异构性时,其学习误差下界基本无效,与实际观察存在严重不匹配。
动机:由于现有理论的异构性模型过于严格且未覆盖最基本的学习任务,如最小二乘回归,因此无法有效解释和预测实际场景中的学习误差。
方法:本文提出了一种更现实的异构性模型,即$(G,B)$-梯度相似性模型,并证明其能覆盖比现有理论更广泛的学习问题。
效果:实验结果显示,该模型的学习误差下界低于传统理论的$frac{1}{2}$,并且通过对比分析,理论与实践之间的差距得到了显著缩小。

The theory underlying robust distributed learning algorithms, designed to resist adversarial machines, matches empirical observations when data is homogeneous. Under data heterogeneity however, which is the norm in practical scenarios, established lower bounds on the learning error are essentially vacuous and greatly mismatch empirical observations. This is because the heterogeneity model considered is too restrictive and does not cover basic learning tasks such as least-squares regression. We consider in this paper a more realistic heterogeneity model, namely $(G,B)$-gradient dissimilarity, and show that it covers a larger class of learning problems than existing theory. Notably, we show that the breakdown point under heterogeneity is lower than the classical fraction $\frac{1}{2}$. We also prove a new lower bound on the learning error of any distributed learning algorithm. We derive a matching upper bound for a robust variant of distributed gradient descent, and empirically show that our analysis reduces the gap between theory and practice.

One-step differentiation of iterative algorithms
Jerome Bolte Edouard Pauwels Samuel Vaiter



研究问题:本文旨在研究一种新的一阶微分方法,即雅可比无反向传播,该方法既简单易用又高效。
动机:传统的自动微分和隐式微分在处理大规模运算时存在计算负担大的问题,而迭代算法的雅可比评估则需要自定义实现。因此,需要一种既能像自动微分一样简单,又能像隐式微分一样高效的微分方法。
方法:本文提出了一种名为雅可比无反向传播的一阶微分方法,该方法适用于快速算法(如超线性优化方法)。通过具体的示例(如牛顿法、梯度下降法)以及双层优化中的结果,进行了完整的理论近似分析。
效果:数值示例证明了一阶估计器的有效性。

In appropriate frameworks, automatic differentiation is transparent to the user, at the cost of being a significant computational burden when the number of operations is large. For iterative algorithms, implicit differentiation alleviates this issue but requires custom implementation of Jacobian evaluation. In this paper, we study one-step differentiation, also known as Jacobian-free backpropagation, a method as easy as automatic differentiation and as performant as implicit differentiation for fast algorithms (e.g. superlinear optimization methods). We provide a complete theoretical approximation analysis with specific examples (Newton's method, gradient descent) along with its consequences in bilevel optimization. Several numerical examples illustrate the well-foundness of the one-step estimator.

Efficient Online Clustering with Moving Costs
Dimitris Christou EFSTRATIOS PANTELEIMON SKOULAKIS Volkan Cevher



研究问题:本文研究了在线学习中的一种问题,即带有移动成本的在线$k$-聚类。
动机:在这个问题中,学习者需要在T轮中维护一个包含k个设施的集合,以最小化对手选择的一系列客户的连接成本。学习者只能在每轮t选择设施后才能知道客户的位置,并可以使用此信息在下一轮更新其决策。然而,更新设施位置会带来额外的移动成本,这取决于设施的移动距离。
方法:我们提出了第一个保证总体成本(连接+移动)最多是最佳固定解决方案的时间平均连接成本的$\mathcal{O}(\log n)$倍的$mathcal{O}(\log n)$-遗憾在线学习算法。
效果:我们的研究改进了最近的结果(Fotakis等人,2021年),该结果仅保证了连接成本的$\mathcal{O}(k)$-遗憾保证。

In this work we consider an online learning problem, called Online $k$-Clustering with Moving Costs, at which a learner maintains a set of $k$ facilities over $T$ rounds so as to minimize the connection cost of an adversarially selected sequence of clients. The learner is informed on the positions of the clients at each round $t$ only after its facility-selection and can use this information to update its decision in the next round. However, updating the facility positions comes with an additional moving cost based on the moving distance of the facilities. We present the first $\mathcal{O}(\log n)$-regret polynomial-time online learning algorithm guaranteeing that the overall cost (connection $+$ moving) is at most $\mathcal{O}(\log n)$ times the time-averaged connection cost of the best fixed solution. Our work improves on the recent result of (Fotakis et al., 2021) establishing $\mathcal{O}(k)$-regret guarantees only on the connection cost.

Smoothed Online Learning for Prediction in Piecewise Affine Systems
Adam Block Max Simchowitz Russ Tedrake



研究问题:分段仿射(PWA)回归和规划在在线学习、控制和机器人学中的基础重要性,为研究系统动力学急剧变化提供了理论和实证上易于处理的环境。
动机:由于跨越不同“段”时产生的不连续性,一般的序列设置中的学习是不可能的,实际的算法被迫采用启发式方法。
方法:我们提出了第一个保证总体成本(连接+移动)最多是最佳固定解决方案的时间平均连接成本的$\mathcal{O}(\log n)$倍的$mathcal{O}(\log n)$-遗憾在线学习算法。
效果:我们的研究改进了最近的结果(Fotakis等人,2021年),该结果仅保证了连接成本的$\mathcal{O}(k)$-遗憾保证。

The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.

Best Arm Identification with Fixed Budget: A Large Deviation Perspective
Po-An Wang Ruo-Chun Tzeng Alexandre Proutiere



研究问题:使用固定采样预算在随机多臂赌博机(MABs)中识别最佳
动机:由于跨越不同“段”时产生的不连续性,一般的序列设置中的学习是不可能的,实际的算法被迫采用启发式方法。
方法:我们提出了第一个保证总体成本(连接+移动)最多是最佳固定解决方案的时间平均连接成本的$\mathcal{O}(\log n)$倍的$mathcal{O}(\log n)$-遗憾在线学习算法。
效果:我们的研究改进了最近的结果(Fotakis等人,2021年),该结果仅保证了连接成本的$\mathcal{O}(k)$-遗憾保证。

We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at a rate that can be explicitly derived via Large Deviation techniques. Analyzing the performance of algorithms with adaptive sampling strategies is however much more challenging. In this paper, we establish a connection between the Large Deviation Principle (LDP) satisfied by the empirical proportions of arm draws and that satisfied by the empirical arm rewards. This connection holds for any adaptive algorithm, and is leveraged (i) to improve error probability upper bounds of some existing algorithms, such as the celebrated SR (Successive Rejects) algorithm \cite{audibert2010best}, and (ii) to devise and analyze new algorithms. In particular, we present CR (Continuous Rejects), a truly adaptive algorithm that can reject arms in {\it any} round based on the observed empirical gaps between the rewards of various arms. Applying our Large Deviation results, we prove that CR enjoys better performance guarantees than existing algorithms, including SR. Extensive numerical experiments confirm this observation.

Agnostically Learning Single-Index Models using Omnipredictors
Aravind Gollakota Parikshit Gopalan Adam Klivans Konstantinos Stavropoulos



研究问题:本文旨在解决如何通过任意单调和Lipschitz激活函数,以无先验知识的方式学习单指数模型(SIMs)。
动机:现有的方法要么仅适用于可实现的设置,要么需要已知激活函数。此外,我们只需要边际具有有界的二阶矩,而所有现有工作都需要更强的分布假设(如反集中或有界性)。
方法:我们的算法基于Gopalan等人[2023]关于使用满足校准多精度的预测器的全预测的工作。我们的分析简单,依赖于Bregman散度(或匹配损失)与l_p距离之间的关系。我们还为标准的算法如GLMtron和逻辑回归在无先验知识的设置中提供了新的保证。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We give the first result for agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. All prior work either held only in the realizable setting or required the activation to be known. Moreover, we only require the marginal to have bounded second moments, whereas all prior work required stronger distributional assumptions (such as anticoncentration or boundedness). Our algorithm is based on recent work by Gopalan et al. [2023] on Omniprediction using predictors satisfying calibrated multiaccuracy. Our analysis is simple and relies on the relationship between Bregman divergences (or matching losses) and $\ell_p$ distances. We also provide new guarantees for standard algorithms like GLMtron and logistic regression in the agnostic setting.

When Can We Track Significant Preference Shifts in Dueling Bandits?
Joe Suk Arpit Agarwal



研究问题:本文探讨了在用户偏好/口味随时间变化的情况下,如何设计一个具有适应性研究问题:本文探讨了在用户偏好/口味随时间变化的情况下,如何设计一个具有适应性的算法来解决带有分布偏移的 $K$-armed 决斗博弈问题。
动机:由于用户偏好/口味可能会随着时间的推移而发生变化,因此需要研究在分布偏移情况下如何解决 $K$-armed 决斗博弈问题。
方法:本文研究了最近提出的显著偏移概念(Suk and Kpotufe, 2022),并探讨了是否可以设计一个具有 $O(\sqrt{K\tilde{L}T})$ 动态遗憾的自适应算法来解决决斗问题。
效果:本文首先给出了一个不可能的结果,排除了在康多塞和 SST 类偏好分布下具有 $O(sqrt{K\tilde{L}T})$ 动态遗憾的任何算法。其次,本文证明了 $\text{SST}\cap \text{STI}$ 是最受欢迎的偏好分布类中可以设计此类算法的最大类。总体而言,本文为上述问题的分布类别层次提供了几乎完整的解决方案。

The $K$-armed dueling bandits problem, where the feedback is in the form of noisy pairwise preferences, has been widely studied due its applications in information retrieval, recommendation systems, etc. Motivated by concerns that user preferences/tastes can evolve over time, we consider the problem of _dueling bandits with distribution shifts_. Specifically, we study the recent notion of _significant shifts_ (Suk and Kpotufe, 2022), and ask whether one can design an _adaptive_ algorithm for the dueling problem with $O(\sqrt{K\tilde{L}T})$ dynamic regret, where $\tilde{L}$ is the (unknown) number of significant shifts in preferences. We show that the answer to this question depends on the properties of underlying preference distributions. Firstly, we give an impossibility result that rules out any algorithm with $O(\sqrt{K\tilde{L}T})$ dynamic regret under the well-studied Condorcet and SST classes of preference distributions. Secondly, we show that $\text{SST}\cap \text{STI}$ is the largest amongst popular classes of preference distributions where it is possible to design such an algorithm. Overall, our results provides an almost complete resolution of the above question for the hierarchy of distribution classes.

A Unifying Perspective on Multi-Calibration: Game Dynamics for Multi-Objective Learning
Nika Haghtalab Michael Jordan Eric Zhao



研究问题:本文旨在为多校准预测器的设计分析提供一个统一的框架。
动机:在多目标学习的背景下,同时满足一组分布和损失函数的学习保证,通过利用游戏动态来达到最先进的多校准学习问题的保证。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We provide a unifying framework for the design and analysis of multi-calibrated predictors. By placing the multi-calibration problem in the general setting of multi-objective learning---where learning guarantees must hold simultaneously over a set of distributions and loss functions---we exploit connections to game dynamics to achieve state-of-the-art guarantees for a diverse set of multi-calibration learning problems. In addition to shedding light on existing multi-calibration guarantees and greatly simplifying their analysis, our approach also yields improved guarantees, such as error tolerances that scale with the square-root of group size versus the constant tolerances guaranteed by prior works, and improving the complexity of $k$-class multi-calibration by an exponential factor of $k$ versus Gopalan et al.. Beyond multi-calibration, we use these game dynamics to address emerging considerations in the study of group fairness and multi-distribution learning.

Data-driven Optimal Filtering for Linear Systems with Unknown Noise Covariances
Shahriar Talebi Amirhossein Taghvaei Mehran Mesbahi



研究问题:本文旨在学习线性系统中未知噪声协方差矩阵的最优滤波策略,即卡尔曼增益,使用有噪声的输出数据。
动机:现有的学习方法无法直接将数据驱动最优控制与它的对偶,即最优滤波相连接。
方法:采用随机梯度下降算法来解决滤波问题,并考虑了偏置梯度和稳定性约束的影响。同时,利用线性系统理论和高维统计工具来推导误差边界。
效果:实验结果表明,该方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper examines learning the optimal filtering policy, known as the Kalman gain, for a linear system with unknown noise covariance matrices using noisy output data. The learning problem is formulated as a stochastic policy optimiza- tion problem, aiming to minimize the output prediction error. This formulation provides a direct bridge between data-driven optimal control and, its dual, op- timal filtering. Our contributions are twofold. Firstly, we conduct a thorough convergence analysis of the stochastic gradient descent algorithm, adopted for the filtering problem, accounting for biased gradients and stability constraints. Secondly, we carefully leverage a combination of tools from linear system theory and high-dimensional statistics to derive bias-variance error bounds that scale logarithmically with problem dimension, and, in contrast to subspace methods, the length of output trajectories only affects the bias term.

Tracking Most Significant Shifts in Nonparametric Contextual Bandits
Joe Suk Samory Kpotufe



研究问题:本文研究了非参数上下文bandits,其中Lipschitz均值奖励函数可能会随时间变化。
动机:在当前这个理解较少的设置中,我们首先建立了最小最大动态遗憾率,以变化的数目L和总变差V来表示,这两者都捕捉了上下文空间中的所有分布变化,并认为最先进的程序在这种设置下是次优的。
方法:接下来,我们尝试解决这种设置下的适应性问题,即在不知道L或V的情况下实现最小最大速率。非常重要的是,我们认为在给定的上下文X_t处查看的bandit问题,不应受到上下文空间其他部分奖励变化的影响。因此,我们提出了一种称为“经验重大转变”的概念,它更好地考虑了局部性,因此计算的变化比L和V少得多。此外,与最近关于非平稳MAB的工作(Suk & Kpotufe,2022)类似,“经验重大转变”只计算均值奖励的最显著变化,例如与观察到的上下文相关的严重最佳手臂变化。
效果:我们的主要结果是展示这种更容忍的变化概念实际上可以被适应。

We study nonparametric contextual bandits where Lipschitz mean reward functions may change over time. We first establish the minimax dynamic regret rate in this less understood setting in terms of number of changes $L$ and total-variation $V$, both capturing all changes in distribution over context space, and argue that state-of-the-art procedures are suboptimal in this setting. Next, we tend to the question of an _adaptivity_ for this setting, i.e. achieving the minimax rate without knowledge of $L$ or $V$. Quite importantly, we posit that the bandit problem, viewed locally at a given context $X_t$, should not be affected by reward changes in other parts of context space $\cal X$. We therefore propose a notion of _change_, which we term _experienced significant shifts_, that better accounts for locality, and thus counts considerably less changes than $L$ and $V$. Furthermore, similar to recent work on non-stationary MAB (Suk & Kpotufe, 2022), _experienced significant shifts_ only count the most _significant_ changes in mean rewards, e.g., severe best-arm changes relevant to observed contexts. Our main result is to show that this more tolerant notion of change can in fact be adapted to.

SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions
Ilias Diakonikolas Daniel Kane Lisheng Ren Yuxin Sun



研究问题:本文研究了统计查询模型中非高斯成分分析(NGCA)的复杂性。
动机:先前的工作已经发展出一种证明NGCA在统计查询模型下具有下界的方法,这种方法已被广泛应用于各种情况。
方法:我们证明了仅需要满足矩匹配条件,就可以得到NGCA在统计查询模型下的近似最优下界。
效果:我们的工作表明,矩匹配条件是必要的,而卡方条件并非必须。

We study the complexity of Non-Gaussian Component Analysis (NGCA) in the Statistical Query (SQ) model. Prior work developed a methodology to prove SQ lower bounds for NGCA that have been applicable to a wide range of contexts. In particular, it was known that for any univariate distribution $A$ satisfying certain conditions, distinguishing between a standard multivariate Gaussian and a distribution that behaves like $A$ in a random hidden direction and like a standard Gaussian in the orthogonal complement, is SQ-hard. The required conditions were that (1) $A$ matches many low-order moments with a standard Gaussian, and (2) the chi-squared norm of $A$ with respect to the standard Gaussian is finite. While the moment-matching condition is clearly necessary for hardness, the chi-squared condition was only required for technical reasons. In this work, we establish that the latter condition is indeed not necessary. In particular, we prove near-optimal SQ lower bounds for NGCA under the moment-matching condition only.

Fair, Polylog-Approximate Low-Cost Hierarchical Clustering
Marina Knittel Max Springer John P Dickerson MohammadTaghi Hajiaghayi



研究问题:近年来,公平机器学习,特别是聚类的研究至关重要,因为现代智能系统引发了许多伦理
动机:先前的工作已经发展出一种证明NGCA在统计查询模型下具有下界的方法,这种方法已被广泛应用于各种情况。
方法:我们证明了仅需要满足矩匹配条件,就可以得到NGCA在统计查询模型下的近似最优下界。
效果:我们的工作表明,矩匹配条件是必要的,而卡方条件并非必须。

Research in fair machine learning, and particularly clustering, has been crucial in recent years given the many ethical controversies that modern intelligent systems have posed. Ahmadian et al. [2020] established the study of fairness in hierarchical clustering, a stronger, more structured variant of its well-known flat counterpart, though their proposed algorithm that optimizes for Dasgupta's [2016] famous cost function was highly theoretical. Knittel et al. [2023] then proposed the first practical fair approximation for cost, however they were unable to break the polynomial-approximate barrier they posed as a hurdle of interest. We break this barrier, proposing the first truly polylogarithmic-approximate low-cost fair hierarchical clustering, thus greatly bridging the gap between the best fair and vanilla hierarchical clustering approximations.

Minimax-Optimal Location Estimation
Shivam Gupta Jasper C.H. Lee Eric Price Paul Valiant



研究问题:如何在有限的样本数量下,对未知的参数$\mu$进行高精度的估计?
动机:在参数统计中,位置估计是最基础的问题之一。尽管最大似然估计器(MLE)在样本数量趋向无穷大时是最优的,但在有限样本的情况下其性能如何尚待研究。
方法:本文提出了两种基于不同标准的位置估计器:1)一种以最小化最大误差并保证成功概率为$1-\delta$为目标的估计器;2)一种具有最小期望平方区间宽度的信心区间估计器,该估计器在所有位移不变的估计器中,其输出区间包含$\mu$的概率至少为$1-\delta$。
效果:后者的构造可以推广到最小化期望损失函数在区间宽度上的情况。

Location estimation is one of the most basic questions in parametric statistics. Suppose we have a known distribution density $f$, and we get $n$ i.i.d. samples from $f(x-\mu)$ for some unknown shift $\mu$. The task is to estimate $\mu$ to high accuracy with high probability. The maximum likelihood estimator (MLE) is known to be asymptotically optimal as $n \to \infty$, but what is possible for finite $n$? In this paper, we give two location estimators that are optimal under different criteria: 1) an estimator that has minimax-optimal estimation error subject to succeeding with probability $1-\delta$ and 2) a confidence interval estimator which, subject to its output interval containing $\mu$ with probability at least $1-\delta$, has the minimum expected squared interval width among all shift-invariant estimators. The latter construction can be generalized to minimizing the expectation of any loss function on the interval width.

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation
Nikki Lijing Kuang Ming Yin Mengdi Wang Yu-Xiang Wang Yian Ma



研究问题:本文旨在解决强化学习中由于延迟反馈导致的性能下降问题。
动机:现有的强化学习算法通常依赖于即时反馈,而忽视了观察结果的延迟影响,这在现实世界系统中可能导致性能严重下降。
方法:本文提出了两种算法,Delayed-PSVI和Delayed-LPSVI,分别采用后验采样和结合梯度近似采样方案来处理延迟反馈的问题。
效果:实验结果表明,这两种算法在统计和计算效率上都表现出色,能有效应对延迟反馈带来的挑战。

Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce \textit{Delayed-PSVI}, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 \mathbb{E}[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $\mathbb{E}[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for \textit{Delayed-LPSVI}, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.

Projection-Free Methods for Solving Nonconvex-Concave Saddle Point Problems
Morteza Boroun Erfan Yazdandoost Hamedani Afrooz Jalilzadeh



研究问题:本文研究了一类目标函数为非凸-凹且平滑的约束鞍点问题,这类问题在机器学习中有广泛应用。
动机:尽管已有一些基于投影的原始-对偶方法来解决这个问题,但缺乏无投影方法的研究。
方法:提出了一种依赖一阶信息的高效单循环无投影方法。具体来说,通过正则化和嵌套近似技术,我们提出了一种仅使用线性最小化查询处理约束的原-对偶条件梯度方法。
效果:当最大化问题的约束集是强凸时,我们的方法可以在$\mathcal{O}(\epsilon^{-6})$次迭代内实现$\epsilon$-稳定解。当最大化问题的约束集的投影易于计算时,我们提出了一种单侧无投影方法,可以在$mathcal{O}(epsilon^{-4})$次迭代内实现$\epsilon$-稳定解。此外,我们还展示了在强凹性假设下改进的迭代复杂度。据我们所知,我们的算法是首批具有收敛保证的非凸-凹SP问题无投影方法之一。

In this paper, we investigate a class of constrained saddle point (SP) problems where the objective function is nonconvex-concave and smooth. This class of problems has wide applicability in machine learning, including robust multi-class classification and dictionary learning. Several projection-based primal-dual methods have been developed to tackle this problem; however, the availability of methods with projection-free oracles remains limited. To address this gap, we propose efficient single-loop projection-free methods reliant on first-order information. In particular, using regularization and nested approximation techniques, we propose a primal-dual conditional gradient method that solely employs linear minimization oracles to handle constraints. Assuming that the constraint set in the maximization is strongly convex, our method achieves an $\epsilon$-stationary solution within $\mathcal{O}(\epsilon^{-6})$ iterations. When the projection onto the constraint set of maximization is easy to compute, we propose a one-sided projection-free method that achieves an $\epsilon$-stationary solution within $\mathcal{O}(\epsilon^{-4})$ iterations. Moreover, we present improved iteration complexities of our methods under a strong concavity assumption. To the best of our knowledge, our proposed algorithms are among the first projection-free methods with convergence guarantees for solving nonconvex-concave SP problems.

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
Hao Qin Kwang-Sung Jun Chicheng Zhang



研究问题:本文研究了$K$-armed bandit问题,其中所有手臂的奖励分布都支持在$[0,1]$区间内。
动机:Maillard采样是一种有吸引力的Thompson采样的替代方案,最近已证明在次高斯奖励设置中实现竞争性遗憾保证,同时保持封闭形式的行动概率,这对离线策略评估很有用。
方法:我们分析了Kullback-Leibler Maillard Sampling(KL-MS)算法,这是Maillard采样的自然扩展和最小经验发散(MED)的特殊案例,用于实现有限时间间隔依赖的KL式遗憾界限。
效果:当奖励是伯努利分布时,KL-MS具有渐近最优性,最坏情况的遗憾界限形式为$O(sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$,其中$\mu^*$是最优手臂的期望奖励,$T$是时间范围长度;这是文献中首次报告此类算法具有渐近最优性保证的适应性。

We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. Maillard sampling\cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting\cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we analyze the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling {and a special case of Minimum Empirical Divergence (MED)~\cite{honda2011asymptotically}} for achieving a KL-style finite-time gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has an {adaptive} worst-case regret bound of the form $O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$, where $\mu^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length; {this is the first time such adaptivity is reported in the literature for an algorithm with asymptotic optimality guarantees.}

No-Regret Online Prediction with Strategic Experts
Omid Sadeghi Maryam Fazel



研究问题:本文研究了在线二元预测与专家建议的一般化框架,其中学习者在研究问题:本文研究了在线二元预测与专家建议的一般化框架,其中学习者在每一轮中可以从K个专家的池子中选择m个专家,并且总体效用是所选专家的模块或亚模块函数。
动机:当专家为了最大化他们对算法预测的影响而可能误报他们关于事件的信念时,我们关注专家会采取策略性行动的情况。这种设置在预测比赛中找到了应用,其中学习者不仅通过聚合不同的预测器来做出预测,而且还要根据他们的相对性能对它们进行排名。
方法:我们的目标是设计满足以下两个要求的算法:1)激励兼容:激励专家们如实报告他们的信念;2)无遗憾:相对于事后最佳固定m个专家的真实信念实现次线性遗憾。
效果:我们首先证明将我们的问题简化为m=1的情况既不是高效的也不是有效的。然后,我们提供了利用效用函数特定结构的算法来实现这两个目标。

We study a generalization of the online binary prediction with expert advice framework where at each round, the learner is allowed to pick $m\geq 1$ experts from a pool of $K$ experts and the overall utility is a modular or submodular function of the chosen experts. We focus on the setting in which experts act strategically and aim to maximize their influence on the algorithm's predictions by potentially misreporting their beliefs about the events. Among others, this setting finds applications in forecasting competitions where the learner seeks not only to make predictions by aggregating different forecasters but also to rank them according to their relative performance. Our goal is to design algorithms that satisfy the following two requirements: 1) \emph{Incentive-compatible}: Incentivize the experts to report their beliefs truthfully, and 2) \emph{No-regret}: Achieve sublinear regret with respect to the true beliefs of the best fixed set of $m$ experts in hindsight. Prior works have studied this framework when $m=1$ and provided incentive-compatible no-regret algorithms for the problem. We first show that a simple reduction of our problem to the $m=1$ setting is neither efficient nor effective. Then, we provide algorithms that utilize the specific structure of the utility functions to achieve the two desired goals.

$H$-Consistency Bounds: Characterization and Extensions
Anqi Mao Mehryar Mohri Yutao Zhong



研究问题:本文旨在为替代损失函数提出更通用的工具和特性描述。
动机:Awasthi等人的最近发表的一系列论文引入了*$H$-一致性边界*的关键概念,这是预测器在假设集中的任何零一估计误差的上界,用其替代损失估计误差表示。然而,确定它们是否适用以及推导这些边界需要对每个替代损失进行特定的证明和分析。我们能否得出更通用的工具和特性描述?
方法:本文提供了一种一般的特性描述和对多类分类的$H$-一致性边界的扩展。我们为约束损失函数族和comp-sum损失函数族(包括应用于神经网络输出的交叉熵或逻辑损失)提出了新的、紧密的$H$-一致性边界。
效果:我们进一步将我们的分析扩展到了先前研究中采用的完整性假设之外,涵盖了更现实的有界假设集。我们的特性描述基于错误转换,每种形式都明确定义。通过几个特殊的例子,我们说明了我们一般结果的应用。我们分析的一个副产品是观察到最近为交叉熵导出的多类$H$-一致性边界降低到一个超额边界,并且并不显著。相反,我们证明了一个更强且更重要的保证。

A series of recent publications by Awasthi et al. have introduced the key notion of *$H$-consistency bounds* for surrogate loss functions. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are both non-asymptotic and hypothesis set-specific and thus stronger and more informative than Bayes-consistency. However, determining if they hold and deriving these bounds have required a specific proof and analysis for each surrogate loss. Can we derive more general tools and characterizations? This paper provides both a general characterization and an extension of $H$-consistency bounds for multi-class classification. We present new and tight $H$-consistency bounds for both the family of constrained losses and that of comp-sum losses, which covers the familiar cross-entropy, or logistic loss applied to the outputs of a neural network. We further extend our analysis beyond the completeness assumptions adopted in previous studies and cover more realistic bounded hypothesis sets. Our characterizations are based on error transformations, which are explicitly defined for each formulation. We illustrate the application of our general results through several special examples. A by-product of our analysis is the observation that a recently derived multi-class $H$-consistency bound for cross-entropy reduces to an excess bound and is not significant. Instead, we prove a much stronger and more significant guarantee.

A Trichotomy for Transductive Online Learning
Steve Hanneke Shay Moran Jonathan Shafer



研究问题:本文旨在确定在线学习中学习者错误数量的上下界。
动机:在Ben-David, Kushilevitz和Mansour (1997)的“转导”在线学习设置中,除了对手在游戏开始时固定一系列实例$x_1,\dots,x_n$并让学习者知道这一序列外,该设置与标准在线学习类似。
方法:我们证明了一个“三重性”,即随着n的增长,学习者犯的最小错误数量只能为三个可能值之一:n、$\Theta\left(log (n)\right)$或$Theta(1)$。此外,这种行为由VC维数和小石塔维数共同决定。
效果:我们展示了各种将错误数量与众所周知的组合维度联系起来的界限。特别是,我们将已知的$\Theta(1)$情况中的常数下界从$Omega\left(\sqrt{\log(d)}\right)$提高到$Omega(\log(d))$,其中d是小石塔维数。最后,我们将结果扩展到多类分类和不确定设置。

We present new upper and lower bounds on the number of learner mistakes in the `transductive' online learning setting of Ben-David, Kushilevitz and Mansour (1997). This setting is similar to standard online learning, except that the adversary fixes a sequence of instances $x_1,\dots,x_n$ to be labeled at the start of the game, and this sequence is known to the learner. Qualitatively, we prove a \emph{trichotomy}, stating that the minimal number of mistakes made by the learner as $n$ grows can take only one of precisely three possible values: $n$, $\Theta\left(\log (n)\right)$, or $\Theta(1)$. Furthermore, this behavior is determined by a combination of the VC dimension and the Littlestone dimension. Quantitatively, we show a variety of bounds relating the number of mistakes to well-known combinatorial dimensions. In particular, we improve the known lower bound on the constant in the $\Theta(1)$ case from $\Omega\left(\sqrt{\log(d)}\right)$ to $\Omega(\log(d))$ where $d$ is the Littlestone dimension. Finally, we extend our results to cover multiclass classification and the agnostic setting.

Structured Prediction with Stronger Consistency Guarantees
Anqi Mao Mehryar Mohri Yutao Zhong



研究问题:本文旨在对结构预测的替代损失进行深入研究,并利用*$H$-一致性边界*进行支持。
动机:最近的研究表明,$H$-一致性边界比贝叶斯一致性更适用于学习,因为它们不是渐近的,并且考虑了使用的假设集$H$。
方法:首先,我们证明了无法为广泛使用的替代结构预测损失导出任何非平凡的$H$-一致性边界。然后,我们定义了几个新的替代损失族,包括*结构化comp-sum损失*和*结构化约束损失*,并证明了它们的$H$-一致性边界和贝叶斯一致性。这些损失函数可以很容易地引导出具有更强理论保证的新结构预测算法,基于它们的最小化。
效果:我们描述了几种这些替代损失的最小化的有效算法,包括一种新的*结构化逻辑损失*。

We present an extensive study of surrogate losses for structured prediction supported by *$H$-consistency bounds*. These are recently introduced guarantees that are more relevant to learning than Bayes-consistency, since they are not asymptotic and since they take into account the hypothesis set $H$ used. We first show that no non-trivial $H$-consistency bound can be derived for widely used surrogate structured prediction losses. We then define several new families of surrogate losses, including *structured comp-sum losses* and *structured constrained losses*, for which we prove $H$-consistency bounds and thus Bayes-consistency. These loss functions readily lead to new structured prediction algorithms with stronger theoretical guarantees, based on their minimization. We describe efficient algorithms for minimizing several of these surrogate losses, including a new *structured logistic loss*.

Advice Querying under Budget Constraint for Online Algorithms
Ziyad Benomar Vianney Perchet



研究问题:本文研究了在有限制的预测数量下,如何最有效地查询和使用预测信息。
动机:大多数现有工作假设算法可以无限制地获取预测输入,但在实际中,预测的数量是有限的。
方法:通过研究三种经典的竞争分析问题(滑雪租赁问题、秘书问题和非透视工作调度问题),探讨何时查询预测以及如何使用预测。
效果:提出了一种有效的策略,可以在有限的预测数量下,最大限度地提高算法的性能。

Several problems have been extensively studied in the learning-augmented setting, where the algorithm has access to some, possibly incorrect, predictions. However, it is assumed in most works that the predictions are provided to the algorithm as input, with no constraint on their size. In this paper, we consider algorithms with access to a limited number of predictions, that they can request at any time during their execution. We study three classical problems in competitive analysis, the ski rental problem, the secretary problem, and the non-clairvoyant job scheduling. We address the question of when to query predictions and how to use them.

Experiment Planning with Function Approximation
Aldo Pacchiano Jonathan Lee Emma Brunskill



研究问题:本文研究了在上下文强盗问题中,如何利用函数近似进行实验规划。
动机:在某些情况下,部署自适应算法的成本较高,例如需要分布式执行数据收集策略或需要人类参与实施这些策略,因此预先制定一组数据收集策略至关重要。
方法:我们提出了两种与函数近似兼容的实验规划策略。首先,我们设计了一种逃避规划和采样程序,该程序可以根据奖励函数类的逃避维度恢复最优性保证。其次,我们在动作数量较少的情况下证明了均匀采样器实现了竞争性的最优率。
效果:我们通过引入统计差距来说明规划和自适应学习之间的本质区别,并为带有模型选择的规划提供了结果。

We study the problem of experiment planning with function approximation in contextual bandit problems. In settings where there is a significant overhead to deploying adaptive algorithms; for example, when the execution of the data collection policies is required to be distributed, or a human in the loop is needed to implement these policies, producing in advance a set of policies for data collection is paramount. We study the setting where a large dataset of contexts -but not rewards- is available and may be used by the learner to design an effective data collection strategy. Although when rewards are linear this problem has been well studied, results are still missing for more complex reward models. In this work we propose two experiment planning strategies compatible with function approximation, first an eluder planning and sampling procedure that can recover optimality guarantees depending on the eluder dimension of the reward function class, and second we show the uniform sampler achieves competitive optimality rates in the setting where the number of actions is small. We finalize our results introducing a statistical gap fleshing out the fundamental differences between planning and adaptive learning and provide results for planning with model selection.

Rethinking Gauss-Newton for learning over-parameterized models
Michael Arbel Romain Menegaux Pierre Wolinski



研究问题:本研究探讨了高斯牛顿法在优化过参数化单隐藏层网络时,在全球收敛性和隐含偏置方面的表现。
动机:尽管高斯牛顿法比随机梯度下降法更快找到全局最优解,但其学习到的模型在测试数据上是否具有良好的泛化能力仍不清楚。
方法:通过合成回归任务进行实证研究,使用小步长来减慢收敛速度,并从具有小方差的随机初始权重开始。
效果:研究发现,这种设置会导致隐藏的学习现象,即使由于线性层的欠优化,训练和测试性能不佳,动态仍然能够恢复具有良好泛化特性的特征。这项研究表明,高斯牛顿法的收敛速度与学习解决方案的泛化能力之间存在权衡。

This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.

Recovering Simultaneously Structured Data via Non-Convex Iteratively Reweighted Least Squares
Christian Kümmerle Johannes Maly



研究问题:如何从线性观察中恢复遵守多个异构低维结构的数据。
动机:针对同时具有行稀疏和低秩特性的数据矩阵,提出一种可以充分利用这两种结构的新算法。
方法:提出了一种迭代重加权最小二乘(IRLS)算法,该算法优化了非凸的行稀疏和秩的替代方案的组合,并在算法中平衡这两者。
效果:实验证明,IRLS方法在少量样本复杂性下显示出良好的经验收敛性,可以从比现有方法更少的测量中识别出同时具有行稀疏和低秩特性的矩阵。

We propose a new algorithm for the problem of recovering data that adheres to multiple, heterogenous low-dimensional structures from linear observations. Focussing on data matrices that are simultaneously row-sparse and low-rank, we propose and analyze an iteratively reweighted least squares (IRLS) algorithm that is able to leverage both structures. In particular, it optimizes a combination of non-convex surrogates for row-sparsity and rank, a balancing of which is built into the algorithm. We prove locally quadratic convergence of the iterates to a simultaneously structured data matrix in a regime of minimal sample complexity (up to constants and a logarithmic factor), which is known to be impossible for a combination of convex surrogates. In experiments, we show that the IRLS method exhibits favorable empirical convergence, identifying simultaneously row-sparse and low-rank matrices from fewer measurements than state-of-the-art methods.

SOL: Sampling-based Optimal Linear bounding of arbitrary scalar functions
Yuriy Biktairov Jyotirmoy Deshmukh



研究问题:寻找神经网络中激活函数的紧线性边界是最先进的神经网络鲁棒性认证工具的重要组成部分。
动机:在现有的鲁棒性认证工作中,这些边界是通过人类的智慧为一些最受欢迎的激活函数计算出来的。尽管已经提出了许多启发式方法来限制任意函数,但据我们所知,还没有对一般标量函数的紧致最优性进行分析。
方法:我们通过制定一个简洁的最优性标准来填补这一空白,该标准允许我们为任何在感兴趣区域R内凸的函数建立最优边界。对于在R中Lipshitz连续的更一般的函数类,我们提出了一种基于采样的方法(SOL),该方法在给定的阈值ε>0下有效地计算最紧的线性边界。
效果:我们利用自适应采样技术迭代地构建一组适合表示目标激活函数的样本点。虽然我们方法的理论最坏情况时间复杂度为O(ε-2d),但它通常只需要O(logβ1/ε)的时间,其中β≥1,因此在实际应用中足够快。我们将SOL纳入鲁棒性认证器中,观察到它产生的认证率与其它方法相当或更高,同时所需时间仅为其它方法的四分之一,从而提供了SOL实用性的实证证据。

Finding tight linear bounds for activation functions in neural networks is an essential part of several state of the art neural network robustness certification tools. An activation function is an arbitrary, nonlinear, scalar function $f: \mathbb{R}^d \rightarrow \mathbb{R}$. In the existing work on robustness certification, such bounds have been computed using human ingenuity for a handful of the most popular activation functions. While a number of heuristics have been proposed for bounding arbitrary functions, no analysis of the tightness optimality for general scalar functions has been offered yet, to the best of our knowledge. We fill this gap by formulating a concise optimality criterion for tightness of the approximation which allows us to build optimal bounds for any function convex in the region of interest $R$. For a more general class of functions Lipshitz-continuous in $R$ we propose a sampling-based approach (SOL) which, given an instance of the bounding problem, efficiently computes the tightest linear bounds within a given $\varepsilon > 0$ threshold. We leverage an adaptive sampling technique to iteratively build a set of sample points suitable for representing the target activation function. While the theoretical worst case time complexity of our approach is $O(\varepsilon^{-2d})$, it typically only takes $O(\log^{\beta} \frac{1}{\varepsilon})$ time for some $\beta \ge 1$ and is thus sufficiently fast in practice. We provide empirical evidence of SOL's practicality by incorporating it into a robustness certifier and observing that it produces similar or higher certification rates while taking as low as quarter of the time compared to the other methods.

Near Optimal Reconstruction of Spherical Harmonic Expansions
Amir Zandieh Insu Han Haim Avron



研究问题:提出一种算法,通过使用接近最优数量的函数评估,恢复定义在$d研究问题:提出一种算法,通过使用接近最优数量的函数评估,恢复定义在$d$-维单位球面$\mathbb{S}^{d-1}$上的函数的球谐展开。
动机:对于任何$f\in L^2(\mathbb{S}^{d-1})$,需要评估$f$的次数等于球谐空间的维度,最多为$q$,这是一个优化问题。
方法:开发了一种简单而有效的基于核回归的算法,仅通过对$\mathbb{S}^{d-1}$上的均匀采样点进行函数评估,就可以恢复$f$的$q$阶展开。该算法建立在球谐函数和Gegenbauer多项式之间的联系上。
效果:实验结果表明,该算法在任何维度$d$下都能有效地使用接近最优数量的样本工作。

We propose an algorithm for robust recovery of the spherical harmonic expansion of functions defined on the $d$-dimensional unit sphere $\mathbb{S}^{d-1}$ using a near-optimal number of function evaluations. We show that for any $f\in L^2(\mathbb{S}^{d-1})$, the number of evaluations of $f$ needed to recover its degree-$q$ spherical harmonic expansion equals the dimension of the space of spherical harmonics of degree at most $q$, up to a logarithmic factor. Moreover, we develop a simple yet efficient kernel regression-based algorithm to recover degree-$q$ expansion of $f$ by only evaluating the function on uniformly sampled points on $\mathbb{S}^{d-1}$. Our algorithm is built upon the connections between spherical harmonics and Gegenbauer polynomials. Unlike the prior results on fast spherical harmonic transform, our proposed algorithm works efficiently using a nearly optimal number of samples in any dimension $d$. Furthermore, we illustrate the empirical performance of our algorithm on numerical examples.

The Gain from Ordering in Online Learning
Vasilis Kontonis Mingchen Ma Christos Tzamos



研究问题:本文研究固定设计在线学习,即学习者可以自由选择数据点的顺序以最小化他们的遗憾(也称为自我指导的在线学习)。
动机:我们专注于在线线性回归的基本任务:给定一个数据集X,学习者在步骤t中选择一个点x_t ∈ X,预测一个值 并承受损失( - w * x_t)^2。目标是设计能够排序例子并实现比随机或最坏顺序在线算法更好的遗憾的算法。
方法:对于任意数据集X,我们在指数时间假设下证明,没有高效的算法可以在因子d^{1/\poly(log \log d)}内近似最优(最佳)遗憾。
效果:然后我们展示,对于结构化数据集,我们可以绕过上述困难结果并实现近乎最优的遗憾。当X的例子是从球面上的均匀分布中抽取时,我们提出了一种基于选择“最容易”的例子先的贪婪启发式算法,实现了最优遗憾的对数-d近似。

We study fixed-design online learning where the learner is allowed to choose the order of the datapoints in order to minimize their regret (aka self-directed online learning). We focus on the fundamental task of online linear regression: the learner is given a dataset $X$ with $n$ examples in $d$ dimensions and at step $t$ they select a point $x_t \in X$, predict a value $\widetilde y_t$, and suffer loss $(\widetilde y_t - w^\ast \cdot x_t)^2$. The goal is to design algorithms that order the examples and achieve better regret than random- or worst-order online algorithms. For an arbitrary dataset $X$, we show that, under the Exponential Time Hypothesis, no efficient algorithm can approximate the optimal (best-order) regret within a factor of $d^{1/\poly(\log \log d)}$. We then show that, for structured datasets, we can bypass the above hardness result and achieve nearly optimal regret. When the examples of $X$ are drawn i.i.d.\ from the uniform distribution on the sphere, we present an algorithm based on the greedy heuristic of selecting ``easiest'' examples first that achieves a $\log d$-approximation of the optimal regret.

The noise level in linear regression with dependent data
Ingvar Ziemann Stephen Tu George J. Pappas Nikolai Matni



研究问题:本文旨在对具有依赖性(β混合)数据的随机设计线性回归进行上界推导,无需任何可实现性假设。
动机:与严格的可实现鞅噪声机制不同,文献中没有严格的实例最优非渐近分析。
方法:通过引入偏差,我们的分析正确地恢复了由中心极限定理预测的方差项——问题的噪声水平,从而表现出优雅的退化。在燃烧过程中,我们的结果在适度偏差范围内是尖锐的,特别是不会通过混合时间因素来放大主要顺序项。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We derive upper bounds for random design linear regression with dependent ($\beta$-mixing) data absent any realizability assumptions. In contrast to the strictly realizable martingale noise regime, no sharp \emph{instance-optimal} non-asymptotics are available in the literature. Up to constant factors, our analysis correctly recovers the variance term predicted by the Central Limit Theorem---the noise level of the problem---and thus exhibits graceful degradation as we introduce misspecification. Past a burn-in, our result is sharp in the moderate deviations regime, and in particular does not inflate the leading order term by mixing time factors.

Experimental Designs for Heteroskedastic Variance
Justin David Naggar Weltz Tanner Fiez Alexander Volfovsky Eric Laber Blake Mason houssam nassif Lalit K Jain



研究问题:在现实环境中,许多实验设计问题存在异方差噪声,而大多数线性实验设计问题则假设同方差。
动机:本研究旨在解决这一问题,提出了一种新的方法来处理具有异方差噪声的实验设计问题。
方法:我们让学习者可以访问一组有限的测量向量集,通过这些向量可以获得带有噪声的线性响应。我们提出了一种新的设计,用于一致地约束方差参数的估计误差。
效果:我们在两个具有异方差噪声的适应性实验设计问题上展示了这种方法,并证明了在这些设置中的第一个实例依赖下界。我们还构建了近乎最优的算法,并通过实证研究证明了考虑异方差方差在这些设计中可以显著降低样本复杂度。

Most linear experimental design problems assume homogeneous variance, while the presence of heteroskedastic noise is present in many realistic settings. Let a learner have access to a finite set of measurement vectors $\mathcal{X}\subset \mathbb{R}^d$ that can be probed to receive noisy linear responses of the form $y=x^{\top}\theta^{\ast}+\eta$. Here $\theta^{\ast}\in \mathbb{R}^d$ is an unknown parameter vector, and $\eta$ is independent mean-zero $\sigma_x^2$-sub-Gaussian noise defined by a flexible heteroskedastic variance model, $\sigma_x^2 = x^{\top}\Sigma^{\ast}x$. Assuming that $\Sigma^{\ast}\in \mathbb{R}^{d\times d}$ is an unknown matrix, we propose, analyze and empirically evaluate a novel design for uniformly bounding estimation error of the variance parameters, $\sigma_x^2$. We demonstrate this method on two adaptive experimental design problems under heteroskedastic noise, fixed confidence transductive best-arm identification and level-set identification and prove the first instance-dependent lower bounds in these settings. Lastly, we construct near-optimal algorithms and demonstrate the large improvements in sample complexity gained from accounting for heteroskedastic variance in these designs empirically.

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs
Dongsheng Ding Chen-Yu Wei Kaiqing Zhang Alejandro Ribeiro



研究问题:计算无限期折扣约束马尔可夫决策过程的最优策略。
动机:尽管拉格朗日基策略搜索方法在实践中广泛使用,但这些方法中策略迭代的振荡现象尚未得到充分理解,带来了诸如违反约束和对超参数敏感等问题。
方法:采用拉格朗日法将约束马尔可夫决策过程转化为约束鞍点问题,其中最大/最小玩家分别对应原始/对偶变量,并开发了两个单时间尺度基于策略的原始-对偶算法,其策略迭代非渐近收敛于最优约束策略。
效果:实验结果表明,我们的方法在计算性能上优于现有的基线方法,并在大规模状态或动作空间中具有良好的扩展性。

We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual variable via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs. We further validate the merits and the effectiveness of our methods in computational experiments.

$\varepsilon$-fractional core stability in Hedonic Games.
Simone Fioravanti Michele Flammini Bojana Kodric Giovanna Varricchio



研究问题:如何有效地在合作博弈中寻找稳定且易于计算的联盟结构。
动机:传统的合作博弈模型中的稳定联盟结构往往难以找到,即使存在也常常无法有效计算。
方法:提出ε-分部核心稳定性的概念,允许最多ε比例的可能联盟成为核心阻碍。设计了针对简单分数和匿名两种基本合作博弈的有效算法来寻找ε-分部核心稳定的联盟结构。
效果:通过引入更复杂的采样分布,使得当估值需要以PAC学习的方式从样本中学习时,能够高效地计算出具有任意高置信度的ε-分部核心稳定的联盟结构。

Hedonic Games (HGs) are a classical framework modeling coalition formation of strategic agents guided by their individual preferences. According to these preferences, it is desirable that a coalition structure (i.e. a partition of agents into coalitions) satisfies some form of stability. The most well-known and natural of such notions is arguably core-stability. Informally, a partition is core-stable if no subset of agents would like to deviate by regrouping in a so-called core-blocking coalition. Unfortunately, core-stable partitions seldom exist and even when they do, it is often computationally intractable to find one. To circumvent these problems, we propose the notion of $\varepsilon$-fractional core-stability, where at most an $\varepsilon$-fraction of all possible coalitions is allowed to core-block. It turns out that such a relaxation may guarantee both existence and polynomial-time computation. Specifically, we design efficient algorithms returning an $\varepsilon$-fractional core-stable partition, with $\varepsilon$ exponentially decreasing in the number of agents, for two fundamental classes of HGs: Simple Fractional and Anonymous. From a probabilistic point of view, being the definition of $\varepsilon$-fractional core equivalent to requiring that uniformly sampled coalitions core-block with probability lower than $\varepsilon$, we further extend the definition to handle more complex sampling distributions. Along this line, when valuations have to be learned from samples in a PAC-learning fashion, we give positive and negative results on which distributions allow the efficient computation of outcomes that are $\varepsilon$-fractional core-stable with arbitrarily high confidence.

A unified framework for information-theoretic generalization bounds
Yifeng Chu Maxim Raginsky



研究问题:本文提出了一种获取学习算法信息理论泛化界限的通用方法。
动机:主要技术工具是基于测度变化和$L_{\psi_p}$ Orlicz空间中Young不等式的放松的概率解相关引理。
方法:通过解相关引理与其他技术如概率测度空间中的对称化、耦合和链接相结合,得到了新的上界,包括期望和高概率上的泛化误差。
效果:作为特殊情况,该方法恢复了现有的许多泛化界限,包括基于互信息、条件互信息、随机链接和PAC-Bayes不等式的情况。此外,Fernique--Talagrand上界作为子高斯过程期望上确界的特殊情况出现。

This paper presents a general methodology for deriving information-theoretic generalization bounds for learning algorithms. The main technical tool is a probabilistic decorrelation lemma based on a change of measure and a relaxation of Young's inequality in $L_{\psi_p}$ Orlicz spaces. Using the decorrelation lemma in combination with other techniques, such as symmetrization, couplings, and chaining in the space of probability measures, we obtain new upper bounds on the generalization error, both in expectation and in high probability, and recover as special cases many of the existing generalization bounds, including the ones based on mutual information, conditional mutual information, stochastic chaining, and PAC-Bayes inequalities. In addition, the Fernique--Talagrand upper bound on the expected supremum of a subgaussian process emerges as a special case.

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction
Xiaowen Jiang Sebastian U Stich



研究问题:现有的随机Polyak步长(SPS)和随机线搜索(SLS)算法在训练过参数化模型时表现出显著的有效性,但在非插值设置中只能保证收敛到解的邻域,可能导致输出结果比初始猜测差。
动机:为了解决上述问题,我们提出了两种新的稳健变体算法AdaSPS和AdaSLS,并设计了一种新的带有方差减少(VR)的方法来加速这两种步长,使其在所有情况下都能达到最优收敛速度。
方法:我们提出的AdaSPS和AdaSLS算法可以在强凸或凸、插值或非插值设置中实现最优渐近速率。AdaSLS不需要知道问题相关的参数,而AdaSPS只需要输入最优函数值的下界。我们还设计了一种新的带有方差减少的方法,可以使用Polyak步长或线搜索进行加速。
效果:我们在合成和真实数据集上的数值实验验证了我们的理论,并展示了我们的算法的有效性和鲁棒性。

The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, two issues remain unsolved in this line of work. First, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al.), this approach results in slower convergence rates under interpolation. Second, intuitive line-search methods equipped with variance-reduction (VR) fail to converge (Dubois-Taine et al.). So far, no VR methods successfully accelerate these two stepsizes with a convergence guarantee. In this work, we make two contributions: Firstly, we propose two new robust variants of SPS and SLS, called AdaSPS and AdaSLS, which achieve optimal asymptotic rates in both strongly-convex or convex and interpolation or non-interpolation settings, except for the case when we have both strong convexity and non-interpolation. AdaSLS requires no knowledge of problem-dependent parameters, and AdaSPS requires only a lower bound of the optimal function value as input. Secondly, we propose a novel VR method that can use Polyak stepsizes or line-search to achieve acceleration. When it is equipped with AdaSPS or AdaSLS, the resulting algorithms obtain the optimal rate for optimizing convex smooth functions. Finally, numerical experiments on synthetic and real datasets validate our theory and demonstrate the effectiveness and robustness of our algorithms.

First Order Stochastic Optimization with Oblivious Noise
Ilias Diakonikolas Sushrut Karmalkar Jongho Park Christos Tzamos



研究问题:现有的随机Polyak步长(SPS)和随机线搜索(SLS)算法在训练过参数化模型时表现出显著的有效性,但在非插值设置中只能保证收敛到解的邻域,可能导致输出结果比初始猜测差。
动机:为了解决上述问题,我们提出了两种新的稳健变体算法AdaSPS和AdaSLS,并设计了一种新的带有方差减少(VR)的方法来加速这两种步长,使其在所有情况下都能达到最优收敛速度。
方法:我们提出的AdaSPS和AdaSLS算法可以在强凸或凸、插值或非插值设置中实现最优渐近速率。AdaSLS不需要知道问题相关的参数,而AdaSPS只需要输入最优函数值的下界。我们还设计了一种新的带有方差减少的方法,可以使用Polyak步长或线搜索进行加速。
效果:我们在合成和真实数据集上的数值实验验证了我们的理论,并展示了我们的算法的有效性和鲁棒性。

We initiate the study of stochastic optimization with oblivious noise, broadly generalizing the standard heavy-tailed noise setup. In our setting, in addition to random observation noise, the stochastic gradient may be subject to independent \emph{oblivious noise}, which may not have bounded moments and is not necessarily centered. Specifically, we assume access to a noisy oracle for the stochastic gradient of $f$ at $x$, which returns a vector $\nabla f(\gamma, x) + \xi$, where $\gamma$ is the bounded variance observation noise and $\xi$ is the oblivious noise that is independent of $\gamma$ and $x$. The only assumption we make on the oblivious noise $\xi$ is that $\Pr[\xi = 0] \ge \alpha$, for some $\alpha \in (0, 1)$. In this setting, it is not information-theoretically possible to recover a single solution close to the target when the fraction of inliers $\alpha$ is less than $1/2$. Our main result is an efficient {\em list-decodable} learner that recovers a small list of candidates at least one of which is close to the true solution. On the other hand, if $\alpha = 1-\epsilon$, where $0< \epsilon < 1/2$ is sufficiently small constant, the algorithm recovers a single solution. Along the way, we develop a rejection-sampling-based algorithm to perform noisy location estimation, which may be of independent interest.

Sharp Recovery Thresholds of Tensor PCA Spectral Algorithms
Michael Jacob Feldman David Donoho



研究问题:如何从噪声张量数据中恢复低秩近似。
动机:许多应用需要从噪声张量数据中恢复低秩近似,为此我们考虑了几种实用的有效矩阵化策略。
方法:我们采用了张量展开、部分追踪、幂迭代和递归展开等策略,通过构造特定的矩阵并应用谱方法来处理噪声张量数据。
效果:我们的分析利用随机矩阵理论得到了尖锐的阈值,这些阈值避开了扰动和集中界限的影响。具体来说,我们在以前的算法部分恢复信号的条件下,证明了幂迭代和递归展开方法可以实现(渐近)精确恢复。

Many applications seek to recover low-rank approximations of noisy tensor data. We consider several practical and effective matricization strategies which construct specific matrices from such tensors and then apply spectral methods; the strategies include tensor unfolding, partial tracing, power iteration, and recursive unfolding. We settle the behaviors of unfolding and partial tracing, identifying sharp thresholds in signal-to-noise ratio above which the signal is partially recovered. In particular, we extend previous results to a much larger class of tensor shapes where axis lengths may be different. For power iteration and recursive unfolding, we prove that under conditions where previous algorithms partially recovery the signal, these methods achieve (asymptotically) exact recovery. Our analysis deploys random matrix theory to obtain sharp thresholds which elude perturbation and concentration bounds. Specifically, we rely upon recent disproportionate random matrix results, which describe sequences of matrices with diverging aspect ratio.

Fast and Simple Spectral Clustering in Theory and Practice
Peter Macgregor



研究问题:设计一种有效的算法在图G中找到k个聚类。
动机:传统的谱聚类算法中,图G的顶点通过图拉普拉斯矩阵的k个特征向量嵌入到R^k中,但这种嵌入计算成本高且占据大部分运行时间。
方法:提出一种基于幂法计算O(log(k))向量的简单谱聚类算法,该算法中的顶点嵌入计算时间与图的大小呈线性关系,并在对输入图的自然假设下证明能恢复真实的聚类。
效果:在多个合成和真实世界的数据集上评估新算法,发现它比其它聚类算法快得多,同时产生的结果具有相近的聚类准确性。

Spectral clustering is a popular and effective algorithm designed to find $k$ clusters in a graph $G$. In the classical spectral clustering algorithm, the vertices of $G$ are embedded into $\mathbb{R}^k$ using $k$ eigenvectors of the graph Laplacian matrix. However, computing this embedding is computationally expensive and dominates the running time of the algorithm. In this paper, we present a simple spectral clustering algorithm based on a vertex embedding with $O(\log(k))$ vectors computed by the power method. The vertex embedding is computed in nearly-linear time with respect to the size of the graph, and the algorithm provably recovers the ground truth clusters under natural assumptions on the input graph. We evaluate the new algorithm on several synthetic and real-world datasets, finding that it is significantly faster than alternative clustering algorithms, while producing results with approximately the same clustering accuracy.

On Learning Latent Models with Multi-Instance Weak Supervision
Kaifu Wang Efthymia Tsamoura Dan Roth



研究问题:本文研究了弱监督学习场景下的多实例部分标签学习(multi-instance PLL)问题,即由多个输入实例的标签转移函数生成的监督信号。
动机:尽管存在许多学习方法,但对此问题的理论研究却很少。因此,作者提出了一个必要且充分的条件来解决这个问题。
方法:作者提出了一个必要且充分的条件来保证这个问题的可学习性,并基于广泛使用的神经符号文献中的top-k替代损失推导出了Rademacher风格的误差边界。
效果:实验结果与理论发现一致,但也暴露出弱监督学习文献中的可扩展性问题。

We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $\sigma$ of labels associated with multiple input instances. We formulate this problem as *multi-instance Partial Label Learning (multi-instance PLL)*, which is an extension to the standard PLL problem. Our problem is met in different fields, including latent structural learning and neuro-symbolic integration. Despite the existence of many learning techniques, limited theoretical analysis has been dedicated to this problem. In this paper, we provide the first theoretical study of multi-instance PLL with possibly an unknown transition $\sigma$. Our main contributions are as follows: First, we proposed a necessary and sufficient condition for the learnability of the problem. This condition nontrivially generalizes and relaxes the existing *small ambiguity degree* in PLL literature since we allow the transition to be deterministic. Second, we derived Rademacher-style error bounds based on the top-$k$ surrogate loss that is widely used in the neuro-symbolic literature. Furthermore, we conclude with empirical experiments for learning with an unknown transition. The empirical results align with our theoretical findings; however, they also expose the issue of scalability in the weak supervision literature.

Bicriteria Approximation Algorithms for the Submodular Cover Problem
Wenjing Chen Victoria G. Crawford



研究问题:本文研究了子模块覆盖(SCP)优化问题,即在有限的全集U中找到一个最小基数的子集,使得子模块函数f的值高于输入阈值τ。
动机:现有的SCP算法无法有效处理非单调和正则化的情况,且运行时间较长。
方法:本文提出了一种可扩展的单调SCP算法,该算法在显著更快的时间内实现了与标准贪婪算法几乎相同的近似保证;同时,我们也是首次开发了通用SCP算法,该算法的解决方案可以任意接近可行解;最后,我们还首次开发了正则化SCP算法。
效果:实验结果表明,我们的算法在数据汇总和图割等SCP应用中非常有效。

In this paper, we consider the optimization problem Submodular Cover (SCP), which is to find a minimum cardinality subset of a finite universe $U$ such that the value of a submodular function $f$ is above an input threshold $\tau$. In particular, we consider several variants of SCP including the general case, the case where $f$ is additionally assumed to be monotone, and finally the case where $f$ is a regularized monotone submodular function. Our most significant contributions are that: (i) We propose a scalable algorithm for monotone SCP that achieves nearly the same approximation guarantees as the standard greedy algorithm in significantly faster time; (ii) We are the first to develop an algorithm for general SCP that achieves a solution arbitrarily close to being feasible; and finally (iii) we are the first to develop algorithms for regularized SCP. Our algorithms are then demonstrated to be effective in an extensive experimental section on data summarization and graph cut, two applications of SCP.

Exponentially Convergent Algorithms for Supervised Matrix Factorization
Joowon Lee Hanbaek Lyu Weixin Yao



研究问题:本文旨在解决监督矩阵分解(SMF)中同时寻找特征提取和分类任务的问题,以及高维数据的挑战。
动机:现有的SMF模型训练方法存在非凸优化和可能的约束问题,且已知算法要么基于启发式,要么只对特殊情况提供弱收敛保证。
方法:本文提出了一种新的框架,将SMF提升为组合因子空间中的低秩矩阵估计问题,并提出了在温和假设下,可以以任意初始值指数快速收敛到目标函数全局最小值的有效算法。
效果:该框架适用于具有辅助特征的多类分类的多种SMF类型问题。实验证明,该算法成功识别了各种癌症中已知的癌症相关基因群。

Supervised matrix factorization (SMF) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. Training SMF model involves solving a nonconvex and possibly constrained optimization with at least three blocks of parameters. Known algorithms are either heuristic or provide weak convergence guarantees for special cases. In this paper, we provide a novel framework that `lifts' SMF as a low-rank matrix estimation problem in a combined factor space and propose an efficient algorithm that provably converges exponentially fast to a global minimizer of the objective with arbitrary initialization under mild assumptions. Our framework applies to a wide range of SMF-type problems for multi-class classification with auxiliary features. To showcase an application, we demonstrate that our algorithm successfully identified well-known cancer-associated gene groups for various cancers.

Outlier-Robust Wasserstein DRO
Sloan Nietert Ziv Goldfeld Soroosh Shafiee



研究问题:本文旨在解决数据驱动决策中存在的不确定性问题,特别是在几何和非几何扰动下。
动机:现有的Wasserstein分布鲁棒优化(WDRO)方法无法处理对抗性异常值等非几何扰动,这会严重扭曲Wasserstein距离测量并阻碍学习到的模型。
方法:提出一种新的异常值稳健的WDRO框架,该框架允许一定比例的数据被任意破坏,同时考虑了几何(Wasserstein)和非几何(总变差(TV))两种类型的扰动。设计了一个包含两种扰动类型的不确定性集合,并推导出了明确的捕获Wasserstein和TV风险的极小极大最优超额风险界限。
效果:通过一系列标准回归和分类任务的实验验证了理论的正确性。

Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to~sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination that allows an $\varepsilon$-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types and derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks. We prove a strong duality result that enables tractable convex reformulations and efficient computation of our outlier-robust WDRO problem. When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks.

Incentivized Communication for Federated Bandits
Zhepei Wei Chuanhao Li Haifeng Xu Hongning Wang



研究问题:现有的联邦学习算法通常假设所有客户端都会无私地共享数据,但这种理想化的情况在现实中往往无法实现,特别是在面对自私的客户端时。
动机:忽视这种自私行为可能会严重影响联邦学习的效率和实用性。因此,我们提出了一个激励通信问题,通过提供奖励来鼓励客户端共享数据。
方法:我们在上下文线性设置中实例化了这个强盗问题,并提出了第一个激励通信协议——Inc-FedUCB,该协议具有可证明的通信和激励成本保证,实现了接近最优的遗憾。
效果:我们在合成和真实世界的数据集上进行了广泛的实验,进一步验证了该方法在不同环境中的有效性。

Most existing works on federated bandits take it for granted that all clients are altruistic about sharing their data with the server for the collective good whenever needed. Despite their compelling theoretical guarantee on performance and communication efficiency, this assumption is overly idealistic and oftentimes violated in practice, especially when the algorithm is operated over self-interested clients, who are reluctant to share data without explicit benefits. Negligence of such self-interested behaviors can significantly affect the learning efficiency and even the practical operability of federated bandit learning. In light of this, we aim to spark new insights into this under-explored research area by formally introducing an incentivized communication problem for federated bandits, where the server shall motivate clients to share data by providing incentives. Without loss of generality, we instantiate this bandit problem with the contextual linear setting and propose the first incentivized communication protocol, namely, Inc-FedUCB, that achieves near-optimal regret with provable communication and incentive cost guarantees. Extensive empirical experiments on both synthetic and real-world datasets further validate the effectiveness of the proposed method across various environments.

Learning Provably Robust Estimators for Inverse Problems via Jittering
Anselm Krainovic Mahdi Soltanolkotabi Reinhard Heckel



研究问题:本文旨在研究深度神经网络在逆问题上的最优最坏情况鲁棒性,以及研究问题:本文旨在研究深度神经网络在逆问题上的最优最坏情况鲁棒性,以及通过添加高斯噪声进行训练的简单正则化技术(抖动)是否能有效学习最坏情况鲁棒估计器。
动机:虽然深度神经网络在去噪等逆问题上表现优秀,但对对抗性或最坏情况扰动敏感,因此需要研究如何有效训练网络以实现最坏情况的鲁棒性。
方法:本文提出了一种新的分析方法来描述线性去噪的最优化$ell_2$-最坏情况鲁棒估计器,并证明了抖动可以产生最优的鲁棒去噪器。此外,还通过训练深度神经网络(U-nets)对自然图像去噪、去卷积和加速磁共振成像(MRI)进行了实证研究。
效果:实验结果表明,抖动显著增强了最坏情况的鲁棒性,但对于去噪之外的逆问题可能不是最优的。此外,我们的研究结果还表明,在经常含有轻微噪声的真实数据上进行训练,可以在一定程度上增强鲁棒性。

Deep neural networks provide excellent performance for inverse problems such as denoising. However, neural networks can be sensitive to adversarial or worst-case perturbations. This raises the question of whether such networks can be trained efficiently to be worst-case robust. In this paper, we investigate whether jittering, a simple regularization technique that adds isotropic Gaussian noise during training, is effective for learning worst-case robust estimators for inverse problems. While well studied for prediction in classification tasks, the effectiveness of jittering for inverse problems has not been systematically investigated. In this paper, we present a novel analytical characterization of the optimal $\ell_2$-worst-case robust estimator for linear denoising and show that jittering yields optimal robust denoisers. Furthermore, we examine jittering empirically via training deep neural networks (U-nets) for natural image denoising, deconvolution, and accelerated magnetic resonance imaging (MRI). The results show that jittering significantly enhances the worst-case robustness, but can be suboptimal for inverse problems beyond denoising. Moreover, our results imply that training on real data which often contains slight noise is somewhat robustness enhancing.

Zeroth-Order Methods for Nondifferentiable, Nonconvex, and Hierarchical Federated Optimization
Yuyang Qiu Uday Shanbhag Farzad Yousefian



研究问题:本文研究了联邦学习中的三个广泛适用的问题类别,包括非可微非凸优化、联邦双层优化和联邦最小最大问题。
动机:这些问题通常由于隐式目标函数缺乏闭型表达式而变得复杂。现有的研究受限于强假设,如需要隐式函数的可微性和L-光滑性。
方法:本文提出了一种随机平滑启用的零阶联邦学习方法,并利用卷积平滑和克拉克次微分计算法来推导计算近似克拉克稳定点的通信和迭代复杂度保证。同时,还设计了一种统一的随机隐式零阶联邦学习框架,明确给出了通信和迭代复杂度。
效果:该方法通过在局部步骤中使用延迟来跳过调用不精确的低级联邦学习查询,从而在解决分层问题时显著减少了通信开销。实验结果验证了该方法在非光滑和分层机器学习问题上的有效性。

Federated learning (FL) has emerged as an enabling framework for communication-efficient decentralized training. We study three broadly applicable problem classes in FL: (i) Nondifferentiable nonconvex optimization; (ii) Federated bilevel optimization; (iii) Federated minimax problems. Notably, in an implicit sense, both (ii) and (iii) are instances of (i). However, these hierarchical problems are often complicated by the absence of a closed-form expression for the implicit objective function. Unfortunately, research on these problems has been limited and afflicted by reliance on strong assumptions, including the need for differentiability and L-smoothness of the implicit function. We address this shortcoming by making the following contributions. In (i), by leveraging convolution-based smoothing and Clarke’s subdifferential calculus, we devise a randomized smoothing-enabled zeroth-order FL method and derive communication and iteration complexity guarantees for computing an approximate Clarke stationary point. To contend with (ii) and (iii), we devise a unifying randomized implicit zeroth-order FL framework, equipped with explicit communication and iteration complexities. Importantly, our method utilizes delays during local steps to skip calling the inexact lower-level FL oracle. This results in significant reduction in communication overhead when addressing hierarchical problems. We empirically validate the theory on nonsmooth and hierarchical ML problems.

Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
Lingjiong Zhu Mert Gurbuzbalaban Anant Raj Umut Simsekli



研究问题:如何为随机优化算法证明Wasserstein稳定性界?
动机:现有的稳定性界需要不同的证明技术和数学工具,缺乏统一性。
方法:通过学习理论和实用概率的新颖联系,引入统一的指导原则来证明随机优化算法的Wasserstein稳定性界。
效果:该方法被成功地应用在随机梯度下降(SGD)上,对于强凸损失和非凸带噪音的损失,获得了时间一致的稳定性界。此外,该方法还被扩展到了其他流行的优化器上,并证明了在没有额外噪音的情况下,获得时间一致界需要满足遍历性。

Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds -- which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel.

What is the Inductive Bias of Flatness Regularization? A Study of Deep Matrix Factorization Models
Khashayar Gatmiry Zhiyuan Li Tengyu Ma Sashank J. Reddi Stefanie Jegelka Ching-Yao Chuang



研究问题:本研究旨在理解在深度学习线性网络中,最小化Hessian迹的解决方案的归纳偏置。
动机:过度参数化的神经网络研究表明,优化器的随机性具有隐式的正则化效果,可以最小化损失函数在其零损失解决方案族上的锐度(特别是其海森矩阵的迹)。然而,为什么以及何时扁平化正则化会导致更好的泛化仍然不清楚。
方法:本研究通过学习深度线性网络的线性测量,即所谓的“深度矩阵分解”,来理解最小化Hessian迹的解决方案的重要设置中的归纳偏置。我们展示了在测量上的标准受限等距属性(RIP)下,最小化Hessian迹近似等于最小化相应端到端矩阵参数的Schatten 1-范数(即所有层矩阵的乘积),这反过来又会导致更好的泛化。
效果:实验结果表明,这种方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Recent works on over-parameterized neural networks have shown that the stochasticity in optimizers has the implicit regularization effect of minimizing the sharpness of the loss function (in particular, the trace of its Hessian) over the family zero-loss solutions. More explicit forms of flatness regularization also empirically improve the generalization performance. However, it remains unclear why and when flatness regularization leads to better generalization. This work takes the first step towards understanding the inductive bias of the minimum trace of the Hessian solutions in an important setting: learning deep linear networks from linear measurements, also known as \emph{deep matrix factorization}. We show that with the standard Restricted Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters (i.e., the product of all layer matrices), which in turn leads to better generalization.

Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming
Gregory Dexter Petros Drineas David Woodruff Taisuke Yasuda



研究问题:如何利用概略算法设计低空间流算法以及快速多项式时间近似方案(PTAS)。
动机:概略算法已被证明是设计低空间流算法和快速PTAS的强大方法。
方法:开发新技巧,将基于概略的方法扩展到稀疏字典学习和欧几里得k-均值聚类问题。
效果:在快速算法方面,为k-均值聚类问题设计了新的PTAS方法,并推广到稀疏字典学习问题的首个PTAS;在流算法方面,得到了字典学习和k-均值聚类的新上界和下界。

Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean $k$-means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the $n$ input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering. In particular, given a design matrix $\mathbf A\in\mathbb R^{n\times d}$ in a turnstile stream, we show an $\tilde O(nr/\epsilon^2 + dk/\epsilon)$ space upper bound for $r$-sparse dictionary learning of size $k$, an $\tilde O(n/\epsilon^2 + dk/\epsilon)$ space upper bound for $k$-means clustering, as well as an $\tilde O(n)$ space upper bound for $k$-means clustering on random order row insertion streams with a natural "bounded sensitivity" assumption. On the lower bounds side, we obtain a general $\tilde\Omega(n/\epsilon + dk/\epsilon)$ lower bound for $k$-means clustering, as well as an $\tilde\Omega(n/\epsilon^2)$ lower bound for algorithms which can estimate the cost of a single fixed set of candidate centers.

A Unified Approach for Maximizing Continuous DR-submodular Functions
Mohammad Pedramfar Christopher John Quinn Vaneet Aggarwal



研究问题:本文提出了一种最大化连续DR-submodular函数的统一方法,涵盖了多种设置和查询类型。
动机:现有的方法在处理单调和非单调函数、不同的凸集约束、确定性和随机性查询等方面存在限制或不足。
方法:本文的方法包括一个针对单调和非单调函数的Frank-Wolfe型离线算法,考虑了梯度和函数值两种查询方式,以及确定性和随机性查询。
效果:在所考虑的十六种情况中,本文的方法在九种情况下取得了新的/改进的结果,避免了三种情况下的昂贵投影计算,其余四种情况的性能与最先进的方法相当。特别是在随机函数值查询的情况下,本文的方法首次实现了带有 bandit 反馈的 regret bounds。

This paper presents a unified approach for maximizing continuous DR-submodular functions that encompasses a range of settings and oracle access types. Our approach includes a Frank-Wolfe type offline algorithm for both monotone and non-monotone functions, with different restrictions on the general convex set. We consider settings where the oracle provides access to either the gradient of the function or only the function value, and where the oracle access is either deterministic or stochastic. We determine the number of required oracle accesses in all cases. Our approach gives new/improved results for nine out of the sixteen considered cases, avoids computationally expensive projections in three cases, with the proposed framework matching performance of state-of-the-art approaches in the remaining four cases. Notably, our approach for the stochastic function value-based oracle enables the first regret bounds with bandit feedback for stochastic DR-submodular functions.

Polynomial-Time Linear-Swap Regret Minimization in Imperfect-Information Sequential Games
Gabriele Farina Charilaos Pipis



研究问题:在序列游戏中,理解最强大的理性概念可以在最坏的情况下有效实现是什么。
动机:现有的后悔最小化学习者在序列游戏中的理性概念仍有待提高。
方法:通过引入一个新的概念——无线性交换后悔,证明了存在一个可以有效接近的子集的扩展形式相关均衡——线性偏差相关均衡。
效果:该概念在非序列游戏中与无交换后悔一样强,在序列游戏中比无触发器后悔更强。

No-regret learners seek to minimize the difference between the loss they cumulated through the actions they played, and the loss they would have cumulated in hindsight had they consistently modified their behavior according to some strategy transformation function. The size of the set of transformations considered by the learner determines a natural notion of rationality. As the set of transformations each learner considers grows, the strategies played by the learners recover more complex game-theoretic equilibria, including correlated equilibria in normal-form games and extensive-form correlated equilibria in extensive-form games. At the extreme, a no-swap-regret agent is one that minimizes regret against the set of all functions from the set of strategies to itself. While it is known that the no-swap-regret condition can be attained efficiently in nonsequential (normal-form) games, understanding what is the strongest notion of rationality that can be attained efficiently in the worst case in sequential (extensive-form) games is a longstanding open problem. In this paper we provide a positive result, by showing that it is possible, in any sequential game, to retain polynomial-time (in the game tree size) iterations while achieving sublinear regret with respect to all linear transformations of the mixed strategy space, a notion called no-linear-swap regret. This notion of hindsight rationality is as strong as no-swap-regret in nonsequential games, and stronger than no-trigger-regret in sequential games—thereby proving the existence of a subset of extensive-form correlated equilibria robust to linear deviations, which we call linear-deviation correlated equilibria, that can be approached efficiently.

Faster Query Times for Fully Dynamic $k$-Center Clustering with Outliers
Leyla Biabani Annika Hennes Morteza Monemizadeh Melanie Schmidt



研究问题:在度量空间中,给定一个点集P和数字k、z,找出一个包含k个点的集合C*,使得P中除最多z个离群点外的所有点到其在C*中的最近中心的最远距离最小。
动机:在完全动态模型下,即插入和删除点的情军下,研究具有有界对数维的度量空间中的问题。
方法:利用分层数据结构维护点及其邻域,以高效地找到聚类。特别是,该数据结构可以随时查询以生成对于输入的k和z值的(3+ε)近似解决方案。
效果:与当前最先进的由Pellizzoni, Pietracaprina, and Pucci使用ε-O(dim)(k+z)^2logΔ查询时间来获得(3+ε)近似解的方法相比,该方法在查询时间和关于k和z的速度方面取得了显著的改进。

Given a point set $P\subseteq M$ from a metric space $(M,d)$ and numbers $k, z \in N$, the *metric $k$-center problem with $z$ outliers* is to find a set $C^\ast\subseteq P$ of $k$ points such that the maximum distance of all but at most $z$ outlier points of $P$ to their nearest center in ${C}^\ast$ is minimized. We consider this problem in the fully dynamic model, i.e., under insertions and deletions of points, for the case that the metric space has a bounded doubling dimension $dim$. We utilize a hierarchical data structure to maintain the points and their neighborhoods, which enables us to efficiently find the clusters. In particular, our data structure can be queried at any time to generate a $(3+\varepsilon)$-approximate solution for input values of $k$ and $z$ in worst-case query time $\varepsilon^{-O(dim)}k \log{n} \log\log{\Delta}$, where $\Delta$ is the ratio between the maximum and minimum distance between two points in $P$. Moreover, it allows insertion/deletion of a point in worst-case update time $\varepsilon^{-O(dim)}\log{n}\log{\Delta}$. Our result achieves a significantly faster query time with respect to $k$ and $z$ than the current state-of-the-art by Pellizzoni, Pietracaprina, and Pucci, which uses $\varepsilon^{-O(dim)}(k+z)^2\log{\Delta}$ query time to obtain a $(3+\varepsilon)$-approximation.

Transportability for Bandits with Data from Different Environments
Alexis Bellot Alan Malek Silvia Chiappa



研究问题:如何有效地优化智能代理的策略,基于可用的问题先验知识和可以采取的更多学习行动。
动机:大多数方法通常仅依赖于代理在一个环境(或多个密切相关的环境)中的实验。本文放松了这个假设,考虑了从批量数据和关于不同环境相关性的质量假设(以因果模型的形式表示)的组合来设计赌博算法。
方法:通过利用因果模型中可能出现的环境间的不变性,来持续改进学习。
效果:由此产生的赌博算法具有次线性遗憾界限,其明确依赖于一个项,该项捕获了相关环境对当前任务的信息量;并且可能比仅实验的赌博实例具有明显更低的遗憾。

A unifying theme in the design of intelligent agents is to efficiently optimize a policy based on what prior knowledge of the problem is available and what actions can be taken to learn more about it. Bandits are a canonical instance of this task that has been intensely studied in the literature. Most methods, however, typically rely solely on an agent's experimentation in a single environment (or multiple closely related environments). In this paper, we relax this assumption and consider the design of bandit algorithms from a combination of batch data and qualitative assumptions about the relatedness across different environments, represented in the form of causal models. In particular, we show that it is possible to exploit invariances across environments, wherever they may occur in the underlying causal model, to consistently improve learning. The resulting bandit algorithm has a sub-linear regret bound with an explicit dependency on a term that captures how informative related environments are for the task at hand; and may have substantially lower regret than experimentation-only bandit instances.

FIRAL: An Active Learning Algorithm for Multinomial Logistic Regression
Youguang Chen George Biros



研究问题:本文旨在研究使用多项式逻辑回归的多类别分类的基于池的主动学习的理论和算法。
动机:为了解决有限样本下的风险控制问题,提出了利用Fisher信息比(FIR)进行主动学习的方法。
方法:通过理论分析证明了FIR可以上下界风险,并基于此提出了一种采用遗憾最小化来最小化FIR的主动学习算法。
效果:在合成数据集上验证了所提出的风险界限,并在MNIST、CIFAR-10和50类ImageNet等实验集上与其他五种方法进行了比较,发现该方法表现最好,能持续产生最小的分类错误。

We investigate theory and algorithms for pool-based active learning for multiclass classification using multinomial logistic regression. Using finite sample analysis, we prove that the Fisher Information Ratio (FIR) lower and upper bounds the excess risk. Based on our theoretical analysis, we propose an active learning algorithm that employs regret minimization to minimize the FIR. To verify our derived excess risk bounds, we conduct experiments on synthetic datasets. Furthermore, we compare FIRAL with five other methods and found that our scheme outperforms them: it consistently produces the smallest classification error in the multiclass logistic regression setting, as demonstrated through experiments on MNIST, CIFAR-10, and 50-class ImageNet.

A Unified Model and Dimension for Interactive Estimation
Nataly Brukhim Miroslav Dudík Aldo Pacchiano Robert E. Schapire



研究问题:本文研究了一种交互式学习的抽象框架,目标是通过学习者查询的点与目标的“相似性”来估计目标。
动机:现有的统计查询学习和结构化决策模型在处理复杂任务时存在一定的局限性,因此提出了一种新的交互式估计框架。
方法:引入了一种新的组合度量——Dissimilarity dimension,该度量在很大程度上捕捉了模型的可学习性。同时,提出了一种简单、通用且广泛应用的算法,并获得了多项式时间复杂度的遗憾和PAC泛化界。
效果:证明了该框架包含了统计查询学习和结构化决策模型两种经典学习模型,并在一些情况下通过Dissimilarity dimension参数显著改进了分析结果。

We study an abstract framework for interactive learning called interactive estimation in which the goal is to estimate a target from its ``similarity'' to points queried by the learner. We introduce a combinatorial measure called Dissimilarity dimension which largely captures learnability in our model. We present a simple, general, and broadly-applicable algorithm, for which we obtain both regret and PAC generalization bounds that are polynomial in the new dimension. We show that our framework subsumes and thereby unifies two classic learning models: statistical-query learning and structured bandits. We also delineate how the Dissimilarity dimension is related to well-known parameters for both frameworks, in some cases yielding significantly improved analyses.

A Robust Exact Algorithm for the Euclidean Bipartite Matching Problem
Akshaykumar G Gattani Sharath Raghvendra Pouyan Shirzadian



研究问题:如何利用最小成本二分匹配算法来估计两个分布之间的Wasserstein距离。
动机:对于在欧几里得空间中的二维点集,最小成本二分匹配算法可以快速计算其最小成本匹配,而Wasserstein距离是衡量两个分布之间差异的重要指标。
方法:本文提出了一种新的算法,可以在$\tilde{O}(n^{2-\frac{1}{2d}}\Phi(n))$时间内计算出最小成本匹配,其中$d$是维度,$\Phi(n)$是动态加权最近邻数据结构的查询/更新时间。
效果:该算法是首个能在$\tilde{O}(n^{7/4}\log \Delta)$期望时间内处理具有实值坐标的随机点集的算法,且能扩展到任意维度。

Algorithms for the minimum-cost bipartite matching can be used to estimate Wasserstein distance between two distributions. Given two sets $A$ and $B$ of $n$ points in a $2$-dimensional Euclidean space, one can use a fast implementation of the Hungarian method to compute a minimum-cost bipartite matching of $A$ and $B$ in $\tilde{O}(n^2)$ time. Let $\Delta$ be the spread, i.e., the ratio of the distance of the farthest to the closest pair of points in $A\cup B$. In this paper, we present a new algorithm to compute a minimum-cost bipartite matching of $A$ and $B$ with a similar worst-case execution time of $\tilde{O}(n^2 \log \Delta)$. However, when $A$ and $B$ are drawn independently and identically from a fixed distribution that is not known to the algorithm, the execution time of our algorithm is, in expectation, $\tilde{O}(n^{7/4}\log \Delta)$. To the best of our knowledge, our algorithm is the first one to achieve a sub-quadratic execution time even for stochastic point sets with real-valued coordinates. Our algorithm extends to any dimension $d$, where it runs in $\tilde{O}(n^{2-\frac{1}{2d}}\Phi(n))$ time for stochastic point sets $A$ and $B$; here $\Phi(n)$ is the query/update time of a dynamic weighted nearest neighbor data structure. Our algorithm can be seen as a careful adaptation of the Hungarian method in the geometric divide-and-conquer framework.

Quantum speedups for stochastic optimization
Aaron Sidford Chenyi Zhang



研究问题:本文旨在解决在给定自然量子随机梯度查询器的情况下,最小化连续函数的问题。
动机:对于最小化Lipschitz凸函数的特殊情况,我们提供了两种新方法。这两种方法都获得了一种维度与精度之间的权衡,这是在经典计算中无法实现的,并且我们证明了其中一种方法在低维设置中是渐近最优的。此外,我们还提供了一种量子算法,用于以经典计算无法达到的速度计算平滑非凸函数的临界点。
方法:我们基于Cornelissen等人的多变量均值估计结果,并提供了一个通用的独立关注的量子方差减少技术。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider the problem of minimizing a continuous function given given access to a natural quantum generalization of a stochastic gradient oracle. We provide two new methods for the special case of minimizing a Lipschitz convex function. Each method obtains a dimension versus accuracy trade-off which is provably unachievable classically and we prove that one method is asymptotically optimal in low-dimensional settings. Additionally, we provide quantum algorithms for computing a critical point of a smooth non-convex function at rates not known to be achievable classically. To obtain these results we build upon the quantum multivariate mean estimation result of Cornelissen et al. and provide a general quantum variance reduction technique of independent interest.

Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization
Mahyar Fazlyab Taha Entesari Aniket Roy Rama Chellappa



研究问题:如何提高深度分类器对对抗性扰动的鲁棒性?
动机:现有的方法可能无法有效增加输入空间的边界,因此需要提出新的方法。
方法:提出了一种可微正则化器,它是数据点到分类边界距离的下界。该方法需要知道模型在某些方向上的Lipschitz常数,为此开发了一种可扩展的方法来计算神经网络Lipschitz常数的保证可微上界。
效果:在MNIST、CIFAR-10和Tiny-ImageNet数据集上的实验表明,与现有技术相比,该方法获得了有竞争力的改进结果。

To improve the robustness of deep classifiers against adversarial perturbations, many approaches have been proposed, such as designing new architectures with better robustness properties (e.g., Lipschitz-capped networks), or modifying the training process itself (e.g., min-max optimization, constrained learning, or regularization). These approaches, however, might not be effective at increasing the margin in the input (feature) space. In this paper, we propose a differentiable regularizer that is a lower bound on the distance of the data points to the classification boundary. The proposed regularizer requires knowledge of the model's Lipschitz constant along certain directions. To this end, we develop a scalable method for calculating guaranteed differentiable upper bounds on the Lipschitz constant of neural networks accurately and efficiently. The relative accuracy of the bounds prevents excessive regularization and allows for more direct manipulation of the decision boundary. Furthermore, our Lipschitz bounding algorithm exploits the monotonicity and Lipschitz continuity of the activation layers, and the resulting bounds can be used to design new layers with controllable bounds on their Lipschitz constant. Experiments on the MNIST, CIFAR-10, and Tiny-ImageNet data sets verify that our proposed algorithm obtains competitively improved results compared to the state-of-the-art.

Optimal Rates for Bandit Nonstochastic Control
Y. Jennifer Sun Stephen Newman Elad Hazan



研究问题:本文旨在解决最优控制中的基础和广泛研究的问题,即线性二次调节器(LQR)和线性二次高斯(LQG)控制。
动机:作者们对带有半对抗性扰动和时变对抗性强盗损失函数的LQR和LQG问题进行了研究。他们试图回答一个开放的问题,即是否可以实现紧的$\sqrt{T}$速率。
方法:作者们提出了一种具有记忆功能的强盗优化新方案,这是他们方法的核心组成部分。
效果:实验结果表明,他们的算法在强盗LQR和LQG问题上实现了最佳遗憾,达到了最优水平。

Linear Quadratic Regulator (LQR) and Linear Quadratic Gaussian (LQG) control are foundational and extensively researched problems in optimal control. We investigate LQR and LQG problems with semi-adversarial perturbations and time-varying adversarial bandit loss functions. The best-known sublinear regret algorithm~\cite{gradu2020non} has a $T^{\frac{3}{4}}$ time horizon dependence, and its authors posed an open question about whether a tight rate of $\sqrt{T}$ could be achieved. We answer in the affirmative, giving an algorithm for bandit LQR and LQG which attains optimal regret, up to logarithmic factors. A central component of our method is a new scheme for bandit convex optimization with memory, which is of independent interest.

Federated Linear Bandits with Finite Adversarial Actions
Li Fan Ruida Zhou Chao Tian Cong Shen



研究问题:本文研究了联邦线性Bandits模型,其中M个客户端与中央服务器进行研究问题:本文研究了联邦线性Bandits模型,其中M个客户端与中央服务器进行通信以解决具有可能在不同客户端之间不同的有限对抗性动作集的线性上下文Bandits问题。
动机:为了解决有限的对抗性动作集的独特挑战,我们提出了FedSupLinUCB算法,该算法扩展了线性上下文Bandits中的SupLinUCB和OFUL算法的原则。
方法:我们证明了FedSupLinUCB实现了总遗憾为O(√dT),其中T是所有客户端的总拔臂次数,d是线性模型的环境维度。这符合最小最大下界,因此是最优的(高达多项式项)。我们研究了异步和同步两种情况,并表明通信成本可以分别控制在O(dM^2log(d)log(T))和O(√d^3M^3log(d))。
效果:FedSupLinUCB设计进一步扩展到两种场景:(1)方差自适应,其中可以实现总遗憾为O(√d∑t=1Tσ_t^2),其中σ_t^2是第t轮的噪声方差;(2)对抗性破坏,其中可以实现总遗憾为O(√dT+dC_p),其中C_p是总破坏预算。实验结果证实了理论分析,并在合成和真实世界数据集上展示了算法的有效性。

We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of **adversarial finite** action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of $\tilde{O}(\sqrt{d T})$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as $O(d M^2 \log(d)\log(T))$ and $O(\sqrt{d^3 M^3} \log(d))$, respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of $\tilde{O} (\sqrt{d \sum \nolimits_{t=1}^{T} \sigma_t^2})$ can be achieved with $\sigma_t^2$ being the noise variance of round $t$; and (2) adversarial corruption, where a total regret of $\tilde{O}(\sqrt{dT} + d C_p)$ can be achieved with $C_p$ being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of \alg on both synthetic and real-world datasets.

Computing Optimal Equilibria and Mechanisms via Learning in Zero-Sum Extensive-Form Games
Brian Hu Zhang Gabriele Farina Ioannis Anagnostides Federico Cacciamani Stephen Marcus McAleer Andreas Alexander Haupt Andrea Celli Nicola Gatti Vincent Conitzer Tuomas Sandholm



研究问题:本文旨在通过学习博弈计算最优均衡。
动机:现有的方法无法有效计算最优均衡,而最优均衡是零和博弈的最小最大平衡策略。
方法:将最优均衡转化为零和博弈的最小最大平衡策略,并应用零和博弈的学习技术,首次提出可以收敛到最优均衡的学习动态。
效果:在基准表格游戏中取得最先进的性能,并通过深度强化学习计算序列拍卖设计问题的最优机制,证明了该方法的实际可扩展性和灵活性。

We introduce a new approach for computing optimal equilibria via learning in games. It applies to extensive-form settings with any number of players, including mechanism design, information design, and solution concepts such as correlated, communication, and certification equilibria. We observe that optimal equilibria are minimax equilibrium strategies of a player in an extensive-form zero-sum game. This reformulation allows to apply techniques for learning in zero-sum games, yielding the first learning dynamics that converge to optimal equilibria, not only in empirical averages, but also in iterates. We demonstrate the practical scalability and flexibility of our approach by attaining state-of-the-art performance in benchmark tabular games, and by computing an optimal mechanism for a sequential auction design problem using deep reinforcement learning.

Double Randomized Underdamped Langevin with Dimension-Independent Convergence Guarantee
Yuanshi Liu Cong Fang Tong Zhang



研究问题:本文主要研究了具有复合结构的对数凹分布的高维采样问题。
动机:为了解决高维采样中的问题,开发了一种双随机化技术,以实现快速阻尼Langevin算法和与维度无关的收敛保证。
方法:通过使用该双随机化技术,我们开发了一个快速阻尼Langevin算法,并证明了该算法具有总体迭代复杂度为$\tilde{\mathcal{O}}\left(\frac{\left(mathrm{tr}(H)\right)^{1/3}}{\epsilon^{2/3}}\right)$的特性,其中$H$是函数$f$的Hessian矩阵的上界,且不显式依赖于维度$d$。
效果:对于正则化数据的线性模型后验采样,我们的算法在收敛速度上表现出明显的优势,其收敛速度与维度无关,并且比之前的最佳已知结果快了$d^{1/3}$倍。这项分析为我们提供了一种更快的收敛率,并为高维采样带来了新的洞察。

This paper focuses on the high-dimensional sampling of log-concave distributions with composite structures: $p^*(\mathrm{d}x)\propto \exp(-g(x)-f(x))\mathrm{d}x$. We develop a double randomization technique, which leads to a fast underdamped Langevin algorithm with a dimension-independent convergence guarantee. We prove that the algorithm enjoys an overall $\tilde{\mathcal{O}}\left(\frac{\left(\mathrm{tr}(H)\right)^{1/3}}{\epsilon^{2/3}}\right)$ iteration complexity to reach an $\epsilon$-tolerated sample whose distribution $p$ admits $W_2(p,p^*)\leq \epsilon$. Here, $H$ is an upper bound of the Hessian matrices for $f$ and does not explicitly depend on dimension $d$. For the posterior sampling over linear models with normalized data, we show a clear superiority of convergence rate which is dimension-free and outperforms the previous best-known results by a $d^{1/3}$ factor. The analysis to achieve a faster convergence rate brings new insights into high-dimensional sampling.

Projection-Free Online Convex Optimization via Efficient Newton Iterations
Khashayar Gatmiry Zakaria Mhammedi



研究问题:本文旨在提出一种新的无投影在线凸优化(OCO)算法。
动机:传统的OCO算法需要执行欧几里得投影到凸集以确保迭代的可行性,而基于弗兰克-沃尔夫方法的替代算法则通过在$\mathcal{K}$上进行线性优化来避免昂贵的欧几里得投影,但其遗憾度低于投影基算法。
方法:本文提出了第三种算法,该算法使用自协方差障碍物输出近似牛顿迭代,自动确保可行性而无需投影。
效果:我们的主要贡献是展示了如何利用牛顿迭代的稳定性仅在少数几轮中计算逆Hessian矩阵,从而得到一种具有最新最优遗憾边界的高效无投影OCO算法。

This paper presents new projection-free algorithms for Online Convex Optimization (OCO) over a convex domain $\mathcal{K} \subset \mathbb{R}^d$. Classical OCO algorithms (such as Online Gradient Descent) typically need to perform Euclidean projections onto the convex set $\mathcal{K}$ to ensure feasibility of their iterates. Alternative algorithms, such as those based on the Frank-Wolfe method, swap potentially-expensive Euclidean projections onto $\mathcal{K}$ for linear optimization over $\mathcal{K}$. However, such algorithms have a sub-optimal regret in OCO compared to projection-based algorithms. In this paper, we look at a third type of algorithms that output approximate Newton iterates using a self-concordant barrier for the set of interest. The use of a self-concordant barrier automatically ensures feasibility without the need of projections. However, the computation of the Newton iterates requires a matrix inverse, which can still be expensive. As our main contribution, we show how the stability of the Newton iterates can be leveraged to only compute the inverse Hessian a vanishing fractions of the rounds, leading to a new efficient projection-free OCO algorithm with a state-of-the-art regret bound.

On the Size and Approximation Error of Distilled Datasets
Alaa Maalouf Murad Tukan Noel Loo Ramin Hasani Mathias Lechner Daniela Rus



研究问题:本文旨在探讨数据集蒸馏的理论限制和保证,特别是与原始未压缩数据集相比,蒸馏研究问题:本文旨在探讨数据集蒸馏的理论限制和保证,特别是与原始未压缩数据集相比,蒸馏实现的额外风险有多大,以及蒸馏后的数据集有多大。
动机:尽管近年来在实证上取得了显著进展,但人们对数据集蒸馏的理论理解仍然有限。
方法:本文以理论的角度看待基于核岭回归(KRR)的数据集蒸馏方法,如核诱导点。通过将岭回归转换到随机傅立叶特征(RFF)空间,我们首次证明了对于平移不变内核,存在小尺寸的蒸馏数据集及其相应的额外风险。
效果:我们证明了在原始输入空间中存在一个小的实例集,其在RFF空间中的解与原始数据的解相吻合。我们还进一步展示了可以使用这个蒸馏的实例集生成一个KRR解,该解近似于在完整输入数据上优化的KRR解。这种集合的大小是输入集的RFF空间维度的线性函数,或者等效自由度的近线性函数,这是内核、数据点数量和正则化参数λ的函数。这个蒸馏集的误差界限也是λ的函数。我们在理论上验证了这些界限,并进行了实证验证。

Dataset Distillation is the task of synthesizing small datasets from large ones while still retaining comparable predictive accuracy to the original uncompressed dataset. Despite significant empirical progress in recent years, there is little understanding of the theoretical limitations/guarantees of dataset distillation, specifically, what excess risk is achieved by distillation compared to the original dataset, and how large are distilled datasets? In this work, we take a theoretical view on kernel ridge regression (KRR) based methods of dataset distillation such as Kernel Inducing Points. By transforming ridge regression in random Fourier features (RFF) space, we provide the first proof of the existence of small (size) distilled datasets and their corresponding excess risk for shift-invariant kernels. We prove that a small set of instances exists in the original input space such that its solution in the RFF space coincides with the solution of the original data. We further show that a KRR solution can be generated using this distilled set of instances which gives an approximation towards the KRR solution optimized on the full input data. The size of this set is linear in the dimension of the RFF space of the input set or alternatively near linear in the number of effective degrees of freedom, which is a function of the kernel, number of data points, and the regularization parameter $\lambda$. The error bound of this distilled set is also a function of $\lambda$. We verify our bounds analytically and empirically.

Cascading Bandits: Optimizing Recommendation Frequency in Delayed Feedback Environments
Dairui Wang Junyu Cao Yan Zhang Wei Qi



研究问题:动态推荐系统中的延迟反馈是一个关键问题,其中反馈结果往往依赖于推荐的频率。
动机:大多数现有的在线学习文献都没有考虑到优化推荐频率的问题,并且认为每条成功推荐的信息带来的回报都是相等的。
方法:本文考虑了一个新颖的级联强盗设置,其中从选定列表中发送的每一条信息都会定期发送给用户。每当用户不喜欢某条信息时,她可能会以与推荐频率正相关的概率放弃系统。学习代理需要通过随机延迟反馈来学习底层消息吸引力概率和用户放弃概率。
效果:我们首先展示了在确定性情况下找到最优消息序列的动态规划解决方案,其中奖励允许随不同消息而变化。然后我们提出了一个基于UCB的多项式时间离线学习算法,并通过描述其遗憾界限来讨论其性能。对于在线设置,我们提出了一种允许为给定用户自适应内容的学习方法。AmEx数据集上的数值实验证实了我们的算法的有效性。

Delayed feedback is a critical problem in dynamic recommender systems. In practice, the feedback result often depends on the frequency of recommendation. Most existing online learning literature fails to consider optimization of the recommendation frequency, and regards the reward from each successfully recommended message to be equal. In this paper, we consider a novel cascading bandits setting, where individual messages from a selected list are sent to a user periodically. Whenever a user does not like a message, she may abandon the system with a probability positively correlated with the recommendation frequency. A learning agent needs to learn both the underlying message attraction probabilities and users' abandonment probabilities through the randomly delayed feedback. We first show a dynamic programming solution to finding the optimal message sequence in deterministic scenarios, in which the reward is allowed to vary with different messages. Then we propose a polynomial time UCB-based offline learning algorithm, and discuss its performance by characterizing its regret bound. For the online setting, we propose a learning algorithm which allows adaptive content for a given user. Numerical experiment on AmEx dataset confirms the effectiveness of our algorithms.

The Bayesian Stability Zoo
Shay Moran Hilla Schefler Jonathan Shafer



研究问题:本文旨在证明学习理论文献中许多稳定性定义彼此等价,并建立不同稳定性定义之间的对应关系。
动机:为了提高对近年来出现的一系列稳定性概念的理解和清晰度,需要对学习理论中的稳定性概念进行系统分类。
方法:通过区分分布依赖和分布独立的贝叶斯稳定性两种类型的定义,建立了各种定义之间的等价关系,包括近似差分隐私、纯差分隐私、可复制性、全局稳定性、完美泛化、TV稳定性、互信息稳定性、KL散度稳定性和Renyi散度稳定性。
效果:证明了增强学习规则稳定性的助推结果,为学习理论中的稳定性概念提供了更系统的分类,促进了对稳定性概念的理解和清晰度。

We show that many definitions of stability found in the learning theory literature are equivalent to one another. We distinguish between two families of definitions of stability: distribution-dependent and distribution-independent Bayesian stability. Within each family, we establish equivalences between various definitions, encompassing approximate differential privacy, pure differential privacy, replicability, global stability, perfect generalization, TV stability, mutual information stability, KL-divergence stability, and Rényi-divergence stability. Along the way, we prove boosting results that enable the amplification of the stability of a learning rule. This work is a step towards a more systematic taxonomy of stability notions in learning theory, which can promote clarity and an improved understanding of an array of stability concepts that have emerged in recent years.

Optimistic Meta-Gradients
Sebastian Flennerhag Tom Zahavy Brendan O'Donoghue Hado van Hasselt András György Satinder Singh



研究问题:本研究探讨了基于梯度的元学习和凸优化之间的联系。
动机:我们发现带有动量的梯度下降是元梯度的一种特殊情况,并基于优化的最新结果,我们证明了单任务设置中元学习的收敛速度。
方法:我们通过最近提出的Bootstrapped Meta-Gradient(Flennerhag等人,2022)方法,展示了元学习中的乐观主义可以如何被捕获,从而提供了对其底层机制的深入理解。
效果:虽然元学习到的更新规则可以在常数因子内实现更快的收敛,但它不足以加速学习。相反,某种形式的乐观主义是必要的。

We study the connection between gradient-based meta-learning and convex optimisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient for acceleration. Instead, some form of optimism is required. We show that optimism in meta-learning can be captured through the recently proposed Bootstrapped Meta-Gradient (Flennerhag et. al., 2022) method, providing deeper insight into its underlying mechanics.

Learning Mixtures of Gaussians Using the DDPM Objective
Kulin Shah Sitan Chen Adam Klivans



研究问题:现有的扩散模型可以学习任何分布,但关于何时可以进行得分估计以及基于梯度的算法何时能成功,仍知之甚少。
动机:为了解决这一问题,我们首次对高斯混合模型这一最基本的分布族进行了有效的证明。
方法:我们在去噪扩散概率模型(DDPM)目标上使用随机初始化和预热启动的梯度下降法进行训练,并证明了其有效性。
效果:我们的实验结果表明,在这两种设置下,GD可以有效地恢复混合模型的真实参数。

Recent works have shown that diffusion models can learn essentially any distribution provided one can perform score estimation. Yet it remains poorly understood under what settings score estimation is possible, let alone when practical gradient-based algorithms for this task can provably succeed. In this work, we give the first provably efficient results for one of the most fundamental distribution families, Gaussian mixture models. We prove that GD on the denoising diffusion probabilistic model (DDPM) objective can efficiently recover the ground truth parameters of the mixture model in the following two settings: 1. We show GD with random initialization learns mixtures of two spherical Gaussians in $d$ dimensions with $1/\text{poly}(d)$-separated centers. 2. We show GD with a warm start learns mixtures of $K$ spherical Gaussians with $\Omega(\sqrt{\log(\min(K,d))})$-separated centers. A key ingredient in our proofs is a new connection between score-based methods and two other approaches to distribution learning, EM and spectral methods.

Fast and Regret Optimal Best Arm Identification: Fundamental Limits and Low-Complexity Algorithms
Qining Zhang Lei Ying



研究问题:本文考虑了一个随机多臂赌博机问题,具有双重目标:(i)快速识别
动机:为了解决这一问题,我们首次对高斯混合模型这一最基本的分布族进行了有效的证明。
方法:我们在去噪扩散概率模型(DDPM)目标上使用随机初始化和预热启动的梯度下降法进行训练,并证明了其有效性。
效果:我们的实验结果表明,在这两种设置下,GD可以有效地恢复混合模型的真实参数。

This paper considers a stochastic Multi-Armed Bandit (MAB) problem with dual objectives: (i) quick identification and commitment to the optimal arm, and (ii) reward maximization throughout a sequence of $T$ consecutive rounds. Though each objective has been individually well-studied, i.e., best arm identification for (i) and regret minimization for (ii), the simultaneous realization of both objectives remains an open problem, despite its practical importance. This paper introduces \emph{Regret Optimal Best Arm Identification} (ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both pre-determined stopping time and adaptive stopping time requirements, we present an algorithm called EOCP and its variants respectively, which not only achieve asymptotic optimal regret in both Gaussian and general bandits, but also commit to the optimal arm in $\mathcal{O}(\log T)$ rounds with pre-determined stopping time and $\mathcal{O}(\log^2 T)$ rounds with adaptive stopping time. We further characterize lower bounds on the commitment time (equivalent to the sample complexity) of ROBAI, showing that EOCP and its variants are sample optimal with pre-determined stopping time, and almost sample optimal with adaptive stopping time. Numerical results confirm our theoretical analysis and reveal an interesting ``over-exploration'' phenomenon carried by classic UCB algorithms, such that EOCP has smaller regret even though it stops exploration much earlier than UCB, i.e., $\mathcal{O}(\log T)$ versus $\mathcal{O}(T)$, which suggests over-exploration is unnecessary and potentially harmful to system performance.

Multiclass Boosting: Simple and Intuitive Weak Learning Criteria
Nataly Brukhim Amit Daniely Yishay Mansour Shay Moran



研究问题:本研究旨在将提升学习(boosting)推广到多类别设置。
动机:现有的提升学习方法主要针对二分类问题,对于多类别问题尚未有明确的解决方案。
方法:我们提出了一种新的弱学习条件,用于捕捉原始的“略优于随机猜测”的弱可学习性概念,并设计了一种简单且高效的提升算法,该算法不需要可实性假设,其样本和查询复杂度与类别数量无关。
效果:我们在列表PAC学习的背景下,利用这种新的提升技术进行了几种理论应用。首先,我们建立了与弱PAC学习的等价性。此外,我们还展示了一种针对列表学习的提升方法,并为多类别PAC学习和列表PAC学习的特性提供了新的证明。值得注意的是,与我们以前的工作相比,我们的技术能够产生更简化的算法和分析。

We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being “slightly better than random guessing”. We give a simple and efficient boosting algorithm, that does not require realizability assumptions and its sample and oracle complexity bounds are independent of the number of classes. In addition, we utilize our new boosting technique in several theoretical applications within the context of List PAC Learning. First, we establish an equivalence to weak PAC learning. Furthermore, we present a new result on boosting for list learners, as well as provide a novel proof for the characterization of multiclass PAC learning and List PAC learning. Notably, our technique gives rise to simplified algorithms and analysis compared to previous works.

Regret Minimization via Saddle Point Optimization
Johannes Kirschner Alireza Bakhtiari Kushagra Chandak Volodymyr Tkachuk Csaba Szepesvari



研究问题:本研究旨在通过最小-最大程序对序列决策中的遗憾最小化进行样本复杂度的特征描述。
动机:在相应的鞍点游戏中,最小玩家针对选择导致大遗憾的混淆模型的敌对最大玩家优化采样分布。最近这个想法的实例是决策估计系数(DEC),它被证明在结构化的赌博和强化学习中提供了近乎紧密的最坏情况期望遗憾的上下界。
方法:通过重新参数化偏移DEC与置信半径并解决相应的最小-最大程序,我们推导出了一种随时可用的估计到决策算法(Anytime-E2D)的版本。重要的是,该算法在线优化探索-利用权衡,而不是通过分析。我们的公式为有限模型类和线性反馈模型带来了一个实用的算法。
效果:我们通过推导高维线性赌博的改进率来说明结果。最后,我们指出了与信息比、解耦系数和PAC-DEC的联系,并对E2D在简单示例上的性能进行了数值评估。

A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions algorithm (Anytime-E2D). Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We illustrate the results by deriving improved rates for high-dimensional linear bandits. Lastly, we point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.

Model-Free Reinforcement Learning with the Decision-Estimation Coefficient
Dylan J Foster Noah Golowich Jian Qian Alexander Rakhlin Ayush Sekhari



研究问题:本文探讨了交互式决策制定的问题,包括结构化的bandits和具有一般函数近似的强化学习。
动机:Foster等人(2021)提出了决策估计系数,这是一种统计复杂度的度量,可以作为交互式决策制定的最优遗憾下界。同时,他们还开发了一种元算法——估计到决策,该算法在相同的数量上实现了上限。
方法:本文通过将估计到决策与张(2022)提出的一种特殊的"乐观"估计相结合,得到了比Foster等人(2021)更好的保证,以适应更宽松的估计误差概念。
效果:我们使用这种方法为具有值函数近似的无模型强化学习导出了遗憾界限,并给出了结构结果,展示了它在何种情况下可以帮助,以及在何种情况下不能帮助。

We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this paper, we show that by combining Estimation-to-Decisions with a specialized form of "optimistic" estimation introduced by Zhang (2022), it is possible to obtain guarantees that improve upon those of Foster et al. (2021) by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation, and give structural results showing when it can and cannot help more generally.

Partial Matrix Completion
Elad Hazan Adam Tauman Kalai Varun Kanade Clara Mohri Y. Jennifer Sun



研究问题:本文旨在解决矩阵补全问题,即通过给定的一组揭示(可能带有噪声)的条目来重构低秩矩阵。
动机:现有的方法虽然可以完成整个矩阵的补全,但由于采样分布的差异,补全条目的准确性可能会在矩阵中显著变化。
方法:我们提出了一种新的问题表述方式,即部分矩阵补全,目标是以高置信度完成大部分条目的补全。我们的算法有效地处理了未知和任意复杂的采样分布,确保所有补全条目的准确性和矩阵的充分覆盖。此外,我们还介绍了问题的在线版本,并提出了基于迭代梯度更新的低遗憾高效算法。
效果:我们的方法在初步的实证评估中表现出良好的效果。

The matrix completion problem involves reconstructing a low-rank matrix by using a given set of revealed (and potentially noisy) entries. Although existing methods address the completion of the entire matrix, the accuracy of the completed entries can vary significantly across the matrix, due to differences in the sampling distribution. For instance, users may rate movies primarily from their country or favorite genres, leading to inaccurate predictions for the majority of completed entries. We propose a novel formulation of the problem as Partial Matrix Completion, where the objective is to complete a substantial subset of the entries with high confidence. Our algorithm efficiently handles the unknown and arbitrarily complex nature of the sampling distribution, ensuring high accuracy for all completed entries and sufficient coverage across the matrix. Additionally, we introduce an online version of the problem and present a low-regret efficient algorithm based on iterative gradient updates. Finally, we conduct a preliminary empirical evaluation of our methods.

On the Role of Entanglement and Statistics in Learning
Srinivasan A Vojtěch Havlíček Louis Schatzki



研究问题:理解在量子统计查询(QSQ)模型中,当学习模型可以获得纠缠测量、可分离测量和统计测量时,它们之间的关系。
动机:当前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过大规模文本语料库和知识图谱训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this work we make progress in understanding the relationship between learning models when given access to entangled measurements, separable measurements and statistical measurements in the quantum statistical query ($\mathsf{QSQ}$) model. To this end, we show the following results. $\textbf{Entanglement versus separable measurements.}$ The goal here is to learn an unknown $f$ from the concept class $\mathcal{C} \subseteq \{f:\{0,1\}^n\rightarrow [k]\}$ given copies of $\frac{1}{\sqrt{2^n}}\sum_x \ket{x,f(x)}$. We show that, if $T$ copies suffice to learn $f$ using entangled measurements, then $O(nT^2)$ copies suffice to learn $f$ using just separable measurements. Additionally, we exhibit a concept class $\mathcal{C}$ for which, in order to learn some \emph{property} of $f$, the sample complexity of learning using entangled measurements is exponentially smaller than separable measurements. $\textbf{Entangled versus statistical measurements}$ The goal here is to learn a function $f \in \mathcal{C}$ given access to separable measurements and statistical measurements. We exhibit a concept class $\mathcal{C}$ based on degree-$2$ functions that gives an exponential separation between $\mathsf{QSQ}$ learning and quantum learning with entangled measurements (even in the presence of noise). This proves the "quantum analogue" of the seminal result of (Blum, 2003) that separates classical $\mathsf{SQ}$ learning from classical $\mathsf{PAC}$ learning with classification~noise. $\textbf{$\mathsf{QSQ}$ lower bounds for learning states.}$ The main technical contribution is to introduce a quantum statistical query dimension ($\mathsf{QSDA}$), which we use to give lower bounds on the $\mathsf{QSQ}$ complexity of learning. Using this, we prove exponential $\mathsf{QSQ}$ lower bounds for testing purity of quantum states, learning CCHL states, coset states of Abelian groups, degree-$2$ functions, planted bi-clique states and learning output states of Clifford circuits of depth polylog($n$). $\textbf{Further applications.}$ Using our $\mathsf{QSQ}$ lower bounds give an $\textit{unconditional}$ separation between weak and strong error mitigation and prove lower bounds for learning distributions in the $\mathsf{QSQ}$ model. Prior works by (Quek et al., 2022), (Hinsche et al., 2022), and (Neitner et al., 23) proved the analogous results $\textit{assuming}$ diagonal measurements and our work removes this assumption.

Universality laws for Gaussian mixtures in generalized linear models
Yatin Dandi Ludovic Stephan Florent Krzakala Bruno Loureiro Lenka Zdeborova



研究问题:本文研究了高维统计中基于高斯混合假设的一系列结果在经验风险最小化、贝叶斯不确定性量化、核方法和神经网络分离、随机特征的集成和波动等上下文中的应用。
动机:作者提供了这些结果适用于一类包含独立样本的数据集$(\mathbf{x_i},y_i, {i=1,\dots,n})$的严格证明,这类数据集来自混合分布$\sum_{c\in\mathcal{C}} rho_{c}P_{c}^{mathbf{x}}$。
方法:具体来说,作者考虑了广义线性模型的假设类$\hat{y} = F(\mathbf{\Theta}^{\top}\mathbf{x})$,并研究了从(a)最小化经验风险$\hat{R_n}^{(m)}(\mathbf{Theta}^{(m)};mathbf{X},\mathbf{y})$或(b)从相关的吉布斯测度$\exp(-\beta n \hat{R_n}^{(m)}(\mathbf{\Theta}^{(m)};\mathbf{X},\mathbf{y}))$采样得到的广义线性估计量族$(mathbf{\Theta}^{(1)}, \dots, \mathbf{\Theta}^{(M)})$的渐近联合统计特性。
效果:本文的主要贡献是刻画出在什么条件下,这一族的渐近联合统计特性仅(弱意义上)依赖于类条件特征分布$P_{c}^{\mathbf{x}}$的均值和协方差,从而证明了不同关注量的普适性,包括训练误差、泛化误差以及估计量的几何性质和相关性。

A recent line of work in high-dimensional statistics working under the Gaussian mixture hypothesis has led to a number of results in the context of empirical risk minimization, Bayesian uncertainty quantification, separation of kernel methods and neural networks, ensembling and fluctuation of random features. We provide rigorous proofs for the applicability of these results to a general class of datasets $(\mathbf{x_i},y_i, {i=1,\dots,n})$ containing independent samples from a mixture distribution $\sum_{c\in\mathcal{C}} \rho_{c}P_{c}^{\mathbf{x}}$. Specifically, we consider the hypothesis class of generalized linear models $\hat{y} = F(\mathbf{\Theta}^{\top}\mathbf{x})$ and investigate the asymptotic joint statistics of a family of generalized linear estimators $(\mathbf{\Theta}^{(1)}, \dots, \mathbf{\Theta}^{(M)})$, obtained either from (a) minimizing an empirical risk $\hat{R_n}^{(m)}(\mathbf{\Theta}^{(m)};\mathbf{X},\mathbf{y})$ or (b) sampling from the associated Gibbs measure $\exp(-\beta n \hat{R_n}^{(m)}(\mathbf{\Theta}^{(m)};\mathbf{X},\mathbf{y}))$. Our main contribution is to characterize under which conditions the asymptotic joint statistics of this family depends (on a weak sense) only on the means and covariances of the class conditional features distribution $P_{c}^{\mathbf{x}}$. This allows us to prove the universality of different quantities of interest, including training, generalization errors, as well as the geometrical properties and correlations of the estimators.

Optimal Algorithms for the Inhomogeneous Spiked Wigner Model
Alexander Pak Justin Ko Florent Krzakala



研究问题:本文研究了具有非均匀噪声特性的加性Wigner问题,目标是恢复通过非均匀低秩矩阵通道的信号。
动机:尽管信息理论性能已知,但主要关注算法问题。
方法:首先,为非均匀问题推导了一种近似消息传递算法(AMP),并证明其严格的状态演变与信息理论最优贝叶斯固定点方程相吻合。其次,推导了一种简单且高效的谱方法,该方法优于PCA,并与信息理论转变相匹配。
效果:实验结果表明,这种方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study a spiked Wigner problem with an inhomogeneous noise profile. Our aim in this problem is to recover the signal passed through an inhomogeneous low-rank matrix channel. While the information-theoretic performances are well-known, we focus on the algorithmic problem. First, we derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem and show that its rigorous state evolution coincides with the information-theoretic optimal Bayes fixed-point equations. Second, we deduce a simple and efficient spectral method that outperforms PCA and is shown to match the information-theoretic transition.

Robust Mean Estimation Without Moments for Symmetric Distributions
Gleb Novikov David Steurer Stefan Tiegel



研究问题:无需假设矩的情况下,稳健地估计均值或位置参数。
动机:目前计算效率高的算法依赖于强分布假设,如次高斯性或(可证明的)有界矩。此外,他们在重尾设置中实现的保证比已知协方差的次高斯分布弱。
方法:我们展示对于一大类对称分布,可以在重尾情况下达到与高斯设置相同的误差。我们研究的分布包括任意一维对称分布的产品,如产品柯西分布,以及椭球分布,这是高斯分布的广泛推广。
效果:对于已知协方差矩阵的产品分布和椭球分布,我们展示了给定一个ε-损坏的样本,我们可以以至少1-δ的概率估计其位置,误差为O(ε√log(1/ε)),使用dlog(d) + log(1/δ)/ε²log(1/ε)个样本。这个结果与高斯分布的最佳已知保证和已知SQ下界相匹配(最多到log(d)因子)。对于未知协方差的椭球分布,我们提出了一系列渐近最优误差的高效算法。具体来说,对于每一个k∈N,我们设计了一个使用时间样本为O(d^k)的估计器,实现误差O(ε^{1-\frac{1}{2k}})。这在假设可证明有界矩阶数高达k时匹配了误差和运行时间保证。对于未知协方差,这样的o(ε)误差界限甚至对(一般)次高斯分布也未知。

We study the problem of robustly estimating the mean or location parameter without moment assumptions. Known computationally efficient algorithms rely on strong distributional assumptions, such as sub-Gaussianity, or (certifiably) bounded moments. Moreover, the guarantees that they achieve in the heavy-tailed setting are weaker than those for sub-Gaussian distributions with known covariance. In this work, we show that such a tradeoff, between error guarantees and heavy-tails, is not necessary for symmetric distributions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distributions, a vast generalization of the Gaussian distribution. For product distributions and elliptical distributions with known scatter (covariance) matrix, we show that given an $\varepsilon$-corrupted sample, we can with probability at least $1-\delta$ estimate its location up to error $O(\varepsilon \sqrt{\log(1/\varepsilon)})$ using $\tfrac{d\log(d) + \log(1/\delta)}{\varepsilon^2 \log(1/\varepsilon)}$ samples. This result matches the best-known guarantees for the Gaussian distribution and known SQ lower bounds (up to the $\log(d)$ factor). For elliptical distributions with unknown scatter (covariance) matrix, we propose a sequence of efficient algorithms that approaches this optimal error. Specifically, for every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ achieving error $O(\varepsilon^{1-\frac{1}{2k}})$. This matches the error and running time guarantees when assuming certifiably bounded moments of order up to $k$. For unknown covariance, such error bounds of $o(\sqrt{\varepsilon})$ are not even known for (general) sub-Gaussian distributions. Our algorithms are based on a generalization of the well-known filtering technique [DK22]. More specifically, we show how this machinery can be combined with Huber-loss-based techniques to work with projections of the noise that behave more nicely than the initial noise. Moreover, we show how sum-of-squares proofs can be used to obtain algorithmic guarantees even for distributions without a first moment. We believe that this approach may find other applications in future works.

Multi-task Representation Learning for Pure Exploration in Bilinear Bandits
Subhojyoti Mukherjee Qiaomin Xie Josiah P. Hanna Robert D Nowak



研究问题:多任务表示学习在双线性Bandits中的纯探索问题。
动机:在双线性Bandits中,一个动作由来自两种不同实体类型的一对手臂组成,奖励是手臂已知特征向量的双线性函数。在多任务双线性Bandits问题中,我们的目标是找到多个共享低维线性表示的最优动作。
方法:我们提出了GOBLIN算法,该算法使用实验设计方法优化样本分配,以学习全局表示并最小化识别单个任务中最优手臂所需的样本数量。
效果:我们的研究首次对双线性Bandits中的共享表示进行纯探索的样本复杂度分析。结果显示,通过学习跨任务的共享表示,我们实现了比独立解决任务的传统方法显著改善的样本复杂度。

We study multi-task representation learning for the problem of pure exploration in bilinear bandits. In bilinear bandits, an action takes the form of a pair of arms from two different entity types and the reward is a bilinear function of the known feature vectors of the arms. In the \textit{multi-task bilinear bandit problem}, we aim to find optimal actions for multiple tasks that share a common low-dimensional linear representation. The objective is to leverage this characteristic to expedite the process of identifying the best pair of arms for all tasks. We propose the algorithm GOBLIN that uses an experimental design approach to optimize sample allocations for learning the global representation as well as minimize the number of samples needed to identify the optimal pair of arms in individual tasks. To the best of our knowledge, this is the first study to give sample complexity analysis for pure exploration in bilinear bandits with shared representation. Our results demonstrate that by learning the shared representation across tasks, we achieve significantly improved sample complexity compared to the traditional approach of solving tasks independently.

A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing Time
Ranran Shen Pan Peng



研究问题:设计一个适用于强可聚类图的亚线性时间谱聚类查询算法。
动机:针对具有强烈可聚性的图,设计一种能在亚线性时间内预处理并完成聚类成员查询的算法。
方法:通过降低内、外电导的差距和优化预处理时间,实现对图的预处理和查询回答,得到与真实聚类相近的结果。
效果:虽然会稍微增加误分类比例,但能处理内、外电导差距不大或预处理时间较长的情况,且对随机边删除具有一定的鲁棒性。在合成网络上进行的实验验证了理论界限。

We address the problem of designing a sublinear-time spectral clustering oracle for graphs that exhibit strong clusterability. Such graphs contain $k$ latent clusters, each characterized by a large inner conductance (at least $\varphi$) and a small outer conductance (at most $\varepsilon$). Our aim is to preprocess the graph to enable clustering membership queries, with the key requirement that both preprocessing and query answering should be performed in sublinear time, and the resulting partition should be consistent with a $k$-partition that is close to the ground-truth clustering. Previous oracles have relied on either a $\textrm{poly}(k)\log n$ gap between inner and outer conductances or exponential (in $k/\varepsilon$) preprocessing time. Our algorithm relaxes these assumptions, albeit at the cost of a slightly higher misclassification ratio. We also show that our clustering oracle is robust against a few random edge deletions. To validate our theoretical bounds, we conducted experiments on synthetic networks.

Transformers learn to implement preconditioned gradient descent for in-context learning
Kwangjun Ahn Xiang Cheng Hadi Daneshmand Suvrit Sra



研究问题:本文探讨了Transformers是否可以通过训练随机问题实例来学习实现算法。
动机:尽管已有研究表明Transformers具有强大的上下文学习能力,可以模拟梯度下降等算法,但目前还不清楚它们是否可以通过训练随机问题实例来学习这些算法。
方法:通过对线性Transformers在随机线性回归问题上的损失景观进行分析,我们证明了训练目标的全局最小值实现了预条件梯度下降的一次迭代。
效果:对于单个注意力层,我们证明了训练目标的全局最小值实现了预条件梯度下降的一次迭代。对于具有k个注意力层的Transformer,我们证明了训练目标的某些关键点实现了k次预条件梯度下降的迭代。

Motivated by the striking ability of transformers for in-context learning, several works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate gradient descent iterations. Going beyond the question of expressivity, we ask: \emph{Can transformers can learn to implement such algorithms by training over random problem instances?} To our knowledge, we make the first theoretical progress toward this question via analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $k$ attention layers, we prove certain critical points of the training objective implement $k$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers.

Responsible AI (RAI) Games and Ensembles
Yash Gupta Runtian Zhai Arun Suggala Pradeep Kumar Ravikumar



研究问题:本文旨在研究人工智能的社会影响,包括公平性、鲁棒性和安全性等问题。
动机:在许多目标中,学习者试图在其预定义分布集(称为不确定性集)上最小化其最坏情况损失,其中常见的例子是经验分布的扰动版本。换句话说,上述问题可以写成这些不确定性集上的最小-最大问题。
方法:本文提供了一个研究这些问题的一般框架,即负责任的人工智能(RAI)游戏。我们提供了两类解决这些游戏的算法:(a)基于游戏的游戏算法,和(b)贪心阶段估计算法。前者受在线学习和博弈论的启发,而后者则受到经典统计文献关于增强和回归的启发。
效果:我们通过实验证明,我们的技术在解决几个RAI问题上具有适用性和竞争性能,特别是在子群体转移方面。

Several recent works have studied the societal effects of AI; these include issues such as fairness, robustness, and safety. In many of these objectives, a learner seeks to minimize its worst-case loss over a set of predefined distributions (known as uncertainty sets), with usual examples being perturbed versions of the empirical distribution. In other words, the aforementioned problems can be written as min-max problems over these uncertainty sets. In this work, we provide a general framework for studying these problems, which we refer to as Responsible AI (RAI) games. We provide two classes of algorithms for solving these games: (a) game-play based algorithms, and (b) greedy stagewise estimation algorithms. The former class is motivated by online learning and game theory, whereas the latter class is motivated by the classical statistical literature on boosting, and regression. We empirically demonstrate the applicability and competitive performance of our techniques for solving several RAI problems, particularly around subpopulation shift.

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
Jiayi Huang Han Zhong Liwei Wang Lin Yang



研究问题:在奖励为重尾分布(即只有有限的$(1+\epsilon)$阶矩)的强化学习中,是否存在对大规模状态-动作空间进行样本或时间高效的算法?
动机:尽管许多工作集中在设计用于奖励均匀有界的强化学习的高效算法,但在奖励为重尾分布的情况下,对于大规模状态-动作空间是否存在样本或时间高效的算法仍然是一个开放的问题。
方法:本文提供了一个研究这些问题的一般框架,即负责任的人工智能(RAI)游戏。我们提供了两类解决这些游戏的算法:(a)基于游戏的游戏算法,和(b)贪心阶段估计算法。前者受在线学习和博弈论的启发,而后者则受到经典统计文献关于增强和回归的启发。
效果:我们通过实验证明,我们的技术在解决几个RAI问题上具有适用性和竞争性能,特别是在子群体转移方面。

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

Memory-Constrained Algorithms for Convex Optimization
Moise Blanchard Junhui Zhang Patrick Jaillet



研究问题:提出一种递归切割平面算法族,解决具有约束记忆的可行性问题,也可应用于一阶凸优化。
动机:为了找到半径为ε的球内的一点或在单位球上将1-Lipschitz凸函数最小化到精度ε——我们的算法使用了O(d^2/p ln 1/ε)位内存,并进行了O((C d/p ln 1/ε)^p)次查询。该算法族由参数p∈[d]定义,并在次多项式区域ln1/ε>ln d中提供了查询复杂度/内存权衡。
方法:算法将d个变量分为p个块,并按顺序优化各块,使用Vaidya方法的变体构造近似分离向量。
效果:在ε≤1/√d的范围内,我们的算法实现了信息理论最优内存使用和梯度下降的查询复杂度改进。

We propose a family of recursive cutting-plane algorithms to solve feasibility problems with constrained memory, which can also be used for first-order convex optimization. Precisely, in order to find a point within a ball of radius $\epsilon$ with a separation oracle in dimension $d$---or to minimize $1$-Lipschitz convex functions to accuracy $\epsilon$ over the unit ball---our algorithms use $\mathcal O(\frac{d^2}{p}\ln \frac{1}{\epsilon})$ bits of memory, and make $\mathcal O((C\frac{d}{p}\ln \frac{1}{\epsilon})^p)$ oracle calls. The family is parametrized by $p\in[d]$ and provides an oracle-complexity/memory trade-off in the sub-polynomial regime $\ln\frac{1}{\epsilon}\gg\ln d$. While several works gave lower-bound trade-offs (impossibility results)---we explicit here their dependence with $\ln\frac{1}{\epsilon}$, showing that these also hold in any sub-polynomial regime---to the best of our knowledge this is the first class of algorithms that provides a positive trade-off between gradient descent and cutting-plane methods in any regime with $\epsilon\leq 1/\sqrt d$. The algorithms divide the $d$ variables into $p$ blocks and optimize over blocks sequentially, with approximate separation vectors constructed using a variant of Vaidya's method. In the regime $\epsilon \leq d^{-\Omega(d)}$, our algorithm with $p=d$ achieves the information-theoretic optimal memory usage and improves the oracle-complexity of gradient descent.

Non-Convex Bilevel Optimization with Time-Varying Objective Functions
Sen Lin Daouda Sow Kaiyi Ji Yingbin Liang Ness Shroff



研究问题:本文旨在解决当前非凸双层优化在处理在线应用中的动态函数和流数据时的问题。
动机:目前的双层优化算法主要针对离线数据集和静态函数,对于具有动态函数和流数据的在线应用效果不佳。
方法:提出了一种单循环在线双层优化器(SOBOW),通过窗口平均最近一次的超梯度估计来更新外层决策,无需知道先前的函数。
效果:实验证明,SOBOW在多个领域都表现出了良好的效果,且在满足一定条件下,可以实现次线性双层局部遗憾。

Bilevel optimization has become a powerful tool in a wide variety of machine learning problems. However, the current nonconvex bilevel optimization considers an offline dataset and static functions, which may not work well in emerging online applications with streaming data and time-varying functions. In this work, we study online bilevel optimization (OBO) where the functions can be time-varying and the agent continuously updates the decisions with online streaming data. To deal with the function variations and the unavailability of the true hypergradients in OBO, we propose a single-loop online bilevel optimizer with window averaging (SOBOW), which updates the outer-level decision based on a window average of the most recent hypergradient estimations stored in the memory. Compared to existing algorithms, SOBOW is computationally efficient and does not need to know previous functions. To handle the unique technical difficulties rooted in single-loop update and function variations for OBO, we develop a novel analytical technique that disentangles the complex couplings between decision variables, and carefully controls the hypergradient estimation error. We show that SOBOW can achieve a sublinear bilevel local regret under mild conditions. Extensive experiments across multiple domains corroborate the effectiveness of SOBOW.

Kernel-Based Tests for Likelihood-Free Hypothesis Testing
Patrik Robert Gerber Tianze Jiang Yury Polyanskiy Rui Sun



研究问题:在已知两个平衡类别的$n$个观察值的情况下,如何对额外标记的$m$个输入进行分类,这些输入全部属于这两个类别中的一个。
动机:当未标记样本来自两个类别的混合时,即在实践中经常遇到的情况,现有的方法可能无法很好地处理。因此,需要研究如何处理这种情况。
方法:通过引入最大均值差异(MMD)分离,研究了非参数类密度的最小最大样本复杂度,并使用神经网络参数化的内核在两个任务上进行了实证性能研究:检测希格斯玻色子和在CIFAR-10图像中检测植入的DDPM生成的图像。
效果:实验结果证实了理论预测的不对称$m$与$n$之间的权衡关系的存在。

Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off.

Initialization Matters: Privacy-Utility Analysis of Overparameterized Neural Networks
Jiayuan Ye Zhenyu Zhu Fanghui Liu Reza Shokri Volkan Cevher



研究问题:本文旨在分析随机化机器学习算法中模型的过度参数化如何影响其训练数据的信息泄露。
动机:为了解决深度学习模型在训练过程中可能出现的信息泄露问题,作者通过理论分析探讨了模型参数化、初始化和网络深度等因素对信息泄露的影响。
方法:通过对模型分布的KL散度进行理论分析,研究了初始化分布、网络宽度和深度等因素对隐私损失的影响。同时,还证明了在固定KL隐私预算下的经验风险上界。
效果:研究发现,在某些初始化设置下(如LeCun和Xavier),随着网络深度的增加,隐私保护性能会提高;而在其他初始化设置下(如He和NTK),随着网络深度的增加,隐私保护性能会降低。这一发现揭示了隐私保护与网络深度之间复杂的相互关系,这取决于所选择的初始化分布。

We analytically investigate how over-parameterization of models in randomized machine learning algorithms impacts the information leakage about their training data. Specifically, we prove a privacy bound for the KL divergence between model distributions on worst-case neighboring datasets, and explore its dependence on the initialization, width, and depth of fully connected neural networks. We find that this KL privacy bound is largely determined by the expected squared gradient norm relative to model parameters during training. Notably, for the special setting of linearized network, our analysis indicates that the squared gradient norm (and therefore the escalation of privacy loss) is tied directly to the per-layer variance of the initialization distribution. By using this analysis, we demonstrate that privacy bound improves with increasing depth under certain initializations (LeCun and Xavier), while degrades with increasing depth under other initializations (He and NTK). Our work reveals a complex interplay between privacy and depth that depends on the chosen initialization distribution. We further prove excess empirical risk bounds under a fixed KL privacy budget, and show that the interplay between privacy utility trade-off and depth is similarly affected by the initialization.

Implicit Regularization in Over-Parameterized Support Vector Machine
Yang Sui Xin HE Yang Bai



研究问题:本文旨在设计一种无需正则化的高维支持向量机(SVM)算法,通过结合过参数化和Nesterov的平滑方法,为所引发的隐式正则化现象提供理论保证。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过整合过参数化与Nesterov的平滑方法,构建了一个过度参数化的hinge损失函数,并利用这个损失函数上的无正则化梯度下降法来估计真实的参数。
效果:实验结果表明,该无正则化梯度下降法在适当的初始值、步长和平滑度参数选择下,实现了接近最优的统计收敛速度。此外,通过一系列的数值实验验证了理论发现,并将该方法与显式正则化进行了比较。结果展示了在稀疏SVM中采用通过梯度下降法结合过参数化的隐式正则化的优势。

In this paper, we design a regularization-free algorithm for high-dimensional support vector machines (SVMs) by integrating over-parameterization with Nesterov's smoothing method, and provide theoretical guarantees for the induced implicit regularization phenomenon. In particular, we construct an over-parameterized hinge loss function and estimate the true parameters by leveraging regularization-free gradient descent on this loss function. The utilization of Nesterov's method enhances the computational efficiency of our algorithm, especially in terms of determining the stopping criterion and reducing computational complexity. With appropriate choices of initialization, step size, and smoothness parameter, we demonstrate that unregularized gradient descent achieves a near-oracle statistical convergence rate. Additionally, we verify our theoretical findings through a variety of numerical experiments and compare the proposed method with explicit regularization. Our results illustrate the advantages of employing implicit regularization via gradient descent in conjunction with over-parameterization in sparse SVMs.

Robustness Guarantees for Adversarially Trained Neural Networks
Poorya Mianjy Raman Arora



研究问题:本研究旨在通过对抗性训练,使两层神经网络变得更健壮。
动机:为了解决神经网络在面对对抗性攻击时的脆弱性问题,提出了一种优化方法。
方法:通过使用投影梯度下降法(PGD)在训练过程中实施对抗性攻击,并最大化关于原点的替代损失的下界,以实现对$0/1$-loss的最小化。
效果:该方法不仅为内部循环的PGD攻击提供了收敛保证,而且在数据线性可分的情况下,还为端到端的对抗性训练提供了精确的迭代复杂度结果。实验结果也验证了理论分析的正确性。

We study robust adversarial training of two-layer neural networks as a bi-level optimization problem. In particular, for the inner loop that implements the adversarial attack during training using projected gradient descent (PGD), we propose maximizing a \emph{lower bound} on the $0/1$-loss by reflecting a surrogate loss about the origin. This allows us to give a convergence guarantee for the inner-loop PGD attack. Furthermore, assuming the data is linearly separable, we provide precise iteration complexity results for end-to-end adversarial training, which holds for any width and initialization. We provide empirical evidence to support our theoretical results.

Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement Learning
Stefan Stojanovic Yassir Jedra Alexandre Proutiere



研究问题:本文研究了强化学习中具有低秩结构的矩阵估计问题。
动机:在低秩的bandits和MDPs中,需要恢复的矩阵指定了预期的臂部奖励或MDP的转换核,每个矩阵条目都包含重要信息,因此需要寻找具有低条目预测误差的估计方法。
方法:研究了基于谱的简单矩阵估计方法,并展示了它们如何有效地恢复矩阵的奇异子空间,并表现出接近最小的条目预测误差。
效果:这些新的低秩矩阵估计结果使得有可能设计出充分利用底层低秩结构的强化学习算法。文中提供了两个这样的算法示例:一个是用于低秩bandit问题的遗憾最小化算法,另一个是用于低秩MDP的最佳策略识别算法,两者都能产生最先进的性能保证。

We study matrix estimation problems arising in reinforcement learning with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it characterizes the transition kernel of the MDP. In both cases, each entry of the matrix carries important information, and we seek estimation methods with low entry-wise prediction error. Importantly, these methods further need to accommodate for inherent correlations in the available data (e.g. for MDPs, the data consists of system trajectories). We investigate the performance of simple spectral-based matrix estimation approaches: we show that they efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise prediction error. These new results on low-rank matrix estimation make it possible to devise reinforcement learning algorithms that fully exploit the underlying low-rank structure. We provide two examples of such algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for low-rank MDPs. Both algorithms yield state-of-the-art performance guarantees.

An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient
Yudong Luo Guiliang Liu Pascal Poupart Yangchen Pan



研究问题:限制强化学习中策略返回的方差是风险厌恶选择的一种常见方式,但这种方法存在数值敏感性和阻碍策略学习等局限性。
动机:本文旨在寻找一种替代的风险度量方法,以解决传统方差限制方法的问题。
方法:提出了使用基尼系数偏差作为风险度量的新方法,并推导出了最小化该风险度量的策略梯度算法。
效果:在明确定义风险厌恶的领域中进行的实证评估表明,该方法能够缓解方差限制方法的局限性,并在其他方法无法学习到合理策略的情况下,实现高回报和低风险。

Restricting the variance of a policy’s return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.

Exploiting hidden structures in non-convex games for convergence to Nash equilibrium
Iosif Sakos Emmanouil-Vasileios Vlatakis-Gkaragkounis Panayotis Mertikopoulos Georgios Piliouras



研究问题:如何利用机器学习应用中的非合作博弈的纳什均衡来表示系统的期望运行状态。
动机:尽管许多感兴趣的情况具有潜在的凸结构,但存在高度非凸的损失景观,这可能阻碍了向均衡点的收敛。
方法:本文提出了一种灵活的一阶方法,该方法成功地利用了这种“隐藏结构”,并在玩家的控制变量与游戏的隐藏、凸结构层之间的转换连接上实现了最小假设下的收敛。这种方法被称为预条件隐藏梯度下降(PHGD),它依赖于与自然梯度方法相关的明智选择的梯度预条件方案。
效果:我们为确定性和随机环境提供了明确的收敛速度保证,并且我们没有对游戏隐藏结构的可分性做出任何假设。

A wide array of modern machine learning applications – from adversarial models to multi-agent reinforcement learning – can be formulated as non-cooperative games whose Nash equilibria represent the system’s desired operational states. Despite having a highly non-convex loss landscape, many cases of interest possess a latent convex structure that could potentially be leveraged to yield convergence to an equilibrium. Driven by this observation, our paper proposes a flexible first-order method that successfully exploits such “hidden structures” and achieves convergence under minimal assumptions for the transformation connecting the players’ control variables to the game’s latent, convex-structured layer. The proposed method – which we call preconditioned hidden gradient descent (PHGD) – hinges on a judiciously chosen gradient preconditioning scheme related to natural gradient methods. Importantly, we make no separability assumptions for the game’s hidden structure, and we provide explicit convergence rate guarantees for both deterministic and stochastic environments.

$p$-value Adjustment for Monotonous, Unbiased, and Fast Clustering Comparison
Kai Klede Thomas Altstidl Dario Zanca Bjoern Eskofier



研究问题:本文旨在解决现有聚类比较指标存在的类型II偏差问题,并提出一种无偏且单调的聚类比较方法。
动机:目前常用的聚类比较指标如调整兰德指数和调整互信息存在类型II偏差,而标准化互信息虽然消除了这种偏差,但又存在不符合直觉的非单调性和计算效率低下的问题。
方法:本文提出了$p$-值调整兰德指数($\operatorname{PMI}_2$),这是第一种无类型II偏差且可证明单调的聚类比较方法。$\operatorname{PMI}_2$具有快速近似法,其性能优于标准化互信息。
效果:在合成基准测试中,我们证明了$\operatorname{PMI}_2$的无偏聚类选择、近似质量和运行时效率。在图像和社交网络数据集的实验中,我们展示了$\operatorname{PMI}_2$如何帮助实践者选择更好的聚类和社区检测算法。

Popular metrics for clustering comparison, like the Adjusted Rand Index and the Adjusted Mutual Information, are type II biased. The Standardized Mutual Information removes this bias but suffers from counterintuitive non-monotonicity and poor computational efficiency. We introduce the $p$-value adjusted Rand Index ($\operatorname{PMI}_2$), the first cluster comparison method that is type II unbiased and provably monotonous. The $\operatorname{PMI}_2$ has fast approximations that outperform the Standardized Mutual information. We demonstrate its unbiased clustering selection, approximation quality, and runtime efficiency on synthetic benchmarks. In experiments on image and social network datasets, we show how the $\operatorname{PMI}_2$ can help practitioners choose better clustering and community detection algorithms.

Dynamic Non-monotone Submodular Maximization
Kiarash Banihashem Leyla Biabani Samira Goudarzi MohammadTaghi Hajiaghayi Peyman Jabbarzade Morteza Monemizadeh



研究问题:如何扩展全动态结果到非单调子模态最大值问题?
动机:最大化子模态函数在机器学习的许多应用中越来越重要,如数据摘要、推荐系统和特征选择。同时,人们对子模态最大化和动态算法的兴趣日益增长。
方法:通过将非单调子模态函数最大化问题转化为单调子模态函数最大化问题,我们得到了第一个解决非单调子模态函数最大化问题的动态算法。
效果:我们的算法保持了解决方案的$(8+epsilon)$近似度,每次更新使用预期的摊销$O(\epsilon^{-3}k^3\log^3(n)\log(k))$或$O(\epsilon^{-1}k^2\log^3(k))$查询,并在视频摘要和最大切割问题上展示了其优势。

Maximizing submodular functions has been increasingly used in many applications of machine learning, such as data summarization, recommendation systems, and feature selection. Moreover, there has been a growing interest in both submodular maximization and dynamic algorithms. In 2020, Monemizadeh and Lattanzi, Mitrovic, Norouzi-Fard, Tarnawski, and Zadimoghaddam initiated developing dynamic algorithms for the monotone submodular maximization problem under the cardinality constraint $k$. In 2022, Chen and Peng studied the complexity of this problem and raised an important open question: "\emph{Can we extend [fully dynamic] results (algorithm or hardness) to non-monotone submodular maximization?}". We affirmatively answer their question by demonstrating a reduction from maximizing a non-monotone submodular function under the cardinality constraint $k$ to maximizing a monotone submodular function under the same constraint. Through this reduction, we obtain the first dynamic algorithms to solve the non-monotone submodular maximization problem under the cardinality constraint $k$. Our algorithms maintain an $(8+\epsilon)$-approximate of the solution and use expected amortized $O(\epsilon^{-3}k^3\log^3(n)\log(k))$ or $O(\epsilon^{-1}k^2\log^3(k))$ oracle queries per update, respectively. Furthermore, we showcase the benefits of our dynamic algorithm for video summarization and max-cut problems on several real-world data sets.

Utilitarian Algorithm Configuration
Devon R. Graham Kevin Leyton-Brown Tim Roughgarden



研究问题:如何配置启发式算法以最大化其对终端用户的效用,同时提供性能的理论保证。
动机:现有的配置过程寻求最小化预期运行时间的设置,但最近的理论研究认为,期望的运行时间最小化无法捕捉到算法设计者的偏好。
方法:提出了一种新的非平凡程序,用于配置启发式算法以最大化其对终端用户的效用,同时提供了性能的理论保证。
效果:证明了这些配置过程的运行时间上限与理论下界相似,同时也通过实证展示了它们的表现。

We present the first nontrivial procedure for configuring heuristic algorithms to maximize the utility provided to their end users while also offering theoretical guarantees about performance. Existing procedures seek configurations that minimize expected runtime. However, very recent theoretical work argues that expected runtime minimization fails to capture algorithm designers' preferences. Here we show that the utilitarian objective also confers significant algorithmic benefits. Intuitively, this is because mean runtime is dominated by extremely long runs even when they are incredibly rare; indeed, even when an algorithm never gives rise to such long runs, configuration procedures that provably minimize mean runtime must perform a huge number of experiments to demonstrate this fact. In contrast, utility is bounded and monotonically decreasing in runtime, allowing for meaningful empirical bounds on a configuration's performance. This paper builds on this idea to describe effective and theoretically sound configuration procedures. We prove upper bounds on the runtime of these procedures that are similar to theoretical lower bounds, while also demonstrating their performance empirically.

Improving the Knowledge Gradient Algorithm
Le Yang Siyang Gao Chin Pang Ho



研究问题:知识梯度算法在最佳手臂识别问题上的应用存在局限性,并非最优策略。
动机:为了提高知识梯度算法的性能,提出改进的知识梯度(iKG)算法。
方法:通过引入一步前瞻的方式,选择最有可能选中最佳手臂的测量方式,代替原有知识梯度算法中选择能最大一步提升对最佳手臂均值估计的测量方式。
效果:实验证明,改进的知识梯度算法在各种变种的最佳手臂识别问题上都表现出优越性能,且易于扩展应用。

The knowledge gradient (KG) algorithm is a popular policy for the best arm identification (BAI) problem. It is built on the simple idea of always choosing the measurement that yields the greatest expected one-step improvement in the estimate of the best mean of the arms. In this research, we show that this policy has limitations, causing the algorithm not asymptotically optimal. We next provide a remedy for it, by following the manner of one-step look ahead of KG, but instead choosing the measurement that yields the greatest one-step improvement in the probability of selecting the best arm. The new policy is called improved knowledge gradient (iKG). iKG can be shown to be asymptotically optimal. In addition, we show that compared to KG, it is easier to extend iKG to variant problems of BAI, with the $\epsilon$-good arm identification and feasible arm identification as two examples. The superior performances of iKG on these problems are further demonstrated using numerical examples.

Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage
Masatoshi Uehara Nathan Kallus Jason D. Lee Wen Sun



研究问题:本文旨在解决离线强化学习中的问题,即只使用离线数据进行训练。
动机:现有的离线强化学习算法需要对状态和动作空间进行全面覆盖,而我们提出了一种只需要部分覆盖的基于值的算法,并给出了PAC保证。
方法:我们提出了一种新的算法,通过最小最大损失函数来准确估计软Q函数和Q函数,并给出了-收敛保证。
效果:实验结果表明,我们的算法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider offline reinforcement learning (RL) where we only have only access to offline data. In contrast to numerous offline RL algorithms that necessitate the uniform coverage of the offline data over state and action space, we propose value-based algorithms with PAC guarantees under partial coverage, specifically, coverage of offline data against a single policy, and realizability of soft Q-function (a.k.a., entropy-regularized Q-function) and another function, which is defined as a solution to a saddle point of certain minimax optimization problem). Furthermore, we show the analogous result for Q-functions instead of soft Q-functions. To attain these guarantees, we use novel algorithms with minimax loss functions to accurately estimate soft Q-functions and Q-functions with -convergence guarantees measured on the offline data. We introduce these loss functions by casting the estimation problems into nonlinear convex optimization problems and taking the Lagrange functions.

Optimal Preconditioning and Fisher Adaptive Langevin Sampling
Michalis Titsias



研究问题:如何优化Langevin扩散的预处理?
动机:通过分析优化预期平方跳距,得到最优预处理。
方法:使用逆Fisher信息协方差矩阵作为最优预处理,该矩阵由目标下的平均对数目标梯度的外积计算得出。将此结果应用于Metropolis调整的Langevin算法(MALA),并推导出一种从算法运行过程中产生的梯度历史中学习预处理的计算效率高的自适应MCMC方案。
效果:在多个实验中,所提出的算法在高维空间中非常稳健,并且显著优于其他方法,包括一个密切相关的自适应MALA方案,该方案使用标准的自适应MCMC以及位置相关的Riemannian流形MALA采样器来学习预处理。

We define an optimal preconditioning for the Langevin diffusion by analytically optimizing the expected squared jumped distance. This yields as the optimal preconditioning an inverse Fisher information covariance matrix, where the covariance matrix is computed as the outer product of log target gradients averaged under the target. We apply this result to the Metropolis adjusted Langevin algorithm (MALA) and derive a computationally efficient adaptive MCMC scheme that learns the preconditioning from the history of gradients produced as the algorithm runs. We show in several experiments that the proposed algorithm is very robust in high dimensions and significantly outperforms other methods, including a closely related adaptive MALA scheme that learns the preconditioning with standard adaptive MCMC as well as the position-dependent Riemannian manifold MALA sampler.

Near-optimal learning with average Hölder smoothness
Guy Kornowski Steve Hanneke Aryeh Kontorovich



研究问题:如何通过平均Hölder平滑度来改进经典最坏情况的Hölder常数。
动机:当前函数的"有效平滑度"测量方法对底层分布敏感,可能远小于其经典的“最坏情况”Hölder常数。
方法:我们考虑了可实现和不确定(有噪声)的回归设置,并证明了基于平均Hölder平滑度的上下风险界限;这些比率在平均Lipschitz平滑度的特殊情况下也优于以前已知的比率。
效果:从算法的角度来看,由于我们的平滑度概念是根据未知的底层分布定义的,学习者没有明确的函数类表示,因此无法执行ERM。然而,我们提供了不同的学习算法,实现了几乎最佳的学习速度。我们的研究结果在任何有界度量空间中都适用,并以它的固有几何学为依据进行表述。总的来说,我们的研究结果表明,经典的最坏情况Hölder平滑度概念可以基本上被其平均值所取代,从而产生更精确的保证。

We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness. Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.

Decentralized Matrix Sensing: Statistical Guarantees and Fast Convergence
Marie Maros Gesualdo Scutari



研究问题:我们探索了从近各向同性线性测量中解决矩阵感测问题的网络代理模型,该模型没有集中节点。
动机:我们首次为解决低秩矩阵估计的非凸Burer-Monteiro类型分解的分散梯度算法提供了统计和计算/通信保证。
方法:通过小的随机初始化,该算法表现出近似的两阶段收敛:(i)一个光谱阶段,将迭代列空间与底层低秩矩阵对齐,模仿集中式光谱初始化(在网络中不能直接实现);(ii)一个局部细化阶段,使迭代偏离某些退化鞍点,同时确保快速收敛到底层低秩矩阵。
效果:我们分析的核心是一个新颖的“在网络内”限制等距性质,它适应了优化的分散特性,揭示了样本复杂度与网络连接、拓扑和通信复杂度之间的有趣互动。

We explore the matrix sensing problem from near-isotropic linear measurements, distributed across a network of agents modeled as an undirected graph, with no centralized node. We provide the first study of statistical, computational/communication guarantees for a decentralized gradient algorithm that solves the (nonconvex) Burer-Monteiro type decomposition associated to the low-rank matrix estimation. With small random initialization, the algorithm displays an approximate two-phase convergence: (i) a spectral phase that aligns the iterates' column space with the underlying low-rank matrix, mimicking centralized spectral initialization (not directly implementable over networks); and (ii) a local refinement phase that diverts the iterates from certain degenerate saddle points, while ensuring swift convergence to the underlying low-rank matrix. Central to our analysis is a novel "in-network" Restricted Isometry Property which accommodates for the decentralized nature of the optimization, revealing an intriguing interplay between sample complexity and network connectivity, topology, and communication complexity.

An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits
Kiarash Banihashem MohammadTaghi Hajiaghayi Suho Shin Max Springer



研究问题:本文旨在解决一种对抗性上下文班次问题,其中上下文是已知分布下的序贯抽取,而成本序列是由在线对手选择的。
动机:现有的算法在处理这种问题时存在效率低下的问题,需要大量的优化查询。
方法:我们提出了一种高效的放松方法,该方法通过减少优化查询的次数,提高了算法的效率。
效果:实验结果表明,我们的算法的遗憾界限比之前的最佳结果有所提高,并且每轮最多只需进行$O(K)$次离线优化查询,其中$K$表示行动的数量,$T$表示轮数,$\Pi$表示策略集。

We present an oracle-efficient relaxation for the adversarial contextual bandits problem, where the contexts are sequentially drawn i.i.d from a known distribution and the cost sequence is chosen by an online adversary. Our algorithm has a regret bound of $O(T^{\frac{2}{3}}(K\log(|\Pi|))^{\frac{1}{3}})$ and makes at most $O(K)$ calls per round to an offline optimization oracle, where $K$ denotes the number of actions, $T$ denotes the number of rounds and $\Pi$ denotes the set of policies. This is the first result to improve the prior best bound of $O((TK)^{\frac{2}{3}}(\log(|\Pi|))^{\frac{1}{3}})$ as obtained by Syrgkanis et al. at NeurIPS 2016, and the first to match the original bound of Langford and Zhang at NeurIPS 2007 which was obtained for the stochastic case.

Learning Cuts via Enumeration Oracles
Daniel Thuerck Boro Sofranac Marc Pfetsch Sebastian Pokutta



研究问题:如何有效地解决大规模整数规划问题,并开发出有效的切割平面方法。
动机:当前大多数的切割平面方法依赖于显式规则来生成有效的不等式,以将目标点与可行集分离。然而,这些方法需要通过求解线性规划问题来获得超平面,效率较低。
方法:本文提出了一种新的通用方法,通过在减少的维度中隐式访问枚举查询来学习底层多面体的面。这是通过将查询嵌入到一种变体的弗兰克-沃尔夫算法中实现的,该算法能够生成强大的切割平面,从而将枚举查询转化为分离查询。
效果:通过针对多维背包问题的案例研究,证明了该方法的有效性。

Cutting-planes are one of the most important building blocks for solving large-scale integer programming (IP) problems to (near) optimality. The majority of cutting plane approaches rely on explicit rules to derive valid inequalities that can separate the target point from the feasible set. Local cuts, on the other hand, seek to directly derive the facets of the underlying polyhedron and use them as cutting planes. However, current approaches rely on solving Linear Programming (LP) problems in order to derive such a hyperplane. In this paper, we present a novel generic approach for learning the facets of the underlying polyhedron by accessing it implicitly via an enumeration oracle in a reduced dimension. This is achieved by embedding the oracle in a variant of the Frank-Wolfe algorithm which is capable of generating strong cutting planes, effectively turning the enumeration oracle into a separation oracle. We demonstrate the effectiveness of our approach with a case study targeting the multidimensional knapsack problem (MKP).

On the Convergence of No-Regret Learning Dynamics in Time-Varying Games
Ioannis Anagnostides Ioannis Panageas Gabriele Farina Tuomas Sandholm



研究问题:大多数关于游戏中学习的研究都集中在基础重复游戏不随时间变化的限制性设置上,对于动态多代理设置中无遗憾学习算法的收敛性了解甚少。
动机:本文旨在描述乐观梯度下降(OGD)在时间变化游戏中的收敛性。
方法:通过建立自然游戏序列变化度量参数化下的零和游戏的框架,得出OGD平衡差距的锐利收敛界限,并建立了强凸性-凹性下的改进二阶变化界限。
效果:结果还适用于通过相关均衡的双线性公式的时间变化广义求和多人游戏,这对元学习和获得改进的变化依赖遗憾界限有新的启示,解决了先前论文中未解决的问题。最后,我们利用我们的框架对静态游戏的动态遗憾保证提供了新的见解。

Most of the literature on learning in games has focused on the restrictive setting where the underlying repeated game does not change over time. Much less is known about the convergence of no-regret learning algorithms in dynamic multiagent settings. In this paper, we characterize the convergence of optimistic gradient descent (OGD) in time-varying games. Our framework yields sharp convergence bounds for the equilibrium gap of OGD in zero-sum games parameterized on natural variation measures of the sequence of games, subsuming known results for static games. Furthermore, we establish improved second-order variation bounds under strong convexity-concavity, as long as each game is repeated multiple times. Our results also apply to time-varying general-sum multi-player games via a bilinear formulation of correlated equilibria, which has novel implications for meta-learning and for obtaining refined variation-dependent regret bounds, addressing questions left open in prior papers. Finally, we leverage our framework to also provide new insights on dynamic regret guarantees in static games.

Dynamic Regret of Adversarial Linear Mixture MDPs
Long-Fei Li Peng Zhao Zhi-Hua Zhou



研究问题:本文研究了具有对抗性全信息奖励和未知转换核的情境异质马尔可夫决策过程(MDP)中的强化学习。
动机:在面对线性混合MDPs,即其转换核为线性混合模型时,如何设计出性能优越的强化学习算法,以应对非平稳测量和未知转移核的问题。
方法:提出了一种新的算法,该算法在已知非平稳测量$P_T$的情况下,可以获得$widetilde{\mathcal{O}}\big(\sqrt{d^2 H^3K} + \sqrt{H^4(K+P_T)(1+P_T)}big)$的动态遗憾,这比之前的最佳已知动态遗憾有了改进。同时,当非平稳测量$P_T$未知时,设计了一个在线集成算法,该算法具有元基结构,并被证明可以达到$\widetilde{\mathcal{O}}\big(\sqrt{d^2 H^3K} + \sqrt{H^4(K+P_T)(1+P_T) + H^2 S_T^2}\big)$的动态遗憾。
效果:实验结果表明,新提出的算法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the \emph{dynamic regret} as the performance measure. Denote by $d$ the dimension of the feature mapping, $H$ the horizon, $K$ the number of episodes, $P_T$ the non-stationary measure, we propose a novel algorithm that enjoys an $\widetilde{\mathcal{O}}\big(\sqrt{d^2 H^3K} + \sqrt{H^4(K+P_T)(1+P_T)}\big)$ dynamic regret under the condition that $P_T$ is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an $\Omega\big(\sqrt{d^2 H^3 K} + \sqrt{H K (H+P_T)}\big)$ lower bound, indicating our algorithm is \emph{optimal} in $K$ and $P_T$. Furthermore, when the non-stationary measure $P_T$ is unknown, we design an online ensemble algorithm with a meta-base structure, which is proved to achieve an $\widetilde{\mathcal{O}}\big(\sqrt{d^2 H^3K} + \sqrt{H^4(K+P_T)(1+P_T) + H^2 S_T^2}\big)$ dynamic regret and here $S_T$ is the expected switching number of the best base-learner. The result can be optimal under certain regimes.

On the Interplay between Social Welfare and Tractability of Equilibria
Ioannis Anagnostides Tuomas Sandholm



研究问题:本文探讨了在算法博弈论中,如何通过无遗憾学习算法来逼近纳什均衡,以实现高效率。
动机:尽管计算效率和社会福祉(即效率)是算法博弈论中的两个基本但通常相互独立的考虑因素,但本文证明了当可以通过Roughgarden的平滑性论证保证近似完全效率时,纳什均衡可以通过一系列无遗憾学习算法来逼近,从而实现快速和分散的计算。
方法:本文利用这种联系,在大型游戏中获得了新的收敛结果——其中玩家数量n远大于1——这是通过在极限中完全效率的平滑性这一众所周知的性质来实现的。
效果:令人惊讶的是,我们的框架统一了不同类别问题的均衡计算,包括战略敏感性消失的游戏和两人零和游戏,同时揭示了一条被忽视的途径,即平滑性和优化文献中被称为Minty属性的一个熟知条件之间的等价性。最后,我们证明了一系列无遗憾动态可以达到比平滑性框架更好的福利界限,同时确保收敛到粗糙相关均衡集。这是通过最近由Piliouras等人引入的有洞察力的镜像下降算法来实现的。

Computational tractability and social welfare (aka. efficiency) of equilibria are two fundamental but in general orthogonal considerations in algorithmic game theory. Nevertheless, we show that when (approximate) full efficiency can be guaranteed via a smoothness argument a la Roughgarden, Nash equilibria are approachable under a family of no-regret learning algorithms, thereby enabling fast and decentralized computation. We leverage this connection to obtain new convergence results in large games---wherein the number of players $n \gg 1$---under the well-documented property of full efficiency via smoothness in the limit. Surprisingly, our framework unifies equilibrium computation in disparate classes of problems including games with vanishing strategic sensitivity and two-player zero-sum games, illuminating en route an immediate but overlooked equivalence between smoothness and a well-studied condition in the optimization literature known as the Minty property. Finally, we establish that a family of no-regret dynamics attains a welfare bound that improves over the smoothness framework while at the same time guaranteeing convergence to the set of coarse correlated equilibria. We show this by employing the clairvoyant mirror descent algortihm recently introduced by Piliouras et al.

Fully Dynamic $k$-Clustering in $\tilde O(k)$ Update Time
Sayan Bhattacharya Martin Nicolas Costa Silvio Lattanzi Nikos Parotsidis



研究问题:开发一种$O(1)$近似的全动态算法,解决度量空间上的$k$-中位数和$k$-均值问题。
动机:当前的解决方案在查询时间和更新时间上存在效率问题,需要更高效的算法。
方法:通过引入一个新颖的框架,实现了对$k$-中位数和$k$-均值问题的高效处理。
效果:实验证明,该算法在查询时间和更新时间上均优于现有技术,且在动态$k$-中位数问题上表现优秀。同时,还提供了动态$k$-中位数问题的下界,为后续研究提供了理论依据。

We present a $O(1)$-approximate fully dynamic algorithm for the $k$-median and $k$-means problems on metric spaces with amortized update time $\tilde O(k)$ and worst-case query time $\tilde O(k^2)$. We complement our theoretical analysis with the first in-depth experimental study for the dynamic $k$-median problem on general metrics, focusing on comparing our dynamic algorithm to the current state-of-the-art by Henzinger and Kale [ESA'20]. Finally, we also provide a lower bound for dynamic $k$-median which shows that any $O(1)$-approximate algorithm with $\tilde O(\text{poly}(k))$ query time must have $\tilde \Omega(k)$ amortized update time, even in the incremental setting.

ReHLine: Regularized Composite ReLU-ReHU Loss Minimization with Linear Computation and Linear Convergence
Ben Dai Yixuan Qiu



研究问题:本文提出了一种新的算法ReHLine,用于最小化一组具有凸分段线性二次损失函数和可选线性约束的正则化ERM。
动机:当前的优化算法在处理复杂的特定领域问题时,如公平支持向量机、弹性网正则化分位数回归、Huber最小化等,往往效率低下。
方法:我们提出了一种新颖的算法ReHLine,该算法可以有效地处理损失函数、正则化和约束的多样化组合,特别适合处理复杂领域的问题。此外,ReHLine还具有可证明的线性收敛速度,并且每次迭代的计算复杂度与样本大小呈线性比例。
效果:实验结果表明,ReHLine在大规模数据集上显著优于通用优化求解器,并在SVMs、Huber最小化和平滑SVMs等方面超越了专门的求解器,显示出了出色的灵活性和效率。

Empirical risk minimization (ERM) is a crucial framework that offers a general approach to handling a broad range of machine learning tasks. In this paper, we propose a novel algorithm, called ReHLine, for minimizing a set of regularized ERMs with convex piecewise linear-quadratic loss functions and optional linear constraints. The proposed algorithm can effectively handle diverse combinations of loss functions, regularization, and constraints, making it particularly well-suited for complex domain-specific problems. Examples of such problems include FairSVM, elastic net regularized quantile regression, Huber minimization, etc. In addition, ReHLine enjoys a provable linear convergence rate and exhibits a per-iteration computational complexity that scales linearly with the sample size. The algorithm is implemented with both Python and R interfaces, and its performance is benchmarked on various tasks and datasets. Our experimental results demonstrate that ReHLine significantly surpasses generic optimization solvers in terms of computational efficiency on large-scale datasets. Moreover, it also outperforms specialized solvers such as Liblinear in SVMs, hqreg in Huber minimization, and Lightning (SAGA, SAG, SDCA, SVRG) in smoothed SVMs, exhibiting exceptional flexibility and efficiency. The source code, project page, accompanying software, and the Python/R interface can be accessed through the link: https://github.com/softmin/ReHLine.

Efficient Batched Algorithm for Contextual Linear Bandits with Large Action Space via Soft Elimination
Osama Hanna Lin Yang Christina Fragouli



研究问题:本文旨在为具有大动作空间的上下文线性Bandits提供第一个有效的批量算法。
动机:现有的批量算法依赖于动作消除,对于大的动作集不可行,而我们的算法仅使用动作集上的线性优化查询来设计策略。
方法:我们提出了一种新颖的软消除方法,通过在每个批次中“塑造”动作集,我们可以有效地识别(接近)最优动作。
效果:实验结果表明,我们的算法可以实现高概率的$\tilde{O}(sqrt{T})$遗憾上限,并使用$O(\log\log T)$个批次,匹配批次数量的下界。当专门用于线性Bandits时,我们的算法可以实现高概率的差距依赖遗憾上界$\tilde{O}(1/Delta_{\min})$,其中$\Delta_{\min}$是次优手臂和最优手臂之间的最小奖励差距。

In this paper, we provide the first efficient batched algorithm for contextual linear bandits with large action spaces. Unlike existing batched algorithms that rely on action elimination, which are not implementable for large action sets, our algorithm only uses a linear optimization oracle over the action set to design the policy. The proposed algorithm achieves a regret upper bound $\tilde{O}(\sqrt{T})$ with high probability, and uses $O(\log\log T)$ batches, matching the lower bound on the number of batches (Gao et al., 2019). When specialized to linear bandits, our algorithm can achieve a high probability gap-dependent regret bound of $\tilde{O}(1/\Delta_{\min})$ with the optimal $\log T$ number of batches, where $\Delta_{\min}$ is the minimum reward gap between a suboptimal arm and the optimal. Our result is achieved via a novel soft elimination approach, that entails $\text{``}$shaping$\text{"}$ the action sets at each batch so that we can efficiently identify (near) optimal actions.

Feature learning via mean-field Langevin dynamics: classifying sparse parities and beyond
Taiji Suzuki Denny Wu Kazusato Oko Atsushi Nitanda



研究问题:现有的关于平均场神经网络的优化效率保证是否会导致改善的泛化性能和样本复杂度,因为存在特征学习。
动机:尽管平均场神经网络在特征学习方面表现出色,但现有的优化算法如平均场朗之万动力学(MFLD)的效果尚不明确。
方法:通过研究一类二分类问题的统计和计算复杂性,我们提出了一种新的分析框架,该框架避免了常见的范数控制,而是利用了MFLD优化参数分布而非参数本身的观点。
效果:我们的框架应用于学习k-稀疏奇偶函数,结果表明,与核方法不同,由MFLD优化的两层神经网络在样本复杂度上实现了度k与维度依赖指数的“解耦”。

Neural network in the mean-field regime is known to be capable of \textit{feature learning}, unlike the kernel (NTK) counterpart. Recent works have shown that mean-field neural networks can be globally optimized by a noisy gradient descent update termed the \textit{mean-field Langevin dynamics} (MFLD). However, all existing guarantees for MFLD only considered the \textit{optimization} efficiency, and it is unclear if this algorithm leads to improved \textit{generalization} performance and sample complexity due to the presence of feature learning. To fill this gap, in this work we study the statistical and computational complexity of MFLD in learning a class of binary classification problems. Unlike existing margin bounds for neural networks, we avoid the typical norm control by utilizing the perspective that MFLD optimizes the \textit{distribution} of parameters rather than the parameter itself; this leads to an improved analysis of the sample complexity and convergence rate. We apply our general framework to the learning of $k$-sparse parity functions, where we prove that unlike kernel methods, two-layer neural networks optimized by MFLD achieves a sample complexity where the degree $k$ is ``decoupled'' from the exponent in the dimension dependence.

On the Last-iterate Convergence in Time-varying Zero-sum Games: Extra Gradient Succeeds where Optimism Fails
Yi Feng Hu Fu Qun Hu Ping Li Ioannis Panageas bo peng Xiao Wang



研究问题:本文探讨了在时间变化的环境中,各种算法的最终迭代行为。
动机:尽管已有的研究在固定环境中得出了一些算法的最终迭代收敛性,但在时间变化的环境中,这些算法的行为仍然不清楚。
方法:本文通过分析周期性和收敛性扰动两种类型的未约束的时间变化双线性零和游戏,来研究各种算法的最终迭代行为。
效果:研究发现,在周期性游戏中,EG会收敛,而OGDA和动量方法会发散;在收敛性扰动的游戏中,只要游戏本身的稳定速度比1/t快,所有算法都会收敛。

Last-iterate convergence has received extensive study in two player zero-sum games starting from bilinear, convex-concave up to settings that satisfy the MVI condition. Typical methods that exhibit last-iterate convergence for the aforementioned games include extra-gradient (EG) and optimistic gradient descent ascent (OGDA). However, all the established last-iterate convergence results hold for the restrictive setting where the underlying repeated game does not change over time. Recently, a line of research has focused on regret analysis of OGDA in time-varying games, i.e., games where payoffs evolve with time; the last-iterate behavior of OGDA and EG in time-varying environments remains unclear though. In this paper, we study the last-iterate behavior of various algorithms in two types of unconstrained, time-varying, bilinear zero-sum games: periodic and convergent perturbed games. These models expand upon the usual repeated game formulation and incorporate external environmental factors, such as the seasonal effects on species competition and vanishing external noise. In periodic games, we prove that EG will converge while OGDA and momentum method will diverge. This is quite surprising, as to the best of our knowledge, it is the first result that indicates EG and OGDA have qualitatively different last-iterate behaviors and do not exhibit similar behavior. In convergent perturbed games, we prove all these algorithms converge as long as the game itself stabilizes with a faster rate than $1/t$.

An Adaptive Algorithm for Learning with Unknown Distribution Drift
Alessio Mazzetto Eli Upfal



研究问题:开发和分析一种学习未知分布漂移的通用技术。
动机:对于来自漂移分布最后T步的独立观测序列,我们的算法能够适应当前T时刻的分布进行学习。
方法:不需要预先了解漂移的大小,而是通过样本数据进行自适应调整。
效果:在二元分类和线性回归两种基本学习场景中展示了该技术的应用,其学习误差优于依赖漂移宽松界限的算法。

We develop and analyze a general technique for learning with an unknown distribution drift. Given a sequence of independent observations from the last $T$ steps of a drifting distribution, our algorithm agnostically learns a family of functions with respect to the current distribution at time $T$. Unlike previous work, our technique does not require prior knowledge about the magnitude of the drift. Instead, the algorithm adapts to the sample data. Without explicitly estimating the drift, the algorithm learns a family of functions with almost the same error as a learning algorithm that knows the magnitude of the drift in advance. Furthermore, since our algorithm adapts to the data, it can guarantee a better learning error than an algorithm that relies on loose bounds on the drift. We demonstrate the application of our technique in two fundamental learning scenarios: binary classification and linear regression.

Regression with Cost-based Rejection
Xin Cheng Yuzhou Cao Haobo Wang Hongxin Wei Bo An Lei Feng



研究问题:本文旨在解决回归问题中基于成本的拒绝学习,即在预测和拒绝之间找到平衡,避免关键的错误预测。
动机:以往的研究主要关注分类设置的成本基础拒绝,无法处理连续和无限的回归目标空间。
方法:我们首先为这个问题设定了预期风险,然后推导出贝叶斯最优解,发现当均方误差用作评估指标时,最优模型应拒绝对方差大于拒绝成本的实例进行预测。我们还提出了一个考虑拒绝作为二元分类的替代损失函数来训练模型,并提供了模型一致性的条件。
效果:大量的实验表明,我们提出的方法非常有效。

Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.

Estimating the Rate-Distortion Function by Wasserstein Gradient Descent
Yibo Yang Stephan Eckstein Marcel Nutz Stephan Mandt



研究问题:本文旨在提出一种新的方法,通过最优传输的角度来估计率失真函数。
动机:传统的Blahut-Arimoto算法需要预先固定再生分布的支持,而我们的方法通过移动粒子来学习最优再生分布的支持。
方法:我们提出了一种基于Wasserstein梯度下降的算法,该算法通过移动粒子来学习最优再生分布的支持。
效果:实验结果表明,我们的方法在低速率源上获得了与最先进的神经网络方法相当或更紧的界,同时所需的调整和计算工作量大大减少。我们还强调了该方法与最大似然去卷积的联系,并引入了一类新的已知解的测试案例。

In the theory of lossy compression, the rate-distortion (R-D) function $R(D)$ describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining $R(D)$ for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate $R(D)$ from the perspective of optimal transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.

First- and Second-Order Bounds for Adversarial Linear Contextual Bandits
Julia Olkhovskaya Jack Mayo Tim van Erven Gergely Neu Chen-Yu Wei



研究问题:在对抗性线性上下文环境中,如何最小化预期的遗憾。
动机:现有的方法在处理固定已知分布的上下文时,其最坏情况的预期遗憾会随着时间、维度和选项数量的增加而增加。
方法:通过使用截断的连续指数权重算法在概率单纯形上进行分析,并利用一种新的与无上下文的线性环境的联系,得出了关于学习者损失累积二阶矩和最佳策略累积损失的新的结果。
效果:当环境相对温和时,这些结果可以改进最坏情况的遗憾,因为它们可能显著小于T。

We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction. Assuming the $d$-dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of $T$ rounds is known to scale as $\tilde O(\sqrt{Kd T})$. Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order $\tilde O(K\sqrt{d V_T})$ in terms of the cumulative second moment of the learner's losses $V_T$, and a closely related first-order bound of order $\tilde O(K\sqrt{d L_T^*})$ in terms of the cumulative loss of the best policy $L_T^*$. Since $V_T$ or $L_T^*$ may be significantly smaller than $T$, these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.

Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes
Emmeran Johnson Ciara Pike-Burke Patrick Rebeschini



研究问题:本文旨在解决强化学习中策略迭代的不稳定性问题,并探讨了在精确策略评估下,如何通过无规则的策略镜像下降(PMD)算法实现策略改进步骤的规范化。
动机:由于策略迭代在非精确策略评估下的不稳定性,作者提出了无规则的策略镜像下降(PMD)算法,该算法对策略改进步骤进行了规范化,而未对目标函数进行规范化。
方法:通过精确策略评估,作者将策略迭代与PMD联系起来,并证明在自适应步长下,未经规范的PMD算法族可以实现维度自由的$\gamma$速率。
效果:作者的研究首次将PMD与速率最优性和步长必要性联系起来。此外,作者还将其分析扩展到非精确设置,并在生成模型下为未经规范的PMD建立了第一个维度优化的样本复杂度,从而超越了现有的最佳结果。

Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, unregularised PMD algorithmically regularises the policy improvement step of PI without regularising the objective function. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor $\gamma$ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free $\gamma$-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the $\gamma$-rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.

On the Convergence of CART under Sufficient Impurity Decrease Condition
Rahul Mazumder Haoyue Wang



研究问题:本文旨在研究CART在回归设置下的收敛速度。
动机:决策树是一种灵活的机器学习方法,其成功应用于众多领域。通常使用CART以递归贪婪的方式拟合决策树。本研究探讨了CART在满足充分不纯度减少(SID)条件时的预测误差上限。
方法:首先,我们证明了在满足SID条件下,CART的预测误差上界。然后,通过实例说明该误差边界无法通过常数或对数因子进一步改进。其次,我们引入了几个易于检查的SID条件的充分条件。特别是,当组件函数满足局部倒Poincare不等式时,我们发现加性模型可以满足SID条件。
效果:实验结果表明,在满足SID条件下,CART的预测误差上界得到了改进。此外,我们还发现一些熟悉的非参数估计函数类也满足这一概念。

The decision tree is a flexible machine-learning model that finds its success in numerous applications. It is usually fitted in a recursively greedy manner using CART. In this paper, we study the convergence rate of CART under a regression setting. First, we prove an upper bound on the prediction error of CART under a sufficient impurity decrease (SID) condition \cite{chi2020asymptotic} -- our result is an improvement over the known result by \cite{chi2020asymptotic} under a similar assumption. We show via examples that this error bound cannot be further improved by more than a constant or a log factor. Second, we introduce a few easy-to-check sufficient conditions of the SID condition. In particular, we show that the SID condition can be satisfied by an additive model when the component functions satisfy a ``locally reverse Poincare inequality". We discuss a few familiar function classes in non-parametric estimation to demonstrate the usefulness of this conception.

Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization
Sanath Kumar Krishnamurthy Ruohan Zhan Susan Athey Emma Brunskill



研究问题:如何在随机上下文环境中设计出一种计算效率高的算法,以最小化累积遗憾和简单遗憾。
动机:在许多应用中,如医疗保健和电子商务,需要学习实验结束后的最佳治疗分配策略,即最小化简单遗憾。然而,这个目标尚未得到充分研究。
方法:提出了一种新的基于“一致臂集合”(CASs)的计算效率高的算法族,用于随机上下文环境的Bandit设置。该算法族可以与任何函数类一起工作,对模型错误不敏感,并可用于连续臂设置。
效果:通过实验证明,这种新的算法族在简单遗憾和累积遗憾的保证方面取得了积极的结果,但同时也展示了一个负面结果,即没有算法能在实现实例依赖的简单遗憾保证的同时,达到最优的累积遗憾的最小最大保证。

In many applications, e.g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on “conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.

Nearly Optimal Bounds for Cyclic Forgetting
William Joseph Swartworth Deanna Needell Rachel Ward Mark Kong Halyun Jeong



研究问题:本文旨在对连续学习环境中线性任务的遗忘量提供理论界限。
动机:在连续学习中,每一轮的学习都对应于投影到一个线性子空间,理解其遗忘现象对于改进学习方法具有重要意义。
方法:通过对所有任务和环境维度的选择进行统一处理,证明了遗忘上界的最佳已知值为O(T^2/m)。
效果:我们的主要技术贡献是对所有$T$(实数或复数)投影产品的数值范围的并集进行了表征,结果呈现出一种螺旋正弦波形状,这可能具有独立的兴趣。

We provide theoretical bounds on the forgetting quantity in the continual learning setting for linear tasks, where each round of learning corresponds to projecting onto a linear subspace. For a cyclic task ordering on $T$ tasks repeated $m$ times each, we prove the best known upper bound of $O(T^2/m)$ on the forgetting. Notably, our bound holds uniformly over all choices of tasks and is independent of the ambient dimension. Our main technical contribution is a characterization of the union of all numerical ranges of products of $T$ (real or complex) projections as a sinusoidal spiral, which may be of independent interest.

Anytime Model Selection in Linear Bandits
Parnian Kassraie Nicolas Emmenegger Andreas Krause Aldo Pacchiano



研究问题:在Bandit优化中进行模型选择是一个挑战性的问题,因为它不仅需要对研究问题:在Bandit优化中进行模型选择是一个挑战性的问题,因为它不仅需要对行动选择进行探索和利用的平衡,还需要对模型选择进行探索和利用的平衡。
动机:我们的主要洞察是,对于线性Bandits中的模型选择,我们可以使用有利的偏差-方差权衡来模拟向在线学习者提供完全信息反馈。
方法:我们开发了ALEXP,该方法具有指数改善($\log M$)的依赖于M的遗憾。
效果:ALEXP对其遗憾有随时保证,既不需要知道视界n,也不依赖于初始的纯探索阶段。我们的方法利用了Lasso的新的时间一致分析,建立了在线学习和高维统计之间的新联系。

Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly ($\mathrm{poly}M$) with the number of models $M$ in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved ($\log M$) dependence on $M$ for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon $n$, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.

Rigorous Runtime Analysis of MOEA/D for Solving Multi-Objective Minimum Weight Base Problems
Anh Viet Do Aneta Neumann Frank Neumann Andrew M. Sutton



研究问题:本文研究多目标最小权重基问题,这是经典NP难组合问题的抽象,如多目标最小生成树问题。
动机:为了解决这类复杂问题,我们提出了一种新颖的优化算法,并对其性能进行了深入的理论分析。
方法:我们首先证明了非支配前沿凸包的一些重要性质,然后利用这些性质对MOEA/D算法进行了首次运行时间分析。我们还设计了适当的分解设置,使得MOEA/D能在预期的固定参数多项式时间内找到所有的极端点。
效果:实验结果证实了我们的理论发现,并且与之前研究的问题GSEMO相比,MOEA/D在所有实例上都能找到所有的极端点,而且速度更快。

We study the multi-objective minimum weight base problem, an abstraction of classical NP-hard combinatorial problems such as the multi-objective minimum spanning tree problem. We prove some important properties of the convex hull of the non-dominated front, such as its approximation quality and an upper bound on the number of extreme points. Using these properties, we give the first run-time analysis of the MOEA/D algorithm for this problem, an evolutionary algorithm that effectively optimizes by decomposing the objectives into single-objective components. We show that the MOEA/D, given an appropriate decomposition setting, finds all extreme points within expected fixed-parameter polynomial time, in the oracle model. Experiments are conducted on random bi-objective minimum spanning tree instances, and the results agree with our theoretical findings. Furthermore, compared with a previously studied evolutionary algorithm for the problem GSEMO, MOEA/D finds all extreme points much faster across all instances.

Statistical Analysis of Quantum State Learning Process in Quantum Neural Networks
Hao-Kai Zhang Chenghong Zhu Mingrui Jing Xin Wang



研究问题:本文旨在研究量子神经网络(QNNs)在未知量子状态学习中的问题。
动机:量子神经网络是追求各种领域近程量子优势的有希望框架,其中许多应用可以看作是学习编码有用数据的量子状态。
方法:本文开发了一个关于使用QNNs学习未知量子状态的无法实现定理,即使从高保真初始状态开始。
效果:实验结果表明,这些结果适用于任何电路结构、初始化策略,并适用于固定架构和自适应方法。这些发现对改善QNNs的可学习和可扩展性的良好初始猜测和自适应方法设定了一般限制,并深化了对先验信息在QNNs中的作用的理解。

Quantum neural networks (QNNs) have been a promising framework in pursuing near-term quantum advantage in various fields, where many applications can be viewed as learning a quantum state that encodes useful data. As a quantum analog of probability distribution learning, quantum state learning is theoretically and practically essential in quantum machine learning. In this paper, we develop a no-go theorem for learning an unknown quantum state with QNNs even starting from a high-fidelity initial state. We prove that when the loss value is lower than a critical threshold, the probability of avoiding local minima vanishes exponentially with the qubit count, while only grows polynomially with the circuit depth. The curvature of local minima is concentrated to the quantum Fisher information times a loss-dependent constant, which characterizes the sensibility of the output state with respect to parameters in QNNs. These results hold for any circuit structures, initialization strategies, and work for both fixed ansatzes and adaptive methods. Extensive numerical simulations are performed to validate our theoretical results. Our findings place generic limits on good initial guesses and adaptive methods for improving the learnability and scalability of QNNs, and deepen the understanding of prior information's role in QNNs.

Zero-sum Polymatrix Markov Games: Equilibrium Collapse and Efficient Computation of Nash Equilibria
Fivos Kalogiannis Ioannis Panageas



研究问题:在多玩家马尔科夫游戏中计算纳什均衡是一个计算上的难题,能否通过专注于特定类别的马尔科夫游戏来规避计算上的困难?
动机:Daskalakis等人(2009, 2022; Jin et al., 2022; Deng et al., 2023)的工作指出,在多玩家马尔科夫游戏中计算纳什均衡是一个计算上的难题。这引发了一个问题,即如果专注于特定的马尔科夫游戏类别,是否可以规避计算上的困难。
方法:受零和多项式矩阵正规形式游戏(Cai et al., 2016)的启发,我们定义了一类只有由图描述的成对交互的零和多代理马尔科夫游戏,该图会随状态变化。
效果:对于这类马尔科夫游戏,我们证明了可以有效地找到ε近似纳什均衡。为此,我们通过证明粗关联均衡集塌陷到纳什均衡集,从而推广了(Cai et al., 2016)的技术。之后,可以使用文献中任何计算近似粗关联均衡马尔可夫策略的算法来获得近似纳什均衡。

The works of (Daskalakis et al., 2009, 2022; Jin et al., 2022; Deng et al., 2023) indicate that computing Nash equilibria in multi-player Markov games is a computationally hard task. This fact raises the question of whether or not computational intractability can be circumvented if one focuses on specific classes of Markov games. One such example is two-player zero-sum Markov games, in which efficient ways to compute a Nash equilibrium are known. Inspired by zero-sum polymatrix normal-form games (Cai et al., 2016), we define a class of zero-sum multi-agent Markov games in which there are only pairwise interactions described by a graph that changes per state. For this class of Markov games, we show that an $\epsilon$-approximate Nash equilibrium can be found efficiently. To do so, we generalize the techniques of (Cai et al., 2016), by showing that the set of coarse-correlated equilibria collapses to the set of Nash equilibria. Afterwards, it is possible to use any algorithm in the literature that computes approximate coarse-correlated equilibria Markovian policies to get an approximate Nash equilibrium.

Most Neural Networks Are Almost Learnable
Amit Daniely Nathan Srebro Gal Vardi



研究问题:开发一种有效的算法来学习随机常深度网络。
动机:为了解决在复杂网络中进行有效学习和理解的问题。
方法:提出了一种多项式时间近似解决方案(PTAS),用于学习具有固定深度和误差的随机Xavier网络。
效果:该算法在时间和样本复杂度上表现出色,对于某些情况下的sigmoid和ReLU类似激活函数,其性能可以进一步提高,实现准多项式时间学习常深度随机网络。

We present a PTAS for learning random constant-depth networks. We show that for any fixed $\epsilon>0$ and depth $i$, there is a poly-time algorithm that for any distribution on $\sqrt{d} \cdot \mathbb{S}^{d-1}$ learns random Xavier networks of depth $i$, up to an additive error of $\epsilon$. The algorithm runs in time and sample complexity of $(\bar{d})^{\mathrm{poly}(\epsilon^{-1})}$, where $\bar d$ is the size of the network. For some cases of sigmoid and ReLU-like activations the bound can be improved to $(\bar{d})^{\mathrm{polylog}(\epsilon^{-1})}$, resulting in a quasi-poly-time algorithm for learning constant depth random networks.

Quantum Bayesian Optimization
Zhongxiang Dai Gregory Kang Ruey Lau Arun Verma Yao Shu Bryan Kian Hsiang Low Patrick Jaillet



研究问题:如何优化复杂的黑箱奖励函数。
动机:现有的优化方法有理论上限和下界,但无法解决具有非线性奖励函数的复杂现实问题。
方法:提出量子高斯过程-置信上界(Q-GP-UCB)算法,利用量子计算实现更好的优化效果。
效果:Q-GP-UCB是首个能在多项式对数时间内达到上界 regret 的优化算法,其在实践中也表现出了潜在的优势。

Kernelized bandits, also known as Bayesian optimization (BO), has been a prevalent method for optimizing complicated black-box reward functions. Various BO algorithms have been theoretically shown to enjoy upper bounds on their cumulative regret which are sub-linear in the number $T$ of iterations, and a regret lower bound of $\Omega(\sqrt{T})$ has been derived which represents the unavoidable regrets for any classical BO algorithm. Recent works on quantum bandits have shown that with the aid of quantum computing, it is possible to achieve tighter regret upper bounds better than their corresponding classical lower bounds. However, these works are restricted to either multi-armed or linear bandits, and are hence not able to solve sophisticated real-world problems with non-linear reward functions. To this end, we introduce the quantum-Gaussian process-upper confidence bound (Q-GP-UCB) algorithm. To the best of our knowledge, our Q-GP-UCB is the first BO algorithm able to achieve a regret upper bound of $\mathcal{O}(\text{poly}\log T)$, which is significantly smaller than its regret lower bound of $\Omega(\sqrt{T})$ in the classical setting. Moreover, thanks to our novel analysis of the confidence ellipsoid, our Q-GP-UCB with the linear kernel achieves a smaller regret than the quantum linear UCB algorithm from the previous work. We use simulations, as well as an experiment using a real quantum computer, to verify that the theoretical quantum speedup achieved by our Q-GP-UCB is also potentially relevant in practice.

Minimum Description Length and Generalization Guarantees for Representation Learning
Milad Sefidgaran Abdellatif Zaidi Piotr Krasnowski



研究问题:设计高效的统计监督学习算法的一个主要挑战是找到不仅在可用训练样本上,而且在未见过的数据上也表现良好的表示。
动机:虽然对表示学习的研究引起了极大的兴趣,但大多数现有的方法都是启发式的;关于理论泛化保证的知识非常有限。
方法:本文建立了一个压缩性框架,使我们能够从标签或潜在变量(表示)的“最小描述长度”(MDL)的角度,推导出表示学习算法的泛化误差的上限。
效果:我们新的界限反映了编码器的结构,对于确定性算法来说并非空洞无物。我们的压缩性方法,本质上是信息论的,建立在Blum-Langford的PAC-MDL界限的基础上,引入了两个基本要素:块编码和有损压缩。最后,我们部分利用了理论结果,引入了一种新的数据依赖先验。数值模拟说明了与IB中使用的经典先验相比,精心选择的这种先验的优势。

A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees. For example, the information bottleneck method seeks a good generalization by finding a minimal description of the input that is maximally informative about the label variable, where minimality and informativeness are both measured by Shannon’s mutual information. In this paper, we establish a compressibility framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm in terms of the "Minimum Description Length'' (MDL) of the labels or the latent variables (representations). Rather than the mutual information between the encoder’s input and the representation, which is often believed to reflect the algorithm’s generalization capability in the related literature but in fact, falls short of doing so, our new bounds involve the "multi-letter" relative entropy between the distribution of the representations (or labels) of the training and test sets and a fixed prior. In particular, these new bounds reflect the structure of the encoder and are not vacuous for deterministic algorithms. Our compressibility approach, which is information-theoretic in nature, builds upon that of Blum-Langford for PAC-MDL bounds and introduces two essential ingredients: block-coding and lossy-compression. The latter allows our approach to subsume the so-called geometrical compressibility as a special case. To the best knowledge of the authors, the established generalization bounds are the first of their kind for Information Bottleneck type encoders and representation learning. Finally, we partly exploit the theoretical results by introducing a new data-dependent prior. Numerical simulations illustrate the advantages of well-chosen such priors over classical priors used in IB.

Batch Bayesian Optimization For Replicable Experimental Design
Zhongxiang Dai Quoc Phong Nguyen Sebastian Shenghong Tay Daisuke Urano Richalynn Leong Bryan Kian Hsiang Low Patrick Jaillet



研究问题:如何在有限的预算下,评估多个实验条件并复制每个条件多次,同时解决大规模异方差观察噪声的问题。
动机:在现实世界的实验设计中,由于存在大量和异方差的观察噪声,需要在评估更多独特条件与复制每个条件较少次之间进行权衡,同时也需要考虑到风险规避。
方法:提出了批量汤普森采样可复制实验设计(BTS-RED)框架,包括三种算法。BTS-RED-Known和BTS-RED-Unknown算法分别针对已知和未知的噪声方差,自适应地选择复制次数,以应对噪声异方差问题。
效果:在精准农业和AutoML两个实际应用中,证明了这些算法的有效性。

Many real-world experimental design problems (a) evaluate multiple experimental conditions in parallel and (b) replicate each condition multiple times due to large and heteroscedastic observation noise. Given a fixed total budget, this naturally induces a trade-off between evaluating more unique conditions while replicating each of them fewer times vs. evaluating fewer unique conditions and replicating each more times. Moreover, in these problems, practitioners may be risk-averse and hence prefer an input with both good average performance and small variability. To tackle both challenges, we propose the Batch Thompson Sampling for Replicable Experimental Design (BTS-RED) framework, which encompasses three algorithms. Our BTS-RED-Known and BTS-RED-Unknown algorithms, for, respectively, known and unknown noise variance, choose the number of replications adaptively rather than deterministically such that an input with a larger noise variance is replicated more times. As a result, despite the noise heteroscedasticity, both algorithms enjoy a theoretical guarantee and are asymptotically no-regret. Our Mean-Var-BTS-RED algorithm aims at risk-averse optimization and is also asymptotically no-regret. We also show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML.

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning
Ahmadreza Moradipari Mohammad Pedramfar Modjtaba Shokrian Zini Vaneet Aggarwal



研究问题:本文旨在证明强化学习中Thompson Sampling的最先进的贝叶斯遗憾界限。
动机:为了改进现有关于信息比的分析,以及在时间非均匀强化学习问题中探索环境空间的Kolmogorov l1-维度上限。
方法:通过详细分析信息比和环境空间的Kolmogorov l1-维度,提出了一种改进的Thompson Sampling方法。
效果:在多种设置下,如表格、线性和有限混合等,找到了具体的d_{l_1}界限,并讨论了如何利用这些结果来提高最先进的性能。

In this paper, we prove state-of-the-art Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We present a refined analysis of the information ratio, and show an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how our results improve the state-of-the-art.

Cascading Contextual Assortment Bandits
Hyunjun Choi Rajan Udwani Min-hwan Oh



研究问题:提出了一种新的组合式选择模型——级联上下文分类选择模型,并设计了相应的算法。
动机:该模型是对现有级联选择模型和分类选择模型的泛化,可以更广泛地应用于实践。
方法:我们为这个模型设计了首个基于UCB的选择算法——UCB-CCA,并证明了该算法能达到比现有级联上下文选择模型更优的遗憾上限。为了进一步改善对问题相关常数的依赖性,我们又设计了第二种算法——UCB-CCA+,该算法利用了新的贝叶斯类型集中结果。
效果:通过数值实验验证了我们的理论研究,证实了我们提出的方法在实践中的有效性。

We present a new combinatorial bandit model, the \textit{cascading contextual assortment bandit}. This model serves as a generalization of both existing cascading bandits and assortment bandits, broadening their applicability in practice. For this model, we propose our first UCB bandit algorithm, UCB-CCA. We prove that this algorithm achieves a $T$-step regret upper-bound of $\tilde{\mathcal{O}}(\frac{1}{\kappa}d\sqrt{T})$, sharper than existing bounds for cascading contextual bandits by eliminating dependence on cascade length $K$. To improve the dependence on problem-dependent constant $\kappa$, we introduce our second algorithm, UCB-CCA+, which leverages a new Bernstein-type concentration result. This algorithm achieves $\tilde{\mathcal{O}}(d\sqrt{T})$ without dependence on $\kappa$ in the leading term. We substantiate our theoretical claims with numerical experiments, demonstrating the practical efficacy of our proposed methods.

Towards Optimal Effective Resistance Estimation
Rajat Vadiraj Dwaraknath Ishani Karmarkar Aaron Sidford



研究问题:如何有效估计无向、扩展图中的有阻值。
动机:当前对于无向、扩展图的有效阻值估计算法存在时间复杂度高和误差大的问题。
方法:提出一种新算法,通过联合训练知识图谱和文本语料库,对ERNIE模型进行训练,以捕捉语义模式。
效果:实验结果表明,该算法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We provide new algorithms and conditional hardness for the problem of estimating effective resistances in $n$-node $m$-edge undirected, expander graphs. We provide an $\widetilde{O}(m\epsilon^{-1})$-time algorithm that produces with high probability, an $\widetilde{O}(n\epsilon^{-1})$-bit sketch from which the effective resistance between any pair of nodes can be estimated, to $(1 \pm \epsilon)$-multiplicative accuracy, in $\widetilde{O}(1)$-time. Consequently, we obtain an $\widetilde{O}(m\epsilon^{-1})$-time algorithm for estimating the effective resistance of all edges in such graphs, improving (for sparse graphs) on the previous fastest runtimes of $\widetilde{O}(m\epsilon^{-3/2})$ [Chu et. al. 2018] and $\widetilde{O}(n^2\epsilon^{-1})$ [Jambulapati, Sidford, 2018] for general graphs and $\widetilde{O}(m + n\epsilon^{-2})$ for expanders [Li, Sachdeva 2022]. We complement this result by showing a conditional lower bound that a broad set of algorithms for computing such estimates of the effective resistances between all pairs of nodes require $\widetilde{\Omega}(n^2 \epsilon^{-1/2})$-time, improving upon the previous best such lower bound of $\widetilde{\Omega}(n^2 \epsilon^{-1/13})$ [Musco et. al. 2017]. Further, we leverage the tools underlying these results to obtain improved algorithms and conditional hardness for more general problems of sketching the pseudoinverse of positive semidefinite matrices and estimating functions of their eigenvalues.

Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization
Shu Tew Mario Boley Daniel F. Schmidt



研究问题:如何更有效地调整岭回归的正则化超参数λ。
动机:当前的调整方法计算量大,且可能陷入局部最优解,导致效果不佳。
方法:提出一种新方法,通过贝叶斯形式对岭回归进行建模,并结合期望最大化算法进行迭代学习,可以在大数据集上找到唯一的最优解。
效果:该方法在大规模数据集上表现优越,并且计算复杂度低,可以有效提升调整岭回归超参数的效率和准确性。

We present a novel method for tuning the regularization hyper-parameter, $\lambda$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $\lambda$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $\lambda$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $\lambda$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $\lambda$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).

BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization
Chen Fan Gaspard Choné-Ducasse Mark Schmidt Christos Thrampoulidis



研究问题:现有的双层优化算法在计算超梯度时,两个耦合的学习率会受到近似误差的影响,需要仔细的微调以保证快速收敛。
动机:为了解决这个问题,我们研究了最近提出的自适应步长方法,即随机线搜索(SLS)和随机Polyak步长(SPS),用于计算上下两层的学习率。
方法:我们重新审视了SLS和SPS在单层优化中的应用,没有通常假设的额外插值条件。对于这些设置,我们调查了改进现有文献中建议的新SLS和SPS变体,并且实现起来更简单。重要的是,这两种变体可以看作是具有包络型步长的一般方法族的特殊实例。这种统一的包络策略允许扩展算法及其收敛保证到BO设置。
效果:我们的大量实验表明,新的算法(有SGD和Adam版本)可以找到大的学习率,需要最小的调整,并且比需要微调的相应的vanilla SGD或Adam BO算法更快地收敛。

The popularity of bi-level optimization (BO) in deep learning has spurred a growing interest in studying gradient-based BO algorithms. However, existing algorithms involve two coupled learning rates that can be affected by approximation errors when computing hypergradients, making careful fine-tuning necessary to ensure fast convergence. To alleviate this issue, we investigate the use of recently proposed adaptive step-size methods, namely stochastic line search (SLS) and stochastic Polyak step size (SPS), for computing both the upper and lower-level learning rates. First, we revisit the use of SLS and SPS in single-level optimization without the additional interpolation condition that is typically assumed in prior works. For such settings, we investigate new variants of SLS and SPS that improve upon existing suggestions in the literature and are simpler to implement. Importantly, these two variants can be seen as special instances of general family of methods with an envelope-type step-size. This unified envelope strategy allows for the extension of the algorithms and their convergence guarantees to BO settings. Finally, our extensive experiments demonstrate that the new algorithms, which are available in both SGD and Adam versions, can find large learning rates with minimal tuning and converge faster than corresponding vanilla SGD or Adam BO algorithms that require fine-tuning.

Blocked Collaborative Bandits: Online Collaborative Filtering with Per-Item Budget Constraints
Soumyabrata Pal Arun Suggala Karthikeyan Shanmugam Prateek Jain



研究问题:多用户多臂赌博机问题中,如何设计算法在预算限制下最大化所有
动机:为了解决这个问题,我们研究了最近提出的自适应步长方法,即随机线搜索(SLS)和随机Polyak步长(SPS),用于计算上下两层的学习率。
方法:我们重新审视了SLS和SPS在单层优化中的应用,没有通常假设的额外插值条件。对于这些设置,我们调查了改进现有文献中建议的新SLS和SPS变体,并且实现起来更简单。重要的是,这两种变体可以看作是具有包络型步长的一般方法族的特殊实例。这种统一的包络策略允许扩展算法及其收敛保证到BO设置。
效果:我们的大量实验表明,新的算法(有SGD和Adam版本)可以找到大的学习率,需要最小的调整,并且比需要微调的相应的vanilla SGD或Adam BO算法更快地收敛。

We consider the problem of \emph{blocked} collaborative bandits where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. Our goal is to design algorithms that maximize the cumulative reward accrued by all the users over time, under the \emph{constraint} that no arm of a user is pulled more than $\mathsf{B}$ times. This problem has been originally considered by \cite{Bresler:2014}, and designing regret-optimal algorithms for it has since remained an open problem. In this work, we propose an algorithm called B-LATTICE (Blocked Latent bAndiTs via maTrIx ComplEtion) that collaborates across users, while simultaneously satisfying the budget constraints, to maximize their cumulative rewards. Theoretically, under certain reasonable assumptions on the latent structure, with $\mathsf{M}$ users, $\mathsf{N}$ arms, $\mathsf{T}$ rounds per user, and $\mathsf{C}=O(1)$ latent clusters, B-LATTICE achieves a per-user regret of $\widetilde{O}(\sqrt{\mathsf{T}(1 + \mathsf{N}\mathsf{M}^{-1})})$ under a budget constraint of $\mathsf{B}=\Theta(\log \mathsf{T})$. These are the first sub-linear regret bounds for this problem, and match the minimax regret bounds when $\mathsf{B}=\mathsf{T}$. Empirically, we demonstrate that our algorithm has superior performance over baselines even when $\mathsf{B}=1$. B-LATTICE is a phased algorithm where in each phase it clusters users into groups and collaborates across users within a group to quickly learn their reward models.

Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of Neural Networks with Polynomial Width, Samples, and Time
Arvind Venkat Mahankali Jeff Z. HaoChen Kefan Dong Margalit Glasgow Tengyu Ma



研究问题:本文探讨了在没有不自然修改的情况下,神经网络的梯度下降是否能达到比核方法更好的样本复杂度。
动机:尽管最近在两层神经网络的非凸优化理论上取得了进展,但神经网络的梯度下降能否在没有不自然修改的情况下达到比核方法更好的样本复杂度仍然是一个开放的问题。
方法:本文对多项式宽度的两层神经网络进行了清晰的平均场分析。与之前的工作不同,我们的分析不需要对优化算法进行不自然的修改。
效果:我们证明了使用$n=O(d^{3.1})$个样本训练的网络可以在多项式时间内收敛到一个非平凡的错误,这是使用$n\ll d^4$个样本的核方法无法实现的,从而清晰地区分了未修改的梯度下降和NTK。因此,我们可以得出结论,具有正学习率和多项式迭代次数的投影梯度下降可以以相同的样本复杂度收敛到低误差。

Despite recent theoretical progress on the non-convex optimization of two-layer neural networks, it is still an open question whether gradient descent on neural networks without unnatural modifications can achieve better sample complexity than kernel methods. This paper provides a clean mean-field analysis of projected gradient flow on polynomial-width two-layer neural networks. Different from prior works, our analysis does not require unnatural modifications of the optimization algorithm. We prove that with sample size $n = O(d^{3.1})$ where $d$ is the dimension of the inputs, the network trained with projected gradient flow converges in polynomial time to a non-trivial error that is not achievable by kernel methods using $n \ll d^4$ samples, hence demonstrating a clear separation between unmodified gradient descent and NTK. As a corollary, we show that projected gradient descent with a positive learning rate and a polynomial number of iterations converges to low error with the same sample complexity.

Statistical and Computational Trade-off in Multi-Agent Multi-Armed Bandits
Filippo Vannella Alexandre Proutiere Jaeseong Jeong



研究问题:多智能体多臂赌博机中遗憾最小化的问题
动机:尽管最近在两层神经网络的非凸优化理论上取得了进展,但神经网络的梯度下降能否在没有不自然修改的情况下达到比核方法更好的样本复杂度仍然是一个开放的问题。
方法:本文对多项式宽度的两层神经网络进行了清晰的平均场分析。与之前的工作不同,我们的分析不需要对优化算法进行不自然的修改。
效果:我们证明了使用$n=O(d^{3.1})$个样本训练的网络可以在多项式时间内收敛到一个非平凡的错误,这是使用$n\ll d^4$个样本的核方法无法实现的,从而清晰地区分了未修改的梯度下降和NTK。因此,我们可以得出结论,具有正学习率和多项式迭代次数的投影梯度下降可以以相同的样本复杂度收敛到低误差。

We study the problem of regret minimization in Multi-Agent Multi-Armed Bandits (MAMABs) where the rewards are defined through a factor graph. We derive an instance-specific regret lower bound and characterize the minimal expected number of times each global action should be explored. Unfortunately, this bound and the corresponding optimal exploration process are obtained by solving a combinatorial optimization problem with a set of variables and constraints exponentially growing with the number of agents. We approximate the regret lower bound problem via Mean Field techniques to reduce the number of variables and constraints. By tuning the latter, we explore the trade-off between achievable regret and complexity. We devise Efficient Sampling for MAMAB (ESM), an algorithm whose regret asymptotically matches the corresponding approximated lower bound. We assess the regret and computational complexity of ESM numerically, using both synthetic and real-world experiments in radio communications networks.

Uniform Convergence with Square-Root Lipschitz Loss
Lijia Zhou Zhen Dai Frederic Koehler Nathan Srebro



研究问题:本文旨在为高斯数据建立通用的一致收敛保证,以假设类的拉达玛契复杂度和标量损失函数的平方根的Lipschitz常数为基础。
动机:现有的基于平滑性(导数的Lipschitz常数)的结果需要进行实质性的泛化,并且需要处理更广泛的平方根-Lipschitz损失类别,包括用于研究相位检索和ReLU回归的非平滑损失函数。
方法:通过使用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We establish generic uniform convergence guarantees for Gaussian data in terms of the Radamacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipschtz losses, which includes also non-smooth loss functions appropriate for studying phase retrieval and ReLU regression, as well as rederive and better understand “optimistic rate” and interpolation learning guarantees.

A Finite-Sample Analysis of Payoff-Based Independent Learning in Zero-Sum Stochastic Games
Zaiwei Chen Kaiqing Zhang Eric Mazumdar Asuman E. Ozdaglar Adam Wierman



研究问题:本研究探讨了两人零和随机博弈,并开发了一种平滑的最佳反应学习动态变体。
动机:为了解决矩阵游戏和随机博弈的独立学习动态问题。
方法:结合了矩阵游戏的独立学习动态和随机博弈的极小极大值迭代。
效果:得出了支付基础、收敛、理性和在两个玩家之间对称的学习动态结果,这是首次对此类独立学习动态进行有限样本分析。

In this work, we study two-player zero-sum stochastic games and develop a variant of the smoothed best-response learning dynamics that combines independent learning dynamics for matrix games with the minimax value iteration for stochastic games. The resulting learning dynamics are payoff-based, convergent, rational, and symmetric between the two players. Our theoretical results present to the best of our knowledge the first last-iterate finite-sample analysis of such independent learning dynamics. To establish the results, we develop a coupled Lyapunov drift approach to capture the evolution of multiple sets of coupled and stochastic iterates, which might be of independent interest.

Global Convergence Analysis of Local SGD for Two-layer Neural Network without Overparameterization
Yajie Bao Amarda Shehu Mingrui Liu



研究问题:目前,对于联邦学习中的基础算法——局部随机梯度下降(Local SGD)在非凸损失函数上的表现缺乏理论理解。
动机:由于噪声依赖于模型参数,因此对SGD的全局收敛性分析具有挑战性。现有的分析大多集中在全局梯度下降(GD)上,并依赖于注入噪声以实现局部或全局最优解的收敛。然而,当扩展到局部SGD时,现有的非凸情况下的分析只能保证找到稳定点,或者假设神经网络是过参数化的,以便通过神经切线核分析确保收敛到全局最小值。
方法:本文为两层神经网络提供了第一个无过参数化和无注入噪声的局部SGD的全局收敛分析,当输入数据为高斯分布时。我们证明的主要技术成分是“自我修正机制”和“全球模型参数方向的新精确递归表征”。
效果:实验结果证实了理论成果,即局部SGD可以在多项式时间内纠正两层网络并进入良好区域,然后以线性速度收敛到全局最小值,同时减少了通信轮次。

Local SGD, a cornerstone algorithm in federated learning, is widely used in training deep neural networks and shown to have strong empirical performance. A theoretical understanding of such performance on nonconvex loss landscapes is currently lacking. Analysis of the global convergence of SGD is challenging, as the noise depends on the model parameters. Indeed, many works narrow their focus to GD and rely on injecting noise to enable convergence to the local or global optimum. When expanding the focus to local SGD, existing analyses in the nonconvex case can only guarantee finding stationary points or assume the neural network is overparameterized so as to guarantee convergence to the global minimum through neural tangent kernel analysis. In this work, we provide the first global convergence analysis of the vanilla local SGD for two-layer neural networks \emph{without overparameterization} and \textit{without injecting noise}, when the input data is Gaussian. The main technical ingredients of our proof are \textit{a self-correction mechanism} and \textit{a new exact recursive characterization of the direction of global model parameters}. The self-correction mechanism guarantees the algorithm reaches a good region even if the initialization is in a bad region. A good (bad) region means updating the model by gradient descent will move closer to (away from) the optimal solution. The main difficulty in establishing a self-correction mechanism is to cope with the gradient dependency between two layers. To address this challenge, we divide the landscape of the objective into several regions to carefully control the interference of two layers during the correction process. As a result, we show that local SGD can correct the two layers and enter the good region in polynomial time. After that, we establish a new exact recursive characterization of the direction of global parameters, which is the key to showing convergence to the global minimum with linear speedup in the number of machines and reduced communication rounds. Experiments on synthetic data confirm theoretical results.

Maximum Average Randomly Sampled: A Scale Free and Non-parametric Algorithm for Stochastic Bandits
Masoud Moravej Khorasani Erik Weyer



研究问题:在线决策问题中如何权衡探索和利用。
动机:传统的UCB方法需要预先知道一个尺度参数,且仅使用尾部信息,这可能影响其性能。
方法:本文为两层神经网络提供了第一个无过参数化和无注入噪声的局部SGD的全局收敛分析,当输入数据为高斯分布时。我们证明的主要技术成分是“自我修正机制”和“全球模型参数方向的新精确递归表征”。
效果:实验结果证实了理论成果,即局部SGD可以在多项式时间内纠正两层网络并进入良好区域,然后以线性速度收敛到全局最小值,同时减少了通信轮次。

Upper Confidence Bound (UCB) methods are one of the most effective methods in dealing with the exploration-exploitation trade-off in online decision-making problems. The confidence bounds utilized in UCB methods tend to be constructed based on concentration equalities which are usually dependent on a parameter of scale (e.g. a bound on the payoffs, a variance, or a subgaussian parameter) that must be known in advance. The necessity of knowing a scale parameter a priori and the fact that the confidence bounds only use the tail information can deteriorate the performance of the UCB methods. Here we propose a data-dependent UCB algorithm called MARS (Maximum Average Randomly Sampled) in a non-parametric setup for multi-armed bandits with symmetric rewards. The algorithm does not depend on any scaling, and the data-dependent upper confidence bound is constructed based on the maximum average of randomly sampled rewards inspired by the work of Hartigan in the 1960s and 70s. A regret bound for the multi-armed bandit problem is derived under the same assumptions as for the $\psi$-UCB method without incorporating any correction factors. The method is illustrated and compared with baseline algorithms in numerical experiments.

Universal Gradient Descent Ascent Method for Nonconvex-Nonconcave Minimax Optimization
Taoli Zheng Linglingzhi Zhu Anthony Man-Cho So Jose Blanchet Jiajin Li



研究问题:非凸-非凹最小最大优化在机器学习中有着广泛的应用,但现有的算法大多依赖于单方面的信息,如原始函数的凸性或对偶函数的凹性,或者特定的结构,如Polyak-Łojasiewicz和Kurdyka-Łojasiewicz条件。然而,在实践中验证这些规则条件是具有挑战性的。
动机:为了应对这个挑战,我们提出了一种新的、普遍适用的单循环算法——双平滑梯度下降上升方法(DS-GDA),它自然地平衡了原始和对偶更新。
方法:DS-GDA使用相同的超参数可以统一解决非凸-凹、凸-非凹和非凸-非凹问题,其收敛复杂度为O(ε^-4)。当知道KŁ指数时,可以获得更精确(甚至最优)的迭代复杂度。
效果:对于各种具有挑战性的非凸-非凹问题,包括“被遗弃的”、“双线性耦合最小最大”、“六次多项式”和“极化游戏”,DS-GDA都能消除极限环。据我们所知,这是第一个能在所有这些困难问题上实现收敛的一阶算法。

Nonconvex-nonconcave minimax optimization has received intense attention over the last decade due to its broad applications in machine learning. Most existing algorithms rely on one-sided information, such as the convexity (resp. concavity) of the primal (resp. dual) functions, or other specific structures, such as the Polyak-Łojasiewicz (PŁ) and Kurdyka-Łojasiewicz (KŁ) conditions. However, verifying these regularity conditions is challenging in practice. To meet this challenge, we propose a novel universally applicable single-loop algorithm, the doubly smoothed gradient descent ascent method (DS-GDA), which naturally balances the primal and dual updates. That is, DS-GDA with the same hyperparameters is able to uniformly solve nonconvex-concave, convex-nonconcave, and nonconvex-nonconcave problems with one-sided KŁ properties, achieving convergence with $\mathcal{O}(\epsilon^{-4})$ complexity. Sharper (even optimal) iteration complexity can be obtained when the KŁ exponent is known. Specifically, under the one-sided KŁ condition with exponent $\theta\in(0,1)$, DS-GDA converges with an iteration complexity of $\mathcal{O}(\epsilon^{-2\max\\{2\theta,1\\}})$. They all match the corresponding best results in the literature. Moreover, we show that DS-GDA is practically applicable to general nonconvex-nonconcave problems even without any regularity conditions, such as the PŁ condition, KŁ condition, or weak Minty variational inequalities condition. For various challenging nonconvex-nonconcave examples in the literature, including *Forsaken*, *Bilinearly-coupled minimax*, *Sixth-order polynomial*, and *PolarGame*, the proposed DS-GDA can all get rid of limit cycles. To the best of our knowledge, this is the first first-order algorithm to achieve convergence on all of these formidable problems.

Certified Minimax Unlearning with Generalization Rates and Deletion Capacity
Jiaqi Liu Jian Lou Zhan Qin Kui Ren



研究问题:针对最小最大模型的$(epsilon,\delta)$认证机器取消学习。
动机:大部分现有工作集中在从具有单个变量的标准统计学习模型中进行取消学习,其取消步骤依赖于基于直接Hessian的传统牛顿更新。
方法:为最小最大模型开发了一种新的$(\epsilon,\delta)$认证机器取消学习算法。它提出了一个由基于总Hessian的完整牛顿更新和从差分隐私借用的高斯机制组成的最小最大取消步骤。为了获得取消认证,我们的方法通过仔细分析最小最大取消步骤的“敏感性”(即,最小最大取消变量与从头开始重新训练变量的接近程度)注入校准的高斯噪声。
效果:对于三种不同损失函数的情况(即,强凸-强凹损失),我们分别推导了总体强原-对偶风险的泛化率。我们还提供了删除容量,以保证只要删除的样本数量不超过推导的数量,就可以保持所需的总体风险。在训练样本n和模型维度d的情况下,我们得到了阶$\mathcal O(n/d^{1/4})$,这与基线方法(差分隐私最小最大学习)的$\mathcal O(n/d^{1/2})$有严格差距。此外,我们的泛化率和删除容量与先前为标准统计学习模型推导的最佳率相匹配。

We study the problem of $(\epsilon,\delta)$-certified machine unlearning for minimax models. Most of the existing works focus on unlearning from standard statistical learning models that have a single variable and their unlearning steps hinge on the \emph{direct Hessian-based conventional Newton} update. We develop a new $(\epsilon,\delta)$-certified machine unlearning algorithm for minimax models. It proposes a minimax unlearning step consisting of a \emph{total-Hessian-based complete Newton} update and the Gaussian mechanism borrowed from differential privacy. To obtain the unlearning certification, our method injects calibrated Gaussian noises by carefully analyzing the ``sensitivity'' of the minimax unlearning step (i.e., the closeness between the minimax unlearning variables and the retraining-from-scratch variables). We derive the generalization rates in terms of population strong and weak primal-dual risk for three different cases of loss functions, i.e., (strongly-)convex-(strongly-)concave losses. We also provide the deletion capacity to guarantee that a desired population risk can be maintained as long as the number of deleted samples does not exceed the derived amount. With training samples $n$ and model dimension $d$, it yields the order $\mathcal O(n/d^{1/4})$, which shows a strict gap over the baseline method of differentially private minimax learning that has $\mathcal O(n/d^{1/2})$. In addition, our rates of generalization and deletion capacity match the state-of-the-art rates derived previously for standard statistical learning models.

On the Generalization Error of Stochastic Mirror Descent for Quadratically-Bounded Losses: an Improved Analysis
Ta Duy Nguyen Alina Ene Huy Nguyen



研究问题:重新审视随机镜像下降在二次有界损失函数上的泛化误差。
动机:二次有界损失函数是一类广泛的损失函数,可以捕获Lipschitz和平滑函数,适用于回归和分类问题。
方法:通过分析新的超级马氏链序列的矩生成函数并利用随机镜像下降的结构,直接获得高概率泛化保证。
效果:在所有上述设置中都获得了改进的界限。具体来说,在可实现的情况下和非可实现的情况下,数据都具有轻尾次高斯分布,我们将界限提高了一个logT因子,分别匹配了正确的1/T和1/sqrt(T)的速率。在更具挑战性的重尾多项式数据情况下,我们通过一个polyT因子改进了现有的界限。

In this work, we revisit the generalization error of stochastic mirror descent for quadratically bounded losses studied in Telgarsky (2022). Quadratically bounded losses is a broad class of loss functions, capturing both Lipschitz and smooth functions, for both regression and classification problems. We study the high probability generalization for this class of losses on linear predictors in both realizable and non-realizable cases when the data are sampled IID or from a Markov chain. The prior work relies on an intricate coupling argument between the iterates of the original problem and those projected onto a bounded domain. This approach enables blackbox application of concentration inequalities, but also leads to suboptimal guarantees due in part to the use of a union bound across all iterations. In this work, we depart significantly from the prior work of Telgarsky (2022), and introduce a novel approach for establishing high probability generalization guarantees. In contrast to the prior work, our work directly analyzes the moment generating function of a novel supermartingale sequence and leverages the structure of stochastic mirror descent. As a result, we obtain improved bounds in all aforementioned settings. Specifically, in the realizable case and non-realizable case with light-tailed sub-Gaussian data, we improve the bounds by a $\log T$ factor, matching the correct rates of $1/T$ and $1/\sqrt{T}$, respectively. In the more challenging case of heavy-tailed polynomial data, we improve the existing bound by a $\mathrm{poly}\ T$ factor.

Non-stationary Experimental Design under Linear Trends
David Simchi-Levi Chonghuan Wang Zeyu Zheng



研究问题:如何设计非平稳的实验,以解决传统静态平均处理效应(ATE)在医疗和其他领域中可能无法反映治疗效应随时间变化的问题。
动机:传统的实验设计和静态ATE在面对治疗效应可能随时间变化的情况下可能无法提供有效的信息,因此需要新的实验设计方法来估计动态处理效应并最小化实验中的福利损失。
方法:提出了一种有效的非平稳实验设计方法,该方法可以根据最优估计误差率、最优遗憾率或两者之间的帕累托最优权衡进行定制。同时,建立了信息理论下界,揭示了估计动态处理效应和最小化福利损失之间的基本权衡关系。
效果:通过实证分析,展示了这种新设计的有效性,并揭示了估计动态处理效应和最小化福利损失之间的基本权衡关系。

Experimentation has been critical and increasingly popular across various domains, such as clinical trials and online platforms, due to its widely recognized benefits. One of the primary objectives of classical experiments is to estimate the average treatment effect (ATE) to inform future decision-making. However, in healthcare and many other settings, treatment effects may be non-stationary, meaning that they can change over time, rendering the traditional experimental design inadequate and the classical static ATE uninformative. In this work, we address the problem of non-stationary experimental design under linear trends by considering two objectives: estimating the dynamic treatment effect and minimizing welfare loss within the experiment. We propose an efficient design that can be customized for optimal estimation error rate, optimal regret rate, or the Pareto optimal trade-off between the two objectives. We establish information-theoretical lower bounds that highlight the inherent challenge in estimating dynamic treatment effects and minimizing welfare loss, and also statistically reveal the fundamental trade-off between them.

Payoff-based Learning with Matrix Multiplicative Weights in Quantum Games
Kyriakos Lotidis Panayotis Mertikopoulos Nicholas Bambos Jose Blanchet



研究问题:本文研究了在量子游戏中学习的问题,以及其他类别的半定博弈。
动机:由于量子游戏具有无限的纯状态(量子等价于纯策略),因此无法使用标准的估计支付向量的重要性加权技术来达到收敛。
方法:我们借鉴了赌博凸优化的思想,设计了一个适应于所处理问题的半定几何的零阶梯度采样器。
效果:结果显示,即使玩家只观察到一个标量,具有确定性支付反馈的3MW方法也能保持量子最小最大游戏中原始、完整信息的MMW算法的O(1/√T)收敛速度。此外,我们还提供了一种只需要玩家观察其支付可观察量的随机实现,并以O(T^{-1/4})的速度收敛到均衡的方法。最后,我们证明了所提出的3MW方法的正则化变体可以以高概率局部收敛到满足一定一阶稳定性条件的所有均衡。

In this paper, we study the problem of learning in quantum games - and other classes of semidefinite games - with scalar, payoff-based feedback. For concreteness, we focus on the widely used matrix multiplicative weights (MMW) algorithm and, instead of requiring players to have full knowledge of the game (and/or each other's chosen states), we introduce a suite of minimal-information matrix multiplicative weights (3MW) methods tailored to different information frameworks. The main difficulty to attaining convergence in this setting is that, in contrast to classical finite games, quantum games have an infinite continuum of pure states (the quantum equivalent of pure strategies), so standard importance-weighting techniques for estimating payoff vectors cannot be employed. Instead, we borrow ideas from bandit convex optimization and we design a zeroth-order gradient sampler adapted to the semidefinite geometry of the problem at hand. As a first result, we show that the 3MW method with deterministic payoff feedback retains the $\mathcal{O}(1/\sqrt{T})$ convergence rate of the vanilla, full information MMW algorithm in quantum min-max games, even though the players only observe a single scalar. Subsequently, we relax the algorithm's information requirements even further and we provide a 3MW method that only requires players to observe a random realization of their payoff observable, and converges to equilibrium at an $\mathcal{O}(T^{-1/4})$ rate. Finally, going beyond zero-sum games, we show that a regularized variant of the proposed 3MW method guarantees local convergence with high probability to all equilibria that satisfy a certain first-order stability condition.

Robust Second-Order Nonconvex Optimization and Its Application to Low Rank Matrix Sensing
Shuyao Li Yu Cheng Ilias Diakonikolas Jelena Diakonikolas Rong Ge Stephen Wright



研究问题:在存在异常值的情况下,寻找近似二阶平稳点(SOSP)是随机非凸优化中的一个基本问题,但在对抗环境中,这个问题的理解还很差。
动机:现有的非凸算法在对抗环境中的使用受到限制。
方法:我们提出了一个通用框架,可以在强污染模型中有效地找到具有“维度独立”精度保证的近似SOSP,使用$\widetilde{O}({D^2}/{\epsilon})$个样本,其中$D$是环境维度,$epsilon$是被污染数据点的分数。
效果:我们将该框架应用于低秩矩阵传感问题,开发了高效且可证明鲁棒的算法,可以容忍传感矩阵和测量中的干扰。此外,我们还建立了统计查询下界,证明了二次依赖$D$的样本复杂度对于计算效率高的算法是必要的。

Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings. In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of datapoints are arbitrarily corrupted. We introduce a general framework for efficiently finding an approximate SOSP with \emph{dimension-independent} accuracy guarantees, using $\widetilde{O}({D^2}/{\epsilon})$ samples where $D$ is the ambient dimension and $\epsilon$ is the fraction of corrupted datapoints. As a concrete application of our framework, we apply it to the problem of low rank matrix sensing, developing efficient and provably robust algorithms that can tolerate corruptions in both the sensing matrices and the measurements. In addition, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms.

Demystifying the Optimal Performance of Multi-Class Classification
Minoh Jeong Martina Cardone Alex Dytso



研究问题:如何有效估计监督多分类问题中分类器的贝叶斯错误率。
动机:由于贝叶斯错误率通常未知,因此有效地估计它是至关重要的。
方法:受Ishida等人(2023)工作的启发,提出了一种用于估计监督多分类问题的贝叶斯错误率的估计器。同时,还提出了一种去噪方法和中位数均值估计器来提高估计器的鲁棒性。
效果:理论分析和实验验证表明,所提出的估计器具有一致性、渐近无偏性、收敛速度和鲁棒性。在各种噪声设置下的合成数据和真实数据上都进行了有效的验证。

Classification is a fundamental task in science and engineering on which machine learning methods have shown outstanding performances. However, it is challenging to determine whether such methods have achieved the Bayes error rate, that is, the lowest error rate attained by any classifier. This is mainly due to the fact that the Bayes error rate is not known in general and hence, effectively estimating it is paramount. Inspired by the work by Ishida et al. (2023), we propose an estimator for the Bayes error rate of supervised multi-class classification problems. We analyze several theoretical aspects of such estimator, including its consistency, unbiasedness, convergence rate, variance, and robustness. We also propose a denoising method that reduces the noise that potentially corrupts the data labels, and we improve the robustness of the proposed estimator to outliers by incorporating the median-of-means estimator. Our analysis demonstrates the consistency, asymptotic unbiasedness, convergence rate, and robustness of the proposed estimators. Finally, we validate the effectiveness of our theoretical results via experiments both on synthetic data under various noise settings and on real data.

New Bounds for Hyperparameter Tuning of Regression Problems Across Instances
Nina Balcan Anh Tuan Nguyen Dravyansh Sharma



研究问题:本文旨在解决在数据驱动环境中,正则化回归模型中调整正则化参数的样本复杂度问题。
动机:目前对于线性和逻辑回归模型,在$\ell_1$和$\ell_2$-约束下调整正则化参数的问题,其有保证的样本复杂度仍然是一个重大挑战。
方法:通过更精细地利用对偶函数类的结构,为验证损失函数类提供了新的伪维度上界,显著改善了该问题上的最佳已知结果。同时,引入了一种新的方法来研究学习保证,通过近似验证损失函数类。
效果:实验结果表明,这种方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

The task of tuning regularization coefficients in regularized regression models with provable guarantees across problem instances still poses a significant challenge in the literature. This paper investigates the sample complexity of tuning regularization parameters in linear and logistic regressions under $\ell_1$ and $\ell_2$-constraints in the data-driven setting. For the linear regression problem, by more carefully exploiting the structure of the dual function class, we provide a new upper bound for the pseudo-dimension of the validation loss function class, which significantly improves the best-known results on the problem. Remarkably, we also instantiate the first matching lower bound, proving our results are tight. For tuning the regularization parameters of logistic regression, we introduce a new approach to studying the learning guarantee via an approximation of the validation loss function class. We examine the pseudo-dimension of the approximation class and construct a uniform error bound between the validation loss function class and its approximation, which allows us to instantiate the first learning guarantee for the problem of tuning logistic regression regularization coefficients.

Fine-Grained Theoretical Analysis of Federated Zeroth-Order Optimization
Jun Chen Hong Chen Bin Gu Hao Deng



研究问题:本文旨在为联邦零阶优化(FedZO)算法建立系统的理论研究评估,通过开发平均模型稳定性的分析技术。
动机:尽管联邦零阶优化(FedZO)算法在黑箱攻击和softmax回归任务上表现出色,但目前还缺乏对其的一般化分析,且其计算收敛速度的分析比相应的一阶优化设置慢。
方法:通过发展平均模型稳定性的分析技术,建立了FedZO在Lipschitz连续性和平滑性条件下的第一个一般化误差边界。然后,通过用有界梯度替换重尾梯度噪声并利用二阶泰勒展开进行梯度近似,提供了更精细的一般化和优化边界。
效果:借助新的误差分解策略,我们的理论分析也扩展到了异步情况。对于FedZO,我们的精细分析填补了一般化保证的理论空白,完善了计算算法的收敛特性。

Federated zeroth-order optimization (FedZO) algorithm enjoys the advantages of both zeroth-order optimization and federated learning, and has shown exceptional performance on black-box attack and softmax regression tasks. However, there is no generalization analysis for FedZO, and its analysis on computing convergence rate is slower than the corresponding first-order optimization setting. This paper aims to establish systematic theoretical assessments of FedZO by developing the analysis technique of on-average model stability. We establish the first generalization error bound of FedZO under the Lipschitz continuity and smoothness conditions. Then, refined generalization and optimization bounds are provided by replacing bounded gradient with heavy-tailed gradient noise and utilizing the second-order Taylor expansion for gradient approximation. With the help of a new error decomposition strategy, our theoretical analysis is also extended to the asynchronous case. For FedZO, our fine-grained analysis fills the theoretical gap on the generalization guarantees and polishes the convergence characterization of the computing algorithm.

Bandit Task Assignment with Unknown Processing Time
Shinji Ito Daisuke Hatano Hanna Sumita Kei Takemura Takuro Fukunaga Naonori Kakimura Ken-Ichi Kawarabayashi



研究问题:本文提出了一种名为“任务分配”的新颖问题设置,该设置在强盗环境中考虑了每个任务的处理时间。
动机:在这个问题设置中,玩家需要连续选择一组要开始的任务,以便处理任务集满足给定的组合约束。每个任务的奖励和处理时间遵循未知分布,只有在任务完成后才会揭示其值。这个问题推广了随机组合半强盗问题和预算受限强盗问题。
方法:针对这个问题设置,我们提出了一种基于置信上界(UCB)与分阶段更新方法的算法。所提出的算法允许差距依赖性遗憾上界为$O(MN(1/\Delta){\log T})$,且无差距遗憾上界为$\tilde{O}( \sqrt{MNT} )$,其中N是任务数量,M是同时运行的最大任务数,T是时间范围,$\Delta$是最优和次优任务集的预期每轮奖励之间的差距。这些遗憾边界几乎匹配了下界。
效果:借助新的误差分解策略,我们的理论分析也扩展到了异步情况。对于FedZO,我们的精细分析填补了一般化保证的理论空白,完善了计算算法的收敛特性。

This study considers a novel problem setting, referred to as \textit{bandit task assignment}, that incorporates the processing time of each task in the bandit setting. In this problem setting, a player sequentially chooses a set of tasks to start so that the set of processing tasks satisfies a given combinatorial constraint. The reward and processing time for each task follow unknown distributions, values of which are revealed only after the task has been completed. The problem generalizes the stochastic combinatorial semi-bandit problem and the budget-constrained bandit problem. For this problem setting, we propose an algorithm based on upper confidence bounds~(UCB) combined with a phased-update approach. The proposed algorithm admits a gap-dependent regret upper bound of $O(MN(1/\Delta){\log T})$ and a gap-free regret upper bound of $\tilde{O}( \sqrt{MNT} )$, where $N$ is the number of the tasks, $M$ is the maximum number of tasks run at the same time, $T$ is the time horizon, and $\Delta$ is the gap between expected per-round rewards of the optimal and best suboptimal sets of tasks. These regret bounds nearly match lower bounds.

An Exploration-by-Optimization Approach to Best of Both Worlds in Linear Bandits
Shinji Ito Kei Takemura



研究问题:如何构建一种在随机和对抗环境中都能实现接近最优性能的线性bandit算法。
动机:现有的线性bandit算法在对抗环境和随机环境中的性能存在差距,需要寻找一种能在两种环境中都表现优秀的算法。
方法:采用优化探索法(exploration by optimization)来构建新的线性bandit算法。
效果:实验结果表明,这种新算法在对抗环境中能达到$O(d \sqrt{ T \log{T}})$的遗憾度,在随机环境中能达到$O(\frac{d^2 log T}{\Delta_{\min}} )$的遗憾度,

In this paper, we consider how to construct best-of-both-worlds linear bandit algorithms that achieve nearly optimal performance for both stochastic and adversarial environments. For this purpose, we show that a natural approach referred to as exploration by optimization [Lattimore and Szepesvári, 2020] works well. Specifically, an algorithm constructed using this approach achieves $O(d \sqrt{ T \log{T}})$-regret in adversarial environments and $O(\frac{d^2 \log T}{\Delta_{\min}} )$-regret in stochastic environments. Symbols $d$, $T$ and $\Delta_{\min}$ here represent the dimensionality of the action set, the time horizon, and the minimum sub-optimality gap, respectively. We also show that this algorithm has even better theoretical guarantees for important special cases including the multi-armed bandit problem and multitask bandits.

Exploiting Correlated Auxiliary Feedback in Parameterized Bandits
Arun Verma Zhongxiang Dai Yao Shu Bryan Kian Hsiang Low



研究问题:本文研究了一种新的参数化bandits问题,其中学习者可以观察到与观察到的奖励相关的额外辅助反馈。
动机:在许多现实生活中的应用中,辅助反馈是现成的,例如,一个在线平台想要向其用户推荐评价最好的服务,可以观察到用户对服务的评分(奖励),并收集其他信息如服务交付时间(辅助反馈)。
方法:本文首先开发了一种利用辅助反馈构建具有紧置信界限的奖励估计器的方法,从而减小了遗憾。然后,我们通过奖励和其辅助反馈之间的相关系数来描述遗憾的减少。
效果:在不同设置下的实验结果也验证了我们提出的方法所获得的性能增益。

We study a novel variant of the parameterized bandits problem in which the learner can observe additional auxiliary feedback that is correlated with the observed reward. The auxiliary feedback is readily available in many real-life applications, e.g., an online platform that wants to recommend the best-rated services to its users can observe the user's rating of service (rewards) and collect additional information like service delivery time (auxiliary feedback). In this paper, we first develop a method that exploits auxiliary feedback to build a reward estimator with tight confidence bounds, leading to a smaller regret. We then characterize the regret reduction in terms of the correlation coefficient between reward and its auxiliary feedback. Experimental results in different settings also verify the performance gain achieved by our proposed method.

Weitzman's Rule for Pandora's Box with Correlations
Evangelia Gergatsouli Christos Tzamos



研究问题:在不确定性决策下,如何优化打开盒子的策略以最小化所选价值和支付的开启成本之和。
动机:当价值分布相关时,重新审视潘多拉魔盒问题,并改进先前的工作。
方法:采用Weitzman的规则作为最佳算法,该规则可以直接应用于相关情况。通过样本访问相关值分布来实现该规则。
效果:与先前的工作相比,该算法实现了显著的改进近似保证,同时大大简化。只需对盒子数量进行多项式数量的样本即可使算法工作。

Pandora’s Box is a central problem in decision making under uncertainty that can model various real life scenarios. In this problem we are given n boxes, each with a fixed opening cost, and an unknown value drawn from a known distribution, only revealed if we pay the opening cost. Our goal is to find a strategy for opening boxes to minimize the sum of the value selected and the opening cost paid. In this work we revisit Pandora’s Box when the value distributions are correlated, first studied in [CGT+20]. We show that the optimal algorithm for the independent case, given by Weitzman’s rule, directly works for the correlated case. In fact, our algorithm results in significantly improved approximation guarantees compared to the previous work, while also being substantially simpler. We also show how to implement the rule given only sample access to the correlated distribution of values. Specifically, we find that a number of samples that is polynomial in the number of boxes is sufficient for the algorithm to work.

Minimax Optimal Rate for Parameter Estimation in Multivariate Deviated Models
Dat Do Huy Nguyen Khai Nguyen Nhat Ho



研究问题:本文研究了多元偏离模型中的最大似然估计(MLE),其中数据由已知函数$h_{0}$和未知参数$(\mu^{\ast}, \Sigma^{\ast})$生成的密度函数$(1-\lambda^{\ast})h_{0}(x)+lambda^{\ast}f(x|mu^{\ast}, \Sigma^{\ast})$生成。
动机:推导MLE的收敛速度的主要挑战来自两个问题:(1)函数$h_{0}$与密度函数$f$之间的相互作用;(2)当样本大小趋向无穷大时,偏离比例$\lambda^{\ast}$可能趋向于$[0,1]$的极端点。
方法:为解决这些挑战,我们开发了“可区分性条件”来捕捉函数$h_{0}$和密度函数$f$之间的线性独立关系。然后,我们通过$\lambda^{\ast}$向零的消失速度以及两个函数$h_{0}$和$f$的可区分性,提供了MLE的全面收敛速度。
效果:实验结果表明,该方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the maximum likelihood estimation (MLE) in the multivariate deviated model where the data are generated from the density function $(1-\lambda^{\ast})h_{0}(x)+\lambda^{\ast}f(x|\mu^{\ast}, \Sigma^{\ast})$ in which $h_{0}$ is a known function, $\lambda^{\ast} \in [0,1]$ and $(\mu^{\ast}, \Sigma^{\ast})$ are unknown parameters to estimate. The main challenges in deriving the convergence rate of the MLE mainly come from two issues: (1) The interaction between the function $h_{0}$ and the density function $f$; (2) The deviated proportion $\lambda^{\ast}$ can go to the extreme points of $[0,1]$ as the sample size tends to infinity. To address these challenges, we develop the \emph{distinguishability condition} to capture the linear independent relation between the function $h_{0}$ and the density function $f$. We then provide comprehensive convergence rates of the MLE via the vanishing rate of $\lambda^{\ast}$ to zero as well as the distinguishability of two functions $h_{0}$ and $f$.

The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning
Kaiwen Wang Kevin Zhou Runzhe Wu Nathan Kallus Wen Sun



研究问题:分布强化学习(DistRL)在何时何地比非分布的强化学习更有效,这个问题尚未得到解答。
动机:通过小损失界限的角度解释DistRL的优势,这些界限是依赖实例的,并且与最优可实现成本成比例。
方法:我们提出了一个分布上下文bandit(DistCB)算法作为热身,该算法展示了小损失遗憾界限,并在三个真实世界的任务上实证表现优于最先进的技术。我们还提出了一种在线RL的DistRL算法,该算法使用最大似然估计构造置信集。
效果:我们的分析表明,在低秩MDPs中,我们的算法享有新颖的小损失PAC界限。在离线RL中,我们证明悲观的DistRL享有新颖的小损失PAC界限,这个界限对糟糕的单一策略覆盖更具鲁棒性。

While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the $\ell_1$ distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.

Simple, Scalable and Effective Clustering via One-Dimensional Projections
Moses Charikar Monika Henzinger Lunjia Hu Maximilian Vötsch Erik Waingarten



研究问题:如何有效地对大规模数据集进行聚类分析?
动机:目前的聚类算法在处理大规模数据集时,时间复杂度高,效率低下。
方法:提出了一种随机化的聚类算法,该算法的时间复杂度为$O(\mathsf{nnz}(X) + n\log n)$,明显优于现有的算法。
效果:实验证明,该算法在找到的聚类质量上通常优于最坏情况的界限,并且在运行时间和聚类质量之间提供了新的权衡。

Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and $k$-means++ can take $\Omega(ndk)$ time when clustering $n$ points in a $d$-dimensional space (represented by an $n\times d$ matrix $X$) into $k$ clusters. On massive datasets with moderate to large $k$, the multiplicative $k$ factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(\mathsf{nnz}(X) + n\log n)$ for arbitrary $k$. Here $\mathsf{nnz}(X)$ is the total number of non-zero entries in the input dataset $X$, which is upper bounded by $nd$ and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio $\widetilde{O}(k^4)$ on any input dataset for the $k$-means objective, and our experiments show that the quality of the clusters found by our algorithm is usually much better than this worst-case bound. We use our algorithm for $k$-means clustering and for coreset construction; our experiments show that it gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks. Our theoretical analysis is based on novel results of independent interest. We show that the approximation ratio achieved after a random one-dimensional projection can be lifted to the original points and that $k$-means++ seeding can be implemented in expected time $O(n\log n)$ in one dimension.

SQ Lower Bounds for Learning Mixtures of Linear Classifiers
Ilias Diakonikolas Daniel Kane Yuxin Sun



研究问题:学习高斯协变量下线性分类器混合的问题。
动机:对一个未知单位向量的混合分布进行学习,以实现在总变分距离上的最佳分布。
方法:通过样本访问和统计查询算法,我们提出了一种新的球面设计构造,用于解决此问题。
效果:我们的主要结果是给出了一个统计查询(SQ)下界,表明已知的算法对于这个问题基本上是最优的,即使对于均匀混合的特殊情况下也是如此。

We study the problem of learning mixtures of linear classifiers under Gaussian covariates. Given sample access to a mixture of $r$ distributions on $\mathbb{R}^n$ of the form $(\mathbf{x},y_{\ell})$, $\ell \in [r]$, where $\mathbf{x}\sim\mathcal{N}(0,\mathbf{I}_n)$ and $y_\ell=\mathrm{sign}(\langle\mathbf{v}_{\ell},\mathbf{x}\rangle)$ for an unknown unit vector $\mathbf{v}_{\ell}$, the goal is to learn the underlying distribution in total variation distance. Our main result is a Statistical Query (SQ) lower bound suggesting that known algorithms for this problem are essentially best possible, even for the special case of uniform mixtures. In particular, we show that the complexity of any SQ algorithm for the problem is $n^{\mathrm{poly}(1/\Delta) \log(r)}$, where $\Delta$ is a lower bound on the pairwise $\ell_2$-separation between the $\mathbf{v}_{\ell}$'s. The key technical ingredient underlying our result is a new construction of spherical designs on the unit sphere that may be of independent interest.

Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization
Daesung Kim Hye Won Chung



研究问题:近年来,矩阵补全问题的非凸公式因其相对于凸公式的可负担复杂性而受到广泛关注。
动机:梯度下降(GD)是一种简单而有效的解决非凸优化问题的基线算法。
方法:通过样本访问和统计查询算法,我们提出了一种新的球面设计构造,用于解决此问题。
效果:我们的主要结果是给出了一个统计查询(SQ)下界,表明已知的算法对于这个问题基本上是最优的,即使对于均匀混合的特殊情况下也是如此。

The nonconvex formulation of the matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient Descent (GD) is a simple yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizers to prove the convergence of GD. In this paper, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in a logarithmic number of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence, and show that a larger initialization can be used as more samples are available. We observe that the implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.

Approximate Allocation Matching for Structural Causal Bandits with Unobserved Confounders
Lai Wei Muhammad Qasim Elahi Mahsa Ghasemi Murat Kocaoglu



研究问题:如何利用结构因果模型进行在线决策,特别是在因果关系已知的情况下。
动机:在随机环境中,观察和干预分布是未知的,需要通过与环境的交互来学习。因此,平衡探索与利用的权衡以最大化预期累积奖励是关键。
方法:使用结构因果模型对未观察变量的领域进行离散化,有效地整合样本以减少模型不确定性。设计一种算法,利用因果关系加速学习过程并采取信息丰富且有益的干预措施。
效果:该算法实现了对数级遗憾,并通过模拟实验证明其优于现有方法。

Structural causal bandit provides a framework for online decision-making problems when causal information is available. It models the stochastic environment with a structural causal model (SCM) that governs the causal relations between random variables. In each round, an agent applies an intervention (or no intervention) by setting certain variables to some constants and receives a stochastic reward from a non-manipulable variable. Though the causal structure is given, the observational and interventional distributions of these random variables are unknown beforehand, and they can only be learned through interactions with the environment. Therefore, to maximize the expected cumulative reward, it is critical to balance the explore-versus-exploit tradeoff. We assume each random variable takes a finite number of distinct values, and consider a semi-Markovian setting, where random variables are affected by unobserved confounders. Using the canonical SCM formulation to discretize the domains of unobserved variables, we efficiently integrate samples to reduce model uncertainty. This gives the decision maker a natural advantage over those in a classical multi-armed bandit setup. We provide a logarithmic asymptotic regret lower bound for the structural causal bandit problem. Inspired by the lower bound, we design an algorithm that can utilize the causal structure to accelerate the learning process and take informative and rewarding interventions. We establish that our algorithm achieves a logarithmic regret and demonstrate that it outperforms the existing methods via simulations.

Provably Fast Convergence of Independent Natural Policy Gradient for Markov Potential Games
Youbang Sun Tao Liu Ruida Zhou Panganamala Kumar Shahin Shahrampour



研究问题:本文旨在研究多智能体强化学习中的Markov潜在博弈问题的独立自然政策梯度(NPG)算法。
动机:在多智能体强化学习中,寻找有效的算法来达到纳什均衡是一个重要的问题。目前的最优结果需要$\mathcal{O}(1/\epsilon^2)$次迭代,而本文提出的独立NPG方法可以在$\mathcal{O}(1/\epsilon)$次迭代内达到$\epsilon$-纳什均衡,这比现有的结果有所改进。
方法:本文提出了一种独立NPG算法,该算法利用一个提供精确策略评估的预言机,在引入“次优差距”的温和技术假设下,可以渐近地在$\mathcal{O}(1/\epsilon)$次迭代内达到$\epsilon$-纳什均衡。
效果:通过对合成潜在博弈和拥塞博弈的实证结果进行验证,证明了理论界的界限。

This work studies an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games. It is shown that, under mild technical assumptions and the introduction of the \textit{suboptimality gap}, the independent NPG method with an oracle providing exact policy evaluation asymptotically reaches an $\epsilon$-Nash Equilibrium (NE) within $\mathcal{O}(1/\epsilon)$ iterations. This improves upon the previous best result of $\mathcal{O}(1/\epsilon^2)$ iterations and is of the same order, $\mathcal{O}(1/\epsilon)$, that is achievable for the single-agent case. Empirical results for a synthetic potential game and a congestion game are presented to verify the theoretical bounds.

Sample Complexity for Quadratic Bandits: Hessian Dependent Bounds and Optimal Algorithms
Qian Yu Yining Wang Baihe Huang Qi Lei Jason D. Lee



研究问题:如何充分利用目标函数的局部几何结构,特别是在二次目标函数的情况下。
动机:在随机零阶优化中,理解如何充分利用目标函数的局部几何结构是一个具有实际意义的问题。
方法:我们引入了一个称为“能量分配”的概念,从信息理论的角度证明了Hessian依赖复杂性下的紧下界。通过解决最优能量谱,我们得到了匹配的上界。然后,在算法上,我们展示了存在一个Hessian独立的算法,该算法能在所有Hessian实例下实现渐近最优样本复杂度。
效果:我们的算法在处理重尾噪声分布时仍能保持最优样本复杂度,这得益于截断方法的实现。

In stochastic zeroth-order optimization, a problem of practical relevance is understanding how to fully exploit the local geometry of the underlying objective function. We consider a fundamental setting in which the objective function is quadratic, and provide the first tight characterization of the optimal Hessian-dependent sample complexity. Our contribution is twofold. First, from an information-theoretic point of view, we prove tight lower bounds on Hessian-dependent complexities by introducing a concept called \emph{energy allocation}, which captures the interaction between the searching algorithm and the geometry of objective functions. A matching upper bound is obtained by solving the optimal energy spectrum. Then, algorithmically, we show the existence of a Hessian-independent algorithm that universally achieves the asymptotic optimal sample complexities for all Hessian instances. The optimal sample complexities achieved by our algorithm remain valid for heavy-tailed noise distributions, which are enabled by a truncation method.

Adversarially Robust Distributed Count Tracking via Partial Differential Privacy
Zhongzheng Xiong Xiaoyi Zhu Zengfeng Huang



研究问题:本文研究分布式跟踪模型,即分布式功能监控。该模型涉及k个站点,每个站点接收一系列项目并与中央服务器通信。服务器的任务是连续跟踪所有已接收项目的函数,同时最小化通信成本。
动机:对于计数跟踪,已知确定性算法和随机化算法之间存在$\sqrt{k}$的通信差距。然而,现有的随机化算法假设一个“盲目的对手”,在算法开始之前构造整个输入流。这里我们考虑适应性对手,他们可以根据算法之前的反馈选择新的项目。确定性算法对适应性对手来说显然是鲁棒的,而随机化算法可能不是。因此,我们研究随机化算法的$\sqrt{k}$优势是否来自随机性本身还是盲目的对手假设。
方法:我们通过引入“部分差分隐私”并证明一个新的泛化定理来解决这个问题。这个定理可能会超越鲁棒计数跟踪的范围,具有更广泛的应用,使其具有独立的兴趣。
效果:我们的实验结果表明,新提出的算法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the distributed tracking model, also known as distributed functional monitoring. This model involves $k$ sites each receiving a stream of items and communicating with the central server. The server's task is to track a function of all items received thus far continuously, with minimum communication cost. For count tracking, it is known that there is a $\sqrt{k}$ gap in communication between deterministic and randomized algorithms. However, existing randomized algorithms assume an "oblivious adversary" who constructs the entire input streams before the algorithm starts. Here we consider adaptive adversaries who can choose new items based on previous answers from the algorithm. Deterministic algorithms are trivially robust to adaptive adversaries, while randomized ones may not. Therefore, we investigate whether the $\sqrt{k}$ advantage of randomized algorithms is from randomness itself or the oblivious adversary assumption. We provide an affirmative answer to this question by giving a robust algorithm with optimal communication. Existing robustification techniques do not yield optimal bounds due to the inherent challenges of the distributed nature of the problem. To address this, we extend the differential privacy framework by introducing "partial differential privacy" and proving a new generalization theorem. This theorem may have broader applications beyond robust count tracking, making it of independent interest.

Online Performative Gradient Descent for Learning Nash Equilibria in Decision-Dependent Games
Zihan Zhu Ethan X Fang Zhuoran Yang



研究问题:本文旨在研究多智能体决策依赖游戏中的纳什均衡问题,特别是在探索性学习设置中。
动机:传统的基于梯度的方法在没有梯度神谕的情况下无法解决策略耦合的问题。
方法:我们通过一个通用的参数模型来模拟策略交互,并提出了一种新的在线算法——在线表现梯度下降(OPGD),它利用了在线随机近似和投影梯度下降的思想,以函数逼近的方式寻找未知梯度的纳什均衡。
效果:在弱假设下,我们证明了OPGD可以在强单调的决策依赖游戏中有效地找到纳什均衡。合成数值实验验证了我们的理论。

We study the multi-agent game within the innovative framework of decision-dependent games, which establishes a feedback mechanism that population data reacts to agents’ actions and further characterizes the strategic interactions between agents. We focus on finding the Nash equilibrium of decision-dependent games in the bandit feedback setting. However, since agents are strategically coupled, traditional gradient-based methods are infeasible without the gradient oracle. To overcome this challenge, we model the strategic interactions by a general parametric model and propose a novel online algorithm, Online Performative Gradient Descent (OPGD), which leverages the ideas of online stochastic approximation and projected gradient descent to learn the Nash equilibrium in the context of function approximation for the unknown gradient. In particular, under mild assumptions on the function classes defined in the parametric model, we prove that OPGD can find the Nash equilibrium efficiently for strongly monotone decision-dependent games. Synthetic numerical experiments validate our theory.

Scaling Up Differentially Private LASSO Regularized Logistic Regression via Faster Frank-Wolfe Iterations
Edward Raff Amol Ashish Khanna Fred Lu



研究问题:目前没有针对稀疏输入数据训练差分隐私回归模型的方法。
动机:为了解决这一问题,我们调整了$L_1$惩罚线性回归的Frank-Wolfe算法,使其能够有效利用稀疏输入。
方法:通过减少算法的训练时间,从$\mathcal{O}( T D S + T N S)$降低到$mathcal{O}(N S + T \sqrt{D} \log{D} + T S^2)$,其中$T$是迭代次数,$S$是数据集的稀疏率,$N$是行数,$D$是特征数。
效果:实验结果表明,这种方法可以将运行时间减少多达$2,200\times$,具体取决于隐私参数$\epsilon$和数据集的稀疏程度。

To the best of our knowledge, there are no methods today for training differentially private regression models on sparse input data. To remedy this, we adapt the Frank-Wolfe algorithm for $L_1$ penalized linear regression to be aware of sparse inputs and to use them effectively. In doing so, we reduce the training time of the algorithm from $\mathcal{O}( T D S + T N S)$ to $\mathcal{O}(N S + T \sqrt{D} \log{D} + T S^2)$, where $T$ is the number of iterations and a sparsity rate $S$ of a dataset with $N$ rows and $D$ features. Our results demonstrate that this procedure can reduce runtime by a factor of up to $2,200\times$, depending on the value of the privacy parameter $\epsilon$ and the sparsity of the dataset.

Online Adaptive Policy Selection in Time-Varying Systems: No-Regret via Contractive Perturbations
Yiheng Lin James A Preiss Emile Timothy Anand Yingying Li Yisong Yue Adam Wierman



研究问题:本文研究了在具有时变成本和动态性的系统中进行在线自适应策略选择的问题。
动机:为了解决现有方法需要大量信息和计算,且不能快速适应环境变化的问题。
方法:开发了一种基于梯度的自适应策略选择(GAPS)算法和一个通过在线优化进行在线策略选择的通用分析框架。
效果:实验结果表明,GAPS能更快地适应不断变化的环境,比现有的基准测试表现更好。

We study online adaptive policy selection in systems with time-varying costs and dynamics. We develop the Gradient-based Adaptive Policy Selection (GAPS) algorithm together with a general analytical framework for online policy selection via online optimization. Under our proposed notion of contractive policy classes, we show that GAPS approximates the behavior of an ideal online gradient descent algorithm on the policy parameters while requiring less information and computation. When convexity holds, our algorithm is the first to achieve optimal policy regret. When convexity does not hold, we provide the first local regret bound for online policy selection. Our numerical experiments show that GAPS can adapt to changing environments more quickly than existing benchmarks.

Computing Approximate $\ell_p$ Sensitivities
Swati Padmanabhan David Woodruff Qiuyi Zhang



研究问题:如何有效地对数据集进行降维处理,以减少计算复杂度并提高模型性能。
动机:现有的降维方法主要关注去除低敏感性的数据点,但快速算法仅适用于$\ell_2$设置。本研究旨在提供高效的算法来近似$\ell_p$敏感性和其他矩阵的统计量。
方法:提出了一种计算给定矩阵的$\alpha$-近似$\ell_1$敏感性的方法,该方法只需进行$n/\alpha$次敏感性计算。同时,还提出了基于$\ell_p$刘易斯权重重要性采样的算法,用于估计总$\ell_p$敏感性,该算法在大约$\sqrt{d}$次敏感性计算的成本下实现了常数因子近似。此外,还估计了最大$\ell_1$敏感性,误差不超过$\sqrt{d}$倍。并将这些结果推广到$\ell_p$范数。
效果:实验表明,对于现实世界数据集中的一类结构矩阵,总敏感性可以快速近似,并且明显小于理论预测值,说明现实世界数据集的平均内在有效维度较低。

Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating sensitivities, which we show is equivalent to approximate regression, are known for only the $\ell_2$ setting, in which they are popularly termed leverage scores. In this work, we provide the first efficient algorithms for approximating $\ell_p$ sensitivities and other summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$-approximation to its $\ell_1$ sensitivities at the cost of $n/\alpha$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation at the cost of roughly $\sqrt{d}$ sensitivity computations, with no polynomial dependence on $n$. Furthermore, we estimate the maximum $\ell_1$ sensitivity up to a $\sqrt{d}$ factor in $O(d)$ sensitivity computations. We also generalize these results to $\ell_p$ norms. Lastly, we experimentally show that for a wide class of structured matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have on average low intrinsic effective dimensionality.

Quantifying the Cost of Learning in Queueing Systems
Daniel Freund Thodoris Lykouris Wentao Weng



研究问题:本文旨在解决队列系统中的参数不确定性问题,特别是在学习过程中早期阶段的统计复杂性。
动机:尽管队列系统的最优控制已得到广泛研究,但大多数现有方法都假设对系统参数有完全了解,这在实际应用中很少成立,因此激发了对队列系统基于学习的探索。
方法:本文提出了一种新的度量标准——*队列系统中的学习成本(CLQ)*,用于量化由于参数不确定性导致的时间平均队列长度的最大增加。我们为单队列多服务器系统、多队列多服务器系统和队列网络定义了CLQ,并建立了一个统一的分析框架,将李雅普诺夫分析和基于学习的探索相结合,为各种算法提供了保证。
效果:实验结果表明,CLQ能够有效地衡量队列系统中的学习成本,为解决参数不确定性问题提供了新的视角和方法。

Queueing systems are widely applicable stochastic models with use cases in communication networks, healthcare, service systems, etc. Although their optimal control has been extensively studied, most existing approaches assume perfect knowledge of the system parameters. Of course, this assumption rarely holds in practice where there is parameter uncertainty, thus motivating a recent line of work on bandit learning for queueing systems. This nascent stream of research focuses on the asymptotic performance of the proposed algorithms. In this paper, we argue that an asymptotic metric, which focuses on late-stage performance, is insufficient to capture the intrinsic statistical complexity of learning in queueing systems which typically occurs in the early stage. Instead, we propose the *Cost of Learning in Queueing (CLQ)*, a new metric that quantifies the maximum increase in time-averaged queue length caused by parameter uncertainty. We characterize the CLQ of a single-queue multi-server system, and then extend these results to multi-queue multi-server systems and networks of queues. In establishing our results, we propose a unified analysis framework for CLQ that bridges Lyapunov and bandit analysis, provides guarantees for a wide range of algorithms, and could be of independent interest.

Path following algorithms for $\ell_2$-regularized $M$-estimation with approximation guarantee
Yunzhang Zhu Renxiong Liu



研究问题:如何有效地选择正则化参数以平衡模型拟合和复杂度,并确定在选定网格点上应多准确地解决正则化问题。
动机:现有的方法通常通过选择一组网格点来解决这些问题,但如何选择网格点以及在选定的网格点上应多准确地解决问题,这两个因素都会极大地影响整体计算量。
方法:我们提出了一种新的网格点选择方案和适应性停止准则,适用于任何产生近似解决方案路径的优化算法,该方案可以保证近似误差。
效果:理论证明,所提出的方法可以近似任意精度的精确解决方案路径,同时尽可能多地节省整体计算量。数值结果也证实了我们的理论研究。

Many modern machine learning algorithms are formulated as regularized M-estimation problems, in which a regularization (tuning) parameter controls a trade-off between model fit to the training data and model complexity. To select the ``best'' tuning parameter value that achieves a good trade-off, an approximated solution path needs to be computed. In practice, this is often done through selecting a grid of tuning parameter values and solving the regularized problem at the selected grid points. However, given any desired level of accuracy, it is often not clear how to choose the grid points and also how accurately one should solve the regularized problems at the selected gird points, both of which can greatly impact the overall amount of computation. In the context of $\ell_2$-regularized $M$-estimation problem, we propose a novel grid point selection scheme and an adaptive stopping criterion for any given optimization algorithm that produces an approximated solution path with approximation error guarantee. Theoretically, we prove that the proposed solution path can approximate the exact solution path to arbitrary level of accuracy, while saving the overall computation as much as possible. Numerical results also corroborate with our theoretical analysis.

Harnessing the power of choices in decision tree learning
Guy Blanc Jane Lange Chirag Pabbaraju Colin Sullivan Li-Yang Tan Mo Tiwari



研究问题:如何改进标准和经验成功的决策树学习算法,如ID3、C4.5和CART。
动机:这些算法在机器学习中占据中心地位,但它们本质上是贪婪的,只考虑最佳属性进行分割。
方法:提出一种简单的一般化方法Top-k,它考虑k个最佳属性作为可能的分割,而不仅仅是单个最佳属性。
效果:理论和实验证明这种简单一般化的强大之处。首先,证明了贪婪层次定理,表明对于每个k∈N,Top-(k+1)可以比Top-k强大得多。然后,通过大量实验,证明Top-k优于决策树学习的两种主要方法:经典的贪婪算法和最近的“最优决策树”算法。一方面,Top-k在广泛的基准测试中始终能显著提高准确性。另一方面,Top-k比最优决策树算法更具可扩展性,能够处理远超过这些算法所能处理的数据集和特征集大小。

We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a greediness hierarchy theorem showing that for every $k\in \mathbb{N}$, Top-$(k+1)$ can be dramatically more powerful than Top-$k$: there are data distributions for which the former achieves accuracy $1-\epsilon$, whereas the latter only achieves accuracy $\frac{1}{2}+\epsilon$. We then show, through extensive experiments, that Top-$k$ outperforms the two main approaches to decision tree learning: classic greedy algorithms and more recent ``optimal decision tree'' algorithms. On one hand, Top-$k$ consistently enjoys significant accuracy gains over greedy algorithms across a wide range of benchmarks. On the other hand, Top-$k$ is markedly more scalable than optimal decision tree algorithms and is able to handle dataset and feature set sizes that remain far beyond the reach of these algorithms. The code to reproduce our results is available at https://github.com/SullivanC19/pydl8.5-topk.

Hypothesis Selection with Memory Constraints
Maryam Aliakbarpour Mark Bun Adam Smith



研究问题:在有限的候选分布集中,如何选择一个与数据最匹配的分布。
动机:面对大量未知分布的数据,如何通过有限的样本来近似这个分布是一个重要的问题。
方法:本研究提出了一种在内存限制下进行假设选择的算法。该模型允许以“PDF比较”查询的形式逐条获取来自P的样本,并比较任意两个假设在点x处的密度。
效果:该算法实现了内存使用和所需样本数量之间的最佳权衡。具体来说,给定b位内存(对于大约在log n和n之间的b),该算法使用s个样本解决了假设选择问题,其中b * s = O(n log n)。这一结果在所有b上都是最优的,误差因子为O(log n)。

Hypothesis selection is a fundamental problem in learning theory and statistics. Given a dataset and a finite set of candidate distributions, the goal is to select a distribution that matches the data as well as possible. More specifically, suppose we have sample access to an unknown distribution $P$ over a domain $\mathcal{X}$ that we know is well-approximated by one of a a class of $n$ distributions (a.k.a. hypotheses), $\mathcal{H} \coloneqq \{H_1, H_2, \ldots, H_n\}$. The goal is to design an algorithm that outputs a distribution $\hat{H} \in \mathcal{H}$ whose total variation distance from $P$ is nearly minimal. In this work, we study the hypothesis selection problem under memory constraints. We consider a model where samples from $P$ are presented in a stream and we access each sample $x$ via ``PDF-comparison'' queries that allow us to compare the probability densities of any pair of hypotheses at the domain point $x$ (i.e., is $H_i(x) < H_j(x)$?). This model allows us to study how much memory is needed at any point in time to store information about the portion of the stream seen so far. Our main result is an algorithm that achieves a nearly optimal tradeoff between memory usage and the number of samples required. In particular, given $b$ bits of memory (for $b$ roughly between $\log n$ and $n$), our algorithm solves the hypothesis selection problem with $s$ samples, where $b \cdot s = O(n \log n)$. This result is optimal up to an $O(\log n)$ factor, for all $b$.

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks
Feng Chen Daniel Kunin Atsushi Yamamura Surya Ganguli



研究问题:本研究旨在揭示随机梯度下降(SGD)在训练深度神经网络时存在的强烈隐含偏置,即倾向于将过度表达的网络简化为更简单的子网络,从而显著减少独立参数的数量并提高泛化能力。
动机:通过识别不变集(即SGD未修改的参数空间子集),我们发现SGD对简单(稀疏或低秩)子网络存在明显的吸引性。这种分析揭示了SGD的随机吸引性特性,并建立了一个基于损失景观曲率和随机梯度引入的噪声之间竞争的充分条件。
方法:我们关注了对应于更简单子网络的两类不变集,这两类不变集在现代架构中经常出现。我们观察到训练好的深度神经网络中存在着吸引力不变的集合,这意味着SGD动态往往崩溃为神经元消失或冗余的简单子网络。
效果:我们进一步证明,这种随机崩溃过程在线性教师-学生框架中有利于泛化。最后,通过这种分析,我们从机制上解释了为什么早期使用大学习率进行长时间训练会对后续泛化产生有利影响。

In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify _invariant sets_, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of _stochastic attractivity_ towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of _stochastic collapse_ benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.

When is Agnostic Reinforcement Learning Statistically Tractable?
Zeyu Jia Gene Li Alexander Rakhlin Ayush Sekhari Nathan Srebro



研究问题:在未知MDP中,对于给定的策略类,需要多少轮交互才能学习到一个相对于策略类ε-次优策略?
动机:当前的预训练语言模型缺乏对丰富的结构化知识的利用。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $\epsilon$-suboptimal policy with respect to \(\Pi\)? Towards that end, we introduce a new complexity measure, called the \emph{spanning capacity}, that depends solely on the set \(\Pi\) and is independent of the MDP dynamics. With a generative model, we show that the spanning capacity characterizes PAC learnability for every policy class $\Pi$. However, for online RL, the situation is more subtle. We show there exists a policy class $\Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional \emph{sunflower} structure which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as recent developments for reachable-state identification and policy evaluation in reward-free exploration.

General Munchausen Reinforcement Learning with Tsallis Kullback-Leibler Divergence
Lingwei Zhu Zheng Chen Matthew Kyle Schlegel Martha White



研究问题:本文旨在探讨在强化学习中,通过引入Kullback-Leilbler (KL) 散度到先前的策略以防止策略改变过快的问题。
动机:许多强化学习中的策略优化方法都包含对前一个策略的KL散度,以防止策略变化过快。这个想法最初在一篇关于保守策略迭代的开创性论文中提出,并通过TRPO和蒙乔森值迭代(MVI)等算法给出了近似解。
方法:我们继续这项工作,通过研究一种广义的KL散度——Tsallis KL散度。Tsallis KL由q-对数定义,是一种严格的广义化,因为当q=1时,它对应于标准的KL散度;当q>1时,它提供了一系列的新选项。
效果:我们描述了在Tsallis KL下学习的策略类型,并说明了当q>1时可能的好处。为了得到一个实际的、融入了Tsallis KL正则化的策略优化算法,我们扩展了MVI,这是最简单的融入KL正则化的方法之一。我们在35个Atari游戏中展示了这种广义MVI(q)在标准MVI(q=1)上取得了显著的改进。

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence---called the Tsallis KL divergence. Tsallis KL defined by the $q$-logarithm is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.

TD Convergence: An Optimization Perspective
Kavosh Asadi Shoham Sabach Yao Liu Omer Gottesman Rasool Fakoor



研究问题:本文旨在研究著名的时间差(TD)学习算法的收敛行为。
动机:通过优化的视角来看待TD算法,作者认为TD可以被视为一种迭代优化算法,每次迭代需要最小化的函数都会发生变化。
方法:通过对TD在经典反例上显示的发散进行仔细研究,作者确定了决定该算法收敛或发散的两个因素。然后在线性TD设置下,使用二次损失的形式,证明了TD的收敛性取决于这两个因素之间的相互作用。
效果:作者将这种优化视角扩展到了比仅线性近似和平方损失更广泛的设置中,证明了TD的收敛性。这些结果为TD在强化学习中的成功应用提供了理论解释。

We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning.

Doubly Constrained Fair Clustering
John P Dickerson Seyed A. Esmaeili Jamie Heather Morgenstern Claire Jie Zhang



研究问题:公平聚类中不同公平性概念之间的关系。
动机:尽管这些概念有充分的理由,但它们通常是孤立地被研究和推动的,其中一个公平性要求被单独考虑,而不考虑其他要求。这使得理解不同公平性概念之间的关系成为公平聚类中的一个重要未解决问题。
方法:我们考虑了聚类中的两种主要的基于人口统计的代表公平性概念:(1)群体公平(GF),其中不同的人口统计群体在每个集群中应有接近于总人口级别的代表;(2)中心选择的多样性(DS),其中所选的中心应具有接近于每个群体的人口级别的代表。我们证明了,对于其中一个约束(只有GF或DS)的常数近似算法,我们可以同时获得满足两个约束的常数近似解。
效果:有趣的是,我们证明任何满足GF约束的解决方案都可以通过有限的成本降低来后处理,以额外满足DS约束,而如果解决方案满足DS而不是GF,则不能这样做。此外,我们还表明GF和DS与一组其他基于距离的公平性概念不兼容(在最坏的情况下,它们的可行性集为空)。最后,我们进行了实验以验证我们的理论发现。

The remarkable attention which fair clustering has received in the last few years has resulted in a significant number of different notions of fairness. Despite the fact that these notions are well-justified, they are often motivated and studied in a disjoint manner where one fairness desideratum is considered exclusively in isolation from the others. This leaves the understanding of the relations between different fairness notions as an important open problem in fair clustering. In this paper, we take the first step in this direction. Specifically, we consider the two most prominent demographic representation fairness notions in clustering: (1) Group Fairness ($\textbf{GF}$), where the different demographic groups are supposed to have close to population-level representation in each cluster and (2) Diversity in Center Selection ($\textbf{DS}$), where the selected centers are supposed to have close to population-level representation of each group. We show that given a constant approximation algorithm for one constraint ($\textbf{GF}$ or $\textbf{DS}$ only) we can obtain a constant approximation solution that satisfies both constraints simultaneously. Interestingly, we prove that any given solution that satisfies the $\textbf{GF}$ constraint can always be post-processed at a bounded degradation to the clustering cost to additionally satisfy the $\textbf{DS}$ constraint while the same statement is not true given a solution that satisfies $\textbf{DS}$ instead. Furthermore, we show that both $\textbf{GF}$ and $\textbf{DS}$ are incompatible (having an empty feasibility set in the worst case) with a collection of other distance-based fairness notions. Finally, we carry experiments to validate our theoretical findings.

Riemannian Projection-free Online Learning
Zihao Hu Guanghui Wang Jacob Abernethy



研究问题:如何提高优化算法在高维或病态约束集设置中的效率。
动机:投影操作在许多优化算法中是关键组成部分,但在高维或病态约束集设置中,其计算复杂性限制了效率。
方法:提出一种无需投影的优化子程序来替代投影查询的方法,以解决投影操作在高维或病态约束集设置中的效率问题。
效果:该方法在曲率空间上的在线测地凸优化问题上实现了次线性遗憾保证,并在具有分离查询或线性优化查询的情况下,分别达到了最佳的效果。

The projection operation is a critical component in a wide range of optimization algorithms, such as online gradient descent (OGD), for enforcing constraints and achieving optimal regret bounds. However, it suffers from computational complexity limitations in high-dimensional settings or when dealing with ill-conditioned constraint sets. Projection-free algorithms address this issue by replacing the projection oracle with more efficient optimization subroutines. But to date, these methods have been developed primarily in the Euclidean setting, and while there has been growing interest in optimization on Riemannian manifolds, there has been essentially no work in trying to utilize projection-free tools here. An apparent issue is that non-trivial affine functions are generally non-convex in such domains. In this paper, we present methods for obtaining sub-linear regret guarantees in online geodesically convex optimization on curved spaces for two scenarios: when we have access to (a) a separation oracle or (b) a linear optimization oracle. For geodesically convex losses, and when a separation oracle is available, our algorithms achieve $O(T^{\frac{1}{2}})$, $O(T^{\frac{3}{4}})$ and $O(T^{\frac{1}{2}})$ adaptive regret guarantees in the full information setting, the bandit setting with one-point feedback and the bandit setting with two-point feedback, respectively. When a linear optimization oracle is available, we obtain regret rates of $O(T^{\frac{3}{4}})$ for geodesically convex losses and $O(T^{\frac{2}{3}}\log T)$ for strongly geodesically convex losses.

Online robust non-stationary estimation
Abishek Sankararaman Murali Balakrishnan



研究问题:如何实时估计来自高维、重尾和被破坏的数据流中随时间变化参数?
动机:在从网络监控和异常检测到数据中心流量调度的各种系统中,这是常见的子程序。
方法:我们证明了一种适当调整的剪切随机梯度下降(SGD)版本,它同时具有:(i)适应漂移,(ii)对重尾内联和任意破坏具有鲁棒性,(iii)不需要分布知识,(iv)可以在线流式实现。
效果:我们的观察是,既不能使用已知对平稳数据流的强凸损失函数最优的$\mathcal{O}left(\frac{1}{t}right)$学习率,也不能使用已知在无噪声环境中自适应漂移最优的$\mathcal{O}(1)$学习率。相反,需要用流长度T的-α次方的学习率来平衡对潜在漂移的适应性和对抗噪声。我们开发了一种新的归纳论证,并将其与鞅收敛结果相结合,以在任何学习率下推导出在表现出任意分布转移的数据流上具有高概率的结果 - 这种证明策略可能具有独立的兴趣。此外,使用经典的加倍技巧,我们放宽了对流长度T的知识。我们的研究是第一个被证明能够同时对重尾、破坏和分布转移具有鲁棒性的在线估计算法。我们在合成和真实数据上对我们的理论结果进行了实证补充。

The real-time estimation of time-varying parameters from high-dimensional, heavy-tailed and corrupted data-streams is a common sub-routine in systems ranging from those for network monitoring and anomaly detection to those for traffic scheduling in data-centers. For estimation tasks that can be cast as minimizing a strongly convex loss function, we prove that an appropriately tuned version of the {\ttfamily clipped Stochastic Gradient Descent} (SGD) is simultaneously {\em(i)} adaptive to drift, {\em (ii)} robust to heavy-tailed inliers and arbitrary corruptions, {\em(iii)} requires no distributional knowledge and {\em (iv)} can be implemented in an online streaming fashion. All prior estimation algorithms have only been proven to posses a subset of these practical desiderata. A observation we make is that, neither the $\mathcal{O}\left(\frac{1}{t}\right)$ learning rate for {\ttfamily clipped SGD} known to be optimal for strongly convex loss functions of a \emph{stationary} data-stream, nor the $\mathcal{O}(1)$ learning rate known to be optimal for being adaptive to drift in a \emph{noiseless} environment can be used. Instead, a learning rate of $T^{-\alpha}$ for $ \alpha < 1$ where $T$ is the stream-length is needed to balance adaptivity to potential drift and to combat noise. We develop a new inductive argument and combine it with a martingale concentration result to derive high-probability under \emph{any learning rate} on data-streams exhibiting \emph{arbitrary distribution shift} - a proof strategy that may be of independent interest. Further, using the classical doubling-trick, we relax the knowledge of the stream length $T$. Ours is the first online estimation algorithm that is provably robust to heavy-tails, corruptions and distribution shift simultaneously. We complement our theoretical results empirically on synthetic and real data.

Gradient-Based Feature Learning under Structured Data
Alireza Mousavi-Hosseini Denny Wu Taiji Suzuki Murat A Erdogdu



研究问题:本文探讨了在非均质数据中,梯度学习单指数模型的样本复杂度受其信息指数控制的问题。
动机:现有的研究主要关注等方差数据,而实际输入数据往往包含额外的结构,这可以隐式地指导算法。
方法:本研究考察了尖峰协方差结构的影响,并揭示了几个现象。首先,我们展示了在非均质设置中,常用的球形梯度动力学可能无法恢复真实方向,即使尖峰与目标方向完全对齐。其次,我们发现类似于批量归一化的适当权重归一化可以缓解这个问题。最后,通过利用(尖峰)输入协方差和目标之间的对齐,我们获得了比等方差情况更好的样本复杂度。
效果:在尖峰模型下,当尖峰大小适中时,梯度训练的样本复杂度可以独立于信息指数,同时优于旋转不变核方法的下界。

Recent works have demonstrated that the sample complexity of gradient-based learning of single index models, i.e. functions that depend on a 1-dimensional projection of the input data, is governed by their information exponent. However, these results are only concerned with isotropic data, while in practice the input often contains additional structure which can implicitly guide the algorithm. In this work, we investigate the effect of a spiked covariance structure and reveal several interesting phenomena. First, we show that in the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction, even when the spike is perfectly aligned with the target direction. Next, we show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. Further, by exploiting the alignment between the (spiked) input covariance and the target, we obtain improved sample complexity compared to the isotropic case. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent while also outperforming lower bounds for rotationally invariant kernel methods.

An Alternating Optimization Method for Bilevel Problems under the Polyak-Łojasiewicz Condition
Quan Xiao Songtao Lu Tianyi Chen



研究问题:双层次优化在新兴的机器学习领域如超参数优化、元学习和强化学习中重新引起了人们的兴趣,但目前还不清楚这个结果是否可以推广到更基本的设置之外的双层次问题。
动机:为了解决这个问题,我们首先引入了一个用于考虑的双层次问题的稳定度量标准,该标准对满足Polyak-Łojasiewicz (PL)条件的非凸下层次目标进行了泛化。然后,我们提出了一种适用于具有凸PL LL问题的BLO的广义交替方法(GALET),并建立了GALET在$\tilde{\cal O}(\epsilon^{-1})$次迭代内实现了所考虑问题的$\epsilon$-稳定点,这匹配了单层平滑非凸问题的GD的迭代复杂度。
方法:我们通过引入一个新的稳定度量标准和提出一种新的广义交替方法来解决双层次优化问题。
效果:实验结果表明,我们的方法可以在$tilde{\cal O}(\epsilon^{-1})$次迭代内实现双层次优化问题的$\epsilon$-稳定点,这在迭代复杂度上与单层平滑非凸问题的GD相匹配。

Bilevel optimization has recently regained interest owing to its applications in emerging machine learning fields such as hyperparameter optimization, meta-learning, and reinforcement learning. Recent results have shown that simple alternating (implicit) gradient-based algorithms can match the convergence rate of single-level gradient descent (GD) when addressing bilevel problems with a strongly convex lower-level objective. However, it remains unclear whether this result can be generalized to bilevel problems beyond this basic setting. In this paper, we first introduce a stationary metric for the considered bilevel problems, which generalizes the existing metric, for a nonconvex lower-level objective that satisfies the Polyak-Łojasiewicz (PL) condition. We then propose a Generalized ALternating mEthod for bilevel opTimization (GALET) tailored to BLO with convex PL LL problem and establish that GALET achieves an $\epsilon$-stationary point for the considered problem within $\tilde{\cal O}(\epsilon^{-1})$ iterations, which matches the iteration complexity of GD for single-level smooth nonconvex problems.

A Competitive Algorithm for Agnostic Active Learning
Yihan Zhou Eric Price



研究问题:探索更有效的主动学习算法,以减少样本需求并提高性能。
动机:目前的主动学习方法在某些假设类别和输入分布上效果不佳,需要更少的样本。
方法:提出一种新的基于分裂的方法,该方法在任意二进制假设类和输入分布上都能与最优算法竞争。
效果:实验结果表明,新算法在各种任务上都能取得较好的效果,且查询次数相对较少。同时证明该算法具有NP-hard的难度。

For some hypothesis classes and input distributions, \emph{active} agnostic learning needs exponentially fewer samples than passive learning; for other classes and distributions, it offers little to no improvement. The most popular algorithms for agnostic active learning express their performance in terms of a parameter called the disagreement coefficient, but it is known that these algorithms are inefficient on some inputs. We take a different approach to agnostic active learning, getting an algorithm that is \emph{competitive} with the optimal algorithm for any binary hypothesis class $H$ and distribution $\mathcal{D}_X$ over $X$. In particular, if any algorithm can use $m^*$ queries to get $O(\eta)$ error, then our algorithm uses $O(m^* \log H)$ queries to get $O(\eta)$ error. Our algorithm lies in the vein of the splitting-based approach of Dasgupta [2004], which gets a similar result for the realizable ($\eta = 0$) setting. We also show that it is NP-hard to do better than our algorithm's $O(\log H)$ overhead in general.

Lower Bounds on Adaptive Sensing for Matrix Recovery
Praneeth Kacham David Woodruff



研究问题:本文旨在研究利用线性测量恢复低秩矩阵的自适应传感算法的下界。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study lower bounds on adaptive sensing algorithms for recovering low rank matrices using linear measurements. Given an $n \times n$ matrix $A$, a general linear measurement $S(A)$, for an $n \times n$ matrix $S$, is just the inner product of $S$ and $A$, each treated as $n^2$-dimensional vectors. By performing as few linear measurements as possible on a rank-$r$ matrix $A$, we hope to construct a matrix $\hat{A}$ that satisfies $|A - \hat{A}|\_F^2 \le c |A|\_F^2$, for a small constant $c$. Here $|A|\_F$ denotes the Frobenius norm $(\sum_{i,j} A_{i,j}^2)^{1/2}$. It is commonly assumed that when measuring $A$ with $S$, the response is corrupted with an independent Gaussian random variable of mean $0$ and variance $\sigma^2$. Candès and Plan (IEEE Trans. Inform. Theory 2011) study non-adaptive algorithms for low rank matrix recovery using random linear measurements. They use the restricted isometry property (RIP) of Random Gaussian Matrices to give tractable algorithms to estimate $A$ from the measurements. At the edge of the noise level where recovery is information-theoretically feasible, it is known that their non-adaptive algorithms need to perform $\Omega(n^2)$ measurements, which amounts to reading the entire matrix. An important question is whether adaptivity helps in decreasing the overall number of measurements. While for the related problem of sparse recovery, adaptive algorithms have been extensively studied, as far as we are aware adaptive algorithms and lower bounds on them seem largely unexplored for matrix recovery. We show that any adaptive algorithm that uses $k$ linear measurements in each round and outputs an approximation as in (1) with probability $\ge 9/10$ must run for $t = \Omega(\log(n^2/k)/\log\log n)$ rounds. Our lower bound shows that any adaptive algorithm which uses $n^{2-\beta}$ ($\beta > 0$ is arbitrary constant) linear measurements in each round must run for $\Omega(\log n/\log\log n)$ rounds. Our techniques also readily extend to obtain lower bounds on adaptive algorithms for tensor recovery. Our hard distribution also allows us to give a measurement-vs-rounds trade-off for many sensing problems in numerical linear algebra, such as spectral norm low rank approximation, Frobenius norm low rank approximation, singular vector approximation, and more.

Context-lumpable stochastic bandits
Chung-Wei Lee Qinghua Liu Yasin Abbasi-Yadkori Chi Jin Tor Lattimore Csaba Szepesvari



研究问题:本文研究了具有S个上下文和K个动作的上下文bandit问题。
动机:在每个回合中,学习者根据过去的经验选择一个动作,然后观察到一个随机奖励,其均值是上下文和该回合动作的函数。
方法:假设上下文可以归入r个组,使得任何两个在同一组中的上下文的各种动作的平均奖励相同,作者给出了一种算法,该算法在使用了至多$\widetilde O(r (S +K )/\epsilon^2)$次采样后,以高概率输出一个$epsilon$-最优策略,并提供了相应的$\widetilde\Omega(r (S +K )/\epsilon^2)$下界。
效果:在遗憾最小化设置中,作者给出了一种算法,其累积遗憾在时间T内被限制为$\widetilde O(\sqrt{r ^3(S +K )T})$。这是首次在PAC设置中展示了近最优的样本复杂度,并在在线设置中对这个问题得到了$\widetilde O{\sqrt{\text{poly}(r)(S+K)T}}$的极小极大遗憾。作者还展示了他们的算法可以应用于更一般的低秩bandits,并在一些场景中得到了改进的遗憾界限。

We consider a contextual bandit problem with $S $ contexts and $K $ actions. In each round $t=1,2,\dots$ the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min(S ,K)$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +K )/\epsilon^2)$ samples with high probability and provide a matching $\widetilde\Omega(r (S +K )/\epsilon^2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r ^3(S +K )T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O{\sqrt{\text{poly}(r)(S+K)T}}$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.

Optimistic Rates for Multi-Task Representation Learning
Austin Watkins Enayat Ullah Thanh Nguyen-Tang Raman Arora



研究问题:本文旨在通过多任务表示学习(MTRL)进行迁移学习,其中多个源任务用于学习良好的通用表示,然后在其上训练目标任务的预测器。
动机:在损失函数和任务多样性的标准假设下,我们提供了关于目标任务的超额风险的新统计率,这些统计率证明了表示学习的好处。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the problem of transfer learning via Multi-Task Representation Learning (MTRL), wherein multiple source tasks are used to learn a good common representation, and a predictor is trained on top of it for the target task. Under standard regularity assumptions on the loss function and task diversity, we provide new statistical rates on the excess risk of the target task, which demonstrate the benefit of representation learning. Importantly, our rates are optimistic, i.e., they interpolate between the standard $O(m^{-1/2})$ rate and the fast $O(m^{-1})$ rate, depending on the difficulty of the learning task, where $m$ is the number of samples for the target task. Besides the main result, we make several new contributions, including giving optimistic rates for excess risk of source tasks (multi-task learning (MTL)), a local Rademacher complexity theorem for MTRL and MTL, as well as a chain rule for local Rademacher complexity for composite predictor classes.

Solving Linear Inverse Problems Provably via Posterior Sampling with Latent Diffusion Models
Litu Rout Negin Raoof Giannis Daras Constantine Caramanis Alex Dimakis Sanjay Shakkottai



研究问题:本文提出了首个利用预训练潜在扩散模型解决线性逆问题的框架。
动机:先前提出的算法(如DPS和DDRM)仅适用于像素空间扩散模型,而我们的方法可以应用于更一般的情况。
方法:我们的理论分析证明了在线性模型设置中样本恢复的可行性,这一算法洞察也适用于实践中常见的更一般的情况。
效果:实验结果表明,我们在随机修复、块修复、去噪、去模糊、去条纹和超分辨率等多种问题上优于先前提出的后验采样算法。

We present the first framework to solve linear inverse problems leveraging pre-trained \textit{latent} diffusion models. Previously proposed algorithms (such as DPS and DDRM) only apply to \textit{pixel-space} diffusion models. We theoretically analyze our algorithm showing provable sample recovery in a linear model setting. The algorithmic insight obtained from our analysis extends to more general settings often considered in practice. Experimentally, we outperform previously proposed posterior sampling algorithms in a wide variety of problems including random inpainting, block inpainting, denoising, deblurring, destriping, and super-resolution.

Sub-optimality of the Naive Mean Field approximation for proportional high-dimensional Linear Regression
Jiaze Qiu



研究问题:本文旨在解决现代机器学习中广泛应用的朴素平均场(NMF)近似在高
动机:先前提出的算法(如DPS和DDRM)仅适用于像素空间扩散模型,而我们的方法可以应用于更一般的情况。
方法:我们的理论分析证明了在线性模型设置中样本恢复的可行性,这一算法洞察也适用于实践中常见的更一般的情况。
效果:实验结果表明,我们在随机修复、块修复、去噪、去模糊、去条纹和超分辨率等多种问题上优于先前提出的后验采样算法。

The Naïve Mean Field (NMF) approximation is widely employed in modern Machine Learning due to the huge computational gains it bestows on the statistician. Despite its popularity in practice, theoretical guarantees for high-dimensional problems are only available under strong structural assumptions (e.g. sparsity). Moreover, existing theory often does not explain empirical observations noted in the existing literature. In this paper, we take a step towards addressing these problems by deriving sharp asymptotic characterizations for the NMF approximation in high-dimensional linear regression. Our results apply to a wide class of natural priors and allow for model mismatch (i.e. the underlying statistical model can be different from the fitted model). We work under an iid Gaussian design and the proportional asymptotic regime, where the number of features and number of observations grow at a proportional rate. As a consequence of our asymptotic characterization, we establish two concrete corollaries: (a) we establish the inaccuracy of the NMF approximation for the log-normalizing constant in this regime, and (b) we provide theoretical results backing the empirical observation that the NMF approximation can be overconfident in terms of uncertainty quantification. Our results utilize recent advances in the theory of Gaussian comparison inequalities. To the best of our knowledge, this is the first application of these ideas to the analysis of Bayesian variational inference problems. Our theoretical results are corroborated by numerical experiments. Lastly, we believe our results can be generalized to non-Gaussian designs and provide empirical evidence to support it.

Learning Curves for Noisy Heterogeneous Feature-Subsampled Ridge Ensembles
Benjamin Samuel Ruben Cengiz Pehlevan



研究问题:本文旨在开发一种理论,解释在有噪声的最小二乘岭集成中进行特征袋装的方法,并简化特殊情况下等相关性数据的学习曲线。
动机:特征袋装是一种成熟的集成方法,通过结合多个估计器在特征子集或投影上的预测来降低预测方差。然而,对于有噪声的最小二乘岭集成,其学习曲线和性能优化仍需要进一步的研究。
方法:本文首先开发了一种理论来解释在有噪声的最小二乘岭集成中进行特征袋装的方法,然后简化了等相关性数据的特殊情况下的学习曲线。接着,通过分析学习曲线,证明了子采样会改变线性预测器的双峰特性。最后,提出了异构特征集成作为一种计算效率高的方法来缓解双峰现象。
效果:实验结果表明,特征子采样集成与单一线性预测器相比,存在一个由子采样引起的噪声放大和由集成引起的噪声减少之间的权衡。这些定性的见解可以应用于具有现实数据集的图像分类任务中的线性分类器。

Feature bagging is a well-established ensembling method which aims to reduce prediction variance by combining predictions of many estimators trained on subsets or projections of features. Here, we develop a theory of feature-bagging in noisy least-squares ridge ensembles and simplify the resulting learning curves in the special case of equicorrelated data. Using analytical learning curves, we demonstrate that subsampling shifts the double-descent peak of a linear predictor. This leads us to introduce heterogeneous feature ensembling, with estimators built on varying numbers of feature dimensions, as a computationally efficient method to mitigate double-descent. Then, we compare the performance of a feature-subsampling ensemble to a single linear predictor, describing a trade-off between noise amplification due to subsampling and noise reduction due to ensembling. Our qualitative insights carry over to linear classifiers applied to image classification tasks with realistic datasets constructed using a state-of-the-art deep learning feature map.

Non-Stationary Bandits with Auto-Regressive Temporal Dependency
Qinyi Chen Negin Golrezaei Djallel Bouneffouf



研究问题:传统的多臂赌博机(MAB)框架在许多现实世界的应用中,如推荐
动机:特征袋装是一种成熟的集成方法,通过结合多个估计器在特征子集或投影上的预测来降低预测方差。然而,对于有噪声的最小二乘岭集成,其学习曲线和性能优化仍需要进一步的研究。
方法:本文首先开发了一种理论来解释在有噪声的最小二乘岭集成中进行特征袋装的方法,然后简化了等相关性数据的特殊情况下的学习曲线。接着,通过分析学习曲线,证明了子采样会改变线性预测器的双峰特性。最后,提出了异构特征集成作为一种计算效率高的方法来缓解双峰现象。
效果:实验结果表明,特征子采样集成与单一线性预测器相比,存在一个由子采样引起的噪声放大和由集成引起的噪声减少之间的权衡。这些定性的见解可以应用于具有现实数据集的图像分类任务中的线性分类器。

Traditional multi-armed bandit (MAB) frameworks, predominantly examined under stochastic or adversarial settings, often overlook the temporal dynamics inherent in many real-world applications such as recommendation systems and online advertising. This paper introduces a novel non-stationary MAB framework that captures the temporal structure of these real-world dynamics through an auto-regressive (AR) reward structure. We propose an algorithm that integrates two key mechanisms: (i) an alternation mechanism adept at leveraging temporal dependencies to dynamically balance exploration and exploitation, and (ii) a restarting mechanism designed to discard out-of-date information. Our algorithm achieves a regret upper bound that nearly matches the lower bound, with regret measured against a robust dynamic benchmark. Finally, via a real-world case study on tourism demand prediction, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to more complex, rapidly evolving time series.

When are ensembles really effective?
Ryan Theisen Hyunsuk Kim Yaoqing Yang Liam Hodgkinson Michael W. Mahoney



研究问题:本文旨在理论和实证地研究在分类任务中,何时集成学习能显著提高性能。
动机:虽然集成学习在统计数据分析中有着悠久的历史和许多有影响力的应用,但在许多现代机器学习设置中,其优势并不普遍且不明显。
方法:理论上,我们证明了与“集成改进率”(衡量集成相对于单个模型减少错误率的程度)和“不一致误差比”相关的新结果。我们表明,只要不一致率相对于平均错误率较大,集成就会显著改善性能;反之,如果不一致率相对于平均错误率较低,则通常一个分类器就足够了。在实践中,我们在各种环境中对集成进行了研究,验证了我们的理论预测,并确定了集成确实和不会导致大幅性能提升的实际场景。
效果:我们发现插值模型(目前在实践中的应用很广泛)和非插值模型(如树基方法,其中集成学习很受欢迎)的行为有明显的区别,证明后者的集成学习帮助要比前者大得多。

Ensembling has a long history in statistical data analysis, with many impactful applications. However, in many modern machine learning settings, the benefits of ensembling are less ubiquitous and less obvious. We study, both theoretically and empirically, the fundamental question of when ensembling yields significant performance improvements in classification tasks. Theoretically, we prove new results relating the \emph{ensemble improvement rate} (a measure of how much ensembling decreases the error rate versus a single model, on a relative scale) to the \emph{disagreement-error ratio}. We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate; and that, conversely, one classifier is often enough whenever the disagreement rate is low relative to the average error rate. On the way to proving these results, we derive, under a mild condition called \emph{competence}, improved upper and lower bounds on the average test error rate of the majority vote classifier. To complement this theory, we study ensembling empirically in a variety of settings, verifying the predictions made by our theory, and identifying practical scenarios where ensembling does and does not result in large performance improvements. Perhaps most notably, we demonstrate a distinct difference in behavior between interpolating models (popular in current practice) and non-interpolating models (such as tree-based methods, where ensembling is popular), demonstrating that ensembling helps considerably more in the latter case than in the former.

On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms
Lam M. Nguyen Trang H. Tran



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文提出通过在知识图谱中的有信息量的实体来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,使其能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.

Reliable learning in challenging environments
Nina Balcan Steve Hanneke Rattana Pukdee Dravyansh Sharma



研究问题:设计能够保证预测结果正确的学习器在机器学习中的重要性日益增加。
动机:现有的学习理论保证只在非常特定的设置中被考虑,而现代机器学习问题中遇到的具有挑战性的测试时间环境(如对抗性测试时间攻击和自然分布转移)尚未得到充分的关注。
方法:本文提出了一种可靠的学习器设计并进行了分析,该学习器在此类环境中具有可证明的最佳保证。我们还讨论了该学习器的计算可行性实现,并在多个自然示例上展示了该算法的强大正性能保证,例如对数凹分布下的线性分离器或平滑概率分布下的平滑边界分类器。
效果:实验结果表明,该学习器在各种现代机器学习问题中具有强大的正性能保证,并且在处理对抗性测试时间攻击和自然分布转移等具有挑战性的测试时间环境时表现出色。

The problem of designing learners that provide guarantees that their predictions are provably correct is of increasing importance in machine learning. However, learning theoretic guarantees have only been considered in very specific settings. In this work, we consider the design and analysis of reliable learners in challenging test-time environments as encountered in modern machine learning problems: namely adversarial test-time attacks (in several variations) and natural distribution shifts. In this work, we provide a reliable learner with provably optimal guarantees in such settings. We discuss computationally feasible implementations of the learner and further show that our algorithm achieves strong positive performance guarantees on several natural examples: for example, linear separators under log-concave distributions or smooth boundary classifiers under smooth probability distributions.

Label Robust and Differentially Private Linear Regression: Computational and Statistical Efficiency
Xiyang Liu Prateek Jain Weihao Kong Sewoong Oh Arun Suggala



研究问题:在数据点从分布中独立采样,并且一部分响应变量受到对抗性破坏的情况下,研究$(\varepsilon,\delta)$-差分隐私下的线性回归问题。
动机:当前预训练语言模型缺乏对丰富的结构化知识的利用,知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the canonical problem of linear regression under $(\varepsilon,\delta)$-differential privacy when the datapoints are sampled i.i.d.~from a distribution and a fraction of response variables are adversarially corrupted. We provide the first provably efficient -- both computationally and statistically -- method for this problem, assuming standard assumptions on the data distribution. Our algorithm is a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two key innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. Our method requires only linear time in input size, and still matches the information theoretical optimal sample complexity up to a data distribution dependent condition number factor. Interestingly, the same algorithm, when applied to a setting where there is no adversarial corruption, still improves upon the existing state-of-the-art and achieves a near optimal sample complexity.

On the Sublinear Regret of GP-UCB
Justin Whitehouse Aaditya Ramdas Steven Wu



研究问题:在核化赌博机问题上,学习者如何仅通过在
动机:当前预训练语言模型缺乏对丰富的结构化知识的利用,知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In the kernelized bandit problem, a learner aims to sequentially compute the optimum of a function lying in a reproducing kernel Hilbert space given only noisy evaluations at sequentially chosen points. In particular, the learner aims to minimize regret, which is a measure of the suboptimality of the choices made. Arguably the most popular algorithm is the Gaussian Process Upper Confidence Bound (GP-UCB) algorithm, which involves acting based on a simple linear estimator of the unknown function. Despite its popularity, existing analyses of GP-UCB give a suboptimal regret rate, which fails to be sublinear for many commonly used kernels such as the Matern kernel. This has led to a longstanding open question: are existing regret analyses for GP-UCB tight, or can bounds be improved by using more sophisticated analytical techniques? In this work, we resolve this open question and show that GP-UCB enjoys nearly optimal regret. In particular, our results yield sublinear regret rates for the Matern kernel, improving over the state-of-the-art analyses and partially resolving a COLT open problem posed by Vakili et al. Our improvements rely on a key technical contribution --- regularizing kernel ridge estimators in proportion to the smoothness of the underlying kernel $k$. Applying this key idea together with a largely overlooked concentration result in separable Hilbert spaces (for which we provide an independent, simplified derivation), we are able to provide a tighter analysis of the GP-UCB algorithm.

$k$-Means Clustering with Distance-Based Privacy
Alessandro Epasto Vahab Mirrokni Shyam Narayanan Peilin Zhong



研究问题:本文研究了基于距离的隐私保护下的欧几里得聚类问题。
动机:在实际应用中,通常只需要保护精确位置的隐私,而非近似位置。
方法:提出了$k$-means和$k$-median聚类的常数近似算法,误差仅取决于攻击者的精度边界$\rho$,而非空间半径$Lambda$。
效果:实验证明,该算法明显优于先前的差分隐私聚类算法以及简单的基于距离的私有聚类基线。

In this paper, we initiate the study of Euclidean clustering with Distance-based privacy. Distance-based privacy is motivated by the fact that it is often only needed to protect the privacy of exact, rather than approximate, locations. We provide constant-approximate algorithms for $k$-means and $k$-median clustering, with additive error depending only on the attacker's precision bound $\rho$, rather than the radius $\Lambda$ of the space. In addition, we empirically demonstrate that our algorithm performs significantly better than previous differentially private clustering algorithms, as well as naive distance-based private clustering baselines.

Unconstrained Dynamic Regret via Sparse Coding
Zhiyu Zhang Ashok Cutkosky Ioannis Paschalidis



研究问题:本文研究了在线凸优化(OCO)在两个问题结构耦合下的问题,即领域无界和比较器序列$u_1,\ldots,u_T$是任意时间变化的。
动机:由于没有算法能同时对所有比较器序列保证低遗憾,因此需要从最小最大最优性转向比较器适应性来处理这种设置。
方法:本文利用稀疏编码框架实现了一种新的这种自适应遗憾界限。比较器的复杂性通过其在用户指定字典上的能量和稀疏性来衡量,这提供了相当大的灵活性。
效果:例如,配备了小波字典,我们的框架通过适应(i)比较器平均的幅度$||\bar u||=||\sum_{t=1}^Tu_t/T||$,而不是最大值$max_t||u_t||$;和(ii)比较器变异性$\sum_{t=1}^T||u_t-\bar u||$,而不是未中心化总和$\sum_{t=1}^T||u_t||$,从而改进了最先进的界限(Jacobsen & Cutkosky,2022)。此外,由于将函数近似与遗憾最小化解耦,我们的证明更简单。

Motivated by the challenge of nonstationarity in sequential decision making, we study Online Convex Optimization (OCO) under the coupling of two problem structures: the domain is unbounded, and the comparator sequence $u_1,\ldots,u_T$ is arbitrarily time-varying. As no algorithm can guarantee low regret simultaneously against all comparator sequences, handling this setting requires moving from minimax optimality to comparator adaptivity. That is, sensible regret bounds should depend on certain complexity measures of the comparator relative to one's prior knowledge. This paper achieves a new type of such adaptive regret bounds leveraging a sparse coding framework. The complexity of the comparator is measured by its energy and its sparsity on a user-specified dictionary, which offers considerable versatility. For example, equipped with a wavelet dictionary, our framework improves the state-of-the-art bound (Jacobsen & Cutkosky, 2022) by adapting to both ($i$) the magnitude of the comparator average $||\bar u||=||\sum_{t=1}^Tu_t/T||$, rather than the maximum $\max_t||u_t||$; and ($ii$) the comparator variability $\sum_{t=1}^T||u_t-\bar u||$, rather than the uncentered sum $\sum_{t=1}^T||u_t||$. Furthermore, our proof is simpler due to decoupling function approximation from regret minimization.

Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks
Tolga Ergen Mert Pilanci



研究问题:揭示深度神经网络成功背后的基本原理。
动机:解决当前文献中最重要的开放性问题之一,理解深度神经网络训练问题的本质。
方法:通过引入解析方法,揭示了优化景观中的隐藏凸性。考虑深度并行ReLU网络架构,并证明路径规范化的训练问题可以表示为一个精确的凸优化问题。进一步证明了等价的凸问题通过组稀疏诱导范数进行正则化。因此,路径规范化的并行ReLU网络可以看作是高维的简约凸模型。
效果:由于原始训练问题可能不是多项式时间可训练的,我们提出了一种全数据维度上具有完全多项式时间复杂度的近似算法。然后,我们证明了该算法的强大全局最优性保证。我们还提供了实验来证实我们的理论。

Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, since the original training problem may not be trainable in polynomial-time, we propose an approximate algorithm with a fully polynomial-time complexity in all data dimensions. Then, we prove strong global optimality guarantees for this algorithm. We also provide experiments corroborating our theory.

On the Role of Noise in the Sample Complexity of Learning Recurrent Neural Networks: Exponential Gaps for Long Sequences
Alireza Fathollah Pour Hassan Ashtiani



研究问题:本研究关注具有无界权重的噪声多层Sigmoid循环神经网络在序列分类问题上的表现。
动机:我们探索了向网络中的每个神经元输出添加独立同分布高斯噪声对学习性能的影响。
方法:我们考虑了带有噪声和无噪声两种情况,通过理论分析得出了样本复杂度的上界和下界。
效果:我们发现,噪声会显著影响样本复杂度与序列长度的关系,即使噪声很小,这种影响仍然存在。

We consider the class of noisy multi-layered sigmoid recurrent neural networks with $w$ (unbounded) weights for classification of sequences of length $T$, where independent noise distributed according to $\mathcal{N}(0,\sigma^2)$ is added to the output of each neuron in the network. Our main result shows that the sample complexity of PAC learning this class can be bounded by $O (w\log(T/\sigma))$. For the non-noisy version of the same class (i.e., $\sigma=0$), we prove a lower bound of $\Omega (wT)$ for the sample complexity. Our results indicate an exponential gap in the dependence of sample complexity on $T$ for noisy versus non-noisy networks. Moreover, given the mild logarithmic dependence of the upper bound on $1/\sigma$, this gap still holds even for numerically negligible values of $\sigma$.

Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise
Ilias Diakonikolas Jelena Diakonikolas Daniel Kane Puqian Wang Nikos Zarifis



研究问题:学习一般半空间的问题,考虑了随机分类噪声和高斯分布。
动机:对于这个基本问题,存在一个令人惊讶的信息计算差距。
方法:我们建立了算法和统计查询(SQ)的下界结果,并设计了一个具有样本复杂度为O(d/ε + d/max(p, ε))^2的高效学习算法。
效果:我们的积极结果是,任何有效的SQ算法(或低度测试)对此问题的样本复杂度至少为Ω(d^(1/2)/(max(p, ε))^2)。

We study the problem of learning general (i.e., not necessarily homogeneous) halfspaces with Random Classification Noise under the Gaussian distribution. We establish nearly-matching algorithmic and Statistical Query (SQ) lower bound results revealing a surprising information-computation gap for this basic problem. Specifically, the sample complexity of this learning problem is $\widetilde{\Theta}(d/\epsilon)$, where $d$ is the dimension and $\epsilon$ is the excess error. Our positive result is a computationally efficient learning algorithm with sample complexity $\tilde{O}(d/\epsilon + d/\max(p, \epsilon))^2)$, where $p$ quantifies the bias of the target halfspace. On the lower bound side, we show that any efficient SQ algorithm (or low-degree test) for the problem requires sample complexity at least $\Omega(d^{1/2}/(\max(p, \epsilon))^2)$. Our lower bound suggests that this quadratic dependence on $1/\epsilon$ is inherent for efficient algorithms.

Boosting with Tempered Exponential Measures
Richard Nock Ehsan Amid Manfred K Warmuth



研究问题:如何利用知识图谱和大规模文本语料库训练一种增强的语言表示模型(ERNIE)?
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,该模型能同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

One of the most popular ML algorithms, AdaBoost, can be derived from the dual of a relative entropy minimization problem subject to the fact that the positive weights on the examples sum to one. Essentially, harder examples receive higher probabilities. We generalize this setup to the recently introduced *tempered exponential measure*s (TEMs) where normalization is enforced on a specific power of the measure and not the measure itself. TEMs are indexed by a parameter $t$ and generalize exponential families ($t=1$). Our algorithm, $t$-AdaBoost, recovers AdaBoost as a special case ($t=1$). We show that $t$-AdaBoost retains AdaBoost's celebrated exponential convergence rate when $t\in [0,1)$ while allowing a slight improvement of the rate's hidden constant compared to $t=1$. $t$-AdaBoost partially computes on a generalization of classical arithmetic over the reals and brings notable properties like guaranteed bounded leveraging coefficients for $t\in [0,1)$. From the loss that $t$-AdaBoost minimizes (a generalization of the exponential loss), we show how to derive a new family of *tempered* losses for the induction of domain-partitioning classifiers like decision trees. Crucially, strict properness is ensured for all while their boosting rates span the full known spectrum. Experiments using $t$-AdaBoost+trees display that significant leverage can be achieved by tuning $t$.

Achieving $\mathcal{O}(\epsilon^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization
Yifan Yang Peiyao Xiao Kaiyi Ji



研究问题:本文重新审视了双层级优化问题,其中上层目标函数通常是非凸的,而下层目标函数是强凸的。尽管这类问题已经得到了广泛的研究,但在不需要任何二阶导数计算的情况下,如何在Hessian/Jacobian自由随机双层级优化中实现$\mathcal{O}(\epsilon^{-1.5})$的样本复杂度仍然是一个未解决的问题。
动机:为了填补这一空白,我们提出了一种名为FdeHBO的新型Hessian/Jacobian-free双层级优化器,它具有简单的全单循环结构、投影辅助有限差分Hessian/Jacobian向量近似和基于动量的更新。
方法:在理论上,我们证明了FdeHBO需要$\mathcal{O}(\epsilon^{-1.5})$次迭代(每次使用$\mathcal{O}(1)$个样本和仅一阶梯度信息)来找到一个$\epsilon$准确的稳定点。
效果:据我们所知,这是第一个在非凸-强凸随机双层级优化中实现$mathcal{O}(\epsilon^{-1.5})$样本复杂度的Hessian/Jacobian-free方法。

In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an $\mathcal{O}(\epsilon^{-1.5})$ sample complexity of $\mathcal{O}(\epsilon^{-1.5})$ in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires $\mathcal{O}(\epsilon^{-1.5})$ iterations (each using $\mathcal{O}(1)$ samples and only first-order gradient information) to find an $\epsilon$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an $\mathcal{O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

Distribution Learnability and Robustness
Shai Ben-David Alex Bie Gautam Kamath Tosca Lechner



研究问题:本文探讨了分布学习中学习能力与鲁棒学习能力之间的关系。
动机:为了理解在面对不同的攻击方式时,学习能力和鲁棒学习能力的关系。
方法:通过理论分析,研究了在加性污染(Huber污染)下学习能力对鲁棒学习能力的影响,以及在允许减法污染的情况下学习能力对鲁棒学习能力的影响。
效果:研究发现,在其他学习设置(如函数类PAC学习)中,可实现的学习能力并不等同于无差别的学习能力。同时,还探讨了压缩方案和差分隐私学习能力的相关影响。

We examine the relationship between learnability and robust learnability for the problem of distribution learning. We show that learnability implies robust learnability if the adversary can only perform additive contamination (and consequently, under Huber contamination), but not if the adversary is allowed to perform subtractive contamination. Thus, contrary to other learning settings (e.g., PAC learning of function classes), realizable learnability does not imply agnostic learnability. We also explore related implications in the context of compression schemes and differentially private learnability.

An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond
Marc Jourdan Rémy Degenne Emilie Kaufmann



研究问题:本文提出了一种新的采样规则EB-TCε,用于随机 bandits 中的ε-best 臂识别。
动机:这是对近似最佳臂识别的Top Two算法进行首次分析。EB-TCε是一种*任何时候*都可以使用的采样规则,无需修改即可用于固定置信度或固定预算识别(无需预先了解预算)。
方法:我们为EB-TCε提供了三种类型的理论保证。首先,我们证明了在固定置信度设置中对其期望样本复杂度的界限,特别是展示了其与自适应探索参数调整相结合的渐近最优性。其次,我们通过对其在任何时间和任何松弛参数的错误概率的上界补充了这些发现,这进一步产生了对其在任何时间的简单遗憾的上界。最后,我们通过数值模拟表明,EB-TCε在不同近似最佳臂识别任务中表现优于现有算法。
效果:研究发现,在其他学习设置(如函数类PAC学习)中,可实现的学习能力并不等同于无差别的学习能力。同时,还探讨了压缩方案和差分隐私学习能力的相关影响。

We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any slack parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms for different approximate best arm identification tasks.

Fast Optimal Locally Private Mean Estimation via Random Projections
Hilal Asi Vitaly Feldman Jelani Nelson Huy Nguyen Kunal Talwar



研究问题:本文研究了欧几里得球中高维向量的局部私有均值估计问题。
动机:现有的算法要么误差次优,要么通信和/或运行时间复杂度高。
方法:提出了一个新的算法框架ProjUnit,用于私有均值估计,该算法具有计算效率高、通信复杂度低、误差最优(1+o(1)因子)的特点。
效果:通过实验证明,该算法在私有均值估计和私有联邦学习方面获得了与最优算法相近的效用,同时通信和计算成本显著降低。

We study the problem of locally private mean estimation of high-dimensional vectors in the Euclidean ball. Existing algorithms for this problem either incur sub-optimal error or have high communication and/or run-time complexity. We propose a new algorithmic framework, namely ProjUnit, for private mean estimation that yields algorithms that are computationally efficient, have low communication complexity, and incur optimal error up to a $1+o(1)$-factor. Our framework is deceptively simple: each randomizer projects its input to a random low-dimensional subspace and then runs an optimal algorithm such a PrivUnitG in the lower dimensional space. We analyze the error of the algorithm in terms of properties of the random projection ensemble, and study two instantiations. We conduct several experiments for private mean estimation and private federated learning which demonstrate that our algorithms obtain nearly the same utility as optimal algorithms while having significantly lower communication and computational cost.

Adaptive Algorithms for Relaxed Pareto Set Identification
Cyrille Kone Emilie Kaufmann Laura Richert



研究问题:本文重新审视了多目标多臂老虎机模型中固定置信度的Pareto最优
动机:现有的算法要么误差次优,要么通信和/或运行时间复杂度高。
方法:提出了一个新的算法框架ProjUnit,用于私有均值估计,该算法具有计算效率高、通信复杂度低、误差最优(1+o(1)因子)的特点。
效果:通过实验证明,该算法在私有均值估计和私有联邦学习方面获得了与最优算法相近的效用,同时通信和计算成本显著降低。

In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant \emph{subset} of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most $k$ Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account.

Fitting trees to $\ell_1$-hyperbolic distances
Joon-Hyeok Yim Anna Gilbert



研究问题:本文旨在研究树的构建以表示或拟合距离,这是系统发育分析、度量嵌入、近似算法、几何图形神经网络和层次数据分析的关键组成部分。
动机:尽管大部分先前的算法工作都集中在没有先验约束的一般度量空间上,但利用来自超几何几何分析和几何群理论的几个想法,我们研究了树拟合问题,即寻找超度(度量)向量与最佳树(度量)嵌入误差之间的关系。
方法:我们定义了一个所有点三元组的超度(度量)值向量,并将这个向量的$\ell_p$范数与最佳树拟合距离的$\ell_q$范数进行比较。这种形式允许我们用归一化的$\ell_1$范数来定义平均超度(度量)。
效果:我们提出了一种名为\textsc{HCCRootedTreeFit}的算法,该算法的输出嵌入的$\ell_1$误差可以解析地限制在超度向量的$ell_1$范数内(即$p=q=1$),并且这个结果非常紧凑。此外,与Gromov的结果和相关算法相比,这种算法在理论和实证性能上有显著差异。最后,我们使用\textsc{HCCRootedTreeFit}和相关的树拟合算法表明,标准层次数据分析和几何图形神经网络数据集的树拟合与真正类似树的合成数据集截然不同,这表明需要对这些标准数据集进行更精细的分析。

Building trees to represent or to fit distances is a critical component of phylogenetic analysis, metric embeddings, approximation algorithms, geometric graph neural nets, and the analysis of hierarchical data. Much of the previous algorithmic work, however, has focused on generic metric spaces (i.e., those with no \emph{a priori} constraints). Leveraging several ideas from the mathematical analysis of hyperbolic geometry and geometric group theory, we study the tree fitting problem as finding the relation between the hyperbolicity (ultrametricity) vector and the error of tree (ultrametric) embedding. That is, we define a vector of hyperbolicity (ultrametric) values over all triples of points and compare the $\ell_p$ norms of this vector with the $\ell_q$ norm of the distortion of the best tree fit to the distances. This formulation allows us to define the average hyperbolicity (ultrametricity) in terms of a normalized $\ell_1$ norm of the hyperbolicity vector. Furthermore, we can interpret the classical tree fitting result of Gromov as a $p = q = \infty$ result. We present an algorithm \textsc{HCCRootedTreeFit} such that the $\ell_1$ error of the output embedding is analytically bounded in terms of the $\ell_1$-norm of the hyperbolicity vector (i.e., $p = q = 1$) and that this result is tight. Furthermore, this algorithm has significantly different theoretical and empirical performance as compared to Gromov's result and related algorithms. Finally, we show using \textsc{HCCRootedTreeFit} and related tree fitting algorithms, that supposedly standard data sets for hierarchical data analysis and geometric graph neural networks have radically different tree fits than those of synthetic, truly tree-like data sets, suggesting that a much more refined analysis of these standard data sets is called for.

Convergence of Actor-Critic with Multi-Layer Neural Networks
Haoxing Tian Alex Olshevsky Ioannis Paschalidis



研究问题:本文旨在解决演员-评论家方法中,使用深度神经网络作为政策和价值函数的近似器时的收敛性问题。
动机:早期的理论认为线性函数近似器可以实现收敛,近期的研究已经证明单隐藏层的神经网络可以达到收敛。本研究希望进一步证明具有任意数量隐藏层的深度神经网络也能达到收敛,弥合理论与实践之间的差距。
方法:通过在初始条件周围的球体上进行演员-评论家更新,证明了其会收敛到一个梯度平方平均值为O(1/√m) + O(ε)的邻域内,其中m是神经网络的宽度,ε是投影集合中最佳评论家神经网络的近似质量。
效果:实验结果表明,深度神经网络无论隐藏层的数量如何,都能实现演员-评论家方法的收敛。

The early theory of actor-critic methods considered convergence using linear function approximators for the policy and value functions. Recent work has established convergence using neural network approximators with a single hidden layer. In this work we are taking the natural next step and establish convergence using deep neural networks with an arbitrary number of hidden layers, thus closing a gap between theory and practice. We show that actor-critic updates projected on a ball around the initial condition will converge to a neighborhood where the average of the squared gradients is $\tilde{O} \left( 1/\sqrt{m} \right) + O \left( \epsilon \right)$, with $m$ being the width of the neural network and $\epsilon$ the approximation quality of the best critic neural network over the projected set.

SLM: A Smoothed First-Order Lagrangian Method for Structured Constrained Nonconvex Optimization
Songtao Lu



研究问题:解决目标和约束都涉及非凸函数的一类非凸功能约束优化(FCO)问题。
动机:近年来,神经网络的应用迅速增加,使得目标和约束往往涉及非凸函数,这对获取高质量的解决方案构成了重大挑战。
方法:利用原始-对偶优化框架,提出了一种平滑化的第一阶拉格朗日方法(SLM)来解决这个问题。
效果:通过量化对偶误差界,我们建立了SLM收敛到Karush-Kuhn-Tucker(KKT)解的理论保证。通过建立这种结构化FCO与均衡约束非凸问题(也称为双层优化)之间的联系,我们将所提出的SLM应用于下层问题为非凸的双层优化导向问题。从玩具示例和超数据清理问题中获得的数值结果表明,SLM优于基准方法。

Functional constrained optimization (FCO) has emerged as a powerful tool for solving various machine learning problems. However, with the rapid increase in applications of neural networks in recent years, it has become apparent that both the objective and constraints often involve nonconvex functions, which poses significant challenges in obtaining high-quality solutions. In this work, we focus on a class of nonconvex FCO problems with nonconvex constraints, where the two optimization variables are nonlinearly coupled in the inequality constraint. Leveraging the primal-dual optimization framework, we propose a smoothed first-order Lagrangian method (SLM) for solving this class of problems. We establish the theoretical convergence guarantees of SLM to the Karush-Kuhn-Tucker (KKT) solutions through quantifying dual error bounds. By establishing connections between this structured FCO and equilibrium-constrained nonconvex problems (also known as bilevel optimization), we apply the proposed SLM to tackle bilevel optimization oriented problems where the lower-level problem is nonconvex. Numerical results obtained from both toy examples and hyper-data cleaning problems demonstrate the superiority of SLM compared to benchmark methods.

Minimum norm interpolation by perceptra: Explicit regularization and implicit bias
Jiyoung Park Ian Pelakh Stephan Wojtowytsch



研究问题:本研究探讨了浅层ReLU网络如何在已知区域之间进行插值。
动机:我们发现,当数据点和参数的数量趋向无穷大时,经验风险最小化器会收敛到一个最小范数的插值,这是在权重衰减正则化器被一个随着网络宽度和数据点数量增长而精确消失的系数所惩罚的情况下发生的。
方法:我们使用和不使用显式正则化的方法,对常见的优化算法对于已知最小范数插值的隐含偏好进行了数值研究。
效果:实验结果表明,无论是否使用显式正则化,经验风险最小化器都会收敛到最小范数插值。

We investigate how shallow ReLU networks interpolate between known regions. Our analysis shows that empirical risk minimizers converge to a minimum norm interpolant as the number of data points and parameters tends to infinity when a weight decay regularizer is penalized with a coefficient which vanishes at a precise rate as the network width and the number of data points grow. With and without explicit regularization, we numerically study the implicit bias of common optimization algorithms towards known minimum norm interpolants.

Unified Lower Bounds for Interactive High-dimensional Estimation under Information Constraints
Jayadev Acharya Clement Louis Canonne Ziteng Sun Himanshu Tyagi



研究问题:本研究关注在带宽限制、本地差分隐私和受限测量等局部信息约束下使用交互协议进行分布式参数估计的问题。
动机:为了解决这些问题,我们提出了一个统一的框架,能够为不同的参数分布族(连续和离散)推导出各种紧的最小最大下界,适用于任何$\ell_p$损失。
方法:我们的框架具有通用性,可以产生广泛适用于大量估计问题的“即插即用”的界限。对于高斯族这一典型情况,我们的方法规避了以往技术的局限性。
效果:具体来说,我们的方法恢复了使用数据处理不等式和克拉美罗-劳界(Cramér–Rao bounds)获得的结果,这两种方法在我们感兴趣的设置中用于证明下界。此外,对于我们考虑的这些族,我们还提供了匹配的上界。

We consider distributed parameter estimation using interactive protocols subject to local information constraints such as bandwidth limitations, local differential privacy, and restricted measurements. We provide a unified framework enabling us to derive a variety of (tight) minimax lower bounds for different parametric families of distributions, both continuous and discrete, under any $\ell_p$ loss. Our lower bound framework is versatile and yields “plug-and-play” bounds that are widely applicable to a large range of estimation problems, and, for the prototypical case of the Gaussian family, circumvents limitations of previous techniques. In particular, our approach recovers bounds obtained using data processing inequalities and Cramér–Rao bounds, two other alternative approaches for proving lower bounds in our setting of interest. Further, for the families considered, we complement our lower bounds with matching upper bounds.

On the Robustness of Mechanism Design under Total Variation Distance
Anuran Makur Marios Mertzanidis Alexandros Psomas Athina Terzoglou



研究问题:设计当代理的估值函数来自未知且相关的潜在分布时的机制。
动机:给定一个潜在分布D,我们有兴趣设计一个(诚实的)机制,对所有在总变分(TV)距离上接近D的“真实分布”都有良好的性能。
方法:我们展示了在这种设置中,DSIC和BIC机制对于任何有界目标函数O,在TV距离上具有很强的鲁棒性,扩展了Brustle等人的最新结果([BCD20],EC 2020)。我们的结果的核心是一个关于总变分距离的基本对偶性质。
效果:(i)我们展示了如何为弱相关的先验分布找到近似收入最优和近似BIC机制;(ii)当只有“噪声”版本的边际可访问时,我们展示了如何找到相关性鲁棒的机制,扩展了Bei等人的最新结果([BGLT19],SODA 2019);(iii)我们证明了先知不等式类型的保证对于相关的潜在分布是保留的,恢复了Dütting和Kesselheim的一个结果的变种([DK19],EC 2019)作为一个特例;(iv)我们给出了一个相关分布见证简单和最优机制之间收入无限分离的新的必要条件,补充了Psomas等人的最新结果([PSCW22],NeurIPS 2022);(v)我们给出了一个条件,对于单个代理的情况,其类型来自可以由马尔科夫随机场捕获的相关分布,简单机制可以近似收入最优机制,补充了Cai和Oikonomou的最新结果([CO21],EC 2021)。

We study the problem of designing mechanisms when agents' valuation functions are drawn from unknown and correlated prior distributions. In particular, we are given a prior distribution $D$, and we are interested in designing a (truthful) mechanism that has good performance for all "true distributions" that are close to $D$ in Total Variation (TV) distance. We show that DSIC and BIC mechanisms in this setting are strongly robust with respect to TV distance, for any bounded objective function $\mathcal{O}$, extending a recent result of Brustle et al. ([BCD20], EC 2020). At the heart of our result is a fundamental duality property of total variation distance. As direct applications of our result, we (i) demonstrate how to find approximately revenue-optimal and approximately BIC mechanisms for weakly dependent prior distributions; (ii) show how to find correlation-robust mechanisms when only ``noisy'' versions of marginals are accessible, extending recent results of Bei et. al. ([BGLT19], SODA 2019); (iii) prove that prophet-inequality type guarantees are preserved for correlated priors, recovering a variant of a result of D{\"u}tting and Kesselheim ([DK19], EC 2019) as a special case; (iv) give a new necessary condition for a correlated distribution to witness an infinite separation in revenue between simple and optimal mechanisms, complementing recent results of Psomas et al. ([PSCW22], NeurIPS 2022); (v) give a new condition for simple mechanisms to approximate revenue-optimal mechanisms for the case of a single agent whose type is drawn from a correlated distribution that can be captured by a Markov Random Field, complementing recent results of Cai and Oikonomou ([CO21], EC 2021).

Complexity of Derivative-Free Policy Optimization for Structured $\mathcal{H}_\infty$ Control
Xingang Guo Darioush Keivan Geir Dullerud Peter Seiler Bin Hu



研究问题:本文研究了在强化学习和连续控制中直接策略搜索的应用,并关注其在鲁棒控制任务中的复杂性。
动机:由于最优的H∞合成在结构约束下会导致非凸非光滑问题,通常需要使用基于Goldstein次微分或其他扩大次微分概念的次梯度策略搜索技术来解决,因此研究其复杂性具有重要的理论和实际意义。
方法:本文通过研究只能访问零阶oracle(即闭环系统的H∞范数)的策略优化方法寻找此类非光滑鲁棒控制设计任务的$(\delta,\epsilon)$-稳定点的问题,提出了一种新的理论结果。
效果:实验结果表明,所提出的方法在寻找$(\delta,\epsilon)$-稳定点时具有较低的样本复杂度,为模型自由、轨迹基础、零阶策略优化提供了新的视角。

The applications of direct policy search in reinforcement learning and continuous control have received increasing attention. In this work, we present novel theoretical results on the complexity of derivative-free policy optimization on an important class of robust control tasks, namely the structured $H_\infty$ synthesis with static output feedback. Optimal $H_\infty$ synthesis under structural constraints leads to a constrained nonconvex nonsmooth problem and is typically addressed using subgradient-based policy search techniques that are built upon the concept of Goldstein subdifferential or other notions of enlarged subdifferential. In this paper, we study the complexity of finding $(\delta,\epsilon)$-stationary points for such nonsmooth robust control design tasks using policy optimization methods which can only access the zeroth-order oracle (i.e. the $H_\infty$ norm of the closed-loop system). First, we study the exact oracle setting and identify the coerciveness of the cost function to prove high-probability feasibility/complexity bounds for derivative-free policy optimization on this problem. Next, we derive a sample complexity result for the multi-input multi-output (MIMO) $H_\infty$-norm estimation. We combine this with our analysis to obtain the first sample complexity of model-free, trajectory-based, zeroth-order policy optimization on finding $(\delta,\epsilon)$-stationary points for structured $H_\infty$ control. Numerical results are also provided to demonstrate our theory.

Asynchronous Proportional Response Dynamics: Convergence in Markets with Adversarial Scheduling
Yoav Kolumbus Menahem Levy Noam Nisan



研究问题:本文研究了线性费雪市场中的成比例响应动态(PRD),其中参与者异步行动。
动机:在模型中,每一步都有一个对手选择一部分玩家更新他们的出价,受到实时性限制。作者希望探索如果每个出价者在被对手选中时都应用PRD更新规则,市场动态会如何发展。
方法:通过序列过程建模这种情况,并展示了如果每个出价者都单独应用PRD更新规则,那么在一般情况下,整个市场动态将收敛到竞争均衡。
效果:作者的证明技术揭示了线性费雪市场的其他属性,例如一般参数下市场均衡的唯一性以及在某些条件下相关无交换遗憾动态和最佳响应动态的收敛性。

We study Proportional Response Dynamics (PRD) in linear Fisher markets, where participants act asynchronously. We model this scenario as a sequential process in which at each step, an adversary selects a subset of the players to update their bids, subject to liveness constraints. We show that if every bidder individually applies the PRD update rule whenever they are included in the group of bidders selected by the adversary, then, in the generic case, the entire dynamic converges to a competitive equilibrium of the market. Our proof technique reveals additional properties of linear Fisher markets, such as the uniqueness of the market equilibrium for generic parameters and the convergence of associated no swap regret dynamics and best response dynamics under certain conditions.

Learning via Wasserstein-Based High Probability Generalisation Bounds
Paul Viallard Maxime Haddouche Umut Simsekli Benjamin Guedj



研究问题:如何改进结构风险最小化(SRM)中的泛化差距上限,以解决研究问题:如何改进结构风险最小化(SRM)中的泛化差距上限,以解决PAC-Bayesian学习框架中存在的KL散度项可能无法捕捉学习问题底层几何结构的问题。
动机:为了克服PAC-Bayesian框架的局限性,尝试用Wasserstein距离替代KL散度,以提高泛化差距上限的稳定性和实用性。
方法:提出了新的基于Wasserstein距离的PAC-Bayesian泛化差距上限,适用于批量学习和在线学习,并给出了可优化的训练目标。
效果:新提出的泛化差距上限具有高概率保证、适用于无界损失、可优化训练目标等优点,并通过实验验证了其在实际问题上的优势。

Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem -- hence restricting its use in practical applications. As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayesian generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. data. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM. As a result we derive novel Wasserstein-based PAC-Bayesian learning algorithms and we illustrate their empirical advantage on a variety of experiments.

Privacy Amplification via Compression: Achieving the Optimal Privacy-Accuracy-Communication Trade-off in Distributed Mean Estimation
Wei-Ning Chen Dan Song Ayfer Ozgur Peter Kairouz



研究问题:如何在联合通信和$(\varepsilon, \delta)$-差分隐私(DP)约束下,实现联邦学习和分析的最优精度。
动机:联邦学习和分析中的隐私和通信限制是两个主要瓶颈。通过研究均值和频率估计的最优精度,可以优化这两个方面的性能。
方法:考虑中心化和多消息混淆两种差分隐私模型,通过压缩数据并随机选择每个客户端的贡献部分,实现了在通信、隐私和精度之间的最优权衡。
效果:实验结果表明,该方法可以在现实情况下获得显著的节省,并在联邦学习和分析中实现了最优的隐私-通信-精度权衡。

Privacy and communication constraints are two major bottlenecks in federated learning (FL) and analytics (FA). We study the optimal accuracy of mean and frequency estimation (canonical models for FL and FA respectively) under joint communication and $(\varepsilon, \delta)$-differential privacy (DP) constraints. We consider both the central and the multi-message shuffled DP models. We show that in order to achieve the optimal $\ell_2$ error under $(\varepsilon, \delta)$-DP, it is sufficient for each client to send $\Theta\left( n \min\left(\varepsilon, \varepsilon^2\right)\right)$ bits for FL %{\color{blue}(assuming the dimension $d \gg n \min\left(\varepsilon, \varepsilon^2\right)$)} and $\Theta\left(\log\left( n\min\left(\varepsilon, \varepsilon^2\right) \right)\right)$ bits for FA to the server, where $n$ is the number of participating clients. Without compression, each client needs $O(d)$ bits and $O\left(\log d\right)$ bits for the mean and frequency estimation problems respectively (where $d$ corresponds to the number of trainable parameters in FL or the domain size in FA), meaning that we can get significant savings in the regime $ n \min\left(\varepsilon, \varepsilon^2\right) = o(d)$, which is often the relevant regime in practice. We propose two different ways to leverage compression for privacy amplification and achieve the optimal privacy-communication-accuracy trade-offs. In both cases, each client communicates only partial information about its sample and we show that privacy is amplified by randomly selecting the part contributed by each client. In the first method, the random selection is revealed to the server, which results in a central DP guarantee with optimal privacy-communication-accuracy trade-offs. In the second method, the random data parts from the clients are shuffled by a secure shuffler resulting in a multi-message shuffling scheme with the same optimal trade-offs. As a result, we establish the optimal three-way trade-offs between privacy, communication, and accuracy for both the central DP and multi-message shuffling frameworks.

Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance
Lisha Chen Heshan Devaka Fernando Yiming Ying Tianyi Chen



研究问题:多目标学习(MOL)中,当需要解决多个学习标准或任务时,如何有效地进行动态权重算法的研究。
动机:尽管动态权重方法在理论上具有吸引力,但实证研究表明,它并不总是优于静态方法。因此,本研究旨在通过研究一种新的随机MGDA变体——双重抽样的多目标梯度(MoDo)算法,来弥合理论与实践之间的差距。
方法:我们专注于MoDo算法的泛化性能和优化之间的相互作用,通过算法稳定性的视角进行研究。我们发现,MGDA的基本思想——沿着冲突避免的方向进行更新——可能会阻碍动态权重算法实现最优的$O(1/\sqrt{n})$种群风险,其中n是训练样本的数量。
效果:我们进一步强调了动态权重的可变性及其对优化、泛化和冲突避免三者之间独特影响的三重权衡的影响。

Multi-objective learning (MOL) often arises in emerging machine learning problems when multiple learning criteria or tasks need to be addressed. Recent works have developed various _dynamic weighting_ algorithms for MOL, including MGDA and its variants, whose central idea is to find an update direction that _avoids conflicts_ among objectives. Albeit its appealing intuition, empirical studies show that dynamic weighting methods may not always outperform static alternatives. To bridge this gap between theory and practice, we focus on a new variant of stochastic MGDA - the Multi-objective gradient with Double sampling (MoDo) algorithm and study its generalization performance and the interplay with optimization through the lens of algorithm stability. We find that the rationale behind MGDA -- updating along conflict-avoidant direction - may \emph{impede} dynamic weighting algorithms from achieving the optimal ${\cal O}(1/\sqrt{n})$ population risk, where $n$ is the number of training samples. We further highlight the variability of dynamic weights and their impact on the three-way trade-off among optimization, generalization, and conflict avoidance that is unique in MOL. Code is available at https://github.com/heshandevaka/Trade-Off-MOL.

How Does Adaptive Optimization Impact Local Neural Network Geometry?
Kaiqi Jiang Dhruv Malik Yuanzhi Li



研究问题:本文探讨了优化方法在神经网络训练中的作用,特别是自适应优化方法如Adam相对于传统梯度下降法的优势。
动机:传统的优化观点认为,自适应算法通过模仿二阶方法的行为来改善性能,但作者认为这种观点在神经网络优化中并不充分。
方法:作者提出了一种局部轨迹分析的方法,并引入了一个类似于损失Hessian条件数的统计量$R^{\text{OPT}}\_{\text{med}}$。通过对语言模型的大量实验,作者发现自适应方法如Adam倾向于将轨迹偏向于$R^{\text{Adam}}_{\text{med}}$较小的区域,而SGD(带动量)则倾向于将轨迹偏向于$R^{\text{SGD}}\_{text{med}}$较大的区域。
效果:作者的理论结果在两层线性网络的简化设置中证明了这一现象。这些发现为自适应方法的成功提供了新的证据,这与传统的观念不同。

Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the *global* geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a *local* trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}\_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments on language models where adaptive algorithms converge faster than vanilla gradient methods like SGD, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster optimization. By contrast, SGD (with momentum) biases the trajectories towards regions where $R^{\text{SGD}}\_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.

Near-Optimal Algorithms for Gaussians with Huber Contamination: Mean Estimation and Linear Regression
Ilias Diakonikolas Daniel Kane Ankit Pensia Thanasis Pittas



研究问题:本文研究了高斯均值估计和带有高斯协变量的线性回归在存在Huber污染时的问题。
动机:当前对于这两个问题,尽管已有一些算法,但其样本复杂度和时间复杂度都不尽人意。
方法:本文设计了新的算法,通过多方向过滤的方法,实现了对高斯均值估计和线性回归问题的快速处理。
效果:实验结果表明,新算法在保证误差范围的前提下,显著降低了样本复杂度和运行时间,提高了效率。

We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both these problems. Specifically, for Gaussian robust mean estimation on $\mathbb R^d$ with contamination parameter $\epsilon \in (0, \epsilon_0)$ for a small absolute constant $\epsilon_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\epsilon)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.

Streaming Algorithms and Lower Bounds for Estimating Correlation Clustering Cost
Sepehr Assadi Vihan Shah Chen Wang



研究问题:本文旨在解决机器学习和理论计算机科学交叉领域的优化问题——相关性聚类。
动机:由于大数据处理的应用需求,近年来在流模型中对此问题的研究结果层出不穷。
方法:本文研究了内存需求远小于输入大小的流算法,所有先前的工作都集中在内存需求为$\Omega(n)$的半流算法上,而本文则研究了内存需求仅为$\text{polylog}{(n)}$位的流算法。
效果:作为主要成果,我们提出了两种新的算法,它们仅使用$\text{polylog}{(n)}$的空间就能估计最优相关性聚类成本,误差因子为某常数乘以一些额外项。其中一种算法输出的是$3$-倍乘性近似值加上$o(n^2)$附加近似值,另一种算法通过增加大常数倍数误差来进一步降低附加误差。

Correlation clustering is a fundamental optimization problem at the intersection of machine learning and theoretical computer science. Motivated by applications to big data processing, recent years have witnessed a flurry of results on this problem in the streaming model. In this model, the algorithm needs to process the input $n$-vertex graph by making one or few passes over the stream of its edges and using a limited memory, much smaller than the input size. All previous work on streaming correlation clustering have focused on semi-streaming algorithms with $\Omega(n)$ memory, whereas in this work, we study streaming algorithms with much smaller memory requirement of only $\text{polylog}{(n)}$ bits. This stringent memory requirement is in the same spirit of classical streaming algorithms that instead of recovering a full solution to the problem---which can be prohibitively large with such small memory as is the case in our problem---, aimed to learn certain statistical properties of their inputs. In our case, this translates to determining the ``(correlation) clusterability'' of input graphs, or more precisely, estimating the cost of the optimal correlation clustering solution. As our main result, we present two novel algorithms that in only $\text{polylog}{(n)}$ space are able to estimate the optimal correlation clustering cost up to some constant multiplicative factor plus some extra additive error. One of the algorithms outputs a $3$-multiplicative approximation plus $o(n^2)$ additive approximation, and the other one improves the additive error further down at the cost of increasing the multiplicative factor to some large constant. We then present new lower bounds that justify this mix of both multiplicative and additive error approximation in our algorithms.

Replicable Reinforcement Learning
ERIC EATON Marcel Hussing Michael Kearns Jessica Sorrell



研究问题:如何使预训练语言模型充分利用结构化知识,提升语言理解能力。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,通过结合知识图谱可以增强语言表示。
方法:采用大规模文本语料库和知识图谱联合训练ERNIE模型,同时捕捉词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

The replicability crisis in the social, behavioral, and data sciences has led to the formulation of algorithm frameworks for replicability --- i.e., a requirement that an algorithm produce identical outputs (with high probability) when run on two different samples from the same underlying distribution. While still in its infancy, provably replicable algorithms have been developed for many fundamental tasks in machine learning and statistics, including statistical query learning, the heavy hitters problem, and distribution testing. In this work we initiate the study of replicable reinforcement learning, providing a provably replicable algorithm for parallel value iteration, and a provably replicable version of R-Max in the episodic setting. These are the first formal replicability results for control problems, which present different challenges for replication than batch learning settings.

Stability of Random Forests and Coverage of Random-Forest Prediction Intervals
Yan Wang Huaiqing Wu Dan Nettleton



研究问题:本文旨在探讨随机森林的稳定性,并分析其在不同条件下的预测区间覆盖率。
动机:尽管随机森林在许多实际应用中表现出良好的性能,但其稳定性和预测区间覆盖率尚未得到充分研究。
方法:通过理论分析和实证研究,本文探讨了随机森林的稳定性以及基于其构建的预测区间的覆盖率。
效果:实验结果表明,随机森林具有良好的稳定性,并且其预测区间覆盖率满足预期。这为随机森林提供了额外的价值,使其不仅能提供满意的点预测,还能提供合理的区间预测,且几乎不需要额外的计算成本。

We establish stability of random forests under the mild condition that the squared response ($Y^2$) does not have a heavy tail. In particular, our analysis holds for the practical version of random forests that is implemented in popular packages like \texttt{randomForest} in \texttt{R}. Empirical results show that stability may persist even beyond our assumption and hold for heavy-tailed $Y^2$. Using the stability property, we prove a non-asymptotic lower bound for the coverage probability of prediction intervals constructed from the out-of-bag error of random forests. With another mild condition that is typically satisfied when $Y$ is continuous, we also establish a complementary upper bound, which can be similarly established for the jackknife prediction interval constructed from an arbitrary stable algorithm. We also discuss the asymptotic coverage probability under assumptions weaker than those considered in previous literature. Our work implies that random forests, with its stability property, is an effective machine learning method that can provide not only satisfactory point prediction but also justified interval prediction at almost no extra computational cost.

On Single-Index Models beyond Gaussian Data
Aaron Zweig Loucas Pillaud-Vivien Joan Bruna



研究问题:本研究旨在探讨稀疏高维函数在浅层神经网络梯度下降方法中的行为,以及其在非线性模型特征学习中的应用。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Sparse high-dimensional functions have arisen as a rich framework to study the behavior of gradient-descent methods using shallow neural networks, and showcasing its ability to perform feature learning beyond linear models. Amongst those functions, the simplest are single-index models $f(x) = \phi( x \cdot \theta^*)$, where the labels are generated by an arbitrary non-linear link function $\phi$ of an unknown one-dimensional projection $\theta^*$ of the input data. By focusing on Gaussian data, several recent works have built a remarkable picture, where the so-called information exponent (related to the regularity of the link function) controls the required sample complexity. In essence, these tools exploit the stability and spherical symmetry of Gaussian distributions. In this work, we explore extensions of this picture beyond the Gaussian setting, where both stability or symmetry might be violated. Focusing on the planted setting where $\phi$ is known, our main results establish that Stochastic Gradient Descent recovers the unknown direction $\theta^*$ with constant probability in the high-dimensional regime, under mild assumptions that significantly extend ~[Yehudai and Shamir,20].

Comparing Apples to Oranges: Learning Similarity Functions for Data Produced by Different Distributions
Leonidas Tsepenekas Ivan Brugere Freddy Lecue Daniele Magazzeni



研究问题:如何有效地获取跨群体的相似性函数?
动机:当比较的元素来自不同的分布或属于不同的“人口”群体时,获取其真实相似性可能非常困难。
方法:提出一种有效的采样框架,仅使用有限的专家反馈来学习这些跨群体的相似性函数。
效果:通过大量的实验验证了算法的有效性,并提供了具有严格理论界限的分析结果。

Similarity functions measure how comparable pairs of elements are, and play a key role in a wide variety of applications, e.g., notions of Individual Fairness abiding by the seminal paradigm of Dwork et al., as well as Clustering problems. However, access to an accurate similarity function should not always be considered guaranteed, and this point was even raised by Dwork et al. For instance, it is reasonable to assume that when the elements to be compared are produced by different distributions, or in other words belong to different ``demographic'' groups, knowledge of their true similarity might be very difficult to obtain. In this work, we present an efficient sampling framework that learns these across-groups similarity functions, using only a limited amount of experts' feedback. We show analytical results with rigorous theoretical bounds, and empirically validate our algorithms via a large suite of experiments.

Debiasing Conditional Stochastic Optimization
Lie He Shiva Kasiviswanathan



研究问题:本文研究了条件随机优化(CSO)问题,该问题涵盖了投资组合选择、强化学习、鲁棒学习、因果推断等多种应用。
动机:由于CSO目标的嵌套结构,其样本平均梯度存在偏差,导致收敛需要高样本复杂度。
方法:我们引入了一种通用的随机外推技术,有效地降低了偏差。对于非凸平滑目标,我们将这种外推与方差减小技术相结合,可以达到比现有界限更好的样本复杂度。
效果:我们还开发了新的有限和算法来解决CSO问题,这也显著改善了现有结果。最后,我们认为我们的去偏技术有可能成为解决其他随机优化问题的有效工具。

In this paper, we study the conditional stochastic optimization (CSO) problem which covers a variety of applications including portfolio selection, reinforcement learning, robust learning, causal inference, etc. The sample-averaged gradient of the CSO objective is biased due to its nested structure, and therefore requires a high sample complexity for convergence. We introduce a general stochastic extrapolation technique that effectively reduces the bias. We show that for nonconvex smooth objectives, combining this extrapolation with variance reduction techniques can achieve a significantly better sample complexity than the existing bounds. Additionally, we develop new algorithms for the finite-sum variant of the CSO problem that also significantly improve upon existing results. Finally, we believe that our debiasing technique has the potential to be a useful tool for addressing similar challenges in other stochastic optimization problems.

A Variational Perspective on High-Resolution ODEs
Hoomaan Maskan Konstantinos C. Zygalakis Alp Yurtsever



研究问题:本文旨在探讨无约束平滑凸函数的最小化问题,并提出一种新的变分视角。
动机:通过使用强制欧拉-拉格朗日方程,可以研究高分辨率ODEs,从而加快梯度范数最小化的收敛速度。
方法:我们提出了一种新颖的变分视角,利用强制欧拉-拉格朗日方程来研究高分辨率ODEs,并使用Nesterov加速梯度法进行梯度范数最小化。此外,我们还展示了Nesterov的方法可以被解释为适当选择的高分辨率ODE的速度匹配离散化。
效果:通过新的变分视角的结果,我们提出了一种用于噪声梯度的随机算法。几个数值实验将我们的随机算法与最先进的方法进行了比较和说明。

We consider unconstrained minimization of smooth convex functions. We propose a novel variational perspective using forced Euler-Lagrange equation that allows for studying high-resolution ODEs. Through this, we obtain a faster convergence rate for gradient norm minimization using Nesterov's accelerated gradient method. Additionally, we show that Nesterov's method can be interpreted as a rate-matching discretization of an appropriately chosen high-resolution ODE. Finally, using the results from the new variational perspective, we propose a stochastic method for noisy gradients. Several numerical experiments compare and illustrate our stochastic algorithm with state of the art methods.

Algorithmic Regularization in Tensor Optimization: Towards a Lifted Approach in Matrix Sensing
Ziye Ma Javad Lavaei Somayeh Sojoudi



研究问题:本文旨在探讨梯度下降(GD)在张量优化中诱导隐式正则化,以实现矩阵感测问题的全局最优性的作用。
动机:最近提出的提升矩阵感测框架通过将虚假解决方案转化为严格的鞍点来解决非凸矩阵感测问题,而GD在这个提升问题上的应用可以产生近似的秩-1张量和具有逃逸方向的关键点。
方法:采用提升矩阵感测框架,并应用GD进行优化,通过适当的初始尺度,可以得到近似的秩-1张量和关键点。
效果:本文的发现强调了在解决这类问题时,张量的参数化以及一阶方法的重要性,可以有效地实现全局最优。

Gradient descent (GD) is crucial for generalization in machine learning models, as it induces implicit regularization, promoting compact representations. In this work, we examine the role of GD in inducing implicit regularization for tensor optimization, particularly within the context of the lifted matrix sensing framework. This framework has been recently proposed to address the non-convex matrix sensing problem by transforming spurious solutions into strict saddles when optimizing over symmetric, rank-1 tensors. We show that, with sufficiently small initialization scale, GD applied to this lifted problem results in approximate rank-1 tensors and critical points with escape directions. Our findings underscore the significance of the tensor parametrization of matrix sensing, in combination with first-order methods, in achieving global optimality in such problems.

On Computing Pairwise Statistics with Local Differential Privacy
Badih Ghazi Pritish Kamath Ravi Kumar Pasin Manurangsi Adam Sealfon



研究问题:本文旨在解决在局部模型中计算具有差分隐私的成对统计量的问题。
动机:为了保护用户数据的隐私,需要在进行数据统计时加入差分隐私的保障。
方法:利用线性查询的差分隐私算法,给出了几种新颖且通用的算法。
效果:这些算法能够有效地计算出各种重要的统计量,如Kendall's tau系数、AUC值、Gini平均差和Gini熵等。

We study the problem of computing pairwise statistics, i.e., ones of the form $\binom{n}{2}^{-1} \sum_{i \ne j} f(x_i, x_j)$, where $x_i$ denotes the input to the $i$th user, with differential privacy (DP) in the local model. This formulation captures important metrics such as Kendall's $\tau$ coefficient, Area Under Curve, Gini's mean difference, Gini's entropy, etc. We give several novel and generic algorithms for the problem, leveraging techniques from DP algorithms for linear queries.

Combinatorial Group Testing with Selfish Agents
Giorgos Chionas Dariusz Rafal Kowalski Piotr Krysta



研究问题:本文在一种新的博弈论框架中研究组合群体测试(CGT)问题,解决方案概念为对抗均衡(AE)。
动机:在传统的CGT问题中,存在自私的代理者,他们的目标是尽快确认自己的存在。然而,当这些代理者的数量较小且未知时,如何设计有效的算法策略仍然是一个开放的问题。
方法:我们设计了一种新颖的游戏模型,其中包含n个自私的代理者和一个隐藏的活动代理者集合K。每个活动代理者在每轮游戏中决定是否出现在查询Q中,所有代理者都会收到关于Q和K交集的反馈。我们为这种新的游戏设计并分析了自适应的算法策略。
效果:如果k是已知的,我们的算法策略的学习时间可以保证为O(k log(n/k))。如果k未知,我们的算法策略的学习时间为O(n^k),并且我们证明了任何此类算法策略的学习时间至少为Omega(n)。这显示了已知和未知k的两个模型之间以及经典CGT(即没有自私代理者)和我们的博弈论CGT模型之间的明显区别。

We study the Combinatorial Group Testing (CGT) problem in a novel game-theoretic framework, with a solution concept of Adversarial Equilibrium (AE). In this new framework, we have $n$ selfish agents corresponding to the elements of the universe $[n] =\{0,1,\ldots,n-1\}$ and a hidden set $K \subseteq [n]$ of active agents of size $|K| = k \ll n$. In each round of the game, each active agent decides if it is present in a query $Q \subseteq [n]$, and all agents receive feedback on $Q \cap K$. The goal of each active agent is to assure that its id could be learned from the feedback as early as possible. We present a comprehensive set of results in this new game, where we design and analyze adaptive algorithmic strategies of agents which are AE's. In particular, if $k$ is known to the agents, then we design adaptive AE strategies with provably near optimal learning time of $O(k \log(n/k))$. In the case of unknown $k$, we design an adaptive AE strategies with learning time of order $n^k$, and we prove a lower bound of $\Omega(n)$ on the learning time of any such algorithmic strategies. This shows a strong separations between the two models of known and unknown $k$, as well as between the classic CGT, i.e., without selfish agents, and our game theoretic CGT model.

Faster approximate subgraph counts with privacy
Dung Nguyen Mahantesh M Halappanavar Venkatesh Srinivasan Anil Vullikanti



研究问题:在图数据中,最常见的问题是计算子图在给定图中的非诱导嵌入的数量。
动机:这些计数具有非常高的全局敏感性。因此,基于强大的替代技术(如平滑敏感性和高阶局部敏感性)添加噪声已被证明可以显著提高准确性。
方法:本文展示了仍然可以使用对这些敏感性度量的良好近似值来获得私有算法。
效果:使用这种方法,我们展示了第一个准线性时间和并行算法用于私有地计数三角形的数量。我们还给出了一个私有的多项式时间算法,用于计数任何常数大小的子图,使用的噪声比全局敏感性少;我们表明,这可以在特殊类别的图中计数路径时得到显著改善。

One of the most common problems studied in the context of differential privacy for graph data is counting the number of non-induced embeddings of a subgraph in a given graph. These counts have very high global sensitivity. Therefore, adding noise based on powerful alternative techniques, such as smooth sensitivity and higher-order local sensitivity have been shown to give significantly better accuracy. However, all these alternatives to global sensitivity become computationally very expensive, and to date efficient polynomial time algorithms are known only for few selected subgraphs, such as triangles, $k$-triangles, and $k$-stars. In this paper, we show that good approximations to these sensitivity metrics can be still used to get private algorithms. Using this approach, we show the first quasilinear time and parallel algorithms for privately counting the number of triangles. We also give a private polynomial time algorithm for counting any constant size subgraph using less noise than the global sensitivity; we show this can be improved significantly for counting paths in special classes of graphs.

Optimal Excess Risk Bounds for Empirical Risk Minimization on $p$-Norm Linear Regression
Ayoub El Hanchi Murat A Erdogdu



研究问题:本研究探讨了在$p\in(1,+\infty)$的$p$-范数线性回归问题上,经验风险最小化的性能。
动机:我们发现,在可实现的情况下,无需任何矩假设,并且仅需要$O(d)$个样本即可精确恢复目标。此外,对于$p\in[2,+\infty)$的情况,我们在目标和协变量上做了弱矩假设,证明了经验风险最小化者的高概率超额风险界,其主导项与渐近精确率相匹配。
方法:我们利用大规模文本语料库和知识图谱训练了一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the performance of empirical risk minimization on the $p$-norm linear regression problem for $p \in (1, \infty)$. We show that, in the realizable case, under no moment assumptions, and up to a distribution-dependent constant, $O(d)$ samples are enough to exactly recover the target. Otherwise, for $p \in [2, \infty)$, and under weak moment assumptions on the target and the covariates, we prove a high probability excess risk bound on the empirical risk minimizer whose leading term matches, up to a constant that depends only on $p$, the asymptotically exact rate. We extend this result to the case $p \in (1, 2)$ under mild assumptions that guarantee the existence of the Hessian of the risk at its minimizer.

Oracle Complexity of Single-Loop Switching Subgradient Methods for Non-Smooth Weakly Convex Functional Constrained Optimization
Yankun Huang Qihang Lin



研究问题:解决非凸约束优化问题,目标函数弱凸,约束函数为凸或弱凸。
动机:经典切换次梯度方法是一种直观且易于实现的一阶方法,但其求解非凸问题的复杂度仅已知于凸问题。
方法:利用大规模文本语料库和知识图谱训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider a non-convex constrained optimization problem, where the objective function is weakly convex and the constraint function is either convex or weakly convex. To solve this problem, we consider the classical switching subgradient method, which is an intuitive and easily implementable first-order method whose oracle complexity was only known for convex problems. This paper provides the first analysis on the oracle complexity of the switching subgradient method for finding a nearly stationary point of non-convex problems. Our results are derived separately for convex and weakly convex constraints. Compared to existing approaches, especially the double-loop methods, the switching gradient method can be applied to non-smooth problems and achieves the same complexity using only a single loop, which saves the effort on tuning the number of inner iterations.

Single-Call Stochastic Extragradient Methods for Structured Non-monotone Variational Inequalities: Improved Analysis under Weaker Conditions
Sayantan Choudhury Eduard Gorbunov Nicolas Loizou



研究问题:尽管随机单步增强梯度方法(如随机过去增强和随机乐观梯度)在研究问题:尽管随机单步增强梯度方法(如随机过去增强和随机乐观梯度)在解决各种机器学习任务中的大规模最小-最大优化和变分不等式问题上非常有效,但目前对于这些方法的收敛性分析需要强大的假设,如有限方差或增长条件。此外,关于这些方法的收敛性质的几个重要问题仍然未解,包括小批量处理、有效步长选择以及在不同采样策略下的收敛保证。
动机:本文旨在解决这些问题,并为两类结构非单调变分不等式提供收敛保证:(i)拟强单调问题(强单调问题的推广),(ii)弱 Minty 变分不等式(单调和 Minty 变分不等式的推广)。
方法:我们引入了期望残差条件,解释了它的优点,并展示了它如何使我们获得比之前使用的增长条件、预期共同强制性或有限方差假设更严格的弱边界。
效果:我们的收敛分析适用于任意采样范式,包括重要性采样和各种小批量处理策略作为特例。

Single-call stochastic extragradient methods, like stochastic past extragradient (SPEG) and stochastic optimistic gradient (SOG), have gained a lot of interest in recent years and are one of the most efficient algorithms for solving large-scale min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, despite their undoubted popularity, current convergence analyses of SPEG and SOG require strong assumptions like bounded variance or growth conditions. In addition, several important questions regarding the convergence properties of these methods are still open, including mini-batching, efficient step-size selection, and convergence guarantees under different sampling strategies. In this work, we address these questions and provide convergence guarantees for two large classes of structured non-monotone VIPs: (i) quasi-strongly monotone problems (a generalization of strongly monotone problems) and (ii) weak Minty variational inequalities (a generalization of monotone and Minty VIPs). We introduce the expected residual condition, explain its benefits, and show how it allows us to obtain a strictly weaker bound than previously used growth conditions, expected co-coercivity, or bounded variance assumptions. Finally, our convergence analysis holds under the arbitrary sampling paradigm, which includes importance sampling and various mini-batching strategies as special cases.

Personalized Dictionary Learning for Heterogeneous Datasets
Geyu Liang Naichen Shi Raed Al Kontar Salar Fattahi



研究问题:本文提出了一个相关但具有挑战性的问题,即个性化字典学习(PerDL),目标是从共享某些共性的异构数据集中学习稀疏线性表示。
动机:在PerDL中,我们将每个数据集的共享和独特特征建模为全局和局部字典。PerDL的挑战不仅来自于经典的字典学习(DL),还由于共享和独特特征的未知性质而产生。
方法:本文严格地定义了这个问题,并提供了在什么条件下全局和局部字典可以证明是可分离的条件。在这些条件下,我们提供了一个名为个性化匹配和平均(PerMA)的元算法,可以从异构数据集中恢复全局和局部字典。
效果:PerMA非常高效,在适当的条件下以线性速度收敛到真实值。此外,它自动从强大的学习者中借用力量来提高弱学习者的预测能力。作为提取全局和局部字典的通用框架,我们在如不平衡数据集训练和视频监控等不同的学习任务中展示了PerDL的应用。

We introduce a relevant yet challenging problem named Personalized Dictionary Learning (PerDL), where the goal is to learn sparse linear representations from heterogeneous datasets that share some commonality. In PerDL, we model each dataset's shared and unique features as global and local dictionaries. Challenges for PerDL not only are inherited from classical dictionary learning(DL), but also arise due to the unknown nature of the shared and unique features. In this paper, we rigorously formulate this problem and provide conditions under which the global and local dictionaries can be provably disentangled. Under these conditions, we provide a meta-algorithm called Personalized Matching and Averaging (PerMA) that can recover both global and local dictionaries from heterogeneous datasets. PerMA is highly efficient; it converges to the ground truth at a linear rate under suitable conditions. Moreover, it automatically borrows strength from strong learners to improve the prediction of weak learners. As a general framework for extracting global and local dictionaries, we show the application of PerDL in different learning tasks, such as training with imbalanced datasets and video surveillance.

Multi-Agent Learning with Heterogeneous Linear Contextual Bandits
Anh Do Thanh Nguyen-Tang Raman Arora



研究问题:本研究旨在解决多智能体学习中,异构环境下的学习者如何以及何时从分享各自的经验中受益的问题。
动机:随着训练有素的智能系统越来越普遍,多智能体学习已成为研究自主代理之间复杂互动的热门框架。然而,对于异构环境中的学习者如何以及何时从分享各自的经验中受益的正式理解还远未完成。
方法:本文严格地定义了这个问题,并提供了在什么条件下全局和局部字典可以证明是可分离的条件。在这些条件下,我们提供了一个名为个性化匹配和平均(PerMA)的元算法,可以从异构数据集中恢复全局和局部字典。
效果:PerMA非常高效,在适当的条件下以线性速度收敛到真实值。此外,它自动从强大的学习者中借用力量来提高弱学习者的预测能力。作为提取全局和局部字典的通用框架,我们在如不平衡数据集训练和视频监控等不同的学习任务中展示了PerDL的应用。

As trained intelligent systems become increasingly pervasive, multiagent learning has emerged as a popular framework for studying complex interactions between autonomous agents. Yet, a formal understanding of how and when learners in heterogeneous environments benefit from sharing their respective experiences is far from complete. In this paper, we seek answers to these questions in the context of linear contextual bandits. We present a novel distributed learning algorithm based on the upper confidence bound (UCB) algorithm, which we refer to as H-LINUCB, wherein agents cooperatively minimize the group regret under the coordination of a central server. In the setting where the level of heterogeneity or dissimilarity across the environments is known to the agents, we show that H-LINUCB is provably optimal in regimes where the tasks are highly similar or highly dissimilar.

Multi-Swap k-Means++
Lorenzo Beretta Vincent Cohen-Addad Silvio Lattanzi Nikos Parotsidis



研究问题:如何优化k-means聚类目标,提高解决方案的质量。
动机:现有的k-means++算法虽然在期望上给出了O(log k)的近似解,但质量仍有提升空间。
方法:通过考虑更大、更复杂的局部搜索邻域,扩展Lattanzi和Sohler的局部搜索算法,允许同时交换多个中心。
效果:该算法实现了9 + ε的近似比,这是局部搜索的最佳可能结果。此外,该算法易于实现,运行速度快,能在各种经典数据集上输出更好的解决方案。

The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local-search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local-search algorithm by considering larger and more sophisticated local-search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our algorithm is practical, namely easy to implement and fast enough to run on a variety of classic datasets, and outputs solutions of better cost.

Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL with General Regularizers and Multiple Optimal Arms
Tiancheng Jin Junyan Liu Haipeng Luo



研究问题:设计一种自适应的多臂赌博算法,同时在随机设置和对抗设置中表现最优(通常被称为最佳双重保证)。
动机:最近的一些研究表明,当配置和分析得当时,最初为对抗环境设计的Follow-the-Regularized-Leader (FTRL) 算法实际上可以很好地适应随机环境。然而,这些结果严重依赖于存在一个唯一最优手臂的假设。
方法:我们通过移除这种不必要的唯一性假设,对具有广泛正则化器和新的学习率调度的FTRL算法进行了显著的改进和泛化。
效果:对于某些正则化器,即使唯一性成立,我们的遗憾界限也优于先前的结果。我们还将这些结果应用于解耦探索和利用问题,证明我们的技术具有广泛的应用性。

We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito [2021] took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the 1/2-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.

Counting Distinct Elements in the Turnstile Model with Differential Privacy under Continual Observation
Palak Jain Iden Kalemaj Sofya Raskhodnikova Satchit Sivakumar Adam Smith



研究问题:如何对从敏感数据集中学习并持续更新输出的系统进行隐私保护,特别是在数据流中项目可能被插入和删除的情况下。
动机:对于处理插入和删除的数据流,即使只考虑插入操作,现有的算法在数据流长度为T时,其误差也只是多项式对数级的。我们发现即使在没有内存限制的情况下,turnstile模型也存在更丰富的情况。
方法:我们提出了一种针对插入和删除操作的差分隐私机制,即使在相对弱的事件级别隐私定义下,该机制的误差也至少为T的1/4。我们还识别了输入流的一个参数——最大不稳定性,对于自然数据流来说,这个参数的值较低,并且我们为其提供了精确的参数化误差保证。
效果:我们提出的机制在所有最大不稳定性为w的turnstile流中,都能以O(√w * polylog T)的误差持续输出不同元素的数量,而无需预先知道w的值。我们证明了这是在w的一个大范围内唯一能达到的最佳误差边界。当w较小时,我们的机制的误差与插入操作时的多项式对数级误差相似,从而避开了turnstile模型中的困难。

Privacy is a central challenge for systems that learn from sensitive data sets, especially when a system's outputs must be continuously updated to reflect changing data. We consider the achievable error for differentially private continual release of a basic statistic---the number of distinct items---in a stream where items may be both inserted and deleted (the turnstile model). With only insertions, existing algorithms have additive error just polylogarithmic in the length of the stream $T$. We uncover a much richer landscape in the turnstile model, even without considering memory restrictions. We show that every differentially private mechanism that handles insertions and deletions has worst-case additive error at least $T^{1/4}$ even under a relatively weak, event-level privacy definition. Then, we identify a parameter of the input stream, its maximum flippancy, that is low for natural data streams and for which we give tight parameterized error guarantees. Specifically, the maximum flippancy is the largest number of times that the contribution of a single item to the distinct elements count changes over the course of the stream. We present an item-level differentially private mechanism that, for all turnstile streams with maximum flippancy $w$, continually outputs the number of distinct elements with an $O(\sqrt{w} \cdot \mathsf{poly}\log T)$ additive error, without requiring prior knowledge of $w$. We prove that this is the best achievable error bound that depends only on $w$, for a large range of values of $w$. When $w$ is small, the error of our mechanism is similar to the polylogarithmic in $T$ error in the insertion-only setting, bypassing the hardness in the turnstile model.

Optimization and Bayes: A Trade-off for Overparameterized Neural Networks
Zhengmian Hu Heng Huang



研究问题:本文提出了一种新的算法,转换贝叶斯学习(TansBL),弥合了经验风险最小化(ERM)和神经网络的贝叶斯学习之间的差距。
动机:比较使用梯度下降优化的经验风险最小化(ERM)和具有重要采样的贝叶斯学习,以了解它们的泛化能力和计算复杂性。
方法:基于无限小步长梯度下降获得的已训练后验分布与高斯先验之间的精确KL散度,推导出第一个算法依赖的无限宽网络的PAC-Bayesian泛化边界。此外,通过引入权重,展示了如何将基于梯度的优化转化为重要性采样。
效果:虽然贝叶斯学习有更好的泛化能力,但其采样效率较低。而优化方法则具有良好的采样效率,但泛化能力较差。我们提出的TansBL算法实现了泛化和采样效率之间的权衡。

This paper proposes a novel algorithm, Transformative Bayesian Learning (TansBL), which bridges the gap between empirical risk minimization (ERM) and Bayesian learning for neural networks. We compare ERM, which uses gradient descent to optimize, and Bayesian learning with importance sampling for their generalization and computational complexity. We derive the first algorithm-dependent PAC-Bayesian generalization bound for infinitely wide networks based on an exact KL divergence between the trained posterior distribution obtained by infinitesimal step size gradient descent and a Gaussian prior. Moreover, we show how to transform gradient-based optimization into importance sampling by incorporating a weight. While Bayesian learning has better generalization, it suffers from low sampling efficiency. Optimization methods, on the other hand, have good sampling efficiency but poor generalization. Our proposed algorithm TansBL enables a trade-off between generalization and sampling efficiency.

Mixture Weight Estimation and Model Prediction in Multi-source Multi-target Domain Adaptation
Yuyang Deng Ilja Kuzborskij Mehrdad Mahdavi



研究问题:如何从多个数据源学习一个模型,以在新的目标分布上表现良好。
动机:在多源数据收集或分布式系统中学习时,数据可能高度异构,需要将这些数据源混合到目标分布中,同时最小化混合源的经验风险。
方法:我们将第一个问题,即给定目标域的最优源混合估计,视为凸非凹组合极小极大问题,并提出了一种具有收敛性保证的高效随机算法。对于第二个问题,我们确定在某些情况下可以避免为每个目标域单独解决ERM,而是将目标最优模型的参数视为混合系数空间上的非线性函数。为此,我们证明在离线设置中,过参数化的GD训练神经网络可以学习这种函数。最后,我们还考虑了在线设置,并提出了标签高效的在线算法,该算法可以在任意混合系数序列下预测新模型的参数,同时享受最优遗憾。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider a problem of learning a model from multiple sources with the goal to perform well on a new target distribution. Such problem arises in learning with data collected from multiple sources (e.g. crowdsourcing) or learning in distributed systems, where the data can be highly heterogeneous. The goal of learner is to mix these data sources in a target-distribution aware way and simultaneously minimize the empirical risk on the mixed source. The literature has made some tangible advancements in establishing theory of learning on mixture domain. However, there are still two unsolved problems. Firstly, how to estimate the optimal mixture of sources, given a target domain; Secondly, when there are numerous target domains, we have to solve empirical risk minimization for each target on possibly unique mixed source data , which is computationally expensive. In this paper we address both problems efficiently and with guarantees. We cast the first problem, mixture weight estimation as convex-nonconcave compositional minimax, and propose an efficient stochastic algorithm with provable stationarity guarantees. Next, for the second problem, we identify that for certain regime, solving ERM for each target domain individually can be avoided, and instead parameters for a target optimal model can be viewed as a non-linear function on a space of the mixture coefficients. To this end, we show that in offline setting, a GD-trained overparameterized neural network can provably learn such function. Finally, we also consider an online setting and propose an label efficient online algorithm, which predicts parameters for new models given arbitrary sequence of mixing coefficients, while enjoying optimal regret.

Sample-Conditioned Hypothesis Stability Sharpens Information-Theoretic Generalization Bounds
Ziqiao Wang Yongyi Mao



研究问题:通过构建"邻近假设"矩阵和一种新的样本条件假设稳定性概念,提供新的信息理论泛化保证。
动机:改进现有的信息理论边界,解决随机凸优化问题中现有信息理论边界的限制。
方法:利用大规模文本语料库和知识图谱训练ERNIE模型,将KG中的知识与文本语料库进行联合训练。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We present new information-theoretic generalization guarantees through the a novel construction of the "neighboring-hypothesis" matrix and a new family of stability notions termed sample-conditioned hypothesis (SCH) stability. Our approach yields sharper bounds that improve upon previous information-theoretic bounds in various learning scenarios. Notably, these bounds address the limitations of existing information-theoretic bounds in the context of stochastic convex optimization (SCO) problems, as explored in the recent work by Haghifam et al. (2023).

Robust Matrix Sensing in the Semi-Random Model
Xing Gao Yu Cheng



研究问题:本文旨在解决机器学习中的基本问题——低秩矩阵恢复,特别是在一个半随机模型中进行矩阵感测的问题。
动机:在实际应用中,低秩矩阵恢复问题可以通过凸优化即核范数最小化来解决,或者通过非凸优化。对于像矩阵感测和矩阵补全这样的低秩矩阵问题,已知在某些理想假设下,自然非凸目标的所有局部最优解也是全局最优的。
方法:本文提出了一种新的方法来处理半随机模型中的矩阵感测问题,其中攻击者可以添加任意数量的任意感测矩阵。具体来说,问题是从线性测量值b_i = ⟨A_i, X*⟩中恢复低秩矩阵X*,其中未知的感测矩阵子集满足限制等距性质(RIP),而其余的A_i则被攻击者选择。
效果:本文提出的下降式算法能够保证恢复出真实的矩阵X*。对于密切相关的半随机矩阵补全问题,先前的工作[CG18]表明,通过重新加权输入数据可以消除所有不良局部最优解。然而,对于矩阵感测问题,需要重新加权一组满足RIP的矩阵,这是NP难的问题。因此,我们构建了一个基于[KLL$^+$23]提出的半随机稀疏线性回归框架的算法,该算法在每次迭代中根据当前解决方案重新加权输入,然后采取一个保证在局部上表现良好的加权梯度步。

Low-rank matrix recovery is a fundamental problem in machine learning with numerous applications. In practice, the problem can be solved by convex optimization namely nuclear norm minimization, or by non-convex optimization as it is well-known that for low-rank matrix problems like matrix sensing and matrix completion, all local optima of the natural non-convex objectives are also globally optimal under certain ideal assumptions. In this paper, we study new approaches for matrix sensing in a semi-random model where an adversary can add any number of arbitrary sensing matrices. More precisely, the problem is to recover a low-rank matrix $X^\star$ from linear measurements $b_i = \langle A_i, X^\star \rangle$, where an unknown subset of the sensing matrices satisfies the Restricted Isometry Property (RIP) and the rest of the $A_i$'s are chosen adversarially. It is known that in the semi-random model, existing non-convex objectives can have bad local optima. To fix this, we present a descent-style algorithm that provably recovers the ground-truth matrix $X^\star$. For the closely-related problem of semi-random matrix completion, prior work [CG18] showed that all bad local optima can be eliminated by reweighting the input data. However, the analogous approach for matrix sensing requires reweighting a set of matrices to satisfy RIP, which is a condition that is NP-hard to check. Instead, we build on the framework proposed in [KLL$^+$23] for semi-random sparse linear regression, where the algorithm in each iteration reweights the input based on the current solution, and then takes a weighted gradient step that is guaranteed to work well locally. Our analysis crucially exploits the connection between sparsity in vector problems and low-rankness in matrix problems, which may have other applications in obtaining robust algorithms for sparse and low-rank problems.

Sorting with Predictions
Xingjian Bai Christian Coester



研究问题:本文旨在通过学习增强的算法,探索排序问题的根本。
动机:现有的排序算法在预测错误的情况下效率低下,而学习增强的算法可以利用可能的错误预测来提高效率。
方法:我们考虑两种不同的设置,第一种是每个元素都被提供了其在排序列表中的位置预测,第二种是除了慢且准确的比较外,我们还假设存在一种“快速且粗略”的元素比较方式。对于这两种设置,我们设计了新的简单算法,只使用$O(\sum_i \log \eta_i)$个准确比较,其中$\eta_i$是对第i个元素的适当定义的预测误差。
效果:实验结果表明,与现有的自适应和非自适应排序算法相比,应用学习增强的算法在排序任务上具有潜力。

We explore the fundamental problem of sorting through the lens of learning-augmented algorithms, where algorithms can leverage possibly erroneous predictions to improve their efficiency. We consider two different settings: In the first setting, each item is provided a prediction of its position in the sorted list. In the second setting, we assume there is a ``quick-and-dirty'' way of comparing items, in addition to slow-and-exact comparisons. For both settings, we design new and simple algorithms using only $O(\sum_i \log \eta_i)$ exact comparisons, where $\eta_i$ is a suitably defined prediction error for the $i$th element. In particular, as the quality of predictions deteriorates, the number of comparisons degrades smoothly from $O(n)$ to $O(n\log n)$. We prove that this comparison complexity is theoretically optimal with respect to the examined error measures. An experimental evaluation against existing adaptive and non-adaptive sorting algorithms demonstrates the potential of applying learning-augmented algorithms in sorting tasks.

No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions
Tiancheng Jin Junyan Liu Chloé Rouyer William Chang Chen-Yu Wei Haipeng Luo



研究问题:现有的在线学习算法在处理对抗马尔可夫决策过程时,即使损失函数被对手任意选择,并且在转移函数固定的情况下,也只能在T轮交互后实现$\mathcal{O}(\sqrt{T})$的遗憾。
动机:尽管已有结果显示对抗性的转移函数使得无遗憾学习成为不可能,但本研究仍致力于开发能够处理对抗性损失和对抗性转移的算法,使遗憾度随着对手恶意程度的增加而平滑增长。
方法:我们首先提出了一种算法,其遗憾度为$\widetilde{\mathcal{O}}(\sqrt{T} + C^{P} )$,其中$C^{P}$用于衡量转移函数的敌对程度,最大可能为$\mathcal{O}(T)$。然后,我们进一步开发了一种黑箱降维方法来消除对$C^{P}$的需求。此外,我们还展示了该算法的进一步优化不仅可以保持相同的遗憾度界限,还可以同时适应更简单的环境(如[Jin et al. 2021]中的损失以某种随机约束的方式生成),并实现$\widetilde{\mathcal{O}}(U + \sqrt{UC^{L}} + C^{P})$的遗憾度,其中$U$是某种标准的差距相关系数,$C^{L}$是损失的损坏量。
效果:实验结果表明,与现有的自适应和非自适应排序算法相比,应用学习增强的算法在排序任务上具有潜力。

Existing online learning algorithms for adversarial Markov Decision Processes achieve $\mathcal{O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed. This is because it has been shown that adversarial transition functions make no-regret learning impossible. Despite such impossibility results, in this work, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose an algorithm that enjoys $\widetilde{\mathcal{O}}(\sqrt{T} + C^{P})$ regret where $C^{P}$ measures how adversarial the transition functions are and can be at most $\mathcal{O}(T)$. While this algorithm itself requires knowledge of $C^{P}$, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in [Jin et al. 2021]) and achieves $\widetilde{\mathcal{O}}(U + \sqrt{UC^{L}} + C^{P})$ regret, where $U$ is some standard gap-dependent coefficient and $C^{L}$ is the amount of corruption on losses.

Direction-oriented Multi-objective Learning: Simple and Provable Stochastic Algorithms
Peiyao Xiao Hao Ban Kaiyi Ji



研究问题:多目标优化(MOO)在许多具有多个目标的机器学习问题中已成为有影响力的框架,如多准则学习和多任务学习(MTL)。
动机:本文提出了一种新的方向导向的多目标优化方法,通过在最优化线性组合目标(如MTL的平均损失或某些任务权重较高的加权损失)的方向附近正则化公共下降方向。
方法:我们提出了随机方向导向的多目标梯度下降(SDMGrad)和其变体SDMGrad-OS,它们都采用了简单的SGD类型的更新,并设计了高效的目标采样。我们还为这两种方法开发了全面的收敛性分析。
效果:实验表明,SDMGrad和SDMGrad-OS在找到ε精度的帕累托稳定点时实现了较低的样本复杂度,同时在向冲突避免(CA)方向移动时保持了小的ε级距离。对于常数级别的CA距离,它们的样本复杂度与已知的最佳情况(无有界函数值假设)相匹配。在一系列多任务监督学习和强化学习任务中,我们的方法在性能上与现有的梯度操作方法相当甚至有所提高。

Multi-objective optimization (MOO) has become an influential framework in many machine learning problems with multiple objectives such as learning with multiple criteria and multi-task learning (MTL). In this paper, we propose a new direction-oriented multi-objective formulation by regularizing the common descent direction within a neighborhood of a direction that optimizes a linear combination of objectives such as the average loss in MTL or a weighted loss that places higher emphasis on some tasks than the others. This formulation includes GD and MGDA as special cases, enjoys the direction-oriented benefit as in CAGrad, and facilitates the design of stochastic algorithms. To solve this problem, we propose Stochastic Direction-oriented Multi-objective Gradient descent (SDMGrad) with simple SGD type of updates, and its variant SDMGrad-OS with an efficient objective sampling. We develop a comprehensive convergence analysis for the proposed methods with different loop sizes and regularization coefficients. We show that both SDMGrad and SDMGrad-OS achieve improved sample complexities to find an $\epsilon$-accurate Pareto stationary point while achieving a small $\epsilon$-level distance toward a conflict-avoidant (CA) direction. For a constant-level CA distance, their sample complexities match the best known $\mathcal{O}(\epsilon^{-2})$ without bounded function value assumption. Extensive experiments show that our methods achieve competitive or improved performance compared to existing gradient manipulation approaches in a series of tasks on multi-task supervised learning and reinforcement learning. Code is available at https://github.com/ml-opt-lab/sdmgrad.

Stochastic Approximation Approaches to Group Distributionally Robust Optimization
Lijun Zhang Peng Zhao Zhenhua Zhuang Tianbao Yang Zhi-Hua Zhou



研究问题:本文旨在研究组分布鲁棒优化(GDRO),以学习一个在$m$个不同分布上表现良好的模型。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过大规模文本语料库和知识图谱训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper investigates group distributionally robust optimization (GDRO), with the purpose to learn a model that performs well over $m$ different distributions. First, we formulate GDRO as a stochastic convex-concave saddle-point problem, and demonstrate that stochastic mirror descent (SMD), using $m$ samples in each iteration, achieves an $O(m (\log m)/\epsilon^2)$ sample complexity for finding an $\epsilon$-optimal solution, which matches the $\Omega(m/\epsilon^2)$ lower bound up to a logarithmic factor. Then, we make use of techniques from online learning to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample complexity. Specifically, we cast GDRO as a two-players game where one player simply performs SMD and the other executes an online algorithm for non-oblivious multi-armed bandits. Next, we consider a more practical scenario where the number of samples that can be drawn from each distribution is different, and propose a novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates. Denote by $n_i$ the sample budget for the $i$-th distribution, and assume $n_1 \geq n_2 \geq \cdots \geq n_m$. In the first approach, we incorporate non-uniform sampling into SMD such that the sample budget is satisfied in expectation, and prove that the excess risk of the $i$-th distribution decreases at an $O(\sqrt{n_1 \log m}/n_i)$ rate. In the second approach, we use mini-batches to meet the budget exactly and also reduce the variance in stochastic gradients, and then leverage stochastic mirror-prox algorithm, which can exploit small variances, to optimize a carefully designed weighted GDRO problem. Under appropriate conditions, it attains an $O((\log m)/\sqrt{n_i})$ convergence rate, which almost matches the optimal $O(\sqrt{1/n_i})$ rate of only learning from the $i$-th distribution with $n_i$ samples.

Sampling from Structured Log-Concave Distributions via a Soft-Threshold Dikin Walk
Oren Mangoubi Nisheeth K Vishnoi



研究问题:如何从受约束的对数凹分布中采样。
动机:该问题在贝叶斯推理和差分隐私等领域有重要应用。
方法:提出了一种适用于此设置的Dikin walk的泛化,通过添加一个由f的Lipschitz或平滑性属性得到的软阈值正则化项到K的障碍函数,使得提出的更新具有较高的Metropolis接受率。
效果:该方法在一系列结构化设置上改进了现有工作的运行时间,对于上述推理和隐私应用具有重要意义。

Given a Lipschitz or smooth convex function $f:K \to \mathbb{R}^d$ for a bounded polytope $K:=${ $\theta \in \mathbb{R}^d: A\theta \leq b$}, where $A\in \mathbb{R}^{m\times d}$ and $b \in \mathbb{R}^m$, we consider the problem of sampling from the log-concave distribution $\pi(\theta) \propto e^{-f(\theta)}$ constrained to $K$. Interest in this problem derives from its applications to Bayesian inference and differential privacy. We present a generalization of the Dikin walk to this setting that requires at most $O((md + d L^2 R^2) \times md^{\omega-1} \log(\frac{w}{\delta}))$ arithmetic operations to sample from $\pi$ within error $\delta>0$ in the total variation distance from a $w$-warm start. Here $L$ is the Lipschitz constant of $f$, $K$ is contained in a ball of radius $R$ and contains a ball of smaller radius $r$, and $\omega \approx 2.37$ is the matrix-multiplication constant. This improves on the running time of prior works for a range of structured settings important for the aforementioned inference and privacy applications. Technically, we depart from previous Dikin walks by adding a soft-threshold regularizer derived from the Lipschitz or smoothness properties of $f$ to a barrier function for $K$ that allows our version of the Dikin walk to propose updates that have a high Metropolis acceptance ratio for $f$, while at the same time remaining inside the polytope $K$.

On the Power of SVD in the Stochastic Block Model
Xinyu Mao Jiapeng Zhang



研究问题:本研究旨在理解在聚类问题中谱方法的行为。
动机:观察到基于光谱的降维工具,如PCA或SVD,在许多应用中提高了聚类算法的性能,这引发了对谱方法在聚类过程中作用的疑问。
方法:本论文首先研究了在随机块模型(SBM)中,普通-SVD算法的作用。结果显示,在对称设置下,普通-SVD算法能正确恢复所有簇。
效果:这一结果回答了Van Vu(2018年《组合数学、概率与计算》)在对称设置下提出的一个开放性问题。

A popular heuristic method for improving clustering results is to apply dimensionality reduction before running clustering algorithms. It has been observed that spectral-based dimensionality reduction tools, such as PCA or SVD, improve the performance of clustering algorithms in many applications. This phenomenon indicates that spectral method not only serves as a dimensionality reduction tool, but also contributes to the clustering procedure in some sense. It is an interesting question to understand the behavior of spectral steps in clustering problems. As an initial step in this direction, this paper studies the power of vanilla-SVD algorithm in the stochastic block model (SBM). We show that, in the symmetric setting, vanilla-SVD algorithm recovers all clusters correctly. This result answers an open question posed by Van Vu (Combinatorics Probability and Computing, 2018) in the symmetric setting.

Learning Regularized Monotone Graphon Mean-Field Games
Fengzhuo Zhang Vincent Tan Zhaoran Wang Zhuoran Yang



研究问题:本文研究了正则化图均值场博弈(GMFGs)中的两个基本问题。
动机:为了解决以往对无正则化GMFGs和λ-正则化MFGs的分析需要更严格条件的问题,以及学习弱单调GMFGs中NE的算法效率低下的问题。
方法:首先建立了任何λ-正则化GMFGs(λ≥0)存在Nash均衡(NE)的理论,然后设计了一种离散时间算法,并推导出其收敛速度,同时开发并分析了在线学习过程中的动作值函数估计过程。
效果:实验结果表明,设计的算法效率高,能够有效地学习弱单调GMFGs中的NE。

This paper studies two fundamental problems in regularized Graphon Mean-Field Games (GMFGs). First, we establish the existence of a Nash Equilibrium (NE) of any $\lambda$-regularized GMFG (for $\lambda\geq 0$). This result relies on weaker conditions than previous works analyzing both unregularized GMFGs ($\lambda=0$) and $\lambda$-regularized MFGs, which are special cases of GMFGs. Second, we propose provably efficient algorithms to learn the NE in weakly monotone GMFGs, motivated by Lasry and Lions (2007). Previous literature either only analyzed continuous-time algorithms or required extra conditions to analyze discrete-time algorithms. In contrast, we design a discrete-time algorithm and derive its convergence rate solely under weakly monotone conditions. Furthermore, we develop and analyze the action-value function estimation procedure during the online learning process, which is absent from algorithms for monotone GMFGs. This serves as a sub-module in our optimization algorithm. The efficiency of the designed algorithm is corroborated by empirical evaluations.

Multitask Learning with No Regret: from Improved Confidence Bounds to Active Learning
Pier Giuseppe Sessa Pierre Laforgue Nicolò Cesa-Bianchi Andreas Krause



研究问题:如何在多任务学习中量化不确定性,特别是在无法获取任务相似性和特征的情况下。
动机:不确定性的量化对于许多下游应用(如在线或主动学习)至关重要。
方法:提出了一种新的置信区间计算方法,用于处理具有挑战性的无知设置下的多任务回归问题。该方法不需要独立同分布的数据,并可以直接应用于在线学习的遗憾约束。
效果:通过改进对多任务信息增益的分析,获得了新的遗憾保证,可以显著优于独立处理任务的方法。同时,还提出了一种新的在线学习算法,可以在不知道任务相似性参数的情况下实现这种改进的遗憾,即自动适应任务相似性。此外,还在合成数据和真实世界(药物发现)数据上验证了这些界限和算法的效果。

Multitask learning is a powerful framework that enables one to simultaneously learn multiple related tasks by sharing information between them. Quantifying uncertainty in the estimated tasks is of pivotal importance for many downstream applications, such as online or active learning. In this work, we provide novel confidence intervals for multitask regression in the challenging agnostic setting, i.e., when neither the similarity between tasks nor the tasks' features are available to the learner. The obtained intervals do not require i.i.d. data and can be directly applied to bound the regret in online learning. Through a refined analysis of the multitask information gain, we obtain new regret guarantees that, depending on a task similarity parameter, can significantly improve over treating tasks independently. We further propose a novel online learning algorithm that achieves such improved regret without knowing this parameter in advance, i.e., automatically adapting to task similarity. As a second key application of our results, we introduce a novel multitask active learning setup where several tasks must be simultaneously optimized, but only one of them can be queried for feedback by the learner at each round. For this problem, we design a no-regret algorithm that uses our confidence intervals to decide which task should be queried. Finally, we empirically validate our bounds and algorithms on synthetic and real-world (drug discovery) data.

Local Convergence of Gradient Methods for Min-Max Games: Partial Curvature Generically Suffices
Guillaume Wang Lénaïc Chizat



研究问题:本文研究了二玩家零和可微分博弈的梯度方法向局部纳什均衡的收敛性。
动机:在连续时间设置中,已知当雅可比矩阵的对称部分(代表游戏的“潜力”部分)非零时,这些动态会收敛,而为零时可能会发散。作者展示了只要对称部分非零且反对称部分的的特征向量相对于核的位置一般,这些动态也会收敛。
方法:作者进一步研究了当对称部分远小于反对称部分时的收敛速度,并证明其通常取决于对称部分特征值的平均值,而不是最小值。
效果:为了说明结果,作者考虑了连续博弈的混合纳什均衡计算问题。结果显示,由于部分曲率,锥形粒子方法——优化权重和支持的混合策略——通常比固定支持方法更快地收敛。对于最小最大博弈,因此添加具有曲率的自由度是有益的,这可以解释为过度参数化的另一个好处。

We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. It is well-known that, in the continuous-time setting, such dynamics converge locally when $S \succ 0$ and may diverge when $S=0$, where $S\succeq 0$ is the symmetric part of the Jacobian at equilibrium that accounts for the "potential" component of the game. We show that these dynamics also converge as soon as $S$ is nonzero (*partial curvature*) and the eigenvectors of the antisymmetric part $A$ are in general position with respect to the kernel of $S$. We then study the convergence rate when $S \ll A$ and prove that it typically depends on the *average* of the eigenvalues of $S$, instead of the minimum as an analogy with minimization problems would suggest. To illustrate our results, we consider the problem of computing mixed Nash equilibria of continuous games. We show that, thanks to partial curvature, conic particle methods -- which optimize over both weights and supports of the mixed strategies -- generically converge faster than fixed-support methods. For min-max games, it is thus beneficial to add degrees of freedom "with curvature": this can be interpreted as yet another benefit of over-parameterization.

Nash Regret Guarantees for Linear Bandits
Ayush Sawarni Soumyabrata Pal Siddharth Barman



研究问题:本文旨在解决随机线性Bandits框架中的一种强化的遗憾概念,即Nash遗憾。
动机:由于几何平均对应于被广泛研究的Nash社会福利(NSW)函数,因此这种形式化将Bandit算法的性能量化为它在各轮次中产生的集体福利。NSW已知满足公平性公理,因此对Nash遗憾的上界提供了一种原则性的公平性保证。
方法:本文考虑了在$mathsf{T}$轮次和一组手臂${\cal X}$的环境中的随机线性Bandits问题,其中与${\cal X}$中的每个手臂相关的随机奖励是一个非负的亚泊松随机变量。对于这种设置,作者开发了一种算法,实现了$O\left( \sqrt{\frac{d}{\mathsf{T}}} \log(\mathsf{T} |{\cal X}|)\right)$的Nash遗憾。此外,针对手臂集合${cal X}$不一定是有限的情况,作者获得了$O\left( \frac{d^\frac{5}{4}}{\sqrt{\mathsf{T}}} \log(\mathsf{T})\right)$的Nash遗憾上界。
效果:由于有界随机变量是亚泊松的,所以这些结果适用于有界、非负的奖励。作者的线性Bandit算法基于成功剔除法,并融入了新的技术见解,包括定制的集中边界和使用通过约翰椭球抽样与Kiefer–Wolfowitz最优设计相结合的方法。

We obtain essentially tight upper bounds for a strengthened notion of regret in the stochastic linear bandits framework. The strengthening---referred to as Nash regret---is defined as the difference between the (a priori unknown) optimum and the geometric mean of expected rewards accumulated by the linear bandit algorithm. Since the geometric mean corresponds to the well-studied Nash social welfare (NSW) function, this formulation quantifies the performance of a bandit algorithm as the collective welfare it generates across rounds. NSW is known to satisfy fairness axioms and, hence, an upper bound on Nash regret provides a principled fairness guarantee. We consider the stochastic linear bandits problem over a horizon of $\mathsf{T}$ rounds and with a set of arms ${\cal X}$ in ambient dimension $d$. Furthermore, we focus on settings in which the stochastic reward---associated with each arm in ${\cal X}$---is a non-negative, sub-Poisson random variable. For this setting, we develop an algorithm that achieves a Nash regret of $O\left( \sqrt{\frac{d}{\mathsf{T}}} \log(\mathsf{T} |{\cal X}|)\right)$. In addition, addressing linear bandit instances in which the set of arms ${\cal X}$ is not necessarily finite, we obtain a Nash regret upper bound of $O\left( \frac{d^\frac{5}{4}}{\sqrt{\mathsf{T}}} \log(\mathsf{T})\right)$. Since bounded random variables are sub-Poisson, these results hold for bounded, non-negative rewards. Our linear bandit algorithm is built upon the successive elimination method with novel technical insights, including tailored concentration bounds and the use of sampling via John ellipsoid in conjunction with the Kiefer–Wolfowitz optimal design.

Federated Learning with Client Subsampling, Data Heterogeneity, and Unbounded Smoothness: A New Algorithm and Lower Bounds
Michael Crawshaw Yajie Bao Mingrui Liu



研究问题:本文研究了具有潜在无界平滑度的客户子采样和数据异质性的联邦学习(FL)问题。
动机:实证证据表明,放松的平滑函数类(梯度的Lipschitz常数与梯度范数线性缩放)与某些神经网络的损失函数(如可能产生爆炸梯度的循环神经网络)非常相似,因此我们对此进行研究。
方法:我们引入了EPISODE++,这是第一个解决这个问题的算法。它为每个客户维护历史统计数据以构建控制变量,并决定当前轮次中抽样客户的剪辑行为。
效果:实验证明,EPISODE++在参与客户数量、减少通信轮次以及应对数据异质性方面实现了线性加速。同时,我们还证明了一个下界,显示在特定情况下,应用梯度剪辑的批量SGD收敛速度会受目标函数在子水平集内的最大梯度范数的显式依赖影响,这可能会很大。

We study the problem of Federated Learning (FL) under client subsampling and data heterogeneity with an objective function that has potentially unbounded smoothness. This problem is motivated by empirical evidence that the class of relaxed smooth functions, where the Lipschitz constant of the gradient scales linearly with the gradient norm, closely resembles the loss functions of certain neural networks such as recurrent neural networks (RNNs) with possibly exploding gradient. We introduce EPISODE++, the first algorithm to solve this problem. It maintains historical statistics for each client to construct control variates and decide clipping behavior for sampled clients in the current round. We prove that EPISODE++ achieves linear speedup in the number of participating clients, reduced communication rounds, and resilience to data heterogeneity. Our upper bound proof relies on novel techniques of recursively bounding the client updates under unbounded smoothness and client subsampling, together with a refined high probability analysis. In addition, we prove a lower bound showing that the convergence rate of a special case of clipped minibatch SGD (without randomness in the stochastic gradient and with randomness in client subsampling) suffers from an explicit dependence on the maximum gradient norm of the objective in a sublevel set, which may be large. This effectively demonstrates that applying gradient clipping to minibatch SGD in our setting does not eliminate the problem of exploding gradients. Our lower bound is based on new constructions of hard instances tailored to client subsampling and a novel analysis of the trajectory of the algorithm in the presence of clipping. Lastly, we provide an experimental evaluation of EPISODE++ when training RNNs on federated text classification tasks, demonstrating that EPISODE++ outperforms strong baselines in FL. The code is available at https://github.com/MingruiLiu-ML-Lab/episode_plusplus.

Characterization of Overfitting in Robust Multiclass Classification
Jingyuan Xu Weiwei Liu



研究问题:在多类别分类问题中,给定类别数m、鲁棒性准确度查询数k和数据集中的测试示例数n,自适应算法能在多大程度上鲁棒地过拟合测试数据集?
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过等价地给出多类别分类问题的鲁棒过拟合偏差的接近匹配的上界和下界,解决了这个问题。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper considers the following question: Given the number of classes m, the number of robust accuracy queries k, and the number of test examples in the dataset n, how much can adaptive algorithms robustly overfit the test dataset? We solve this problem by equivalently giving near-matching upper and lower bounds of the robust overfitting bias in multiclass classification problems.

Stability-penalty-adaptive follow-the-regularized-leader: Sparsity, game-dependency, and best-of-both-worlds
Taira Tsuchiya Shinji Ito Junya Honda



研究问题:本文旨在开发一种通用的适应性学习率,称为稳定性-惩罚-自适应(SPA)学习率,以进一步推广FTRL在赌博问题上的适应性。
动机:现有的稀疏多臂赌博算法假设稀疏度级别$s leq k$是预先知道的,但在现实世界的场景中,这往往不是情况。
方法:利用SPA学习率和$s$-agnostic算法的技术,结合对FTRL输出变化的新分析,建立了第一个具有稀疏依赖性限制的BOBW算法。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,新提出的BOBW算法在随机和敌对环境中都能实现接近最佳的遗憾。

Adaptivity to the difficulties of a problem is a key property in sequential decision-making problems to broaden the applicability of algorithms. Follow-the-regularized-leader (FTRL) has recently emerged as one of the most promising approaches for obtaining various types of adaptivity in bandit problems. Aiming to further generalize this adaptivity, we develop a generic adaptive learning rate, called stability-penalty-adaptive (SPA) learning rate for FTRL. This learning rate yields a regret bound jointly depending on stability and penalty of the algorithm, into which the regret of FTRL is typically decomposed. With this result, we establish several algorithms with three types of adaptivity: sparsity, game-dependency, and best-of-both-worlds (BOBW). Despite the fact that sparsity appears frequently in real problems, existing sparse multi-armed bandit algorithms with $k$-arms assume that the sparsity level $s \leq k$ is known in advance, which is often not the case in real-world scenarios. To address this issue, we first establish $s$-agnostic algorithms with regret bounds of $\tilde{O}(\sqrt{sT})$ in the adversarial regime for $T$ rounds, which matches the existing lower bound up to a logarithmic factor. Meanwhile, BOBW algorithms aim to achieve a near-optimal regret in both the stochastic and adversarial regimes. Leveraging the SPA learning rate and the technique for $s$-agnostic algorithms combined with a new analysis to bound the variation in FTRL output in response to changes in a regularizer, we establish the first BOBW algorithm with a sparsity-dependent bound. Additionally, we explore partial monitoring and demonstrate that the proposed SPA learning rate framework allows us to achieve a game-dependent bound and the BOBW simultaneously.

Computing Optimal Nash Equilibria in Multiplayer Games
Youzhi Zhang Bo An Venkatramanan Siva Subrahmanian



研究问题:设计有效的算法来计算多人游戏中的纳什均衡(NE)仍是一个开放的挑战。
动机:在多人游戏中,找到优化给定目标函数的最优纳什均衡可以转化为一个混合整数双线性规划问题,但引入的辅助变量会产生大量的双线性项,使得求解变得困难。
方法:我们首先提出了一个基于一组关联计划的通用框架,然后开发了一种名为CRM的新型算法,该算法利用关联计划及其关系在双线性项的凸松弛后严格缩小可行解空间,同时尽量减少关联计划的数量以显著减少双线性项的数量。
效果:我们的技术可以显著降低时间复杂度,并且CRM的速度比最先进的基线快几个数量级。

Designing efficient algorithms to compute a Nash Equilibrium (NE) in multiplayer games is still an open challenge. In this paper, we focus on computing an NE that optimizes a given objective function. For example, when there is a team of players independently playing against an adversary in a game (e.g., several groups in a forest trying to interdict illegal loggers in green security games), these team members may need to find an NE minimizing the adversary’s utility. Finding an optimal NE in multiplayer games can be formulated as a mixed-integer bilinear program by introducing auxiliary variables to represent bilinear terms, leading to a huge number of bilinear terms, making it hard to solve. To overcome this challenge, we first propose a general framework for this formulation based on a set of correlation plans. We then develop a novel algorithm called CRM based on this framework, which uses correlation plans with their relations to strictly reduce the feasible solution space after the convex relaxation of bilinear terms while minimizing the number of correlation plans to significantly reduce the number of bilinear terms. We show that our techniques can significantly reduce the time complexity and CRM can be several orders of magnitude faster than the state-of-the-art baseline.

An Optimal Structured Zeroth-order Algorithm for Non-smooth Optimization
Marco Rando Cesare Molinari Lorenzo Rosasco Silvia Villa



研究问题:本文旨在解决黑箱优化问题,特别是在非光滑设置中的问题。
动机:由于在实践中无法验证可微性和光滑性假设,因此需要一种算法来近似目标函数的梯度。
方法:本文提出了O-ZD,这是一种用于非光滑黑箱优化的结构有限差分算法。该方法利用了目标函数的光滑近似,并证明了其在随机正交方向集上近似其梯度。
效果:在满足假设的情况下,数值模拟显示该算法具有非常好的实际性能。

Finite-difference methods are a class of algorithms designed to solve black-box optimization problems by approximating a gradient of the target function on a set of directions. In black-box optimization, the non-smooth setting is particularly relevant since, in practice, differentiability and smoothness assumptions cannot be verified. To cope with nonsmoothness, several authors use a smooth approximation of the target function and show that finite difference methods approximate its gradient. Recently, it has been proved that imposing a structure in the directions allows improving performance. However, only the smooth setting was considered. To close this gap, we introduce and analyze O-ZD, the first structured finite-difference algorithm for non-smooth black-box optimization. Our method exploits a smooth approximation of the target function and we prove that it approximates its gradient on a subset of random {\em orthogonal} directions. We analyze the convergence of O-ZD under different assumptions. For non-smooth convex functions, we obtain the optimal complexity. In the non-smooth non-convex setting, we characterize the number of iterations needed to bound the expected norm of the smoothed gradient. For smooth functions, our analysis recovers existing results for structured zeroth-order methods for the convex case and extends them to the non-convex setting. We conclude with numerical simulations where assumptions are satisfied, observing that our algorithm has very good practical performances.

Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback
Canzhe Zhao Ruofeng Yang Baoxiang Wang Xuezhou Zhang Shuai Li



研究问题:本文研究了在全信息反馈设置中,具有对抗性损失的低秩MDPs。
动机:在未知转移概率核允许低秩矩阵分解的情况下,损失函数可能会对抗性地变化,但在每个剧集结束时会向学习者揭示。
方法:提出了一种基于策略优化的算法POLO。
效果:证明了POLO可以获得$\widetilde{O}(dA^{\frac{1}{2}}K^{\frac{3}{4}}\ln^{\frac{1}{4}}M/(1-\gamma)^2)$的遗憾保证,其中$d$是转移核的秩(因此也是未知表示的维度),$A$是动作空间的基数,$M$是模型类的基数,$\gamma$是折扣因子。

In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we prove that it attains the $\widetilde{O}(dA^{\frac{1}{2}}K^{\frac{3}{4}}\ln^{\frac{1}{4}}M/(1-\gamma)^2)$ regret guarantee, where $d$ is rank of the transition kernel (and hence the dimension of the unknown representations), $A$ is the cardinality of the action space, $M$ is the cardinality of the model class, and $\gamma$ is the discounted factor. Notably, our algorithm is oracle-efficient and has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an $\Omega(\frac{\gamma^2}{1-\gamma} \sqrt{d A K})$ regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting. To the best of our knowledge, we present the first algorithm that interleaves representation learning, exploration, and exploitation to achieve the sublinear regret guarantee for RL with nonlinear function approximation and adversarial losses.

Corruption-Robust Offline Reinforcement Learning with General Function Approximation
Chenlu Ye Rui Yang Quanquan Gu Tong Zhang



研究问题:本文研究了在离线强化学习中,当对手可以对每个样本进行破坏时,如何找到一种对这种破坏具有鲁棒性并最小化与最优策略的次优差距的策略。
动机:由于对手可以在离线数据集上进行破坏,因此需要找到一种能够抵抗这种破坏的策略。
方法:借鉴在线强化学习中的不确定性加权技术,设计了一种新的不确定性权重迭代过程,并在批量样本上进行高效计算,提出了一种针对离线强化学习的抗破坏算法。
效果:在单策略覆盖和已知破坏程度的情况下,所提出的算法实现了一个次优界限,该界限因破坏而恶化,恶化程度为$\mathcal O(\zeta \cdot (text CC(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})$。

We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level $\zeta\geq0$ quantifies the cumulative corruption amount over $n$ episodes and $H$ steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of $\zeta$, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of $\mathcal O(\zeta \cdot (\text CC(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})$ due to the corruption. Here $\text CC(\lambda,\hat{\mathcal F},\mathcal Z_n^H)$ is the coverage coefficient that depends on the regularization parameter $\lambda$, the confidence set $\hat{\mathcal F}$, and the dataset $\mathcal Z_n^H$, and $C(\hat{\mathcal F},\mu)$ is a coefficient that depends on $\hat{\mathcal F}$ and the underlying data distribution $\mu$. When specialized to linear MDPs, the corruption-dependent error term reduces to $\mathcal O(\zeta d n^{-1})$ with $d$ being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term.

Failure-Aware Gaussian Process Optimization with Regret Bounds
Shogo Iwazaki Shion Takeno Tomohiko Tanabe Mitsuru Irie



研究问题:解决现实世界中的黑箱优化问题,当观察成功时获取目标函数值,失败时只能得到失败的事实,且失败区域可能由多个未知数量的潜在约束条件构成。
动机:针对这一问题,提出了一种失败感知的高斯过程上置信界(F-GP-UCB)方法,该方法只需要对观察失败做出一个温和的假设,即最优解位于可行区域的内部。
方法:通过线性增长的成功观察次数,我们首次给出了F-GP-UCB的遗憾上限和收敛性。
效果:在几个基准函数上验证了F-GP-UCB的有效性,包括由材料合成实验启发的模拟函数。

Real-world optimization problems often require black-box optimization with observation failure, where we can obtain the objective function value if we succeed, otherwise, we can only obtain a fact of failure. Moreover, this failure region can be complex by several latent constraints, whose number is also unknown. For this problem, we propose a failure-aware Gaussian process upper confidence bound (F-GP-UCB), which only requires a mild assumption for the observation failure that an optimal solution lies on an interior of a feasible region. Furthermore, we show that the number of successful observations grows linearly, by which we provide the first regret upper bounds and the convergence of F-GP-UCB. We demonstrate the effectiveness of F-GP-UCB in several benchmark functions, including the simulation function motivated by material synthesis experiments.

Active Bipartite Ranking
James Cheshire Vincent Laurent Stephan Clémençon



研究问题:本文旨在开发一个主动学习框架,解决二部排名问题。
动机:二部排名问题在许多应用中都有涉及,如监督异常检测、信用评分和医疗诊断支持系统设计等。尽管被动环境下的二部排名算法已经得到了大量研究,但主动二部排名规则在文献中鲜有记录。由于其全局性,需要一种策略来按顺序标记难以与其他数据点进行比较的数据点。这个学习任务比二元分类更复杂,为此设计了许多主动算法。本文的目标是为这种选择性采样方法提供一个严格的公式化表述。
方法:我们提出了一个名为active-rank的专用算法,旨在最小化所构建的排名函数的ROC曲线与最优ROC曲线之间的欧氏距离。
效果:理论分析和数值结果表明,对于固定的置信水平和概率,active-rank是PAC(ε,δ)的。此外,我们还提供了一个问题相关的active-rank预期采样时间上限,并证明了任何PAC(ε,δ)算法的预期采样时间下限。实验结果强有力地证明了所提出算法的性能,并与更简单的方法进行了比较。

In this paper, we develop an active learning framework for the bipartite ranking problem. Motivated by numerous applications, ranging from supervised anomaly detection to credit-scoring through the design of medical diagnosis support systems, and usually formulated as the problem of optimizing (a scalar summary of) the ROC curve, bipartite ranking has been the subject of much attention in the passive context. Various dedicated algorithms have been recently proposed and studied by the machine-learning community. In contrast, active bipartite ranking rule is poorly documented in the literature. Due to its global nature, a strategy for labeling sequentially data points that are difficult to rank w.r.t. to the others is required. This learning task is much more complex than binary classification, for which many active algorithms have been designed. It is the goal of this article to provide a rigorous formulation of such a selective sampling approach. We propose a dedicated algorithm, referred to as active-rank, which aims to minimise the distance between the ROC curve of the ranking function built and the optimal one, w.r.t. the sup norm. We show that, for a fixed confidence level $\epsilon$ and probability $\delta$, active-rank is PAC$(\epsilon,\delta)$. In addition, we provide a problem dependent upper bound on the expected sampling time of active-rank and also demonstrate a problem dependent lower bound on the expected sampling time of any PAC$(\epsilon,\delta)$ algorithm. Beyond the theoretical analysis carried out, numerical results are presented, providing strong empirical evidence of the performance of the algorithm proposed, which compares favorably with more naive approaches.

Accelerated Zeroth-order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance
Nikita Kornilov Ohad Shamir Aleksandr Lobanov Darina Dvinskikh Alexander Gasnikov Innokentiy Andreevich Shibaev Eduard Gorbunov Samuel Horváth



研究问题:本文研究了无限方差噪声下,每轮两次函数评估的非平滑随机凸优化问题。
动机:在有限方差噪声的经典设置中,已有基于批量加速梯度方法的最优算法(Gasnikov等人,2022)。然而,有限方差的假设在许多实际场景中可能不成立。
方法:本文将(Sadiev等人,2023)中的一种改进的剪切加速梯度(随机相似三角形)方法适应于两点零阶查询。这种适应包括将批处理技术扩展到无限方差,这是一个具有显著贡献的非平凡任务。
效果:实验结果表明,该方法在各种实用场景中都能取得良好的效果。

In this paper, we consider non-smooth stochastic convex optimization with two function evaluations per round under infinite noise variance. In the classical setting when noise has finite variance, an optimal algorithm, built upon the batched accelerated gradient method, was proposed in (Gasnikov et. al., 2022). This optimality is defined in terms of iteration and oracle complexity, as well as the maximal admissible level of adversarial noise. However, the assumption of finite variance is burdensome and it might not hold in many practical scenarios. To address this, we demonstrate how to adapt a refined clipped version of the accelerated gradient (Stochastic Similar Triangles) method from (Sadiev et al., 2023) for a two-point zero-order oracle. This adaptation entails extending the batching technique to accommodate infinite variance — a non-trivial task that stands as a distinct contribution of this paper.

Optimal approximation using complex-valued neural networks
Paul Geuchen Felix Voigtlaender



研究问题:本文旨在分析复值神经网络(CVNNs)的表达能力,通过研究其近似性质。
动机:尽管深度学习在实值情况下取得了巨大的成功,但其在复值情况下的理论基础仍然不足。
方法:通过对激活函数的研究,得出了适用于广泛类别激活函数的CVNNs的首次定量近似边界,包括流行的modReLU和复杂心形激活函数。
效果:结果显示,当神经元数量趋向无穷大时,近似误差将按$m^{-k/(2n)}$缩放,其中$m$是神经元的数量,$k$是目标函数的平滑度,$n$是输入维度。此外,证明了使用连续近似方法逼近$C^k$-函数的问题不可避免地受到维数灾难的影响。

Complex-valued neural networks (CVNNs) have recently shown promising empirical success, for instance for increasing the stability of recurrent neural networks and for improving the performance in tasks with complex-valued inputs, such as MRI fingerprinting. While the overwhelming success of Deep Learning in the real-valued case is supported by a growing mathematical foundation, such a foundation is still largely lacking in the complex-valued case. We thus analyze the expressivity of CVNNs by studying their approximation properties. Our results yield the first quantitative approximation bounds for CVNNs that apply to a wide class of activation functions including the popular modReLU and complex cardioid activation functions. Precisely, our results apply to any activation function that is smooth but not polyharmonic on some non-empty open set; this is the natural generalization of the class of smooth and non-polynomial activation functions to the complex setting. Our main result shows that the approximation error scales as $m^{-k/(2n)}$ for $m \to \infty$ where $m$ is the number of neurons, $k$ the smoothness of the target function and $n$ is the (complex) input dimension. Under a natural continuity assumption, we show that this rate is optimal; we further discuss the optimality when dropping this assumption. Moreover, we prove that the problem of approximating $C^k$-functions using continuous approximation methods unavoidably suffers from the curse of dimensionality.

Nearest Neighbour with Bandit Feedback
Stephen Pasteris Chris Hicks Vasilios Mavroudis



研究问题:本文旨在将最近邻规则应用于上下文强盗问题。
动机:在完全对抗的设置中,没有任何关于数据生成过程的假设,需要一种有效的算法来处理这个问题。
方法:结合足够快的(可能近似)自适应最近邻搜索数据结构,如导航网络,该算法非常高效,每次试验的运行时间是试验次数和动作数量的多项式对数,并且只占用准线性空间。
效果:我们给出了该算法的通用遗憾界限,并在欧几里得空间的随机强盗问题上进行了进一步的分析。此外,当应用于具有随机标签的在线分类问题时,该算法在特定条件下,每次试验只找到一个最近邻,与k-最近邻算法形成鲜明对比,同时可以实现次线性遗憾。

In this paper we adapt the nearest neighbour rule to the contextual bandit problem. Our algorithm handles the fully adversarial setting in which no assumptions at all are made about the data-generation process. When combined with a sufficiently fast data-structure for (perhaps approximate) adaptive nearest neighbour search, such as a navigating net, our algorithm is extremely efficient - having a per trial running time polylogarithmic in both the number of trials and actions, and taking only quasi-linear space. We give generic regret bounds for our algorithm and further analyse them when applied to the stochastic bandit problem in euclidean space. A side result of this paper is that, when applied to the online classification problem with stochastic labels, our algorithm can, under certain conditions, have sublinear regret whilst only finding a single nearest neighbour per trial - in stark contrast to the k-nearest neighbours algorithm.

Adversarial Attacks on Online Learning to Rank with Click Feedback
Jinhang Zuo Zhiyao Zhang Zhiyong Wang Shuai Li Mohammad Hajiesmaili Adam Wierman



研究问题:在线学习排序(OLTR)算法可能受到攻击,导致实际损失,但关于OLTR的对抗性攻击的知识有限。
动机:研究针对多种OLTR变体的对抗策略,以揭示其脆弱性并提高其鲁棒性。
方法:首先对基于二进制反馈的经典随机Bandit的UCB算法进行攻击,然后设计针对基于位置和级联模型的UCB-based OLTR的攻击算法,最后提出一种通用的攻击策略,适用于任何点击模式下的算法。
效果:实验证明,所提出的攻击算法能够有效地操纵学习代理选择目标攻击项,且累积成本可控。

Online learning to rank (OLTR) is a sequential decision-making problem where a learning agent selects an ordered list of items and receives feedback through user clicks. Although potential attacks against OLTR algorithms may cause serious losses in real-world applications, there is limited knowledge about adversarial attacks on OLTR. This paper studies attack strategies against multiple variants of OLTR. Our first result provides an attack strategy against the UCB algorithm on classical stochastic bandits with binary feedback, which solves the key issues caused by bounded and discrete feedback that previous works cannot handle. Building on this result, we design attack algorithms against UCB-based OLTR algorithms in position-based and cascade models. Finally, we propose a general attack strategy against any algorithm under the general click model. Each attack algorithm manipulates the learning agent into choosing the target attack item $T-o(T)$ times, incurring a cumulative cost of $o(T)$. Experiments on synthetic and real data further validate the effectiveness of our proposed attack algorithms.

A Guide Through the Zoo of Biased SGD
Yury Demidovich Grigory Malinovsky Igor Sokolov Peter Richtárik



研究问题:尽管有大量关于无偏梯度估计器的随机梯度下降(SGD)的研究,但依赖有偏估计器的SGD变体却鲜有人研究。
动机:近年来,对有偏估计器SGD的兴趣日益增加,但现有文献缺乏连贯性,每篇新论文都依赖于不同的假设,缺乏对这些假设之间如何连接的清晰理解,可能导致混淆。
方法:我们通过建立现有假设之间的联系,呈现了其底层关系的全面图景。此外,我们还引入了一组新的、被证明弱于所有先前假设的假设,并使用它来在凸和非凸设置中对有偏SGD进行深入分析,提供了优于以前的结果。
效果:实验结果验证了我们的理论研究,展示了我们框架的有效性。

Stochastic Gradient Descent (SGD) is arguably the most important single algorithm in modern machine learning. Although SGD with unbiased gradient estimators has been studied extensively over at least half a century, SGD variants relying on biased estimators are rare. Nevertheless, there has been an increased interest in this topic in recent years. However, existing literature on SGD with biased estimators lacks coherence since each new paper relies on a different set of assumptions, without any clear understanding of how they are connected, which may lead to confusion. We address this gap by establishing connections among the existing assumptions, and presenting a comprehensive map of the underlying relationships. Additionally, we introduce a new set of assumptions that is provably weaker than all previous assumptions, and use it to present a thorough analysis of BiasedSGD in both convex and non-convex settings, offering advantages over previous results. We also provide examples where biased estimators outperform their unbiased counterparts or where unbiased versions are simply not available. Finally, we demonstrate the effectiveness of our framework through experimental results that validate our theoretical findings.

Boosting Adversarial Transferability by Achieving Flat Local Maxima
Zhijin Ge Xiaosen Wang Hongying Liu Fanhua Shang Yuanyuan Liu



研究问题:如何提高对抗性攻击的转移性?
动机:对抗性攻击在现实世界的应用越来越广泛,而其转移性是决定其实用性的关键因素。
方法:通过引入梯度范数惩罚到原始损失函数,使处于平坦局部区域的对抗性样本具有更好的转移性。同时,提出近似优化方法简化目标函数的梯度更新过程,以提高计算效率。
效果:实验结果表明,该方法能生成具有良好转移性的平坦局部区域对抗性样本,并在ImageNet兼容数据集上显著提高了对抗性转移性,优于现有最先进的攻击方法。

Transfer-based attack adopts the adversarial examples generated on the surrogate model to attack various models, making it applicable in the physical world and attracting increasing interest. Recently, various adversarial attacks have emerged to boost adversarial transferability from different perspectives. In this work, inspired by the observation that flat local minima are correlated with good generalization, we assume and empirically validate that adversarial examples at a flat local region tend to have good transferability by introducing a penalized gradient norm to the original loss function. Since directly optimizing the gradient regularization norm is computationally expensive and intractable for generating adversarial examples, we propose an approximation optimization method to simplify the gradient update of the objective function. Specifically, we randomly sample an example and adopt a first-order procedure to approximate the curvature of the second-order Hessian matrix, which makes computing more efficient by interpolating two Jacobian matrices. Meanwhile, in order to obtain a more stable gradient direction, we randomly sample multiple examples and average the gradients of these examples to reduce the variance due to random sampling during the iterative process. Extensive experimental results on the ImageNet-compatible dataset show that the proposed method can generate adversarial examples at flat local regions, and significantly improve the adversarial transferability on either normally trained models or adversarially trained models than the state-of-the-art attacks. Our codes are available at: https://github.com/Trustworthy-AI-Group/PGN.

Towards Data-Algorithm Dependent Generalization: a Case Study on Overparameterized Linear Regression
Jing Xu Jiaye Teng Yang Yuan Andrew C Yao



研究问题:机器学习中的主要开放性问题是在过参数化的情况下描述泛化,其中大多数传统的泛化界限即使在过参数化的线性回归中也会变得不一致。
动机:这种失败的原因往往在于混淆了训练算法和底层数据分布之间的关键互动。
方法:本文提出了一种称为数据-算法兼容性的概念,它考虑了整个依赖于数据的训练轨迹的泛化行为,而不是传统的最后迭代分析。
效果:通过研究使用梯度下降法解决过参数化线性回归的问题,我们的理论结果表明,如果我们考虑到早期停止迭代,那么泛化可以在问题实例上具有比之前的最后迭代分析明显弱的限制条件下成立。

One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent even for overparameterized linear regression. In many scenarios, this failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. This paper demonstrate that the generalization behavior of overparameterized model should be analyzed in a both data-relevant and algorithm-relevant manner. To make a formal characterization, We introduce a notion called data-algorithm compatibility, which considers the generalization behavior of the entire data-dependent training trajectory, instead of traditional last-iterate analysis. We validate our claim by studying the setting of solving overparameterized linear regression with gradient descent. Specifically, we perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting. Our theoretical results demonstrate that if we take early stopping iterates into consideration, generalization can hold with significantly weaker restrictions on the problem instance than the previous last-iterate analysis.

Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete
Daniel Bertschinger Christoph Hertrich Paul Jungeblut Tillmann Miltzow Simon Weber



研究问题:寻找一个两层全连接神经网络的最优权重和偏差,以拟合一组给定的数据点。
动机:目前预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider the algorithmic problem of finding the optimal weights and biases for a two-layer fully connected neural network to fit a given set of data points, also known as empirical risk minimization. We show that the problem is $\exists\mathbb{R}$-complete. This complexity class can be defined as the set of algorithmic problems that are polynomial-time equivalent to finding real roots of a multivariate polynomial with integer coefficients. Furthermore, we show that arbitrary algebraic numbers are required as weights to be able to train some instances to optimality, even if all data points are rational. Our result already applies to fully connected instances with two inputs, two outputs, and one hidden layer of ReLU neurons. Thereby, we strengthen a result by Abrahamsen, Kleist and Miltzow [NeurIPS 2021]. A consequence of this is that a combinatorial search algorithm like the one by Arora, Basu, Mianjy and Mukherjee [ICLR 2018] is impossible for networks with more than one output dimension, unless $\text{NP} = \exists\mathbb{R}$.

Optimal cross-learning for contextual bandits with unknown context distributions
Jon Schneider Julian Zimmert



研究问题:在Balseiro等人的“交叉学习”设置中,设计上下文强盗算法。
动机:当前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider the problem of designing contextual bandit algorithms in the ``cross-learning'' setting of Balseiro et al., where the learner observes the loss for the action they play in all possible contexts, not just the context of the current round. We specifically consider the setting where losses are chosen adversarially and contexts are sampled i.i.d. from an unknown distribution. In this setting, we resolve an open problem of Balseiro et al. by providing an efficient algorithm with a nearly tight (up to logarithmic factors) regret bound of $\widetilde{O}(\sqrt{TK})$, independent of the number of contexts. As a consequence, we obtain the first nearly tight regret bounds for the problems of learning to bid in first-price auctions (under unknown value distributions) and sleeping bandits with a stochastic action set. At the core of our algorithm is a novel technique for coordinating the execution of a learning algorithm over multiple epochs in such a way to remove correlations between estimation of the unknown distribution and the actions played by the algorithm. This technique may be of independent interest for other learning problems involving estimation of an unknown context distribution.

Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing
Nived Rajaraman Fnu Devvrit Aryan Mokhtari Kannan Ramchandran



研究问题:本研究旨在解决预训练模型参数过多的问题,通过理论分析了解裁剪+微调框架成功降低模型复杂度的原因。
动机:尽管裁剪+微调的流程在降低模型复杂度上取得了巨大成功,但背后的理论机制尚不清楚。
方法:本研究以矩阵感测问题为例,对过参数化模型进行裁剪和微调,并研究了均方误差的近似局部最小值,以及平滑群Lasso正则化项。
效果:研究结果显示,裁剪掉所有低于特定$ell_2$-范数阈值的列可以得到一个接近真实值且列数最少的解。此外,后续的微调阶段中,从$U_{text{prune}}$开始的梯度下降会以线性速度收敛到其极限。这些结果为理解正则化在裁剪中的作用提供了洞见。

Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. In fact, several practical studies have shown that if the pruned model is fine-tuned with some gradient-based updates it generalizes well to new samples. Although the above pipeline, which we refer to as pruning + fine-tuning, has been extremely successful in lowering the complexity of trained models, there is very little known about the theory behind this success. In this paper we address this issue by investigating the pruning + fine-tuning framework on the overparameterized matrix sensing problem with the ground truth denoted $U_\star \in \mathbb{R}^{d \times r}$ and the overparameterized model $U \in \mathbb{R}^{d \times k}$ with $k \gg r$. We study the approximate local minima of the mean square error, augmented with a smooth version of a group Lasso regularizer, $\sum_{i=1}^{k} \lVert Ue_i \rVert_2 $. In particular, we provably show that pruning all the columns below a certain explicit $\ell_2$-norm threshold results in a solution $U_{\text{prune}}$ which has the minimum number of columns $r$, yet close to the ground truth in training loss. Moreover, in the subsequent fine-tuning phase, gradient descent initialized at $U_{\text{prune}}$ converges at a linear rate to its limit. While our analysis provides insights into the role of regularization in pruning, we also show that running gradient descent in the absence of regularization results in models which {are not suitable for greedy pruning}, i.e., many columns could have their $\ell_2$ norm comparable to that of the maximum. Lastly, we show that our results also extend for the training and pruning of two-layer neural networks with quadratic activation functions. To the best of our knowledge, our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.

Riemannian stochastic optimization methods avoid strict saddle points
Ya-Ping Hsieh Mohammad Reza Karimi Jaghargh Andreas Krause Panayotis Mertikopoulos



研究问题:在黎曼流形上的随机黎曼优化算法是否能够保证以1的概率避免鞍点。
动机:许多现代机器学习应用可以表述为黎曼流形上的最小化问题,但结果的最小化问题并非地几何凸的,因此所选求解器的收敛性无法得到保证。
方法:本论文研究的是一族基于缩回的方法,除了可能比黎曼梯度下降法有更低的每次迭代成本外,还包括其他广泛使用的算法,如普通凸空间的自然策略梯度方法和镜像下降法。
效果:在对环境流形和提供梯度信息的预言机进行温和假设的情况下,研究表明,所研究的策略以1的概率避免了严格的鞍点/子流形,无论初始条件如何。这一结果为在流形上使用梯度方法提供了重要的验证,因为它表明,几乎总是,随机黎曼算法的最终状态只能是局部极小值。

Many modern machine learning applications - from online principal component analysis to covariance matrix identification and dictionary learning - can be formulated as minimization problems on Riemannian manifolds, typically solved with a Riemannian stochastic gradient method (or some variant thereof). However, in many cases of interest, the resulting minimization problem is _not_ geodesically convex, so the convergence of the chosen solver to a desirable solution - i.e., a local minimizer - is by no means guaranteed. In this paper, we study precisely this question, that is, whether stochastic Riemannian optimization algorithms are guaranteed to avoid saddle points with probability $1$. For generality, we study a family of retraction-based methods which, in addition to having a potentially much lower per-iteration cost relative to Riemannian gradient descent, include other widely used algorithms, such as natural policy gradient methods and mirror descent in ordinary convex spaces. In this general setting, we show that, under mild assumptions for the ambient manifold and the oracle providing gradient information, the policies under study avoid strict saddle points / submanifolds with probability $1$, from any initial condition. This result provides an important sanity check for the use of gradient methods on manifolds as it shows that, almost always, the end state of a stochastic Riemannian algorithm can only be a local minimizer.

Exact recovery and Bregman hard clustering of node-attributed Stochastic Block Model
Maximilien Dreveton Felipe Schreiber Fernandes Daniel R. Figueiredo



研究问题:如何同时利用网络信息(边)和节点信息(属性)设计高性能的聚类算法。
动机:在许多情况下,节点的属性是相关的,可以用于识别节点集群。因此,需要联合利用网络信息和节点信息来设计高效的聚类算法。
方法:建立了一个通用的网络和节点属性模型,并提出了信息论标准以实现社区标签的精确恢复。此外,还提出了一种迭代聚类算法,该算法最大化联合概率分布,假设网络交互和节点属性的概率分布在指数族中。
效果:通过大量的数值实验,包括合成数据和真实数据,表明了所提出的算法优于只利用网络或只利用属性信息的算法,以及最近提出的同时使用两种信息源进行聚类的算法。这项研究为在具有节点属性的网络中推断社区标签提供了深入的理论极限和实用技术。

Classic network clustering tackles the problem of identifying sets of nodes (communities) that have similar connection patterns. However, in many scenarios nodes also have attributes that are correlated and can also be used to identify node clusters. Thus, network information (edges) and node information (attributes) can be jointly leveraged to design high-performance clustering algorithms. Under a general model for the network and node attributes, this work establishes an information-theoretic criteria for the exact recovery of community labels and characterizes a phase transition determined by the Chernoff-Hellinger divergence of the model. The criteria shows how network and attribute information can be exchanged in order to have exact recovery (e.g., more reliable network information requires less reliable attribute information). This work also presents an iterative clustering algorithm that maximizes the joint likelihood, assuming that the probability distribution of network interactions and node attributes belong to exponential families. This covers a broad range of possible interactions (e.g., edges with weights) and attributes (e.g., non-Gaussian models) while also exploring the connection between exponential families and Bregman divergences. Extensive numerical experiments using synthetic and real data indicate that the proposed algorithm outperforms algorithms that leverage only network or only attribute information as well as recently proposed algorithms that perform clustering using both sources of information. The contributions of this work provide insights into the fundamental limits and practical techniques for inferring community labels on node-attributed networks.

Generalized equivalences between subsampling and ridge regularization
Pratik Patil Jin-Hong Du



研究问题:本文旨在建立子采样和岭正则化在集成岭估计器之间的精确结构和风险等价性。
动机:为了解决现有的预训练语言模型对结构化知识的利用不足的问题,提出通过结合大规模文本语料库和知识图谱来训练增强的语言表示模型ERNIE。
方法:采用大规模文本语料库和知识图谱进行联合训练,同时充分利用词汇、句法和知识信息,以更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We establish precise structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. Specifically, we prove that linear and quadratic functionals of subsample ridge estimators, when fitted with different ridge regularization levels $\lambda$ and subsample aspect ratios $\psi$, are asymptotically equivalent along specific paths in the $(\lambda,\psi)$-plane (where $\psi$ is the ratio of the feature dimension to the subsample size). Our results only require bounded moment assumptions on feature and response distributions and allow for arbitrary joint distributions. Furthermore, we provide a data-dependent method to determine the equivalent paths of $(\lambda,\psi)$. An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio. This resolves a recent open problem raised by Nakkiran et al. for general data distributions under proportional asymptotics, assuming a mild regularity condition that maintains regression hardness through linearized signal-to-noise ratios.

Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards
Semih Cayci Atilla Eryilmaz



研究问题:在强化学习中,当奖励分布具有重尾特性时,现有的方法可能会因频繁的统计异常值而失败。
动机:为了解决这一问题,本文提出了一种动态梯度裁剪机制,并证明其可以显著提高强化学习方法对重尾奖励分布的鲁棒性。
方法:通过引入动态梯度裁剪机制,改进了时间差分学习和自然演员-评论家算法。
效果:理论分析和实验结果表明,这种方法可以在期望和大概率情况下实现对重尾奖励分布的鲁棒性,同时降低了随机梯度的偏差和方差。

In a broad class of reinforcement learning applications, stochastic rewards have heavy-tailed distributions, which lead to infinite second-order moments for stochastic (semi)gradients in policy evaluation and direct policy optimization. In such instances, the existing RL methods may fail miserably due to frequent statistical outliers. In this work, we establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism, and correspondingly operated natural actor-critic (NAC), can be provably robustified against heavy-tailed reward distributions. It is shown in the framework of linear function approximation that a favorable tradeoff between bias and variability of the stochastic gradients can be achieved with this dynamic gradient clipping mechanism. In particular, we prove that robust versions of TD learning achieve sample complexities of order $\mathcal{O}(\varepsilon^{-\frac{1}{p}})$ and $\mathcal{O}(\varepsilon^{-1-\frac{1}{p}})$ with and without the full-rank assumption on the feature matrix, respectively, under heavy-tailed rewards with finite moments of order $(1+p)$ for some $p\in(0,1]$, both in expectation and with high probability. We show that a robust variant of NAC based on Robust TD learning achieves $\tilde{\mathcal{O}}(\varepsilon^{-4-\frac{2}{p}})$ sample complexity. We corroborate our theoretical results with numerical experiments.

An active learning framework for multi-group mean estimation
Abdellah Aznag Rachel Cummings Adam N. Elmachtoub



研究问题:在多个数据分布未知的群体中,如何通过主动学习框架动态收集样本以最小化均值估计器的方差向量的$p$-范数。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider a fundamental problem where there are multiple groups whose data distributions are unknown, and an analyst would like to learn the mean of each group. We consider an active learning framework to sequentially collect $T$ samples with bandit, each period observing a sample from a chosen group. After observing a sample, the analyst may update their estimate of the mean and variance of that group and choose the next group accordingly. The objective is to dynamically collect samples to minimize the $p$-norm of the vector of variances of our mean estimators after $T$ rounds. We propose an algorithm, Variance-UCB, that selects groups according to a an upper bound on the variance estimate adjusted to the $p$-norm chosen. We show that the regret of Variance-UCB is $O(T^{-2})$ for finite $p$, and prove that no algorithm can do better. When $p$ is infinite, we recover the $O(T^{-1.5})$ obtained in \cite{activelearning, carpentier2011upper} and provide a new lower bound showing that no algorithm can do better.

On the Asymptotic Learning Curves of Kernel Ridge Regression under Power-law Decay
Yicheng Li Haobo Zhang Qian Lin



研究问题:神经网络中的“良性过拟合现象”对统计学习理论的“偏差-方差权衡”原则提出了挑战。
动机:过参数化的神经网络的泛化能力可以通过神经切线核回归来近似,因此,核岭回归的超额风险曲线(即学习曲线)最近引起了越来越多的关注。
方法:在温和且更现实的假设下,我们严格地在核和目标函数的特征值满足幂律衰减条件的情况下,从理论上全面刻画了学习曲线。
效果:学习曲线详细阐述了正则化参数选择、源条件和噪声的影响和相互作用。特别是,我们的研究结果表明,只有当噪声水平较小时,过参数化的神经网络中才会出现“良性过拟合现象”。

The widely observed 'benign overfitting phenomenon' in the neural network literature raises the challenge to the `bias-variance trade-off' doctrine in the statistical learning theory. Since the generalization ability of the 'lazy trained' over-parametrized neural network can be well approximated by that of the neural tangent kernel regression, the curve of the excess risk (namely, the learning curve) of kernel ridge regression attracts increasing attention recently. However, most recent arguments on the learning curve are heuristic and are based on the 'Gaussian design' assumption. In this paper, under mild and more realistic assumptions, we rigorously provide a full characterization of the learning curve in the asymptotic sense under a power-law decay condition of the eigenvalues of the kernel and also the target function. The learning curve elaborates the effect and the interplay of the choice of the regularization parameter, the source condition and the noise. In particular, our results suggest that the 'benign overfitting phenomenon' exists in over-parametrized neural networks only when the noise level is small.

Information Theoretic Lower Bounds for Information Theoretic Upper Bounds
Roi Livni



研究问题:本文探讨了输出模型与经验样本之间的互信息和随机凸优化算法的泛化之间的关系。
动机:尽管人们对信息理论的泛化边界越来越感兴趣,但尚不清楚这些边界是否能为各种学习算法的卓越性能提供见解。
方法:通过对随机凸优化的研究,我们发现对于真正的风险最小化,需要依赖维度的互信息。
效果:这表明现有的信息理论泛化边界无法捕捉到像SGD和正则化ERM这样具有维度独立样本复杂度的算法的泛化能力。

We examine the relationship between the mutual information between the output model and the empirical sample and the algorithm's generalization in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimension-dependent mutual information is necessary. This indicates that existing information-theoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.

Stability and Generalization of the Decentralized Stochastic Gradient Descent Ascent Algorithm
Miaoxi Zhu Li Shen Bo Du Dacheng Tao



研究问题:本文旨在解决在各种机器学习任务中,如何以分散的方式求解最小最大问题。
动机:尽管现有的理论研究主要关注分散式最小最大算法的收敛速度和通信复杂度,但对它们的泛化能力关注不足。
方法:本文采用算法稳定性的方法,研究了分散式随机梯度下降上升(D-SGDA)算法的原-对偶泛化界,包括凸-凹和非凸-非凹设置。
效果:理论分析表明,分散结构并未破坏D-SGDA的稳定性和泛化性,暗示在某些情况下,其可以像普通的SGDA一样进行泛化。此外,结果还分析了不同拓扑结构对D-SGDA算法泛化界的影响,并通过数值实验验证了理论发现。

The growing size of available data has attracted increasing interest in solving minimax problems in a decentralized manner for various machine learning tasks. Previous theoretical research has primarily focused on the convergence rate and communication complexity of decentralized minimax algorithms, with little attention given to their generalization. In this paper, we investigate the primal-dual generalization bound of the decentralized stochastic gradient descent ascent (D-SGDA) algorithm using the approach of algorithmic stability under both convex-concave and nonconvex-nonconcave settings. Our theory refines the algorithmic stability in a decentralized manner and demonstrates that the decentralized structure does not destroy the stability and generalization of D-SGDA, implying that it can generalize as well as the vanilla SGDA in certain situations. Our results analyze the impact of different topologies on the generalization bound of the D-SGDA algorithm beyond trivial factors such as sample sizes, learning rates, and iterations. We also evaluate the optimization error and balance it with the generalization gap to obtain the optimal population risk of D-SGDA in the convex-concave setting. Additionally, we perform several numerical experiments which validate our theoretical findings.

Contrastive Moments: Unsupervised Halfspace Learning in Polynomial Time
Xinyuan Cao Santosh Vempala



研究问题:如何学习高维空间中带有边缘的高维半空间,当环境分布是未知的一维对称对数凹分布的d倍乘积的仿射变换时。
动机:在没有标签的情况下,通过删除至少一个分量分布中的ε部分数据来引入半空间,建立隐藏半空间在这种分布假设下的唯一(和高效)可识别性。
方法:使用仅适合的重新加权的经验分布的前两个矩,即对比矩;算法的分析使用关于广义迪利克雷多项式的经典事实,并依赖于截断对数凹分布的矩比的新单调性属性。
效果:该算法在维度和1/ε上的样本和时间复杂度都是多项式的。在先前的工作基础上,我们通过总变分(TV)距离,而不是现有的可能超过多项式的矩界保证,提供了多项式时间的保证。此外,我们的工作也是在这个设置中首次超越高斯的工作。

We give a polynomial-time algorithm for learning high-dimensional halfspaces with margins in $d$-dimensional space to within desired TV distance when the ambient distribution is an unknown affine transformation of the $d$-fold product of an (unknown) symmetric one-dimensional logconcave distribution, and the halfspace is introduced by deleting at least an $\epsilon$ fraction of the data in one of the component distributions. Notably, our algorithm does not need labels and establishes the unique (and efficient) identifiability of the hidden halfspace under this distributional assumption. The sample and time complexity of the algorithm are polynomial in the dimension and $1/\epsilon$. The algorithm uses only the first two moments of *suitable re-weightings* of the empirical distribution, which we call *contrastive moments*; its analysis uses classical facts about generalized Dirichlet polynomials and relies crucially on a new monotonicity property of the moment ratio of truncations of logconcave distributions. Such algorithms, based only on first and second moments were suggested in earlier work, but hitherto eluded rigorous guarantees. Prior work addressed the special case when the underlying distribution is Gaussian via Non-Gaussian Component Analysis. We improve on this by providing polytime guarantees based on Total Variation (TV) distance, in place of existing moment-bound guarantees that can be super-polynomial. Our work is also the first to go beyond Gaussians in this setting.

Zero-Regret Performative Prediction Under Inequality Constraints
Wenjing YAN Xuanyu Cao



研究问题:本文旨在研究在不平等约束下的表现预测问题,并寻找最优解。
动机:目前的表现预测研究仅关注无约束问题,忽视了许多现实世界的学习问题都受到约束这一事实。
方法:本文开发了一个鲁棒的原-对偶框架,该框架只需要近似梯度即可达到一定的精度,性能与没有表现性的稳定随机原-对偶算法相同。基于此框架,作者提出了一种适用于位置族的自适应原-对偶算法。
效果:分析表明,所提出的自适应原-对偶算法在时间范围为T时,可以达到O(√T)的遗憾和约束违规,并且只需使用√T + 2T个样本。通过数值模拟验证了算法和理论结果的有效性。

Performative prediction is a recently proposed framework where predictions guide decision-making and hence influence future data distributions. Such performative phenomena are ubiquitous in various areas, such as transportation, finance, public policy, and recommendation systems. To date, work on performative prediction has only focused on unconstrained problems, neglecting the fact that many real-world learning problems are subject to constraints. This paper bridges this gap by studying performative prediction under inequality constraints. Unlike most existing work that provides only performative stable points, we aim to find the optimal solutions. Anticipating performative gradient is a challenging task, due to the agnostic performative effect on data distributions. To address this issue, we first develop a robust primal-dual framework that requires only approximate gradients up to a certain accuracy, yet delivers the same order of performance as the stationary stochastic primal-dual algorithm without performativity. Based on this framework, we then propose an adaptive primal-dual algorithm for location families. Our analysis demonstrates that the proposed adaptive primal-dual algorithm attains $\mathcal{O}(\sqrt{T})$ regret and constraint violations, using only $\sqrt{T} + 2T$ samples, where $T$ is the time horizon. To our best knowledge, this is the first study and analysis on the optimality of the performative prediction problem under inequality constraints. Finally, we validate the effectiveness of our algorithm and theoretical results through numerical simulations.

Bandit Social Learning under Myopic Behavior
Kiarash Banihashem MohammadTaghi Hajiaghayi Suho Shin Aleksandrs Slivkins



研究问题:本研究关注在线平台上的评论引发的社会学习动态。
动机:目前的表现预测研究仅关注无约束问题,忽视了许多现实世界的学习问题都受到约束这一事实。
方法:本文开发了一个鲁棒的原-对偶框架,该框架只需要近似梯度即可达到一定的精度,性能与没有表现性的稳定随机原-对偶算法相同。基于此框架,作者提出了一种适用于位置族的自适应原-对偶算法。
效果:分析表明,所提出的自适应原-对偶算法在时间范围为T时,可以达到O(√T)的遗憾和约束违规,并且只需使用√T + 2T个样本。通过数值模拟验证了算法和理论结果的有效性。

We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals for the arms’ expected rewards. We derive stark exploration failures for any such behavior, and provide matching positive results. As a special case, we obtain the first general results on failure of the greedy algorithm in bandits, thus providing a theoretical foundation for why bandit algorithms should explore.

Projection-Free Methods for Stochastic Simple Bilevel Optimization with Convex Lower-level Problem
Jincheng Cao Ruichen Jiang Nazanin Abolfazli Erfan Yazdandoost Hamedani Aryan Mokhtari



研究问题:本文研究了一类随机双层优化问题,即随机简单双层优化,其中我们最小化另一个随机凸优化问题的最优解集上的平滑随机目标函数。
动机:针对双层优化问题,提出了新的随机双层优化方法,该方法通过随机切割平面局部近似下层问题的解集,然后使用降低方差的技术进行条件梯度更新以控制使用随机梯度引起的误差。
方法:对于上层函数为凸的情况,我们的方法需要$\mathcal{O}(\max\\{1/\epsilon_f^{2},1/\epsilon_g^{2}\\})$的随机查询来获得一个对上层$\epsilon_f$-最优和对下层$\epsilon_g$-最优的解决方案。这一保证改进了先前已知的最好复杂度$\mathcal{O}(\max\\{1/epsilon_f^{4},1/epsilon_g^{4}\\})$。
效果:对于上层函数为非凸的情况,我们的方法最多需要$mathcal{O}(\max\\{1/\epsilon_f^{3},1/\epsilon_g^{3}\})$的随机查询找到一个$(\epsilon_f, \epsilon_g)$-稳定点。在有限和设置中,我们的方法所需的随机查询次数分别为$\mathcal{O}(\sqrt{n}/\epsilon)$和$\mathcal{O}(\sqrt{n}/epsilon^{2})$,其中$\epsilon=\min \\{\epsilon_f,epsilon_g\\}$。

In this paper, we study a class of stochastic bilevel optimization problems, also known as stochastic simple bilevel optimization, where we minimize a smooth stochastic objective function over the optimal solution set of another stochastic convex optimization problem. We introduce novel stochastic bilevel optimization methods that locally approximate the solution set of the lower-level problem via a stochastic cutting plane, and then run a conditional gradient update with variance reduction techniques to control the error induced by using stochastic gradients. For the case that the upper-level function is convex, our method requires $\mathcal{O}(\max\\{1/\epsilon_f^{2},1/\epsilon_g^{2}\\}) $ stochastic oracle queries to obtain a solution that is $\epsilon_f$-optimal for the upper-level and $\epsilon_g$-optimal for the lower-level. This guarantee improves the previous best-known complexity of $\mathcal{O}(\max\\{1/\epsilon_f^{4},1/\epsilon_g^{4}\\})$. Moreover, for the case that the upper-level function is non-convex, our method requires at most $\mathcal{O}(\max\\{1/\epsilon_f^{3},1/\epsilon_g^{3}\\}) $ stochastic oracle queries to find an $(\epsilon_f, \epsilon_g)$-stationary point. In the finite-sum setting, we show that the number of stochastic oracle calls required by our method are $\mathcal{O}(\sqrt{n}/\epsilon)$ and $\mathcal{O}(\sqrt{n}/\epsilon^{2})$ for the convex and non-convex settings, respectively, where $\epsilon=\min \\{\epsilon_f,\epsilon_g\\}$.

Unified Enhancement of Privacy Bounds for Mixture Mechanisms via $f$-Differential Privacy
Chendi Wang Buxin Su Jiayuan Ye Reza Shokri Weijie J Su



研究问题:如何提高使用$f$-DP的洗牌模型和随机初始化一次差分隐私梯度下降(DP-GD)的隐私边界。
动机:现有的随机性如随机初始化、随机批量子采样和混洗在证明差分隐私界限时难以考虑,因为它们会为算法的输出引入难以分析的混合分布。
方法:通过推导洗牌模型的超越现有基于$(epsilon,\delta)$-DP结果的交易函数闭型表达式,以及研究随机初始化对一次迭代DP-GD隐私的影响,来改善隐私边界。
效果:数值计算表明,随机初始化可以增强DP-GD的隐私性。此外,我们的研究还发现了一种新的交易函数不等式,该不等式暗示了$F$-散度的联合凸性,这有助于我们更好地理解和改进使用$f$-DP的混合机制的隐私性。

Differentially private (DP) machine learning algorithms incur many sources of randomness, such as random initialization, random batch subsampling, and shuffling. However, such randomness is difficult to take into account when proving differential privacy bounds because it induces mixture distributions for the algorithm's output that are difficult to analyze. This paper focuses on improving privacy bounds for shuffling models and one-iteration differentially private gradient descent (DP-GD) with random initializations using $f$-DP. We derive a closed-form expression of the trade-off function for shuffling models that outperforms the most up-to-date results based on $(\epsilon,\delta)$-DP. Moreover, we investigate the effects of random initialization on the privacy of one-iteration DP-GD. Our numerical computations of the trade-off function indicate that random initialization can enhance the privacy of DP-GD. Our analysis of $f$-DP guarantees for these mixture mechanisms relies on an inequality for trade-off functions introduced in this paper. This inequality implies the joint convexity of $F$-divergences. Finally, we study an $f$-DP analog of the advanced joint convexity of the hockey-stick divergence related to $(\epsilon,\delta)$-DP and apply it to analyze the privacy of mixture mechanisms.

On Differentially Private Sampling from Gaussian and Product Distributions
Badih Ghazi Xiao Hu Ravi Kumar Pasin Manurangsi



研究问题:在保持差分隐私约束下,如何生成与未知分布P相近的样本。
动机:解决在保护用户隐私的同时,对未知分布进行有效采样的问题。
方法:针对多元高斯分布的不同假设(已知协方差、未知有界协方差和未知无界协方差),提出新的差分隐私采样算法。
效果:在已知协方差和未知有界协方差的设定中,新算法实现了接近最优的样本复杂度;当P为二进制超立方体上的乘积分布时,得到了纯差分隐私算法,而此前仅知道近似差分隐私算法(样本复杂度稍差)。

We study the problem, where given a dataset of $n$ i.i.d. samples from an unknown distribution $P$, we seek to generate a sample from a distribution that is close to $P$ in total variation distance, under the constraint of differential privacy. We study the settings where $P$ is a multi-dimensional Gaussian distribution with different assumptions: known covariance, unknown bounded covariance, and unknown unbounded covariance. We present new differentially private sampling algorithms, and show that they achieve near-optimal sample complexity in the first two settings. Moreover, when $P$ is a product distribution on the binary hypercube, we obtain a pure-DP algorithm whereas only an approximate-DP algorithm (with slightly worse sample complexity) was previously known.

Online Corrupted User Detection and Regret Minimization
Zhiyong Wang Jize Xie Tong Yu Shuai Li John C.S. Lui



研究问题:设计有效的在线学习算法,以从可能被破坏的用户行为中进行学习,并准确在线识别被破坏的用户。
动机:在现实世界的在线网络系统中,多个用户通常会顺序进入系统。对于点击欺诈和虚假评论等应用,一些用户可能会恶意执行破坏性行为来欺骗系统。因此,需要设计高效的在线学习算法来从可能被破坏的用户行为中进行学习,并准确在线识别被破坏的用户。
方法:提出了一种名为LOCUD的重要在线学习问题,以从破坏性行为中学习和利用未知的用户关系,加快学习速度,并在在线环境中识别被破坏的用户。为了从可能被破坏的用户之间稳健地学习和利用未知的关系,我们提出了一种新的基于奖励的算法RCLUB-WCU。为了检测被破坏的用户,我们根据RCLUB-WCU推断出的用户关系设计了一种新颖的在线检测算法OCCUD。
效果:通过大量的实验,我们的方法在性能上超过了以前的奖励算法,并且对被破坏用户的检测精度很高。

In real-world online web systems, multiple users usually arrive sequentially into the system. For applications like click fraud and fake reviews, some users can maliciously perform corrupted (disrupted) behaviors to trick the system. Therefore, it is crucial to design efficient online learning algorithms to robustly learn from potentially corrupted user behaviors and accurately identify the corrupted users in an online manner. Existing works propose bandit algorithms robust to adversarial corruption. However, these algorithms are designed for a single user, and cannot leverage the implicit social relations among multiple users for more efficient learning. Moreover, none of them consider how to detect corrupted users online in the multiple-user scenario. In this paper, we present an important online learning problem named LOCUD to learn and utilize unknown user relations from disrupted behaviors to speed up learning, and identify the corrupted users in an online setting. To robustly learn and utilize the unknown relations among potentially corrupted users, we propose a novel bandit algorithm RCLUB-WCU. To detect the corrupted users, we devise a novel online detection algorithm OCCUD based on RCLUB-WCU's inferred user relations. We prove a regret upper bound for RCLUB-WCU, which asymptotically matches the lower bound with respect to $T$ up to logarithmic factors, and matches the state-of-the-art results in degenerate cases. We also give a theoretical guarantee for the detection accuracy of OCCUD. With extensive experiments, our methods achieve superior performance over previous bandit algorithms and high corrupted user detection accuracy.

On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions
Yufeng Zhang Jialu Pan Kenli Li Wanwei Liu Zhenbang Chen Xinwang Liu J Wang



研究问题:本文主要研究了多元高斯分布之间的Kullback-Leibler散度的性质。
动机:Kullback-Leibler散度是衡量概率分布之间差异的重要指标,对于理解多元高斯分布的特性具有重要意义。
方法:通过理论分析,研究了多元高斯分布之间的Kullback-Leibler散度的上确界和下确界,并得出了一些有用的性质。
效果:这些理论结果有助于深化我们对多元高斯分布的理解,并在深度学习、强化学习和样本复杂度研究中找到了应用。

Kullback-Leibler (KL) divergence is one of the most important measures to calculate the difference between probability distributions. In this paper, we theoretically study several properties of KL divergence between multivariate Gaussian distributions. Firstly, for any two $n$-dimensional Gaussian distributions $\mathcal{N}_1$ and $\mathcal{N}_2$, we prove that when $KL(\mathcal{N}_2||\mathcal{N}_1)\leq \varepsilon\ (\varepsilon>0)$ the supremum of $KL(\mathcal{N}_1||\mathcal{N}_2)$ is $(1/2)\left((-W_{0}(-e^{-(1+2\varepsilon)}))^{-1}+\log(-W_{0}(-e^{-(1+2\varepsilon)})) -1 \right)$, where $W_0$ is the principal branch of Lambert $W$ function. For small $\varepsilon$, the supremum is $\varepsilon + 2\varepsilon^{1.5} + O(\varepsilon^2)$. This quantifies the approximate symmetry of small KL divergence between Gaussian distributions. We further derive the infimum of $KL(\mathcal{N}_1||\mathcal{N}_2)$ when $KL(\mathcal{N}_2||\mathcal{N}_1)\geq M\ (M>0)$. We give the conditions when the supremum and infimum can be attained. Secondly, for any three $n$-dimensional Gaussian distributions $\mathcal{N}_1$, $\mathcal{N}_2$, and $\mathcal{N}_3$, we theoretically show that an upper bound of $KL(\mathcal{N}_1||\mathcal{N}_3)$ is $3\varepsilon_1+3\varepsilon_2+2\sqrt{\varepsilon_1\varepsilon_2}+o(\varepsilon_1)+o(\varepsilon_2)$ when $KL(\mathcal{N}_1||\mathcal{N}_2)\leq \varepsilon_1$ and $KL(\mathcal{N}_2||\mathcal{N}_3)\leq \varepsilon_2$ ($\varepsilon_1,\varepsilon_2\ge 0$). This reveals that KL divergence between Gaussian distributions follows a relaxed triangle inequality. Note that, all these bounds in the theorems presented in this work are independent of the dimension $n$. Finally, we discuss several applications of our theories in deep learning, reinforcement learning, and sample complexity research.

Federated Conditional Stochastic Optimization
Xidong Wu Jianhui Sun Zhengmian Hu Junyi Li Aidong Zhang Heng Huang



研究问题:如何在联邦学习中进行非凸条件随机优化。
动机:随着大规模分布式数据训练模型的需求增加,对通信高效的分布式优化算法的需求也在增加。
方法:提出了第一种具有条件随机梯度估计器和基于动量的算法(即FCSG-M)的联邦条件随机优化算法(FCSG)。通过方差减少设计了一个加速算法(Acc-FCSG-M)以实现最佳样本和通信复杂度。
效果:与现有的FL中的元学习优化分析相比,联邦条件随机优化考虑了任务的样本。大量实验结果验证了这些算法的效率。

Conditional stochastic optimization has found applications in a wide range of machine learning tasks, such as invariant learning, AUPRC maximization, and meta-learning. As the demand for training models with large-scale distributed data grows in these applications, there is an increasing need for communication-efficient distributed optimization algorithms, such as federated learning algorithms. This paper considers the nonconvex conditional stochastic optimization in federated learning and proposes the first federated conditional stochastic optimization algorithm (FCSG) with a conditional stochastic gradient estimator and a momentum-based algorithm (\emph{i.e.}, FCSG-M). To match the lower bound complexity in the single-machine setting, we design an accelerated algorithm (Acc-FCSG-M) via the variance reduction to achieve the best sample and communication complexity. Compared with the existing optimization analysis for Meta-Learning in FL, federated conditional stochastic optimization considers the sample of tasks. Extensive experimental results on various tasks validate the efficiency of these algorithms.

Bilevel Coreset Selection in Continual Learning: A New Formulation and Algorithm
Jie Hao Kaiyi Ji Mingrui Liu



研究问题:本文旨在解决基于复习的持续学习中,用于代表以前任务的代表性样本的核心集选择问题。
动机:在持续学习中,核心集通常用于记忆重播缓冲区以代表以前任务的代表性样本,但传统的二层核心集选择方法计算成本高。
方法:提出一种新的二层公式化方法,其中内部问题尝试找到一个模型最小化从给定概率分布采样的预期训练误差,外部问题则学习一个具有大约K个非零条目的概率分布,使得内部问题中学习的模型在整个数据上最小化训练误差。
效果:通过引入基于平滑的Top-K损失的新正则化器,确保学到的概率具有大约K个非零条目。设计了新的优化算法,该算法在持续学习的基准数据集上以O(1/ε^4)的计算复杂度收敛到ε-稳定点。实验结果表明,该方法在各种设置下显著优于竞争性基线。

Coreset is a small set that provides a data summary for a large dataset, such that training solely on the small set achieves competitive performance compared with a large dataset. In rehearsal-based continual learning, the coreset is typically used in the memory replay buffer to stand for representative samples in previous tasks, and the coreset selection procedure is typically formulated as a bilevel problem. However, the typical bilevel formulation for coreset selection explicitly performs optimization over discrete decision variables with greedy search, which is computationally expensive. Several works consider other formulations to address this issue, but they ignore the nested nature of bilevel optimization problems and may not solve the bilevel coreset selection problem accurately. To address these issues, we propose a new bilevel formulation, where the inner problem tries to find a model which minimizes the expected training error sampled from a given probability distribution, and the outer problem aims to learn the probability distribution with approximately $K$ (coreset size) nonzero entries such that learned model in the inner problem minimizes the training error over the whole data. To ensure the learned probability has approximately $K$ nonzero entries, we introduce a novel regularizer based on the smoothed top-$K$ loss in the upper problem. We design a new optimization algorithm that provably converges to the $\epsilon$-stationary point with $O(1/\epsilon^4)$ computational complexity. We conduct extensive experiments in various settings in continual learning, including balanced data, imbalanced data, and label noise, to show that our proposed formulation and new algorithm significantly outperform competitive baselines. From bilevel optimization point of view, our algorithm significantly improves the vanilla greedy coreset selection method in terms of running time on continual learning benchmark datasets. The code is available at https://github.com/MingruiLiu-ML-Lab/Bilevel-Coreset-Selection-via-Regularization.

Online Clustering of Bandits with Misspecified User Models
Zhiyong Wang Jize Xie Xutong Liu Shuai Li John C.S. Lui



研究问题:如何设计出对用户模型误设定具有鲁棒性的聚类线性Bandit算法。
动机:现有的聚类线性Bandit算法需要准确指定线性用户模型,当这个关键假设不成立时,算法可能会失败。对于更实际的、用户模型误设定的场景,能否设计出鲁棒的聚类线性Bandit算法仍是一个开放的问题。
方法:我们首次提出了聚类线性Bandit与误设定用户模型(CBMUM)的重要问题,并设计了两种鲁棒的CB算法RCLUMB和RSCLUMB,这两种算法可以适应由模型误设定引起的用户偏好估计不准确和错误聚类的问题。
效果:在比之前CB工作更宽松的假设下,我们的算法得到了$O(\epsilon_*T\sqrt{md\log T} + d\sqrt{mT}\log T)$的遗憾上界,这与之前的CB工作在渐近情况下的下界相匹配,并在几种退化情况下也匹配了最先进的结果。我们在合成数据和真实世界数据上的实验都表现出优于以往算法的性能。

The contextual linear bandit is an important online learning problem where given arm features, a learning agent selects an arm at each round to maximize the cumulative rewards in the long run. A line of works, called the clustering of bandits (CB), utilize the collaborative effect over user preferences and have shown significant improvements over classic linear bandit algorithms. However, existing CB algorithms require well-specified linear user models and can fail when this critical assumption does not hold. Whether robust CB algorithms can be designed for more practical scenarios with misspecified user models remains an open problem. In this paper, we are the first to present the important problem of clustering of bandits with misspecified user models (CBMUM), where the expected rewards in user models can be perturbed away from perfect linear models. We devise two robust CB algorithms, RCLUMB and RSCLUMB (representing the learned clustering structure with dynamic graph and sets, respectively), that can accommodate the inaccurate user preference estimations and erroneous clustering caused by model misspecifications. We prove regret upper bounds of $O(\epsilon_*T\sqrt{md\log T} + d\sqrt{mT}\log T)$ for our algorithms under milder assumptions than previous CB works, which match the lower bound asymptotically in $T$ up to logarithmic factors, and also match the state-of-the-art results in several degenerate cases. Our regret analysis is novel and different from the typical proof flow of previous CB works. The techniques in proving the regret caused by misclustering users are quite general and may be of independent interest. Experiments on both synthetic and real-world data show our outperformance over previous algorithms.

Bayesian Optimization with Cost-varying Variable Subsets
Sebastian Shenghong Tay Chuan-Sheng Foo Daisuke Urano Richalynn Leong Bryan Kian Hsiang Low



研究问题:本文提出了一种贝叶斯优化问题,其中每次迭代中,学习者选择一组查询变量并指定其值,而其余的则随机抽样。每个选定的子集都有一个关联的成本。这为学习者带来了新的挑战,即在更有针对性的学习和减少成本之间进行权衡。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:本文提出了一种新的基于高斯过程上界置信区间的算法来解决BOCVS问题,该算法是无遗憾的。我们分析了更便宜的控制集的可用性如何帮助探索和减少总体遗憾。
效果:实验结果表明,我们的算法可以找到比相同预算的可比基线更好的解决方案。

We introduce the problem of Bayesian optimization with cost-varying variable subsets (BOCVS) where in each iteration, the learner chooses a subset of query variables and specifies their values while the rest are randomly sampled. Each chosen subset has an associated cost. This presents the learner with the novel challenge of balancing between choosing more informative subsets for more directed learning versus leaving some variables to be randomly sampled to reduce incurred costs. This paper presents a novel Gaussian process upper confidence bound-based algorithm for solving the BOCVS problem that is provably no-regret. We analyze how the availability of cheaper control sets helps in exploration and reduces overall regret. We empirically show that our proposed algorithm can find significantly better solutions than comparable baselines with the same budget.

On the Overlooked Structure of Stochastic Gradients
Zeke Xie Qian-Yuan Tang Mingming Sun Ping Li



研究问题:本研究旨在探索深度神经网络中随机梯度的结构和重尾性,以及其对优化和泛化的影响。
动机:尽管一些研究试图通过梯度噪声的重尾特性来解释深度学习中的随机优化成功,但其他研究则提供了反对梯度噪声重尾假设的理论和实证证据。然而,对于深度学习中随机梯度的结构和重尾性的正式统计分析仍然鲜有研究。
方法:本研究主要进行了两项贡献。首先,我们对参数和迭代之间的随机梯度和梯度噪声的分布进行了正式的统计分析。我们的统计测试发现,维度相关的梯度通常表现出幂律重尾性,而迭代相关的梯度和由小批量训练引起的随机梯度噪声通常不表现出幂律重尾性。其次,我们发现随机梯度的协方差谱具有被前人研究所忽视的幂律结构,并展示了其对深度神经网络训练的理论影响。
效果:我们的研究挑战了现有的认知,并为深度学习中随机梯度的结构提供了新的见解。

Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures overlooked by previous studies and present its theoretical implications for training of DNNs. While previous studies believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.

$L_2$-Uniform Stability of Randomized Learning Algorithms: Sharper Generalization Bounds and Confidence Boosting
Xiaotong Yuan Ping Li



研究问题:将已有的最优概率边界从确定性学习算法扩展到随机化学习领域。
动机:为了提高随机化学习算法的泛化性能,需要对现有的稳定性定义进行扩展和改进。
方法:提出了一种新的L2一致稳定性概念,并在经典的置信度提升框架内证明了一种强指数边界。
效果:通过这种方法,我们得到了一种在数据和算法随机性上具有高概率联合最优泛化性能的基于bagging的元算法,并将其推广到自然衰减学习率的凸或非凸优化问题,从而获得了更精确的指数边界。

Exponential generalization bounds with near-optimal rates have recently been established for uniformly stable algorithms~\citep{feldman2019high,bousquet2020sharper}. We seek to extend these best known high probability bounds from deterministic learning algorithms to the regime of randomized learning. One simple approach for achieving this goal is to define the stability for the expectation over the algorithm's randomness, which may result in sharper parameter but only leads to guarantees regarding the on-average generalization error. Another natural option is to consider the stability conditioned on the algorithm's randomness, which is way more stringent but may lead to generalization with high probability jointly over the randomness of sample and algorithm. The present paper addresses such a tension between these two alternatives and makes progress towards relaxing it inside a classic framework of confidence-boosting. To this end, we first introduce a novel concept of $L_2$-uniform stability that holds uniformly over data but in second-moment over the algorithm's randomness. Then as a core contribution of this work, we prove a strong exponential bound on the first-moment of generalization error under the notion of $L_2$-uniform stability. As an interesting consequence of the bound, we show that a bagging-based meta algorithm leads to near-optimal generalization with high probability jointly over the randomness of data and algorithm. We further substantialize these generic results to stochastic gradient descent (SGD) to derive sharper exponential bounds for convex or non-convex optimization with natural time-decaying learning rates, which have not been possible to prove with the existing stability-based generalization guarantees.

Federated Spectral Clustering via Secure Similarity Reconstruction
Dong Qiao Chris Ding Jicong Fan



研究问题:本文旨在提出一种安全的联邦学习核因子分解方法,用于在分布式数据集上进行联邦谱聚类。
动机:尽管联邦学习在保护信息隐私方面具有显著优势,但关于安全联邦无监督学习,特别是聚类的研究仍然有限。
方法:我们的方法隐式地构造了一个近似的核矩阵,以便在隐私保护的限制下进行谱聚类。我们还提供了优化算法的收敛保证,高斯核矩阵的重构误差边界,以及我们方法的正确聚类的充分条件。
效果:我们在合成和真实数据集上的数值结果表明,与基线相比,我们的方法既高效又准确。

Federated learning has a significant advantage in protecting information privacy. Many scholars proposed various secure learning methods within the framework of federated learning but the study on secure federated unsupervised learning especially clustering is limited. We in this work propose a secure kernelized factorization method for federated spectral clustering on distributed dataset. The method is non-trivial because the kernel or similarity matrix for spectral clustering is computed by data pairs, which violates the principle of privacy protection. Our method implicitly constructs an approximation for the kernel matrix on distributed data such that we can perform spectral clustering under the constraint of privacy protection. We provide a convergence guarantee of the optimization algorithm, reconstruction error bounds of the Gaussian kernel matrix, and the sufficient condition of correct clustering of our method. We also present some results of differential privacy. Numerical results on synthetic and real datasets demonstrate that the proposed method is efficient and accurate in comparison to the baselines.

Deep Contract Design via Discontinuous Networks
Tonghan Wang Paul Duetting Dmitry Ivanov Inbal Talgam-Cohen David C. Parkes



研究问题:本文旨在通过深度学习技术,实现合同设计的自动化,以优化合同效果。
动机:现有的合同设计方法缺乏对复杂情况的适应性和效率,需要一种能够自动设计最优合同的方法。
方法:提出了一种新的表示方法——非连续ReLU(DeLU)网络,将委托人的效用表示为代理人采取特定行动的合同设计的分段仿射函数。DeLU网络隐式地学习了代理人的激励兼容性约束和委托人的效用最大化目标的闭型表达式,并通过线性规划或解决最优合同的内部点方法支持每个部分的并行推理。
效果:实验结果表明,该方法能够在少量训练样本的情况下近似委托人的效用函数,并在有大量行动和结果的问题上找到近似最优合同。

Contract design involves a principal who establishes contractual agreements about payments for outcomes that arise from the actions of an agent. In this paper, we initiate the study of deep learning for the automated design of optimal contracts. We introduce a novel representation: the Discontinuous ReLU (DeLU) network, which models the principal's utility as a discontinuous piecewise affine function of the design of a contract where each piece corresponds to the agent taking a particular action. DeLU networks implicitly learn closed-form expressions for the incentive compatibility constraints of the agent and the utility maximization objective of the principal, and support parallel inference on each piece through linear programming or interior-point methods that solve for optimal contracts. We provide empirical results that demonstrate success in approximating the principal's utility function with a small number of training samples and scaling to find approximately optimal contracts on problems with a large number of actions and outcomes.

A Finite-Particle Convergence Rate for Stein Variational Gradient Descent
Jiaxin Shi Lester Mackey



研究问题:本文旨在为Stein变分梯度下降(SVGD)算法提供有限的粒子收敛速率,这是一种常用的用粒子集合近似概率分布的算法。
动机:当前的目标分布是次高斯分布且具有Lipschitz得分时,SVGD算法及其步骤大小序列可以驱动核Stein分歧度以${1/}{\sqrt{log\log n}}$的速度趋近于零。
方法:通过使用$n$个粒子和适当的步长序列,SVGD能够将核Stein分歧度驱动到零。
效果:实验结果表明,当目标分布是次高斯分布且具有Lipschitz得分时,SVGD算法及其步骤大小序列可以将核Stein分歧度以${1/}{\sqrt{\log\log n}}$的速度趋近于零。

We provide the first finite-particle convergence rate for Stein variational gradient descent (SVGD), a popular algorithm for approximating a probability distribution with a collection of particles. Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with $n$ particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order ${1/}{\sqrt{\log\log n}}$ rate. We suspect that the dependence on $n$ can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.

Efficient Testable Learning of Halfspaces with Adversarial Label Noise
Ilias Diakonikolas Daniel Kane Vasilis Kontonis Sihan Liu Nikos Zarifis



研究问题:如何在存在对抗性标签噪声的情况下,利用高斯分布进行半空间的可测试学习。
动机:在最近引入的可测试学习模型中,需要产生一个测试器-学习器,如果数据通过测试器的检验,那么就可以信任鲁棒学习器在该数据上的输出。
方法:该算法采用迭代软定位技术,辅以适当的测试器,确保数据分布与高斯分布足够相似。
效果:实验结果表明,该算法的时间复杂度为$\text{poly}(d/\epsilon)$,并且输出的半空间误分类误差为$O(text{opt})+\epsilon$,其中$text{opt}$是最佳拟合半空间的0-1误差。此外,该算法可以很容易地适应并生成一个只需要$d ~ \text{polylog}(1/\epsilon)$个标记示例的高效且可测试的主动学习器。

We give the first polynomial-time algorithm for the testable learning of halfspaces in the presence of adversarial label noise under the Gaussian distribution. In the recently introduced testable learning model, one is required to produce a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data. Our tester-learner runs in time $\text{poly}(d/\epsilon)$ and outputs a halfspace with misclassification error $O(\text{opt})+\epsilon$, where $\text{opt}$ is the 0-1 error of the best fitting halfspace. At a technical level, our algorithm employs an iterative soft localization technique enhanced with appropriate testers to ensure that the data distribution is sufficiently similar to a Gaussian. Finally, our algorithm can be readily adapted to yield an efficient and testable active learner requiring only $d ~ \text{polylog}(1/\epsilon)$ labeled examples.

Functional Equivalence and Path Connectivity of Reducible Hyperbolic Tangent Networks
Matthew Farrugia-Roberts



研究问题:理解人工神经网络的学习过程需要阐明学习发生的参数空间结构。
动机:对于许多架构,几乎所有的参数都有一个简单且记录良好的功能等价类,但也存在少数可简化的参数,其功能等价类由于网络单元之间的冗余而更丰富。
方法:本文为单隐藏层双曲正切架构的单元冗余和可简化的功能等价类提供了算法表征。
效果:研究发现,这样的功能等价类是分段线性路径连通集,对于大多数冗余单元的参数,集合的直径最多为7个线性段。

Understanding the learning process of artificial neural networks requires clarifying the structure of the parameter space within which learning takes place. A neural network parameter's functional equivalence class is the set of parameters implementing the same input--output function. For many architectures, almost all parameters have a simple and well-documented functional equivalence class. However, there is also a vanishing minority of reducible parameters, with richer functional equivalence classes caused by redundancies among the network's units. In this paper, we give an algorithmic characterisation of unit redundancies and reducible functional equivalence classes for a single-hidden-layer hyperbolic tangent architecture. We show that such functional equivalence classes are piecewise-linear path-connected sets, and that for parameters with a majority of redundant units, the sets have a diameter of at most 7 linear segments.

Logarithmic Bayes Regret Bounds
Alexia Atsidakou Branislav Kveton Sumeet Katariya Constantine Caramanis sujay sanghavi



研究问题:本文旨在为贝叶斯多臂赌博机(Bayesian bandits)推导
动机:对于许多架构,几乎所有的参数都有一个简单且记录良好的功能等价类,但也存在少数可简化的参数,其功能等价类由于网络单元之间的冗余而更丰富。
方法:本文为单隐藏层双曲正切架构的单元冗余和可简化的功能等价类提供了算法表征。
效果:研究发现,这样的功能等价类是分段线性路径连通集,对于大多数冗余单元的参数,集合的直径最多为7个线性段。

We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In Gaussian bandits, we obtain $O(c_\Delta \log n)$ and $O(c_h \log^2 n)$ bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of random bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the existing lower bounds.

Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products
Tamas Sarlos Xingyou Song David Woodruff Qiuyi Zhang



研究问题:本文旨在研究在低秩近似设置中,如何找到对$f(U cdot V)$的好的秩$k$近似,其中$U, V^\top \in \mathbb{R}^{n \times r}$是给定的,$r = O(log(n))$,$f(x)$是一个通用标量函数。
动机:先前关于次线性低秩近似的研究显示,如果满足条件(1)$U = V^\top$和(2)$f(x)$是PSD核函数,那么存在一个时间复杂度为$O(nk^{\omega-1})$的常数相对误差近似算法,其中$\omega \approx 2.376$是矩阵乘法的指数。
方法:我们给出了这个问题的条件时间硬性结果,证明条件(1)和(2)实际上对于获得优于$n^{2-o(1)}$时间的相对误差低秩近似是必要的。我们还给出了新的基于强指数时间假设(SETH)的缩减,这些缩减依赖于下界平坦稀疏向量的杠杆分数,即使当变换矩阵$f(UV)$的秩和目标秩都是$n^{o(1)}$时,以及当$U = V^\top$时也适用。
效果:最后,我们通过给出一个时间复杂度为$O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$的相对误差近似算法和一个快速的$O(n \cdot \text{poly}(k, p, 1/\epsilon))$的加性误差近似来证明我们的下界是紧的。此外,由于我们的低秩算法依赖于矩阵向量产品子程序,我们的下界扩展显示计算$f(UV)W$,即使是一个小矩阵$W$,也需要$\Omega(n^{2-o(1)})$的时间。

Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1) $U = V^\top$ and (2) $f(x)$ is a PSD kernel function, then there is an $O(nk^{\omega-1})$ time constant relative error approximation algorithm, where $\omega \approx 2.376$ is the exponent of matrix multiplication. We give the first conditional time hardness results for this problem, demonstrating that both conditions (1) and (2) are in fact necessary for getting better than $n^{2-o(1)}$ time for a relative error low rank approximation for a wide class of functions. We give novel reductions from the Strong Exponential Time Hypothesis (SETH) that rely on lower bounding the leverage scores of flat sparse vectors and hold even when the rank of the transformed matrix $f(UV)$ and the target rank are $n^{o(1)}$, and when $U = V^\top$. Furthermore, even when $f(x) = x^p$ is a simple polynomial, we give runtime lower bounds in the case when $U \neq V^\top$ of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. Lastly, we demonstrate that our lower bounds are tight by giving an $O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm and a fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation using fast tensor-based sketching. Additionally, since our low rank algorithms rely on matrix-vector product subroutines, our lower bounds extend to show that computing $f(UV)W$, for even a small matrix $W$, requires $\Omega(n^{2-o(1)})$ time.

On Generalization Bounds for Projective Clustering
Maria Sofia Bucarelli Matilde Fjeldsø Larsen Chris Schwiegelshohn Mads Toftrup



研究问题:本文旨在研究在给定一组点的情况下,如何进行聚类以将点集划分为k个簇,使得每个点被分配到的中心尽可能接近。
动机:目前的聚类方法大多选择中心为点本身,导致著名的k-median和k-means目标。本文考虑选择中心为j维子空间,从而引出子空间聚类的问题。
方法:通过学习已知但固定的分布D下的样本集合P的解,计算其与最优聚类之间的收敛速度。
效果:对于基于中心的优化目标,本文展示了一个收敛率为O(√k/n)的结果。对于子空间聚类问题,本文展示了一个收敛率为O(√(kj^2)/n)的结果。这些是大多数这些问题的首次可证明的界限。

Given a set of points, clustering consists of finding a partition of a point set into $k$ clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous $k$-median and $k$-means objectives. One may also choose centers to be $j$ dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of $n$ samples $P$ drawn independently from some unknown, but fixed distribution $\mathcal{D}$, how quickly does a solution computed on $P$ converge to the optimal clustering of $\mathcal{D}$? We give several near optimal results. In particular, 1. For center-based objectives, we show a convergence rate of $\tilde{O}\left(\sqrt{{k}/{n}}\right)$. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for $k$-means and extends it to other important objectives such as $k$-median. 2. For subspace clustering with $j$-dimensional subspaces, we show a convergence rate of $\tilde{O}\left(\sqrt{{(kj^2)}/{n}}\right)$. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes $k$-means, we show a converge rate of $\Omega\left(\sqrt{{(kj)}/{n}}\right)$ is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal.

A Combinatorial Algorithm for Approximating the Optimal Transport in the Parallel and MPC Settings
Nathaniel Lahn Sharath Raghvendra Kaiyi Zhang



研究问题:如何有效地计算最佳传输距离,并实现并行化?
动机:现有的精确和近似的组合算法难以进行并行化,这限制了最佳传输距离的计算效率。
方法:我们引入了首个找到最佳传输距离的加性ε-近似值的并行组合算法。该算法在大规模并行计算框架(如Hadoop和MapReduce)中,能在O(log(n)/ε^2)轮次内计算出ε-近似的最佳传输计划,每台机器需要O(n/ε)的空间。
效果:实验表明,我们的组合算法比现有的最佳传输近似求解器更快,特别是在n较大时,能显著提升计算效率。

Optimal Transport is a popular distance metric for measuring similarity between distributions. Exact and approximate combinatorial algorithms for computing the optimal transport distance are hard to parallelize. This has motivated the development of numerical solvers (e.g. Sinkhorn method) that can exploit GPU parallelism and produce approximate solutions. We introduce the first parallel combinatorial algorithm to find an additive $\varepsilon$-approximation of the OT distance. The parallel complexity of our algorithm is $O(\log(n)/ \varepsilon^2)$ where $n$ is the total support size for the input distributions. In Massive Parallel Computation (MPC) frameworks such as Hadoop and MapReduce, our algorithm computes an $\varepsilon$-approximate transport plan in $O(\log (\log (n/\varepsilon))/\varepsilon^2)$ rounds with $O(n/\varepsilon)$ space per machine; all prior algorithms in the MPC framework take $\Omega(\log n)$ rounds. We also provide a GPU-friendly matrix-based interpretation of our algorithm where each step of the algorithm is row or column manipulation of the matrix. Experiments suggest that our combinatorial algorithm is faster than the state-of-the-art approximate solvers in the GPU, especially for higher values of $n$.

Faster Relative Entropy Coding with Greedy Rejection Coding
Gergely Flamich Stratis Markou José Miguel Hernández-Lobato



研究问题:本文旨在解决相对熵编码(REC)算法运行速度慢和应用受限的问题。
动机:尽管REC算法具有实践效益,但由于其运行速度慢或限制性假设,尚未得到广泛应用。
方法:本文提出了贪婪拒绝编码(GRC),这是一种基于拒绝采样的算法,适用于任意概率空间和分区方案。我们首先证明了GRC几乎肯定终止并返回无偏的$Q$样本,然后专注于GRC的两个变体,即GRCS和GRCD。
效果:对于连续的$Q$和$P$在实数上,且$dQ/dP$为单峰分布,GRCS的预期运行时间上限为$\beta D_{KL}(Q||P) + \mathcal{O}(1)$,其中$\beta approx 4.82$,并且其预期代码长度是最优的。这显著改善了先前最先进的方法A*编码(Flamich等人,2022)。在相同的假设下,我们实验观察到并推测GRCD的预期运行时间和代码长度上限为$D_{KL}(Q||P) + \mathcal{O}(1)$。最后,我们在MNIST上使用变分自动编码器评估GRC,并显示修改的训练目标和代码长度压缩方法可以进一步提高压缩效率。

Relative entropy coding (REC) algorithms encode a sample from a target distribution $Q$ using a proposal distribution $P$ using as few bits as possible. Unlike entropy coding, REC does not assume discrete distributions and require quantisation. As such, it can be naturally integrated into communication pipelines such as learnt compression and differentially private federated learning. Unfortunately, despite their practical benefits, REC algorithms have not seen widespread application, due to their prohibitively slow runtimes or restrictive assumptions. In this paper, we make progress towards addressing these issues. We introduce Greedy Rejection Coding (GRC), which generalises the rejection sampling-based algorithm of Harsha et al. (2007) to arbitrary probability spaces and partitioning schemes. We first show that GRC terminates almost surely and returns unbiased samples from $Q$, and then focus on two variants of GRC, namely GRCS and GRCD. We show that for continuous $Q$ and $P$ over $\mathbb{R}$ with unimodal $dQ/dP$, the expected runtime of GRCS is upper bounded by $\beta D_{KL}(Q||P) + \mathcal{O}(1)$ where $\beta \approx 4.82$, and its expected codelength is optimal. This makes GRCS the first REC algorithm with guaranteed optimal runtime for this class of distributions, up to the multiplicative constant $\beta$. This significantly improves upon the previous state-of-the-art method, A* coding (Flamich et al., 2022). Under the same assumptions, we experimentally observe and conjecture that the expected runtime and codelength of GRCD are upper bounded by $D_{KL}(Q||P) + \mathcal{O}(1)$. Finally, we evaluate GRC in a compression pipeline with variational autoencoders on MNIST, and show that a modified training objective and a codelength-compression method can further improve compression efficiency.

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems
Chaoyue Liu Dmitriy Drusvyatskiy Misha Belkin Damek Davis Yian Ma



研究问题:如何使随机梯度下降法在插值区间内具有与确定性梯度下降法相同的最坏情况迭代复杂度。
动机:现有的保证方法要求随机梯度下降法采取小步长,导致收敛速度慢得多。
方法:提出了一种插值区间内的正则条件,使得随机梯度下降法在每次迭代中只使用单个采样梯度(或一个小批量),并具有与确定性梯度下降法相同的最坏情况迭代复杂度。
效果:通过训练具有线性输出层的足够宽的前馈神经网络,证明了该条件成立。

Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.

Near-Optimal $k$-Clustering in the Sliding Window Model
David Woodruff Peilin Zhong Samson Zhou



研究问题:如何在滑动窗口模型中实现接近最优的$(k,z)$-聚类。
动机:在许多应用中,近期的数据可以提供更准确的信息,而旧的数据在一定时间后会过期。滑动窗口模型能够捕捉这些期望的特性,因此对滑动窗口模型中的聚类有着大量的关注。
方法:本文提出了第一个在滑动窗口模型中实现接近最优的$(k,z)$-聚类的算法。该算法使用了$\frac{k}{\min(varepsilon^4,\varepsilon^{2+z})}\,\text{polylog}\frac{n\Delta}{\varepsilon}$个词的空间,当点来自$[\Delta]^d$时,这显著改善了Braverman等人(SODA 2016),Borassi等人(NeurIPS 2021)和Epasto等人(SODA 2022)的工作。
效果:我们开发了一个名为在线核心集的数据结构进行聚类,它不仅输出流的末尾的核心集,还输出所有前缀的核心集。我们的在线核心集从流中采样了$\frac{k}{\min(\varepsilon^4,\varepsilon^{2+z})}\,\text{polylog}\frac{nDelta}{\varepsilon}$个点。然后我们证明任何在线核心集都需要$Omega\left(\frac{k}{varepsilon^2}\log n\right)$个样本,这表明构造离线核心集的问题,即构造在线核心集是严格更难的。我们的结果也扩展到$[\Delta]^d$上的一般度量,并且在考虑一个$Omega\left(\frac{k}{varepsilon^{2+z}}\right)$的离线核心集大小下是接近最优的。

Clustering is an important technique for identifying structural information in large-scale data analysis, where the underlying dataset may be too large to store. In many applications, recent data can provide more accurate information and thus older data past a certain time is expired. The sliding window model captures these desired properties and thus there has been substantial interest in clustering in the sliding window model. In this paper, we give the first algorithm that achieves near-optimal $(1+\varepsilon)$-approximation to $(k,z)$-clustering in the sliding window model. Our algorithm uses $\frac{k}{\min(\varepsilon^4,\varepsilon^{2+z})}\,\text{polylog}\frac{n\Delta}{\varepsilon}$ words of space when the points are from $[\Delta]^d$, thus significantly improving on works by Braverman et. al. (SODA 2016), Borassi et. al. (NeurIPS 2021), and Epasto et. al. (SODA 2022). Along the way, we develop a data structure for clustering called an online coreset, which outputs a coreset not only for the end of a stream, but also for all prefixes of the stream. Our online coreset samples $\frac{k}{\min(\varepsilon^4,\varepsilon^{2+z})}\,\text{polylog}\frac{n\Delta}{\varepsilon}$ points from the stream. We then show that any online coreset requires $\Omega\left(\frac{k}{\varepsilon^2}\log n\right)$ samples, which shows a separation between the problem of constructing an offline coreset, i.e., constructing online coresets is strictly harder. Our results also extend to general metrics on $[\Delta]^d$ and are near-optimal in light of a $\Omega\left(\frac{k}{\varepsilon^{2+z}}\right)$ lower bound for the size of an offline coreset.

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model
Laixi Shi Gen Li Yuting Wei Yuxin Chen Matthieu Geist Yuejie Chi



研究问题:本文旨在通过分布稳健的马尔可夫决策过程(RMDPs)框架,研究强化学习中的模型鲁棒性。
动机:尽管已有一些努力,但无论使用哪种不确定性集,RMDPs的样本复杂度都远未被充分理解;特别是,现有的上界和下界之间存在很大的差距,而且尚不清楚当与标准RL进行基准测试时,分布稳健性是否具有任何统计含义。
方法:假设可以使用生成模型,我们使用一种名为分布稳健值迭代的基于模型的算法,推导出当不确定性集通过总变差或χ²散度在整个不确定性水平范围内测量时的RMDPs的样本复杂度,并开发了最小最大下界以衡量其紧密程度。
效果:我们的结果不仅加强了现有技术在上下界两个方向的进步,而且还带来了令人惊讶的信息,即学习RMDPs并不一定比标准MDPs更容易或更困难。在总变差的情况下,我们建立了RMDPs的最小最大优化样本复杂度,它总是小于标准MDPs的。在χ²散度的情况下,我们建立了RMDPs的样本复杂度,它在无穷大时以线性方式增长,并且当接近无穷大时,与有效范围呈多项式因子关系。

This paper investigates model robustness in reinforcement learning (RL) via the framework of distributionally robust Markov decision processes (RMDPs). Despite recent efforts, the sample complexity of RMDPs is much less understood regardless of the uncertainty set in use; in particular, there exist large gaps between existing upper and lower bounds, and it is unclear if distributional robustness bears any statistical implications when benchmarked against standard RL. In this paper, assuming access to a generative model, we derive the sample complexity of RMDPs---when the uncertainty set is measured via either total variation or $\chi^2$ divergence over the full range of uncertainty levels---using a model-based algorithm called distributionally robust value iteration, and develop minimax lower bounds to benchmark its tightness. Our results not only strengthen the prior art in both directions of upper and lower bounds, but also deliver surprising messages that learning RMDPs is not necessarily easier or more difficult than standard MDPs. In the case of total variation, we establish the minimax-optimal sample complexity of RMDPs which is always smaller than that of standard MDPs. In the case of $\chi^2$ divergence, we establish the sample complexity of RMDPs that is tight up to polynomial factors of the effective horizon, and grows linearly with respect to the uncertainty level when it approaches infinity.

Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits
Dorian Baudry Fabien Pesquerel Rémy Degenne Odalric-Ambrym Maillard



研究问题:非参数随机带的遗憾最小化问题。
动机:当奖励已知有上限时,存在渐近最优算法,其渐近遗憾取决于Kullback-Leibler散度的下确界(KL)。这些算法计算量大且需要存储所有过去的奖励,因此通常使用较简单但非最优的算法。
方法:我们引入几种方法来近似下确界的KL,大大减少了现有最优算法的计算和内存成本,同时保持了他们的遗憾保证。我们将这些发现应用于设计MED和IMED算法的新变体,并通过大量的数值模拟来证明它们的兴趣。
效果:实验结果表明,新设计的MED和IMED算法在各种情况下都能有效地减少遗憾,证明了我们的方法的有效性。

We consider the problem of regret minimization in non-parametric stochastic bandits. When the rewards are known to be bounded from above, there exists asymptotically optimal algorithms, with asymptotic regret depending on an infimum of Kullback-Leibler divergences (KL). These algorithms are computationally expensive and require storing all past rewards, thus simpler but non-optimal algorithms are often used instead. We introduce several methods to approximate the infimum KL which reduce drastically the computational and memory costs of existing optimal algorithms, while keeping their regret guaranties. We apply our findings to design new variants of the MED and IMED algorithms, and demonstrate their interest with extensive numerical simulations.

Finding Local Minima Efficiently in Decentralized Optimization
Wenhan Xian Heng Huang



研究问题:本文研究了非凸优化问题的分散随机算法的二阶最优性,以有效逃离鞍点。
动机:现有的分散随机算法在寻找鞍点时存在技术挑战,缺乏二阶最优性的证明。
方法:提出了一种新的纯梯度基分散随机算法PEDESTAL,并设计了新的收敛分析框架来解决这个问题。
效果:该方法是第一个在分散随机设置中实现二阶最优性且具有非渐近分析的算法。理论保证其梯度复杂度为$\tilde{O}(\epsilon^{-3})$,可以找到$O(\epsilon,\sqrt{\epsilon})$-二阶稳定点,这比得上集中式对应算法或找到一阶稳定点的分散式方法的最新结果。实验部分在两个分散式任务上验证了该方法的性能,包括一个使用合成数据的矩阵感测任务和一个使用真实世界数据集的矩阵分解任务。

In this paper we study the second-order optimality of decentralized stochastic algorithm that escapes saddle point efficiently for nonconvex optimization problems. We propose a new pure gradient-based decentralized stochastic algorithm PEDESTAL with a novel convergence analysis framework to address the technical challenges unique to the decentralized stochastic setting. Our method is the first decentralized stochastic algorithm to achieve second-order optimality with non-asymptotic analysis. We provide theoretical guarantees with the gradient complexity of $\tilde{O} (\epsilon^{-3})$ to find $O(\epsilon, \sqrt{\epsilon})$-second-order stationary point, which matches state-of-the-art results of centralized counterparts or decentralized methods to find first-order stationary point. We also conduct two decentralized tasks in our experiments, a matrix sensing task with synthetic data and a matrix factorization task with a real-world dataset to validate the performance of our method.

Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization
Quanqi Hu Dixian Zhu Tianbao Yang



研究问题:本文旨在研究一种新的组合优化问题,即非光滑弱凸有限和耦合组合优化(NSWC FCCO)。
动机:由于其在机器学习和AI中的广泛应用以及解决基于经验风险最小化的随机算法的局限性的能力,对FCCO的兴趣日益增长。然而,当前关于FCCO的研究假设内外函数都是平滑的,这限制了它们处理更多样化问题的能力。
方法:我们扩展了这个领域,研究了非光滑弱凸FCCO,其中外部函数是弱凸和非递减的,内部函数是弱凸的。我们分析了单循环算法,并建立了其找到目标函数的莫罗包的ε-稳定点的复杂性。此外,我们还扩展了该算法来解决新型的非光滑弱凸三层有限和耦合组合优化问题,这些问题具有三个函数的嵌套排列。最后,我们在深度学习中探索了我们的算法在双向部分AUC最大化和多实例双向部分AUC最大化中的应用,通过实证研究展示了所提出算法的有效性。
效果:实验结果表明,我们提出的算法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper investigates new families of compositional optimization problems, called non-smooth weakly-convex finite-sum coupled compositional optimization (NSWC FCCO). There has been a growing interest in FCCO due to its wide-ranging applications in machine learning and AI, as well as its ability to address the shortcomings of stochastic algorithms based on empirical risk minimization. However, current research on FCCO presumes that both the inner and outer functions are smooth, limiting their potential to tackle a more diverse set of problems. Our research expands on this area by examining non-smooth weakly-convex FCCO, where the outer function is weakly convex and non-decreasing, and the inner function is weakly-convex. We analyze a single-loop algorithm and establish its complexity for finding an $\epsilon$-stationary point of the Moreau envelop of the objective function. Additionally, we also extend the algorithm for solving novel non-smooth weakly-convex tri-level finite-sum coupled compositional optimization problems, which feature a nested arrangement of three functions. Lastly, we explore the applications of our algorithms in deep learning for two-way partial AUC maximization and multi-instance two-way partial AUC maximization, using empirical studies to showcase the effectiveness of the proposed algorithms.

On Proper Learnability between Average- and Worst-case Robustness
Vinod Raman UNIQUE SUBEDI Ambuj Tewari



研究问题:本文旨在研究在最坏情况的鲁棒损失下,对PAC学习设置进行何种类型的放松可以使得学习成为可能。
动机:Montasser等人(2019)指出,有限的VC维数对于对抗性鲁棒PAC学习来说并不足够。因此,人们开始研究哪种类型的放松可以使得对抗性鲁棒PAC学习成为可能。
方法:我们提出了一种鲁棒损失的放松方法,在这种放松条件下,VC类是可以进行适当的PAC学习的,其样本复杂度接近于标准PAC学习设置所需的样本复杂度。
效果:我们发现,对于现有的一种自然放松的最坏情况的鲁棒损失,有限的VC维数并不足以进行适当的学习。最后,我们对对抗性鲁棒经验风险最小化器给出了新的泛化保证。

Recently, Montasser at al. (2019) showed that finite VC dimension is not sufficient for proper adversarially robust PAC learning. In light of this hardness, there is a growing effort to study what type of relaxations to the adversarially robust PAC learning setup can enable proper learnability. In this work, we initiate the study of proper learning under relaxations of the worst-case robust loss. We give a family of robust loss relaxations under which VC classes are properly PAC learnable with sample complexity close to what one would require in the standard PAC learning setup. On the other hand, we show that for an existing and natural relaxation of the worst-case robust loss, finite VC dimension is not sufficient for proper learning. Lastly, we give new generalization guarantees for the adversarially robust empirical risk minimizer.

Global Identifiability of $\ell_1$-based Dictionary Learning via Matrix Volume Optimization
Jingzhou Hu Kejun Huang



研究问题:提出一种新的字典学习公式,最小化字典矩阵行列式(也称为体积)的行列式,同时满足稀疏系数矩阵的每一行具有单位l1范数的约束。
动机:提出的新公式能保证真实字典和稀疏系数矩阵的全局可识别性,如果从系数矩阵得到的一组向量在l∞范数球内但包含其凸包中的l2范数球。
方法:提出了一种基于线性化的ADMM算法,每次迭代都有有效的更新。
效果:数值实验表明,所提出的算法在正确且高效地恢复字典方面表现出惊人的效果。

We propose a novel formulation for dictionary learning that minimizes the determinant of the dictionary matrix, also known as its volume, subject to the constraint that each row of the sparse coefficient matrix has unit $\ell_1$ norm. The main motivation for the proposed formulation is that it provides global identifiability guarantee of the groundtruth dictionary and sparse coefficient matrices, up to the inherent and inconsequential permutation and scaling ambiguity, if a set of vectors obtained from the coefficient matrix lies inside the $\ell_\infty$ norm ball but contains the $\ell_2$ norm ball in their convex hull. Unlike existing work on identifiability of dictionary learning, our result is global, meaning that a globally optimal solution to our proposed formulation has to be a permuted and rescaled version of the groundtruth factors. Another major improvement in our result is that there is no additional assumption on the dictionary matrix other than it is nonsingular, unlike most other work that require the atoms of the dictionary to be mutually incoherent. We also provide a probabilistic analysis and show that if the sparse coefficient matrix is generated from the widely adopted Bernoulli-Gaussian model, then it is globally identifiable if the sample size is bigger than a constant times $k\log k$, where $k$ is the number atoms in the dictionary, with overwhelming probability. The bound is essentially the same as those local identifiability results, but we show that it is also global. Finally, we propose algorithms to solve the new proposed formulation, specifically one based on the linearized-ADMM with efficient per-iteration updates. The proposed algorithms exhibit surprisingly effective performance in correctly and efficiently recovering the dictionary, as demonstrated in the numerical experiments.

High-dimensional Contextual Bandit Problem without Sparsity
Junpei Komiyama Masaaki Imaizumi



研究问题:本研究探讨了高维线性上下文强盗问题,其中特征数量p大于预算T,甚至可能无限。
动机:与该领域的大多数先前工作不同,我们没有对回归系数施加稀疏性。相反,我们依赖于关于过参数化模型的最新发现,这使得我们能够分析当数据分布具有小的有效秩时,最小范数插值估计器的性能。
方法:我们提出了一种探索-然后-承诺(EtC)算法来解决这个问题,并检查了其性能。通过我们的分析,我们导出了ETC算法的最优率,并表明这个比率可以通过平衡探索和利用来实现。此外,我们还引入了一种自适应探索-然后-承诺(AEtC)算法,该算法可以自适应地找到最优平衡。
效果:我们通过一系列模拟评估了所提出算法的性能。

In this research, we investigate the high-dimensional linear contextual bandit problem where the number of features $p$ is greater than the budget $T$, or it may even be infinite. Differing from the majority of previous works in this field, we do not impose sparsity on the regression coefficients. Instead, we rely on recent findings on overparameterized models, which enables us to analyze the performance of the minimum-norm interpolating estimator when data distributions have small effective ranks. We propose an explore-then-commit (EtC) algorithm to address this problem and examine its performance. Through our analysis, we derive the optimal rate of the ETC algorithm in terms of $T$ and show that this rate can be achieved by balancing exploration and exploitation. Moreover, we introduce an adaptive explore-then-commit (AEtC) algorithm that adaptively finds the optimal balance. We assess the performance of the proposed algorithms through a series of simulations.

Contextual Stochastic Bilevel Optimization
Yifan Hu Jie Wang Yao Xie Andreas Krause Daniel Kuhn



研究问题:本文旨在提出一种上下文随机双层优化(CSBO)框架,用于处理底层决策不仅受上层决策影响,还受一些旁侧信息影响的情况。
动机:当底层决策者的最优决策不仅取决于上层决策者的决策,还取决于一些旁侧信息时,传统的随机双层优化方法无法收敛。
方法:提出了一种基于多级蒙特卡洛(MLMC)技术的高效双循环梯度方法,并建立了其样本和计算复杂性。
效果:在元学习中,该方法的复杂度不依赖于任务的数量。数值实验进一步验证了理论结果。

We introduce contextual stochastic bilevel optimization (CSBO) -- a stochastic bilevel optimization framework with the lower-level problem minimizing an expectation conditioned on some contextual information and the upper-level decision variable. This framework extends classical stochastic bilevel optimization when the lower-level decision maker responds optimally not only to the decision of the upper-level decision maker but also to some side information and when there are multiple or even infinite many followers. It captures important applications such as meta-learning, personalized federated learning, end-to-end learning, and Wasserstein distributionally robust optimization with side information (WDRO-SI). Due to the presence of contextual information, existing single-loop methods for classical stochastic bilevel optimization are unable to converge. To overcome this challenge, we introduce an efficient double-loop gradient method based on the Multilevel Monte-Carlo (MLMC) technique and establish its sample and computational complexities. When specialized to stochastic nonconvex optimization, our method matches existing lower bounds. For meta-learning, the complexity of our method does not depend on the number of tasks. Numerical experiments further validate our theoretical results.

Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods
Junchi YANG Xiang Li Ilyas Fatkhullin Niao He



研究问题:本研究旨在解决随机梯度下降(SGD)在处理具有未知参数的问题时,如Lipschitz平滑常数,需要精确调整学习率的问题。
动机:尽管经典的SGD通过多项式衰减的学习率可以达到良好的效果,但在实践中,其依赖于精确调整的学习率和问题参数,如Lipschitz平滑常数,这在许多情况下是未知的。
方法:本研究提出了一种名为未调整SGD的方法,即使用任意大于0的学习率进行优化。虽然这种方法在最小化平滑目标时可以获得次优的收敛速度,但其对平滑度常数的依赖性呈指数级增长,即使在无噪声的环境中也无法避免。
效果:通过对三种自适应方法——标准化SGD、AMSGrad和AdaGrad的研究,我们发现这些方法可以在缺乏关于平滑度参数的信息和随机梯度有界的情况下,有效地防止这种指数级的依赖性。这为自适应方法在缓解大梯度问题上优于未调整SGD提供了理论依据。

The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $\widetilde{\mathcal{O}}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods — Normalized SGD (NSGD), AMSGrad, and AdaGrad — unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.

Horospherical Decision Boundaries for Large Margin Classification in Hyperbolic Space
Xiran Fan Chun-Hao Yang Baba C. Vemuri



研究问题:如何利用双曲空间对分层数据进行表示,并设计有效的分类算法。
动机:双曲空间在表示分层数据方面具有优势,但现有的分类算法存在优化问题。
方法:提出一种基于球面决策边界的大间隔分类器,其优化问题是几何凸的,可使用黎曼梯度下降技术进行全局最优解的优化。
效果:实验表明,该分类器的性能优于现有技术。

Hyperbolic spaces have been quite popular in the recent past for representing hierarchically organized data. Further, several classification algorithms for data in these spaces have been proposed in the literature. These algorithms mainly use either hyperplanes or geodesics for decision boundaries in a large margin classifiers setting leading to a non-convex optimization problem. In this paper, we propose a novel large margin classifier based on horospherical decision boundaries that leads to a geodesically convex optimization problem that can be optimized using any Riemannian gradient descent technique guaranteeing a globally optimal solution. We present several experiments depicting the competitive performance of our classifier in comparison to SOTA.

Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback
Yang Cai Haipeng Luo Chen-Yu Wei Weiqiang Zheng



研究问题:本文旨在解决两人零和马尔科夫博弈学习的问题,特别是开发一个解耦、收敛且理性的算法,以非渐近收敛速度达到纳什均衡。
动机:现有的算法需要同步和先验知识,而且对于仅能获得探索性反馈的情况,其收敛速度无法确定。因此,本文的目标是开发出一种不需要同步和先验知识的算法,并确定其在探索性反馈下的有限收敛速度。
方法:首先从状态无关的矩阵博弈开始,利用探索性反馈作为热身,展示了一个具有 $\tilde{\mathcal{O}}(t^{-\frac{1}{8}})$ 最后迭代收敛率的结果。然后扩展到不可约马尔科夫博弈的情况,提供了任意 $\varepsilon>0$ 的 $\tilde{\mathcal{O}}(t^{-\frac{1}{9+\varepsilon}})$ 最后迭代收敛率。最后,研究了没有任何动态假设的马尔科夫博弈,并展示了一个新的收敛概念——路径收敛,其收敛率为 $\tilde{mathcal{O}}(t^{-\frac{1}{10}})$.
效果:该算法去除了同步和先验知识的要求,这是与Wei等人(2021)在不可约马尔科夫博弈中追求相同目标的不同之处。此外,该算法还建立了在探索性反馈下的有限收敛速度,这是之前的研究没有做到的。

We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is *uncoupled*, *convergent*, and *rational*, with non-asymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an $\tilde{\mathcal{O}}(t^{-\frac{1}{8}})$ last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $\tilde{\mathcal{O}}(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a *path convergence* rate, a new notion of convergence we defined, of $\tilde{\mathcal{O}}(t^{-\frac{1}{10}})$. Our algorithm removes the synchronization and prior knowledge requirement of Wei et al. (2021), which pursued the same goals as us for irreducible Markov games. Our algorithm is related to Chen et al. (2021) and Cen et al. (2021)'s and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.

Adaptive Principal Component Regression with Applications to Panel Data
Anish Agarwal Keegan Harris Justin Whitehouse Steven Wu



研究问题:本文旨在为在线(正则化)PCR提供首次一致的有限样本保证,无论数据是否自适应收集。
动机:在观察协变量被随机噪声污染的情况下,PCR是一种流行的固定设计误差变量回归技术,而现有的PCR在固定设计设置中的证明技术并不直接适用于在线设置。
方法:通过将现代鞅集中的工具适应到误差变量环境中,提供了在线(正则化)PCR的一致有限样本保证。
效果:作为我们界限的应用,提供了一个框架用于面板数据设置中单位特定治疗效果的反事实估计,当干预措施通过一个自适应的干预分配策略进行收集时。

Principal component regression (PCR) is a popular technique for fixed-design error-in-variables regression, a generalization of the linear regression setting in which the observed covariates are corrupted with random noise. We provide the first time-uniform finite sample guarantees for online (regularized) PCR whenever data is collected adaptively. Since the proof techniques for PCR in the fixed design setting do not readily extend to the online setting, our results rely on adapting tools from modern martingale concentration to the error-in-variables setting. As an application of our bounds, we provide a framework for counterfactual estimation of unit-specific treatment effects in panel data settings when interventions are assigned adaptively. Our framework may be thought of as a generalization of the synthetic interventions framework where data is collected via an adaptive intervention assignment policy.

Delayed Algorithms for Distributed Stochastic Weakly Convex Optimization
Wenzhi Gao Qi Deng



研究问题:本文研究了分布式网络中具有最大信息延迟的弱凸优化问题的延迟随机算法。
动机:Xu等人在2022年的研究中,证明了基于惯性随机子梯度的方法以$\mathcal{O}(\tau_{\text{max}}/\sqrt{K})$的速率收敛,其中$\tau_{text{max}}$是最大的信息延迟。
方法:本文提出了一种更紧的基于期望延迟$\bar{tau}$的延迟随机子梯度下降法(DSGD)。对于一类重要的复合弱凸问题,我们开发了一种新的延迟随机prox-线性方法(DSPL),其中延迟仅影响高阶项,因此在一定的DSPL迭代次数后可以忽略。
效果:通过在两种方法中都引入一个简单的保护步骤,我们实现了只依赖于工人数量的收敛速度,消除了延迟的影响。我们的数值实验进一步证实了我们提出的方法在实践中的优势。

This paper studies delayed stochastic algorithms for weakly convex optimization in a distributed network with workers connected to a master node. Recently, Xu~et~al.~2022 showed that an inertial stochastic subgradient method converges at a rate of $\mathcal{O}(\tau_{\text{max}}/\sqrt{K})$ which depends on the maximum information delay $\tau_{\text{max}}$. In this work, we show that the delayed stochastic subgradient method ($\texttt{DSGD}$) obtains a tighter convergence rate which depends on the expected delay $\bar{\tau}$. Furthermore, for an important class of composition weakly convex problems, we develop a new delayed stochastic prox-linear ($\texttt{DSPL}$) method in which the delays only affect the high-order term in the rate and hence, are negligible after a certain number of $\texttt{DSPL}$ iterations. In addition, we demonstrate the robustness of our proposed algorithms against arbitrary delays. By incorporating a simple safeguarding step in both methods, we achieve convergence rates that depend solely on the number of workers, eliminating the effect of delays. Our numerical experiments further confirm the empirical superiority of our proposed methods.

Bypassing the Simulator: Near-Optimal Adversarial Linear Contextual Bandits
Haolin Liu Chen-Yu Wei Julian Zimmert



研究问题:我们考虑了对抗性线性上下文强盗问题,其中损失向量完全敌对地选择,每轮动作集(即上下文)从固定分布中抽取。
动机:现有的方法要么需要访问模拟器来生成免费的独立同分布的上下文,要么实现次优的遗憾不超过T的5/6,要么在每轮的动作集较小时计算效率低下。
方法:我们通过在没有模拟器的情况下实现遗憾为O(T),同时在每轮的动作集较小时保持计算效率,从而大大改善了这些结果。
效果:在具有对抗性损失和随机臂可用性的睡眠强盗的特殊情况下,我们的结果肯定回答了[SGV20]关于是否存在具有poly(d)sqrt(T)遗憾的多项式时间算法的开放问题。我们的方法自然地处理了损失是线性的,最多有一个附加的错误的情况,我们的遗憾显示出对错误大小的近乎最优依赖性。

We consider the adversarial linear contextual bandit problem, where the loss vectors are selected fully adversarially and the per-round action set (i.e. the context) is drawn from a fixed distribution. Existing methods for this problem either require access to a simulator to generate free i.i.d. contexts, achieve a sub-optimal regret no better than $\tilde{\mathcal{O}}(T^{\frac{5}{6}})$, or are computationally inefficient. We greatly improve these results by achieving a regret of $\tilde{\mathcal{O}}(\sqrt{T})$ without a simulator, while maintaining computational efficiency when the action set in each round is small. In the special case of sleeping bandits with adversarial loss and stochastic arm availability, our result answers affirmatively the open question by [SGV20] on whether there exists a polynomial-time algorithm with $poly(d)\sqrt{T}$ regret. Our approach naturally handles the case where the loss is linear up to an additive misspecification error, and our regret shows near-optimal dependence on the magnitude of the error.

Trading-off price for data quality to achieve fair online allocation
Mathieu Molina Nicolas Gast Patrick Loiseau Vianney Perchet



研究问题:在线分配问题中考虑长期公平性惩罚,但决策制定者无法观察到受保护的属性。
动机:与现有工作不同,我们不假设决策制定者能观察到受保护的属性,而是他们可以购买数据来估计这些属性,从而降低公平性惩罚。
方法:我们将此问题建模为一个多臂强盗问题,每个强盗对应于数据源的选择,并结合公平的在线分配问题。我们提出了一种联合解决这两个问题的算法,并证明其遗憾度被限制在O(T)内。
效果:选择来源的奖励受到公平性惩罚的影响而产生相关性,需要随机化处理(尽管处于随机设置中)。我们的算法考虑了在选择来源之前可用的上下文信息,并能适应许多不同的公平性概念。

We consider the problem of online allocation subject to a long-term fairness penalty. Contrary to existing works, however, we do not assume that the decision-maker observes the protected attributes---which is often unrealistic in practice. Instead they can purchase data that help estimate them from sources of different quality; and hence reduce the fairness penalty at some cost. We model this problem as a multi-armed bandit problem where each arm corresponds to the choice of a data source, coupled with the fair online allocation problem. We propose an algorithm that jointly solves both problems and show that it has a regret bounded by $\mathcal{O}(\sqrt{T})$. A key difficulty is that the rewards received by selecting a source are correlated by the fairness penalty, which leads to a need for randomization (despite a stochastic setting). Our algorithm takes into account contextual information available before the source selection, and can adapt to many different fairness notions.

Sample Complexity of Goal-Conditioned Hierarchical Reinforcement Learning
Arnaud Robert Ciara Pike-Burke Aldo A. Faisal



研究问题:本文旨在解决强化学习中多层次规划的效率提升问题,并尝试理解其基础和设计规则。
动机:尽管层次化强化学习算法在样本效率上表现出显著改进,但其效率提升的基础和理论设计规则尚未完全明了。
方法:通过推导出一类目标条件层次强化学习算法的样本复杂度下界,提出了一种利用层次分解的简单Q-learning类型算法。
效果:通过对一系列任务进行实证验证,包括多层次n房间任务和Gymnasium的出租车任务,证明了理论发现的正确性,并为量化层次分解相对于单一解决方案在强化学习中的改进提供了一步。

Hierarchical Reinforcement Learning (HRL) algorithms can perform planning at multiple levels of abstraction. Empirical results have shown that state or temporal abstractions might significantly improve the sample efficiency of algorithms. Yet, we still do not have a complete understanding of the basis of those efficiency gains nor any theoretically grounded design rules. In this paper, we derive a lower bound on the sample complexity for the considered class of goal-conditioned HRL algorithms. The proposed lower bound empowers us to quantify the benefits of hierarchical decomposition and leads to the design of a simple Q-learning-type algorithm that leverages hierarchical decompositions. We empirically validate our theoretical findings by investigating the sample complexity of the proposed hierarchical algorithm on a spectrum of tasks (hierarchical $n$-rooms, Gymnasium's Taxi). The hierarchical $n$-rooms tasks were designed to allow us to dial their complexity over multiple orders of magnitude. Our theory and algorithmic findings provide a step towards answering the foundational question of quantifying the improvement hierarchical decomposition offers over monolithic solutions in reinforcement learning.

Classification of Heavy-tailed Features in High Dimensions: a Superstatistical Approach
Urte Adomaityte Gabriele Sicuro Pierpaolo Vivo



研究问题:本研究旨在探讨在高维环境下,通过经验风险最小化学习两个混合云数据点的问题。
动机:由于数据分布的多样性和复杂性,如何有效地学习和理解这些数据成为一个重要的研究问题。
方法:我们采用了一种双重随机过程来获取每个云的数据点,并假设损失函数和正则化项都是凸函数。同时,我们还考虑了具有无协方差幂律尾分布的数据分布情况。
效果:通过对所得到的估计器进行泛化性能的研究、正则化作用的分析以及分离性转变的解析描述,我们发现该方法具有良好的性能,并且能够覆盖一大类数据分布,包括幂律尾分布且无协方差的情况。

We characterise the learning of a mixture of two clouds of data points with generic centroids via empirical risk minimisation in the high dimensional regime, under the assumptions of generic convex loss and convex regularisation. Each cloud of data points is obtained via a double-stochastic process, where the sample is obtained from a Gaussian distribution whose variance is itself a random parameter sampled from a scalar distribution $\varrho$. As a result, our analysis covers a large family of data distributions, including the case of power-law-tailed distributions with no covariance, and allows us to test recent ''Gaussian universality'' claims. We study the generalisation performance of the obtained estimator, we analyse the role of regularisation, and we analytically characterise the separability transition.

Composable Coresets for Determinant Maximization: Greedy is Almost Optimal
Siddharth Gollapudi Sepideh Mahabadi Varun Sivashankar



研究问题:给定一组$n$维向量,如何选取$k$个向量以最大化其行列式。
动机:行列式最大化问题是确定性点过程(DPP)的最大后验概率(MAP)推理任务,近年来在模型多样性方面受到广泛关注。由于大多数应用都使用大量数据,因此该问题已在相关的“可组合核心集”设置中进行了研究。
方法:我们展示了广泛使用的贪婪算法也提供了具有几乎最优近似因子$O(k)^{3k}$的可组合核心集,这比之前已知的$C^{k^2}$保证有所改进,并支持先前的实验结果,显示了贪婪算法作为核心集的实际性。
效果:我们的主要结果是通过展示贪婪算法的局部最优性:将单个点从贪婪解决方案与未被贪婪算法选择的向量交换可以增加体积最多$(1+\sqrt{k})$倍。这个上界在加法常数$1$处是紧的。最后,我们的实验表明,贪婪算法的局部最优性在实际数据集上甚至低于理论界限。

Given a set of $n$ vectors in $\mathbb{R}^d$, the goal of the \emph{determinant maximization} problem is to pick $k$ vectors with the maximum volume. Determinant maximization is the MAP-inference task for determinantal point processes (DPP) and has recently received considerable attention for modeling diversity. As most applications for the problem use large amounts of data, this problem has been studied in the relevant \textit{composable coreset} setting. In particular, [Indyk-Mahabadi-OveisGharan-Rezaei--SODA'20, ICML'19] showed that one can get composable coresets with optimal approximation factor of $\tilde O(k)^k$ for the problem, and that a local search algorithm achieves an almost optimal approximation guarantee of $O(k)^{2k}$. In this work, we show that the widely-used Greedy algorithm also provides composable coresets with an almost optimal approximation factor of $O(k)^{3k}$, which improves over the previously known guarantee of $C^{k^2}$, and supports the prior experimental results showing the practicality of the greedy algorithm as a coreset. Our main result follows by showing a local optimality property for Greedy: swapping a single point from the greedy solution with a vector that was not picked by the greedy algorithm can increase the volume by a factor of at most $(1+\sqrt{k})$. This is tight up to the additive constant $1$. Finally, our experiments show that the local optimality of the greedy algorithm is even lower than the theoretical bound on real data sets.

Online Inventory Problems: Beyond the i.i.d. Setting with Online Convex Optimization
Massil HIHAT Stéphane Gaïffas Guillaume Garrigos Simon Bussy



研究问题:本研究针对多产品库存控制问题,探讨如何基于部分历史信息做出连续的补充决策以最小化累积损失。
动机:为了超越标准模型,我们考虑了一般的需求、损失和动态,这些模型通常依赖于新闻销售商类型的损失、固定的动态和不切实际的独立同分布需求假设。
方法:我们提出了MaxCOSD,一种在线算法,即使在具有非独立同分布需求和有状态动态的问题上也有保证,包括易腐性等问题。
效果:我们考虑了对需求过程的非退化假设,并认为它们是允许学习的必要条件。

We study multi-product inventory control problems where a manager makes sequential replenishment decisions based on partial historical information in order to minimize its cumulative losses. Our motivation is to consider general demands, losses and dynamics to go beyond standard models which usually rely on newsvendor-type losses, fixed dynamics, and unrealistic i.i.d. demand assumptions. We propose MaxCOSD, an online algorithm that has provable guarantees even for problems with non-i.i.d. demands and stateful dynamics, including for instance perishability. We consider what we call non-degeneracy assumptions on the demand process, and argue that they are necessary to allow learning.

On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $\epsilon$-Greedy Exploration
Shuai Zhang Hongkang Li Meng Wang Miao Liu Pin-Yu Chen Songtao Lu Sijia Liu Keerthiram Murugesan Subhajit Chaudhury



研究问题:本文旨在对深度强化学习中的深度Q网络(DQN)和ε-贪婪探索进行理论理解。
动机:尽管DQN在实证上取得了巨大的成就,但其理论特性仍未得到充分探索。
方法:本文首先分析了DQN的探索策略,然后通过使用目标网络和经验回放来获取无偏的均方贝尔曼误差(MSBE)估计,以训练Q网络。最后,我们证明了一个具有衰减ε值的迭代过程会几何收敛到最优Q值函数。
效果:实验结果验证了我们对DQN的理论洞察。

This paper provides a theoretical understanding of deep Q-Network (DQN) with the $\varepsilon$-greedy exploration in deep reinforcement learning. Despite the tremendous empirical achievement of the DQN, its theoretical characterization remains underexplored. First, the exploration strategy is either impractical or ignored in the existing analysis. Second, in contrast to conventional Q-learning algorithms, the DQN employs the target network and experience replay to acquire an unbiased estimation of the mean-square Bellman error (MSBE) utilized in training the Q-network. However, the existing theoretical analysis of DQNs lacks convergence analysis or bypasses the technical challenges by deploying a significantly overparameterized neural network, which is not computationally efficient. This paper provides the first theoretical convergence and sample complexity analysis of the practical setting of DQNs with $\epsilon$-greedy policy. We prove an iterative procedure with decaying $\epsilon$ converges to the optimal Q-value function geometrically. Moreover, a higher level of $\epsilon$ values enlarges the region of convergence but slows down the convergence, while the opposite holds for a lower level of $\epsilon$ values. Experiments justify our established theoretical insights on DQNs.

Efficient Model-Free Exploration in Low-Rank MDPs
Zakaria Mhammedi Adam Block Dylan J Foster Alexander Rakhlin



研究问题:在高维领域进行强化学习时,如何开发实用、样本高效的探索算法。
动机:现有的算法要么计算上不可行,要么需要限制性的统计假设,如潜在变量结构或模型基础的函数近似。
方法:提出了第一个在低秩马尔可夫决策过程中进行探索的样本高效算法,该算法既计算效率高又无需模型,允许通用函数近似,除了一个可达性条件外,不需要其他结构假设。
效果:该算法使用特征嵌入的重心生成树作为有效计算的基础进行探索,通过交织表示学习和策略优化子例程来执行有效的生成树计算。

A major challenge in reinforcement learning is to develop practical, sample-efficient algorithms for exploration in high-dimensional domains where generalization and function approximation is required. Low-Rank Markov Decision Processes---where transition probabilities admit a low-rank factorization based on an unknown feature embedding---offer a simple, yet expressive framework for RL with function approximation, yet existing algorithms either (1) are computationally intractable, or (2) require restrictive statistical assumptions such as latent variable structure or access to model-based function approximation. In this work, we propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs that is both computationally efficient and model-free, allowing for general function approximation while requiring no structural assumptions beyond a reachability condition that we show is substantially weaker than that assumed in prior work. Our algorithm, SpanRL, uses the notion of a barycentric spanner for the feature embedding as an efficiently computable basis for exploration, performing efficient spanner computation by interleaving representation learning and policy optimization subroutines. Our analysis---which is appealingly simple and modular---carefully combines several techniques, including a new approach to error-tolerant barycentric spanner computation, and a new analysis of a certain minimax representation learning objective found in prior work.

A Theoretical Analysis of the Test Error of Finite-Rank Kernel Ridge Regression
Tin Sum Cheng Aurelien Lucchi Anastasis Kratsios Ivan Dokmanić David Belius



研究问题:如何为有限秩核回归(KRR)提供更精确的统计学习保证。
动机:现有的统计学习保证对于一般的核回归器在应用于有限秩内核时,通常会产生宽松的界限。然而,有限秩内核在许多机器学习问题中自然出现,例如在执行迁移学习时微调预训练深度神经网络的最后一层以适应新任务。
方法:通过为任何有限秩KRR推导出锐利的非渐进上下界来解决这个问题。
效果:我们的界限比之前对有限秩KRR推导出的界限更精确,并且与可比的结果不同,它们也适用于任何正则化参数。

Existing statistical learning guarantees for general kernel regressors often yield loose bounds when used with finite-rank kernels. Yet, finite-rank kernels naturally appear in a number of machine learning problems, e.g. when fine-tuning a pre-trained deep neural network's last layer to adapt it to a novel task when performing transfer learning. We address this gap for finite-rank kernel ridge regression (KRR) by deriving sharp non-asymptotic upper and lower bounds for the KRR test error of any finite-rank KRR. Our bounds are tighter than previously derived bounds on finite-rank KRR and, unlike comparable results, they also remain valid for any regularization parameters.

Discrete-Smoothness in Online Algorithms with Predictions
Yossi Azar Debmalya Panigrahi Noam Touitou



研究问题:设计具有(机器学习)预测的学习增强算法。
动机:理想的学习增强算法在给定完美预测时与最优解相当(一致性),对任意预测是最佳在线近似(鲁棒性),并应在预测误差的平滑函数之间进行插值。
方法:通过我们称之为离散平滑性的一般属性对这些保证进行量化,并为在线覆盖问题,特别是设施定位和集合覆盖问题实现离散平滑算法。
效果:对于集合覆盖问题,我们通过增强一致性和鲁棒性以及提供平滑性保证,改进了Bamas,Maggiori和Svensson(2020)的结果。对于设施定位问题,我们通过推广到非均匀成本并通过增强一致性和鲁棒性提供平滑性保证,改进了Almanza等人(2021)的工作。

In recent years, there has been an increasing focus on designing online algorithms with (machine-learned) predictions. The ideal learning-augmented algorithm is comparable to the optimum when given perfect predictions (consistency), to the best online approximation for arbitrary predictions (robustness), and should interpolate between these extremes as a smooth function of the prediction error. In this paper, we quantify these guarantees in terms of a general property that we call discrete-smoothness, and achieve discrete-smooth algorithms for online covering, specifically the facility location and set cover problems. For set cover, our work improves the results of Bamas, Maggiori, and Svensson (2020) by augmenting consistency and robustness with smoothness guarantees. For facility location, our work improves on prior work by Almanza et al. (2021) by generalizing to nonuniform costs and also providing smoothness guarantees by augmenting consistency and robustness.

A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing
Junren Chen Jonathan Scarlett Michael Ng Zhaoqiang Liu



研究问题:本文旨在解决使用生成压缩感知从非线性测量中恢复信号的问题。
动机:在现有的非线性压缩感知研究中,大部分结果都是针对特定信号的非均匀恢复保证,而缺乏对所有可能信号的统一恢复保证。
方法:本文建立了一个统一的框架,用于推导非线性压缩感知的统一恢复保证。该框架可以容纳具有1位/均匀量化观测和单指数模型的情况。具体来说,通过使用单个测量集合实现和广义Lasso,所有可能的信号都可以恢复,误差上限为ε,大约需要 O(k/ε^2) 个样本。
效果:实验结果表明,该方法能够有效地恢复非线性压缩感知中的所有可能信号,且恢复精度与现有非均匀保证相当。此外,该方法还引入了Lipschitz近似来处理不连续的观测模型,并开发了一个紧致性不等式,适用于其指标集具有低度量熵的乘积过程。

In generative compressed sensing (GCS), we want to recover a signal $\mathbf{x^*}\in\mathbb{R}^n$ from $m$ measurements ($m\ll n$) using a generative prior $\mathbf{x^*}\in G(\mathbb{B}_2^k(r))$, where $G$ is typically an $L$-Lipschitz continuous generative model and $\mathbb{B}_2^k(r)$ represents the radius-$r$ $\ell_2$-ball in $\mathbb{R}^k$. Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $\mathbf{x^*}$ rather than for all $\mathbf{x^*}$ simultaneously. In this paper, we build a unified framework to derive uniform recovery guarantees for nonlinear GCS where the observation model is nonlinear and possibly discontinuous or unknown. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index model as canonical examples. Specifically, using a single realization of the sensing ensemble and generalized Lasso, all $\mathbf{x^*}\in G(\mathbb{B}_2^k(r))$ can be recovered up to an $\ell_2$-error at most $\epsilon$ using roughly $\tilde{O}({k}/{\epsilon^2})$ samples, with omitted logarithmic factors typically being dominated by $\log L$. Notably, this almost coincides with existing non-uniform guarantees up to logarithmic factors, hence the uniformity costs very little. As part of our technical contributions, we introduce Lipschitz approximation to handle discontinuous observation models. We also develop a concentration inequality that produces tighter bound for product process whose index sets have low metric entropy. Experimental results are presented to corroborate our theory.

SHOT: Suppressing the Hessian along the Optimization Trajectory for Gradient-Based Meta-Learning
JunHoo Lee Jayeon Yoo Nojun Kwak



研究问题:本文旨在解决梯度基础的元学习(GBML)在优化过程中可能抑制Hessian矩阵的问题。
动机:作者假设GBML在内部循环中会隐式地抑制Hessian矩阵,基于此假设,他们提出了一种新的算法SHOT。
方法:SHOT算法通过最小化目标模型和参考模型参数之间的距离来抑制内部循环中的Hessian矩阵,尽管需要处理高阶项,但并不会显著增加基线模型的计算复杂度。
效果:实验结果证实了作者的假设,并且SHOT算法在标准的小样本学习任务上表现优于相应的基线模型。

In this paper, we hypothesize that gradient-based meta-learning (GBML) implicitly suppresses the Hessian along the optimization trajectory in the inner loop. Based on this hypothesis, we introduce an algorithm called SHOT (Suppressing the Hessian along the Optimization Trajectory) that minimizes the distance between the parameters of the target and reference models to suppress the Hessian in the inner loop. Despite dealing with high-order terms, SHOT does not increase the computational complexity of the baseline model much. It is agnostic to both the algorithm and architecture used in GBML, making it highly versatile and applicable to any GBML baseline. To validate the effectiveness of SHOT, we conduct empirical tests on standard few-shot learning tasks and qualitatively analyze its dynamics. We confirm our hypothesis empirically and demonstrate that SHOT outperforms the corresponding baseline.

Penalising the biases in norm regularisation enforces sparsity
Etienne Boursier Nicolas Flammarion



研究问题:训练神经网络时,控制参数范数通常能获得良好的泛化效果,但参数范数与所得估计器之间的关系在理论上仍不明确。
动机:对于具有单维数据的隐藏ReLU层网络,本研究显示表示函数所需的参数范数由其二阶导数的总变差乘以$\sqrt{1+x^2}$因子给出。值得注意的是,当偏置项的范数未被正则化时,这种加权因子会消失。这种额外加权因子的存在至关重要,因为它被证明可以强制最小范数插值器的唯一性和稀疏性(在拐点数量上)。相反,省略偏置的范数允许非稀疏解。
方法:通过对偏置项进行显式或隐式的正则化惩罚,从而得到稀疏估计器。
效果:实验结果表明,这种方法可以在各种知识驱动任务上取得显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.

On Certified Generalization in Structured Prediction
Bastian Boll Christoph Schnoerr



研究问题:结构化预测中,目标对象具有丰富的内部结构,无法分解为独立的组件,违反了常见的独立同分布假设。
动机:在图像分割或场景图生成等应用中,这种挑战通过输出空间的指数级增长变得明显。
方法:我们提出了一种新的PAC-Bayesian风险边界用于结构化预测,其中泛化率不仅与结构化示例的数量有关,还与其大小有关。
效果:我们的研究朝着利用强大的生成模型在结构化预测的挑战性设置中建立判别性下游任务的泛化边界迈出了初步的一步。

In structured prediction, target objects have rich internal structure which does not factorize into independent components and violates common i.i.d. assumptions. This challenge becomes apparent through the exponentially large output space in applications such as image segmentation or scene graph generation. We present a novel PAC-Bayesian risk bound for structured prediction wherein the rate of generalization scales not only with the number of structured examples but also with their size. The underlying assumption, conforming to ongoing research on generative models, is that data are generated by the Knothe-Rosenblatt rearrangement of a factorizing reference measure. This allows to explicitly distill the structure between random output variables into a Wasserstein dependency matrix. Our work makes a preliminary step towards leveraging powerful generative models to establish generalization bounds for discriminative downstream tasks in the challenging setting of structured prediction.

Optimize Planning Heuristics to Rank, not to Estimate Cost-to-Goal
Leah Chrestien Stefan Edelkamp Antonin Komenda Tomáš Pevný



研究问题:本文旨在重新审视前向搜索算法(如A*和贪婪最佳优先搜索)严格最优高效启发式函数的必要充分条件,并针对特定前向搜索算法提出一种基于排名的损失函数族。
动机:当前在模仿学习中,优化启发式函数的参数通常是针对一组已解决的问题实例进行的。然而,对于只扩展返回最优路径状态的前向搜索算法,优化代价到目标h*是不必要的困难。
方法:本文提出了一种基于排名的损失函数族,适用于特定的前向搜索算法。从学习理论的角度讨论了优化代价到目标h*的困难性。
效果:通过在一系列不同问题上进行实验比较,本文的理论得到了明确的支持。

In imitation learning for planning, parameters of heuristic functions are optimized against a set of solved problem instances. This work revisits the necessary and sufficient conditions of strictly optimally efficient heuristics for forward search algorithms, mainly A* and greedy best-first search, which expand only states on the returned optimal path. It then proposes a family of loss functions based on ranking tailored for a given variant of the forward search algorithm. Furthermore, from a learning theory point of view, it discusses why optimizing cost-to-goal h* is unnecessarily difficult. The experimental comparison on a diverse set of problems unequivocally supports the derived theory.

Optimal Regret Is Achievable with Bounded Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework
Ziyi Huang Henry Lam Amirhossein Meisami Haofeng Zhang



研究问题:现有的贝叶斯强化学习方法在实际应用中表现优秀,但其理论依据存在较大差距。
动机:为了弥补这一差距,我们提出了增强的贝叶斯上置信界限(EBUCB)框架,以适应近似推理下的强化学习问题。
方法
效果:通过在一系列不同问题上进行实验比较,本文的理论得到了明确的支持。

Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. However, there is a large discrepancy between the superior practical performance of these approaches and their theoretical justification. Previous research only indicates a negative theoretical result: Thompson sampling could have a worst-case linear regret $\Omega(T)$ with a constant threshold on the inference error measured by one $\alpha$-divergence. To bridge this gap, we propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order $O(\log T)$ if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is. To our best knowledge, our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error. Furthermore, in concordance with the negative results in previous studies, we show that only one bounded $\alpha$-divergence is insufficient to guarantee a sub-linear regret.

Structured Semidefinite Programming for Recovering Structured Preconditioners
Arun Jambulapati Jerry Li Christopher Musco Kirankumar Shiragur Aaron Sidford Kevin Tian



研究问题:开发一个通用框架,寻找解决线性系统的最佳预处理器。
动机:利用这个框架,我们可以改进基本预条件和线性系统求解问题的运行时间。
方法:我们给出了一种算法,对于给定的正定矩阵K,可以在O(nnz(K) * poly(kappa^∗,epsilon^-1))时间内计算出ε-最优对角预处理器,其中kappa^∗是重新缩放矩阵的最佳条件数。
效果:我们的对角预处理器结果将目前通过通用半定规划实现的最佳运行时间Ω(d^3.5)提高了,并且我们的求解器将目前的最佳运行时间Ω(d^ω)提高了,其中ω > 2.3是目前的矩阵乘法常数。

We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including: Diagonal preconditioning. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{-1}))$, where $\kappa^\star$ is the optimal condition number of the rescaled matrix. Structured linear systems. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve state-of-the-art runtimes of $\Omega(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery.

Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards
Bo Xue Yimu Wang Yuanyu Wan Jinfeng Yi Lijun Zhang



研究问题:本文探讨了具有重尾回报的广义线性博彩问题,其$(1+\epsilon)$阶矩对于某些$\epsilon\in (0,1]$是固定的。
动机:尽管存在处理广义线性博彩的方法,但大多数方法都集中在有界或次高斯回报上,并不适用于许多现实世界的场景,如金融市场和网络广告。
方法:我们提出了两种基于截断和中位数均值的新算法。这些算法实现了几乎最优的遗憾界$widetilde{O}(dT^{frac{1}{1+\epsilon}})$,其中$d$是上下文信息维度,$T$是时间范围。我们的截断基算法支持在线学习,与现有的截断基方法有所区别。此外,我们的中位数均值基算法只需要$O(log T)$个奖励和一个估计器每轮,使其更具实用性。
效果:我们的算法在$\epsilon=1$时,将现有算法的遗憾界提高了一个对数因子。数值实验结果证实了我们算法的优点。

This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose $(1+\epsilon)$-th moment is bounded for some $\epsilon\in (0,1]$. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of $\widetilde{O}(dT^{\frac{1}{1+\epsilon}})$, where $d$ is the dimension of contextual information and $T$ is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only $O(\log T)$ rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when $\epsilon=1$. Numerical experimental results confirm the merits of our algorithms.

Revisiting Area Convexity: Faster Box-Simplex Games and Spectrahedral Generalizations
Arun Jambulapati Kevin Tian



研究问题:本文旨在深入研究区域凸性,一种用于解决$\ell_infty$几何下优化问题的神秘工具,并开发其与传统的外梯度方法分析的关系。
动机:为了解决在$\ell_\infty$几何下的优化问题,研究者引入了区域凸性这一工具,但其与常规的外梯度方法分析之间的关系尚不清楚。
方法:通过使用相对平滑性[BBT17, LFN18]的新工具,我们为Sherman17算法的变体所需的子问题提供了改进的求解器。
效果:利用这些新工具,我们为解决具有有界行数的$d times n$矩阵中的箱单纯形游戏($\ell_\infty$回归的原-对偶形式)提供了一种先进的一阶算法,该算法需要$O(\log d \cdot epsilon^{-1})$的矩阵向量查询。作为结果,我们获得了近似最大流、最优传输、最小-平均-周期等基本组合优化问题的改进复杂度。此外,我们还开发了一种近线性时间的算法,用于处理箱单纯形游戏的矩阵泛化问题,该问题捕获了最近在鲁棒统计和数值线性代数中用作子例程的一系列半定规划问题。

We investigate area convexity [Sherman17], a mysterious tool introduced to tackle optimization problems under the challenging $\ell_\infty$ geometry. We develop a deeper understanding of its relationship with conventional analyses of extragradient methods [Nemirovski04, Nesterov07]. We also give improved solvers for the subproblems required by variants of the [Sherman17] algorithm, designed through the lens of relative smoothness [BBT17, LFN18}. Leveraging these new tools, we give a state-of-the-art first-order algorithm for solving box-simplex games (a primal-dual formulation of $\ell_\infty$ regression) in a $d \times n$ matrix with bounded rows, using $O(\log d \cdot \epsilon^{-1})$ matrix-vector queries. As a consequence, we obtain improved complexities for approximate maximum flow, optimal transport, min-mean-cycle, and other basic combinatorial optimization problems. We also develop a near-linear time algorithm for a matrix generalization of box-simplex games, capturing a family of problems closely related to semidefinite programs recently used as subroutines in robust statistics and numerical linear algebra.

Exponential Lower Bounds for Fictitious Play in Potential Games
Ioannis Panageas Nikolas Patris Stratis Skoulakis Volkan Cevher



研究问题:本文旨在解决在潜在游戏中,当应用Fictitious Play(FP)动态时,其收敛速度的问题。
动机:尽管FP已经被广泛应用于博弈论和多智能体强化学习中,但是除了二玩家零和游戏和特定的支付矩阵实例或对抗性决胜规则外,FP的收敛速度仍然未知。
方法:通过构造一个具有唯一纳什均衡的二玩家协调游戏,并证明该游戏中的每一个近似纳什均衡必须接近纯纳什均衡的$\ell_1$距离,来证明FP在潜在游戏中达到纳什均衡可能需要指数时间。
效果:实验结果表明,即使在两个玩家的情况下,FP也可能需要指数时间才能达到纳什均衡。

Fictitious Play (FP) is a simple and natural dynamic for repeated play with many applications in game theory and multi-agent reinforcement learning. It was introduced by Brown and its convergence properties for two-player zero-sum games was established later by Robinson. Potential games [Monderer and Shapley 1996] is another class of games which exhibit the FP property [Monderer and Shapley 1996], i.e., FP dynamics converges to a Nash equilibrium if all agents follows it. Nevertheless, except for two-player zero-sum games and for specific instances of payoff matrices [Abernethy et. al. 2021] or for adversarial tie-breaking rules [Daskalakis and Pan, 2014], the \textit{convergence rate} of FP is unknown. In this work, we focus on the rate of convergence of FP when applied to potential games and more specifically identical payoff games. We prove that FP can take exponential time (in the number of strategies) to reach a Nash equilibrium, even if the game is restricted to \textit{two agents}. To prove this, we recursively construct a two-player coordination game with a unique Nash equilibrium. Moreover, every approximate Nash equilibrium in the constructed game must be close to the pure Nash equilibrium in $\ell_1$-distance.

First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities
Aleksandr Beznosikov Sergey Samsonov Marina Sheshukova Alexander Gasnikov Alexey Naumov Eric Moulines



研究问题:本文探讨了涉及马尔可夫噪声的随机优化问题。
动机:为了解决现有研究的局限性,如需要有界域和均匀有界的随机梯度等假设,我们提出了一种新的方法。
方法:我们提出了一种基于多层蒙特卡洛方法的随机批处理方案,用于消除这些限制性假设,并实现了对非凸和强凸最小化问题的一阶梯度方法的理论分析。
效果:实验结果表明,我们的方法在各种情况下都能达到最优(线性)依赖底层噪声序列的混合时间,并且在强凸优化问题上与原始查询复杂度相匹配。此外,我们还首次将该方法扩展到了马尔可夫噪声下的变分不等式问题。

This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the underlying noise sequence, we use the randomized batching scheme, which is based on the multilevel Monte Carlo method. Moreover, our technique allows us to eliminate the limiting assumptions of previous research on Markov noise, such as the need for a bounded domain and uniformly bounded stochastic gradients. Our extension to variational inequalities under Markovian noise is original. Additionally, we provide lower bounds that match the oracle complexity of our method in the case of strongly convex optimization problems.

PAC-Bayesian Spectrally-Normalized Bounds for Adversarially Robust Generalization
Jiancong Xiao Ruoyu Sun Zhi-Quan Luo



研究问题:深度神经网络易受对抗性攻击,如何保证对抗性鲁棒的泛化能力是建立防御算法的关键。
动机:对抗性鲁棒的泛化在防御对抗性攻击中至关重要,因此需要研究其理论保证。
方法:本文以PAC-Bayes方法为基础,研究基于范数复杂度的对抗性鲁棒泛化。主要挑战在于将标准设置中的关键成分——权重扰动界扩展到鲁棒设置。
效果:我们提出了一种光谱归一化的对抗性鲁棒泛化界,与现有界限相比,我们的界限有两个显著优点:首先,它不依赖于额外的假设;其次,它更为紧密,与标准的泛化界限一致。此外,我们将主要结果扩展到了针对一般非$\ell_p$攻击和其他神经网络架构的对抗性鲁棒性。

Deep neural networks (DNNs) are vulnerable to adversarial attacks. It is found empirically that adversarially robust generalization is crucial in establishing defense algorithms against adversarial attacks. Therefore, it is interesting to study the theoretical guarantee of robust generalization. This paper focuses on norm-based complexity, based on a PAC-Bayes approach (Neyshabur et al., 2017). The main challenge lies in extending the key ingredient, which is a weight perturbation bound in standard settings, to the robust settings. Existing attempts heavily rely on additional strong assumptions, leading to loose bounds. In this paper, we address this issue and provide a spectrally-normalized robust generalization bound for DNNs. Compared to existing bounds, our bound offers two significant advantages: Firstly, it does not depend on additional assumptions. Secondly, it is considerably tighter, aligning with the bounds of standard generalization. Therefore, our result provides a different perspective on understanding robust generalization: The mismatch terms between standard and robust generalization bounds shown in previous studies do not contribute to the poor robust generalization. Instead, these disparities solely due to mathematical issues. Finally, we extend the main result to adversarial robustness against general non-$\ell_p$ attacks and other neural network architectures.

Implicit Bias of (Stochastic) Gradient Descent for Rank-1 Linear Neural Network
Bochen Lyu Zhanxing Zhu



研究问题:揭示深度学习的隐含偏差对理解其底层机制至关重要,但即使在回归设置的标准线性网络中,全面描述这种隐含偏差仍然是一个开放的问题。
动机:本文提出了一种新的标准线性网络的代理模型——秩-1线性网络,其中每个权重矩阵都被参数化为秩-1形式。对于过度参数化的回归问题,我们精确分析了GD和SGD的隐含偏差。
方法:通过识别一个“潜力”函数,使得GD收敛于其最小化器约束下的零训练误差(即插值解),并进一步描述了SGD引入的噪声如何干扰这种潜力的形式。
效果:我们的研究结果明确地将网络的深度和初始化与GD和SGD的隐含偏差联系起来。此外,我们还强调了由随机性和过度参数化共同引起的SGD的新隐含偏差,这可以降低SGD解决方案对初始化的依赖性。我们的发现关于隐含偏差与最近流行的对角线线性网络模型不同,我们的秩-1模型产生的偏差更符合标准线性网络,而对角线模型则不是。这表明提出的秩-1线性网络可能是标准线性网络的一个合理的代理模型。

Studying the implicit bias of gradient descent (GD) and stochastic gradient descent (SGD) is critical to unveil the underlying mechanism of deep learning. Unfortunately, even for standard linear networks in regression setting, a comprehensive characterization of the implicit bias is still an open problem. This paper proposes to investigate a new proxy model of standard linear network, rank-1 linear network, where each weight matrix is parameterized as a rank-1 form. For over-parameterized regression problem, we precisely analyze the implicit bias of GD and SGD---by identifying a “potential” function such that GD converges to its minimizer constrained by zero training error (i.e., interpolation solution), and further characterizing the role of the noise introduced by SGD in perturbing the form of this potential. Our results explicitly connect the depth of the network and the initialization with the implicit bias of GD and SGD. Furthermore, we emphasize a new implicit bias of SGD jointly induced by stochasticity and over-parameterization, which can reduce the dependence of the SGD's solution on the initialization. Our findings regarding the implicit bias are different from that of a recently popular model, the diagonal linear network. We highlight that the induced bias of our rank-1 model is more consistent with standard linear network while the diagonal one is not. This suggests that the proposed rank-1 linear network might be a plausible proxy for standard linear net.

ReSync: Riemannian Subgradient-based Robust Rotation Synchronization
Huikang Liu Xiao Li Anthony Man-Cho So



研究问题:本文提出了一种名为ReSync的黎曼子梯度算法,用于解决各种工程应用中出现的鲁棒旋转同步问题。
动机:旋转同步问题是在各种工程应用中广泛出现的问题,而现有的解决方法往往无法直接恢复出底层的旋转。
方法:ReSync通过最小化非光滑和非凸的旋转群上的最小二乘误差来解决问题,并提供了在随机干扰设置下的强大理论保证。
效果:实验结果表明,ReSync在适当的条件下能够线性收敛到真实的旋转值,证明了其有效性。

This work presents ReSync, a Riemannian subgradient-based algorithm for solving the robust rotation synchronization problem, which arises in various engineering applications. ReSync solves a least-unsquared minimization formulation over the rotation group, which is nonsmooth and nonconvex, and aims at recovering the underlying rotations directly. We provide strong theoretical guarantees for ReSync under the random corruption setting. Specifically, we first show that the initialization procedure of ReSync yields a proper initial point that lies in a local region around the ground-truth rotations. We next establish the weak sharpness property of the aforementioned formulation and then utilize this property to derive the local linear convergence of ReSync to the ground-truth rotations. By combining these guarantees, we conclude that ReSync converges linearly to the ground-truth rotations under appropriate conditions. Experiment results demonstrate the effectiveness of ReSync.

Multi-Fidelity Multi-Armed Bandits Revisited
Xuchuang Wang Qingyun Wu Wei Chen John C.S. Lui



研究问题:本文研究了多保真度多臂老虎机(MF-MAB)问题,这是
动机:旋转同步问题是在各种工程应用中广泛出现的问题,而现有的解决方法往往无法直接恢复出底层的旋转。
方法:ReSync通过最小化非光滑和非凸的旋转群上的最小二乘误差来解决问题,并提供了在随机干扰设置下的强大理论保证。
效果:实验结果表明,ReSync在适当的条件下能够线性收敛到真实的旋转值,证明了其有效性。

We study the multi-fidelity multi-armed bandit ($\texttt{MF-MAB}$), an extension of the canonical multi-armed bandit (MAB) problem. $\texttt{MF-MAB}$ allows each arm to be pulled with different costs (fidelities) and observation accuracy. We study both the best arm identification with fixed confidence ($\texttt{BAI}$) and the regret minimization objectives. For $\texttt{BAI}$, we present (a) a cost complexity lower bound, (b) an algorithmic framework with two alternative fidelity selection procedures, and (c) both procedures' cost complexity upper bounds. From both cost complexity bounds of $\texttt{MF-MAB}$, one can recover the standard sample complexity bounds of the classic (single-fidelity) MAB. For regret minimization of $\texttt{MF-MAB}$, we propose a new regret definition, prove its problem-independent regret lower bound $\Omega(K^{1/3}\Lambda^{2/3})$ and problem-dependent lower bound $\Omega(K\log \Lambda)$, where $K$ is the number of arms and $\Lambda$ is the decision budget in terms of cost, and devise an elimination-based algorithm whose worst-cost regret upper bound matches its corresponding lower bound up to some logarithmic terms and, whose problem-dependent bound matches its corresponding lower bound in terms of $\Lambda$.

Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent
Xixi Jia Hailin Wang Jiangjun Peng Xiangchu Feng Deyu Meng



研究问题:本文旨在解决矩阵分解中的非凸优化问题,即如何通过优化目标函数来找到全局最优解。
动机:现有的梯度下降法在处理非凸优化问题时,由于目标函数的非光滑性和非凸性,使得其全局收敛性难以保证。
方法:本文提出了预条件技术来加速收敛,并证明了经预条件处理后的缩放梯度下降法(ScaledGD)及其变种——交替缩放梯度下降法(AltScaledGD)可以在一般随机初始化下,经过 $O({\rm ln} \frac{d}{\delta} + {\rm ln} \frac{d}{varepsilon})$ 次迭代后收敛到 $\varepsilon$-全局最小值。
效果:实验结果表明,预条件技术可以有效加速收敛,且AltScaledGD的收敛速度优于ScaledGD,其全局收敛不依赖于小的学习率和小的初始化,这证明了AltScaledGD在矩阵分解问题上的优势。

Low-rank matrix factorization (LRMF) is a canonical problem in non-convex optimization, the objective function to be minimized is non-convex and even non-smooth, which makes the global convergence guarantee of gradient-based algorithm quite challenging. Recent work made a breakthrough on proving that standard gradient descent converges to the $\varepsilon$-global minima after $O( \frac{d \kappa^2}{\tau^2} {\rm ln} \frac{d \sigma_d}{\tau} + \frac{d \kappa^2}{\tau^2} {\rm ln} \frac{\sigma_d}{\varepsilon})$ iterations from small initialization with a very small learning rate (both are related to the small constant $\tau$). While the dependence of the convergence on the \textit{condition number} $\kappa$ and \textit{small learning rate} makes it not practical especially for ill-conditioned LRMF problem. In this paper, we show that precondition helps in accelerating the convergence and prove that the scaled gradient descent (ScaledGD) and its variant, alternating scaled gradient descent (AltScaledGD) converge to an $\varepsilon$-global minima after $O( {\rm ln} \frac{d}{\delta} + {\rm ln} \frac{d}{\varepsilon})$ iterations from general random initialization. Meanwhile, for small initialization as in gradient descent, both ScaledGD and AltScaledGD converge to $\varepsilon$-global minima after only $O({\rm ln} \frac{d}{\varepsilon})$ iterations. Furthermore, we prove that as a proximity to the alternating minimization, AltScaledGD converges faster than ScaledGD, its global convergence does not rely on small learning rate and small initialization, which certificates the advantages of AltScaledGD in LRMF.

Efficient Sampling of Stochastic Differential Equations with Positive Semi-Definite Models
Anant Raj Umut Simsekli Alessandro Rudi



研究问题:如何有效地从随机微分方程中进行采样,给定漂移函数和扩散矩阵。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper deals with the problem of efficient sampling from a stochastic differential equation, given the drift function and the diffusion matrix. The proposed approach leverages a recent model for probabilities (Rudi and Ciliberto, 2021) (the positive semi-definite -- PSD model) from which it is possible to obtain independent and identically distributed (i.i.d.) samples at precision $\varepsilon$ with a cost that is $m^2 d \log(1/\varepsilon)$ where $m$ is the dimension of the model, $d$ the dimension of the space. The proposed approach consists in: first, computing the PSD model that satisfies the Fokker-Planck equation (or its fractional variant) associated with the SDE, up to error $\varepsilon$, and then sampling from the resulting PSD model. Assuming some regularity of the Fokker-Planck solution (i.e. $\beta$-times differentiability plus some geometric condition on its zeros) We obtain an algorithm that: (a) in the preparatory phase obtains a PSD model with L2 distance $\varepsilon$ from the solution of the equation, with a model of dimension $m = \varepsilon^{-(d+1)/(\beta-2s)} (\log(1/\varepsilon))^{d+1}$ where $1/2\leq s\leq1$ is the fractional power to the Laplacian, and total computational complexity of $O(m^{3.5} \log(1/\varepsilon))$ and then (b) for Fokker-Planck equation, it is able to produce i.i.d.\ samples with error $\varepsilon$ in Wasserstein-1 distance, with a cost that is $O(d \varepsilon^{-2(d+1)/\beta-2} \log(1/\varepsilon)^{2d+3})$ per sample. This means that, if the probability associated with the SDE is somewhat regular, i.e. $\beta \geq 4d+2$, then the algorithm requires $O(\varepsilon^{-0.88} \log(1/\varepsilon)^{4.5d})$ in the preparatory phase, and $O(\varepsilon^{-1/2}\log(1/\varepsilon)^{2d+2})$ for each sample. Our results suggest that as the true solution gets smoother, we can circumvent the curse of dimensionality without requiring any sort of convexity.

Geometric Analysis of Matrix Sensing over Graphs
Haixiang Zhang Ying Chen Javad Lavaei



研究问题:本文探讨了矩阵在图上的感测问题(MSoG),这是矩阵补全和矩阵感测问题的一般情况,但尚未在文献中得到分析。
动机:现有的结果不能直接应用于MSoG问题,因此需要对MSoG问题的优化景观进行首次理论分析。
方法:提出了一种新的条件,称为Ω-RIP条件,以描述问题的优化复杂性。同时,通过改进的不一致性正则化,证明了在不一致性条件和Ω-RIP条件下,MSoG问题大概率具有严格的鞍点性质,这保证了鞍点避免方法的多项式时间全局收敛。
效果:与最先进的结果相比,本文的结果在常数上是紧的。除了理论保证外,我们还数值说明了Ω-RIP条件和优化复杂性之间的密切关系。

In this work, we consider the problem of matrix sensing over graphs (MSoG). As a general case of matrix completion and matrix sensing problems, the MSoG problem has not been analyzed in the literature and the existing results cannot be directly applied to the MSoG problem. This work provides the first theoretical results on the optimization landscape of the MSoG problem. More specifically, we propose a new condition, named the $\Omega$-RIP condition, to characterize the optimization complexity of the problem. In addition, with an improved regularizer of the incoherence, we prove that the strict saddle property holds for the MSoG problem with high probability under the incoherence condition and the $\Omega$-RIP condition, which guarantees the polynomial-time global convergence of saddle-avoiding methods. Compared with state-of-the-art results, the bounds in this work are tight up to a constant. Besides the theoretical guarantees, we numerically illustrate the close relation between the $\Omega$-RIP condition and the optimization complexity.

Byzantine-Tolerant Methods for Distributed Variational Inequalities
Nazarii Tupitsa Abdulla Jasem Almansoori Yanlin Wu Martin Takáč Karthik Nandakumar Samuel Horváth Eduard Gorbunov



研究问题:如何提高预训练语言模型在知识驱动任务上的性能,同时保持对常见NLP任务的优异表现?
动机:目前的预训练语言模型缺乏对结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出一种增强的语言表示模型ERNIE,以充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work [Abidi et al., 2022] addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.

Practical Contextual Bandits with Feedback Graphs
Mengxiao Zhang Yuheng Zhang Olga Vrousgou Haipeng Luo Paul Mineiro



研究问题:如何有效地利用不同的反馈模式来提高学习速度,同时降低学习的统计复杂性。
动机:虽然上下文强盗理论已经成熟,但如何有效利用不同的反馈模式来提高学习速度仍然不清楚。反馈图强盗,作为完全信息和强盗制度之间的插值,提供了一个有希望的框架来减轻学习的统计复杂性。
方法:本文提出了一种基于降维到回归的反馈图强盗的方法,并进行了分析。这种方法在计算上是实用的,并且达到了既定的最小最大速率,从而在实际应用程序中降低了统计复杂性。
效果:实验结果表明,该方法能有效提高学习速度,降低统计复杂性,并在实际应用中取得了良好的效果。

While contextual bandit has a mature theory, effectively leveraging different feedback patterns to enhance the pace of learning remains unclear. Bandits with feedback graphs, which interpolates between the full information and bandit regimes, provides a promising framework to mitigate the statistical complexity of learning. In this paper, we propose and analyze an approach to contextual bandits with feedback graphs based upon reduction to regression. The resulting algorithms are computationally practical and achieve established minimax rates, thereby reducing the statistical complexity in real-world applications.

Robust Learning for Smoothed Online Convex Optimization with Feedback Delay
Pengfei Li Jianyi Yang Adam Wierman Shaolei Ren



研究问题:本文研究了平滑在线凸优化的一般形式,包括多步切换成本和反馈延迟。
动机:提出了一种新的机器学习增强的在线算法,即鲁棒性约束学习(RCL),通过受约束的投影将不可信的ML预测与可信的专家在线算法结合,以增强ML预测的鲁棒性。
方法:证明了RCL能够保证在任何给定专家的情况下,对于任何大于0的λ,都具有竞争性,同时也明确地以鲁棒性感知的方式训练ML模型,以提高平均性能。
效果:通过电池管理作为案例研究,展示了RCL在鲁棒性和平均性能方面的改进。

We study a general form of Smoothed Online Convex Optimization, a.k.a. SOCO, including multi-step switching costs and feedback delay. We propose a novel machine learning (ML) augmented online algorithm, Robustness-Constrained Learning (RCL), which combines untrusted ML predictions with a trusted expert online algorithm via constrained projection to robustify the ML prediction. Specifically, we prove that RCL is able to guarantee $(1+\lambda)$-competitiveness against any given expert for any $\lambda>0$, while also explicitly training the ML model in a robustification-aware manner to improve the average-case performance. Importantly, RCL is the first ML-augmented algorithm with a provable robustness guarantee in the case of multi-step switching cost and feedback delay. We demonstrate the improvement of RCL in both robustness and average performance using battery management as a case study.

Solving a Class of Non-Convex Minimax Optimization in Federated Learning
Xidong Wu Jianhui Sun Zhengmian Hu Aidong Zhang Heng Huang



研究问题:解决机器学习应用中的最小最大问题,包括对抗训练、强化学习中的策略评估和AUROC最大化等。
动机:在面对大规模的分布式数据挑战时,通过有效的通信进行分布式训练的联邦学习(FL)正在受到欢迎。然而,对于联邦学习下的最小最大问题的优化算法仍然有待探索。
方法:研究一类联邦非凸最小最大优化问题,提出联邦学习算法(FedSGDA+和FedSGDA-M),并降低了现有最常见最小最大问题的复杂度。对于非凸-凹问题,我们提出了FedSGDA+并将通信复杂度降低到O(ε^-6)。在非凸-强凸和非凸-PL最小最大设置下,我们证明了FedSGDA-M具有已知最好的样本复杂度O(κ^3 N^-1 ε^-3)和已知最好的通信复杂度O(κ^2 ε^-2)。
效果:在公平分类和AUROC最大化等实验上,我们的算法表现出了高效性。

The minimax problems arise throughout machine learning applications, ranging from adversarial training and policy evaluation in reinforcement learning to AUROC maximization. To address the large-scale distributed data challenges across multiple clients with communication-efficient distributed training, federated learning (FL) is gaining popularity. Many optimization algorithms for minimax problems have been developed in the centralized setting (\emph{i.e.}, single-machine). Nonetheless, the algorithm for minimax problems under FL is still underexplored. In this paper, we study a class of federated nonconvex minimax optimization problems. We propose FL algorithms (FedSGDA+ and FedSGDA-M) and reduce existing complexity results for the most common minimax problems. For nonconvex-concave problems, we propose FedSGDA+ and reduce the communication complexity to $O(\varepsilon^{-6})$. Under nonconvex-strongly-concave and nonconvex-PL minimax settings, we prove that FedSGDA-M has the best-known sample complexity of $O(\kappa^{3} N^{-1}\varepsilon^{-3})$ and the best-known communication complexity of $O(\kappa^{2}\varepsilon^{-2})$. FedSGDA-M is the first algorithm to match the best sample complexity $O(\varepsilon^{-3})$ achieved by the single-machine method under the nonconvex-strongly-concave setting. Extensive experimental results on fair classification and AUROC maximization show the efficiency of our algorithms.

Energy-Efficient Scheduling with Predictions
Eric Balkanski Noemie Perivier Clifford Stein Hao-Ting Wei



研究问题:如何有效地管理电力使用以提高能源效率。
动机:现代调度系统的一个重要目标是有效管理电力使用,以降低能源消耗并优化服务质量成本。
方法:通过利用机器学习预测未来需求,设计了一种新的学习增强算法框架,该框架在预测误差较小时可以提供更好的性能保证。
效果:实验结果表明,该框架在许多不同的节能调度问题上都能提供改进的竞争力,并在预测误差较大时仍能保持有界的竞争力。

An important goal of modern scheduling systems is to efficiently manage power usage. In energy-efficient scheduling, the operating system controls the speed at which a machine is processing jobs with the dual objective of minimizing energy consumption and optimizing the quality of service cost of the resulting schedule. Since machine-learned predictions about future requests can often be learned from historical data, a recent line of work on learning-augmented algorithms aims to achieve improved performance guarantees by leveraging predictions. In particular, for energy-efficient scheduling, Bamas et. al. [NeurIPS '20] and Antoniadis et. al. [SWAT '22] designed algorithms with predictions for the energy minimization with deadlines problem and achieved an improved competitive ratio when the prediction error is small while also maintaining worst-case bounds even when the prediction error is arbitrarily large. In this paper, we consider a general setting for energy-efficient scheduling and provide a flexible learning-augmented algorithmic framework that takes as input an offline and an online algorithm for the desired energy-efficient scheduling problem. We show that, when the prediction error is small, this framework gives improved competitive ratios for many different energy-efficient scheduling problems, including energy minimization with deadlines, while also maintaining a bounded competitive ratio regardless of the prediction error. Finally, we empirically demonstrate that this framework achieves an improved performance on real and synthetic datasets.

Fast Attention Requires Bounded Entries
Josh Alman Zhao Song



研究问题:本文探讨了在训练大型语言模型如Transformer、GPT-1、BERT等时,内积注意力计算的问题。
动机:当前的预训练语言模型在进行内积注意力计算时,需要显式地计算注意力矩阵,这在矩阵规模较大时会消耗大量的时间。
方法:本文提出了一种利用注意力矩阵$A$进行隐式计算的方法,通过改变输入矩阵的数值大小,可以显著提高注意力计算的效率。
效果:实验结果表明,当输入矩阵的数值较小时,注意力计算的效率会有显著提高。同时,本文还发现,当矩阵规模和数值大小的比值达到一定阈值时,即使使用最优算法,也无法在对数时间内完成注意力计算。

In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix $\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$, where $A = \exp(QK^\top/d)$ is the `attention matrix', and $\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the $n \times n$ attention matrix $A$, and hence require time $\Omega(n^2)$ even when $d = n^{o(1)}$ is small. In this paper, we investigate whether faster algorithms are possible by \emph{implicitly} making use of the matrix $A$. We present two results, showing that there is a sharp transition at $B = \Theta(\sqrt{\log n})$. $\bullet$ If $d = O(\log n)$ and $B = o(\sqrt{\log n})$, there is an $n^{1+o(1)}$ time algorithm to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error. $\bullet$ If $d = O(\log n)$ and $B = \Theta (\sqrt{\log n})$, assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error in truly subquadratic time $n^{2 - \Omega(1)}$. This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

Learning the Efficient Frontier
Philippe Chatigny Ivan Sergienko Ryan Ferguson Jordan Weir Maxime Bergeron



研究问题:如何有效地分配资源以在特定风险水平下最大化回报。
动机:传统的优化方法计算效率低,需要寻找一种快速且稳健的神经近似框架来预测最优化结果。
方法:提出了NeuralEF,将优化问题重构为序列到序列问题,通过处理不连续行为来加速大规模模拟。
效果:NeuralEF可以有效地预测最优化结果,提高了计算效率和稳健性。

The efficient frontier (EF) is a fundamental resource allocation problem where one has to find an optimal portfolio maximizing a reward at a given level of risk. This optimal solution is traditionally found by solving a convex optimization problem. In this paper, we introduce NeuralEF: a fast neural approximation framework that robustly forecasts the result of the EF convex optimizations problems with respect to heterogeneous linear constraints and variable number of optimization inputs. By reformulating an optimization problem as a sequence to sequence problem, we show that NeuralEF is a viable solution to accelerate large-scale simulation while handling discontinuous behavior.

Time-uniform confidence bands for the CDF under nonstationarity
Paul Mineiro Steven R Howard



研究问题:如何从一系列观测中估计完整的一元分布,这对于手动和自动决策都是有用的。
动机:这个问题在独立同分布的设置下已经得到了广泛的关注,但是任意数据依赖的设置仍然基本上没有得到解决。
方法:我们提出了一种计算上令人满意的时间一致和值一致的边界,用于估计一系列实值随机变量的条件分布的运行平均值。
效果:我们的CDF边界总是有效的,但当实例过于困难时,有时可能是无关紧要的,我们给出了一个实例相关的收敛保证。重要性加权扩展适用于估计给定随机实验数据的奖励的完整反事实分布,例如A/B测试或上下文强盗。

Estimation of a complete univariate distribution from a sequence of observations is a useful primitive for both manual and automated decision making. This problem has received extensive attention in the i.i.d. setting, but the arbitrary data dependent setting remains largely unaddressed. We present computationally felicitous time-uniform and value-uniform bounds on the CDF of the running averaged conditional distribution of a sequence of real-valued random variables. Consistent with known impossibility results, our CDF bounds are always valid but sometimes trivial when the instance is too hard, and we give an instance-dependent convergence guarantee. The importance-weighted extension is appropriate for estimating complete counterfactual distributions of rewards given data from a randomized experiment, e.g., from an A/B test or a contextual bandit.

Provable convergence guarantees for black-box variational inference
Justin Domke Robert M. Gower Guillaume Garrigos



研究问题:本文旨在解决黑箱变分推断中随机优化成功无证明的问题。
动机:现有的随机优化证明存在理论空白,即非常规噪声边界的梯度估计器挑战和复合非平滑目标。
方法:针对密集高斯变分族,本文发现基于重参数化的现有梯度估计器满足二次噪声边界,并为此提供了新的收敛保证。
效果:这为类似实践中使用的方法在现实推理问题上的收敛提供了严格的保证。

Black-box variational inference is widely used in situations where there is no proof that its stochastic optimization succeeds. We suggest this is due to a theoretical gap in existing stochastic optimization proofs—namely the challenge of gradient estimators with unusual noise bounds, and a composite non-smooth objective. For dense Gaussian variational families, we observe that existing gradient estimators based on reparameterization satisfy a quadratic noise bound and give novel convergence guarantees for proximal and projected stochastic gradient descent using this bound. This provides rigorous guarantees that methods similar to those used in practice converge on realistic inference problems.

Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with Application to Fairness
Evgenii E Chzhen Christophe Giraud Zhen LI Gilles Stoltz



研究问题:本文研究了具有背包约束的上下文Bandit问题。
动机:在这个问题中,学习者需要在满足预设的成本约束下,最大化累积奖励。
方法:我们提出了一种基于投影梯度下降更新的对偶策略,能够处理高达$\sqrt{T}$次方级别的总成本约束。
效果:这种策略比现有文献中的策略更直接、更简单,且能通过精心、自适应的步长调整来达到较好的效果。

We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated---a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.

On the Complexity of Differentially Private Best-Arm Identification with Fixed Confidence
Achraf Azize Marc Jourdan Aymen Al Marjani Debabrota Basu



研究问题:本文研究了在$\epsilon$-全局差分隐私(DP)下进行固定置信度的最佳手臂识别(BAI)问题。
动机:由于数据敏感应用如设计适应性临床试验、调整超参数和进行用户研究等对数据隐私的关注,我们对此进行了研究。
方法:我们提出了一种满足$\epsilon$-全局DP的AdaP-TT算法,这是一种Top Two算法的变体。该算法以“手臂依赖的自适应阶段”运行,并添加拉普拉斯噪声以确保良好的隐私-效用权衡。
效果:实验分析验证了我们的理论研究结果,AdaP-TT的样本复杂度上限与低隐私区间下的理论下界相匹配。

Best Arm Identification (BAI) problems are progressively used for data-sensitive applications, such as designing adaptive clinical trials, tuning hyper-parameters, and conducting user studies to name a few. Motivated by the data privacy concerns invoked by these applications, we study the problem of BAI with fixed confidence under $\epsilon$-global Differential Privacy (DP). First, to quantify the cost of privacy, we derive a lower bound on the sample complexity of any $\delta$-correct BAI algorithm satisfying $\epsilon$-global DP. Our lower bound suggests the existence of two privacy regimes depending on the privacy budget $\epsilon$. In the high-privacy regime (small $\epsilon$), the hardness depends on a coupled effect of privacy and a novel information-theoretic quantity, called the Total Variation Characteristic Time. In the low-privacy regime (large $\epsilon$), the sample complexity lower bound reduces to the classical non-private lower bound. Second, we propose AdaP-TT, an $\epsilon$-global DP variant of the Top Two algorithm. AdaP-TT runs in *arm-dependent adaptive episodes* and adds *Laplace noise* to ensure a good privacy-utility trade-off. We derive an asymptotic upper bound on the sample complexity of AdaP-TT that matches with the lower bound up to multiplicative constants in the high-privacy regime. Finally, we provide an experimental analysis of AdaP-TT that validates our theoretical results.

Learning Exponential Families from Truncated Samples
Jane Lee Andre Wibisono Manolis Zampetakis



研究问题:本文旨在解决科学领域中普遍存在的缺失数据问题,特别是当样本被截断时。
动机:截断样本是缺失数据问题的一种基本类型,其统计估计问题是统计学中的经典问题。最近的一些工作为高斯分布和带有高斯噪声的线性回归提供了有效的参数估计算法。
方法:本文将这些结果推广到对数凹指数族,提供了一个估计算法,该算法表明对于更大一类的分布,外推是可能的,同时在平均情况下保持了多项式样本和时间复杂度。该算法基于投影随机梯度下降法,不仅适用于更一般的情况,而且比最近的算法更简单、更有效。
效果:本文的工作还对仅访问截断数据的情况下学习一般对数凹分布和采样有重要意义。

Missing data problems have many manifestations across many scientific fields. A fundamental type of missing data problem arises when samples are \textit{truncated}, i.e., samples that lie in a subset of the support are not observed. Statistical estimation from truncated samples is a classical problem in statistics which dates back to Galton, Pearson, and Fisher. A recent line of work provides the first efficient estimation algorithms for the parameters of a Gaussian distribution and for linear regression with Gaussian noise. In this paper we generalize these results to log-concave exponential families. We provide an estimation algorithm that shows that \textit{extrapolation} is possible for a much larger class of distributions while it maintains a polynomial sample and time complexity on average. Our algorithm is based on Projected Stochastic Gradient Descent and is not only applicable in a more general setting but is also simpler and more efficient than recent algorithms. Our work also has interesting implications for learning general log-concave distributions and sampling given only access to truncated data.

Meta-Learning Adversarial Bandit Algorithms
Mikhail Khodak Ilya Osadchiy Keegan Harris Nina Balcan Kfir Yehuda Levy Ron Meir Steven Wu



研究问题:本文旨在解决科学领域中普遍存在的缺失数据问题,特别是当样本被截断时。
动机:截断样本是缺失数据问题的一种基本类型,其统计估计问题是统计学中的经典问题。最近的一些工作为高斯分布和带有高斯噪声的线性回归提供了有效的参数估计算法。
方法:本文将这些结果推广到对数凹指数族,提供了一个估计算法,该算法表明对于更大一类的分布,外推是可能的,同时在平均情况下保持了多项式样本和时间复杂度。该算法基于投影随机梯度下降法,不仅适用于更一般的情况,而且比最近的算法更简单、更有效。
效果:本文的工作还对仅访问截断数据的情况下学习一般对数凹分布和采样有重要意义。

We study online meta-learning with bandit feedback, with the goal of improving performance across multiple tasks if they are similar according to some natural similarity measure. As the first to target the adversarial online-within-online partial-information setting, we design meta-algorithms that combine outer learners to simultaneously tune the initialization and other hyperparameters of an inner learner for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-learners initialize and set hyperparameters of the Tsallis-entropy generalization of Exp3, with the task-averaged regret improving if the entropy of the optima-in-hindsight is small. For BLO, we learn to initialize and tune online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with an action space-dependent measure they induce. Our guarantees rely on proving that unregularized follow-the-leader combined with two levels of low-dimensional hyperparameter tuning is enough to learn a sequence of affine functions of non-Lipschitz and sometimes non-convex Bregman divergences bounding the regret of OMD.

Fair Allocation of Indivisible Chores: Beyond Additive Costs
Bo Li Fangxiao Wang Yu Zhou



研究问题:如何公平地将m个不可分割的任务分配给n个完成任务有成本的代理人。
动机:已知精确的最大最小分享(MMS)公平性无法保证,对于加性成本函数,目前最好的近似值是Huang和Segal-Halevi [EC, 2023]提出的$\frac{13}{11}$;然而,在超越加性的情况下,我们知之甚少。
方法:我们首先证明,如果成本函数是模态的,那么没有算法能保证比$\min{n,\frac{\log m}{log \log m}\}$更好的近似值。这一结果也与商品分配形成了鲜明的对比,后者如Barman和Krishnamurthy [TEAC, 2020]以及Ghodsi等人[AIJ, 2022]所示,存在常数近似值。然后,我们证明了对于次加性成本,总是存在一个$\min\{n,\lceillog m\rceil\}$近似值的分配,因此,近似比是渐近紧的。
效果:除了乘性近似外,我们还考虑了序数松弛,即最近由Hosseini等人[JAIR和AAMAS, 2022]提出的1-out-of-d MMS。我们的不可能结果意味着对于任何$d\ge 2$,可能存在1-out-of-d MMS分配。由于这些针对一般次加性成本的困难结果,我们将注意力转向了两种特定的次加性成本,即装箱和作业调度。对于这两种情况,我们都证明了乘性和序数松弛的MMS都存在常数近似分配。

We study the maximin share (MMS) fair allocation of $m$ indivisible tasks to $n$ agents who have costs for completing the assigned tasks. It is known that exact MMS fairness cannot be guaranteed, and so far the best-known approximation for additive cost functions is $\frac{13}{11}$ by Huang and Segal-Halevi [EC, 2023]; however, beyond additivity, very little is known. In this work, we first prove that no algorithm can ensure better than $\min\{n,\frac{\log m}{\log \log m}\}$-approximation if the cost functions are submodular. This result also shows a sharp contrast with the allocation of goods where constant approximations exist as shown by Barman and Krishnamurthy [TEAC, 2020] and Ghodsi et al. [AIJ, 2022]. We then prove that for subadditive costs, there always exists an allocation that is $\min\{n,\lceil\log m\rceil\}$-approximation, and thus the approximation ratio is asymptotically tight. Besides multiplicative approximation, we also consider the ordinal relaxation, 1-out-of-$d$ MMS, which was recently proposed by Hosseini et al. [JAIR and AAMAS, 2022]. Our impossibility result implies that for any $d\ge 2$, a 1-out-of-$d$ MMS allocation may not exist. Due to these hardness results for general subadditive costs, we turn to studying two specific subadditive costs, namely, bin packing and job scheduling. For both settings, we show that constant approximate allocations exist for both multiplicative and ordinal relaxations of MMS.

Covariance-adaptive best arm identification
El Mehdi Saad Gilles Blanchard Nicolas Verzelen



研究问题:在多臂赌博机模型中,考虑固定置信度
动机:已知精确的最大最小分享(MMS)公平性无法保证,对于加性成本函数,目前最好的近似值是Huang和Segal-Halevi [EC, 2023]提出的$\frac{13}{11}$;然而,在超越加性的情况下,我们知之甚少。
方法:我们首先证明,如果成本函数是模态的,那么没有算法能保证比$\min{n,\frac{\log m}{log \log m}\}$更好的近似值。这一结果也与商品分配形成了鲜明的对比,后者如Barman和Krishnamurthy [TEAC, 2020]以及Ghodsi等人[AIJ, 2022]所示,存在常数近似值。然后,我们证明了对于次加性成本,总是存在一个$\min\{n,\lceillog m\rceil\}$近似值的分配,因此,近似比是渐近紧的。
效果:除了乘性近似外,我们还考虑了序数松弛,即最近由Hosseini等人[JAIR和AAMAS, 2022]提出的1-out-of-d MMS。我们的不可能结果意味着对于任何$d\ge 2$,可能存在1-out-of-d MMS分配。由于这些针对一般次加性成本的困难结果,我们将注意力转向了两种特定的次加性成本,即装箱和作业调度。对于这两种情况,我们都证明了乘性和序数松弛的MMS都存在常数近似分配。

We consider the problem of best arm identification in the multi-armed bandit model, under fixed confidence. Given a confidence input $\delta$, the goal is to identify the arm with the highest mean reward with a probability of at least $1 - \delta$, while minimizing the number of arm pulls. While the literature provides solutions to this problem under the assumption of independent arms distributions, we propose a more flexible scenario where arms can be dependent and rewards can be sampled simultaneously. This framework allows the learner to estimate the covariance among the arms distributions, enabling a more efficient identification of the best arm. The relaxed setting we propose is relevant in various applications, such as clinical trials, where similarities between patients or drugs suggest underlying correlations in the outcomes. We introduce new algorithms that adapt to the unknown covariance of the arms and demonstrate through theoretical guarantees that substantial improvement can be achieved over the standard setting. Additionally, we provide new lower bounds for the relaxed setting and present numerical simulations that support their theoretical findings.

Time-Independent Information-Theoretic Generalization Bounds for SGLD
Futoshi Futami Masahiro Fujisawa



研究问题:为随机梯度Langevin动力学(SGLD)在采样和非线性优化研究中的广泛应用,提供新颖的信息理论泛化界限。
动机:当前的研究主要关注于改进采样和非线性优化的性能,但缺乏对SGLD泛化能力的深入理解。
方法:我们提出了一种新的信息理论泛化界限,该界限基于KL散度的演化时间,与数据集的稳定性和输出参数与输入数据集之间的互信息上限有关。
效果:我们的泛化界限是时间独立的,并且会随着样本大小的增加而衰减到零,无论迭代次数多少,步长是否固定。此外,我们还建立了第一个当训练和测试损失相同时的信息理论泛化界限,这一界限也是时间独立的,消除了现有工作中步长依赖性的问题,通过将我们的分析与现有的非线性优化误差界限结合,得到了改进的超额风险界限。

We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike previous studies, we derive the generalization error bounds by focusing on the time evolution of the Kullback--Leibler divergence, which is related to the stability of datasets and is the upper bound of the mutual information between output parameters and an input dataset. Additionally, we establish the first information-theoretic generalization bound when the training and test loss are the same by showing that a loss function of SGLD is sub-exponential. This bound is also time-independent and removes the problematic step size dependence in existing work, leading to an improved excess risk bound by combining our analysis with the existing non-convex optimization error bounds.

A Batch-to-Online Transformation under Random-Order Model
Jing Dong Yuichi Yoshida



研究问题:如何将离线近似算法转化为在线算法,以实现低ε-近似遗憾。
动机:为了解决随机顺序模型下的在线算法问题,提出一种转换框架。
方法:通过降低平均敏感性,将离线近似算法转化为具有低ε-近似遗憾的在线算法。
效果:成功应用于多种问题,包括在线(k,z)聚类、在线矩阵近似和在线回归,实现了多项式对数ε-近似遗憾,且在所有这些情况下,算法都具有低不一致性。

We introduce a transformation framework that can be utilized to develop online algorithms with low $\epsilon$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $\epsilon$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $\epsilon$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.

Logarithmic-Regret Quantum Learning Algorithms for Zero-Sum Games
Minbo Gao Zhengfeng Ji Tongyang Li Qisheng Wang



研究问题:提出首个在线量子算法解决零和游戏,并实现低遗憾。
动机:在游戏环境中,设计出具有低遗憾的在线量子算法,以解决零和游戏的问题。
方法:利用标准量子输入和简洁描述的输出,开发出一个在线量子算法,该算法基于乐观的乘法更新方法对经典算法进行"量化"处理。
效果:成功实现了一个快速量子线性规划求解器,并在理论上证明了其有效性。

We propose the first online quantum algorithm for zero-sum games with $\widetilde O(1)$ regret under the game setting. Moreover, our quantum algorithm computes an $\varepsilon$-approximate Nash equilibrium of an $m \times n$ matrix zero-sum game in quantum time $\widetilde O(\sqrt{m+n}/\varepsilon^{2.5})$. Our algorithm uses standard quantum inputs and generates classical outputs with succinct descriptions, facilitating end-to-end applications. As an application, we obtain a fast quantum linear programming solver. Technically, our online quantum algorithm "quantizes" classical algorithms based on the optimistic multiplicative weight update method. At the heart of our algorithm is a fast quantum multi-sampling procedure for the Gibbs sampling problem, which may be of independent interest.

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space
Saghar Adler Vijay Subramanian



研究问题:如何优化具有未知参数的计数无限状态空间马尔可夫决策过程(MDPs)的控制策略。
动机:许多现实生活的应用模型,如通信网络或计算系统的排队模型,都具有计数无限的状态空间。现有的算法和学习程序主要关注有限状态设置,并不能直接应用于这些模型。
方法:从贝叶斯的角度出发,我们提出了一种基于汤普森采样和动态大小剧集的算法来优化未知MDP的控制。在每个剧集开始时,通过贝叶斯规则形成的后验分布用于产生参数估计,然后决定在剧集中应用的策略。
效果:我们建立了一个上界$\tilde O(dh^d\sqrt{|mathcal A|T})$,其中$T$是时间范围,以证明我们的算法的稳定性。最后,我们考虑了两种具有未知动态的排队模型,并展示了我们的算法可以应用于开发近似最优控制算法。

Models of many real-life applications, such as queueing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state-space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queueing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

Selective Sampling and Imitation Learning via Online Regression
Ayush Sekhari Karthik Sridharan Wen Sun Runzhe Wu



研究问题:本文探讨了通过主动查询噪声专家反馈的模仿学习(IL)问题。
动机:虽然模仿学习在实践中取得了成功,但大部分先前的工作都假设可以无噪声地获取专家反馈,这在许多应用中并不现实。实际上,当只能获取噪声专家反馈时,依赖纯离线数据(非交互式IL)的算法需要大量样本才能成功,而这通常是不可行的。
方法:本文提供了一个交互式的IL算法,该算法使用选择性采样来主动查询噪声专家反馈。我们的贡献有两个方面:首先,我们提供了一个新的选择性采样算法,该算法适用于一般函数类和多个动作,并获得了已知的最佳遗憾和查询次数上界。其次,我们将这种分析扩展到带有噪声专家反馈的IL问题,并提供了一种新IL算法,该算法限制了查询次数。
效果:我们的选择性采样算法利用了函数逼近,并依赖于一个关于给定模型类的在线回归神谕来预测行动,以及决定是否向专家查询其标签。在理论上,我们算法的遗憾上界由在线回归神谕的遗憾所确定,而查询复杂度则取决于模型类的逃避维度。我们还提供了一个下界来证明我们的结果是一致的。我们将我们的选择性采样算法扩展到具有一般函数逼近的IL,并对向噪声专家进行的遗憾和查询次数提供了界限。一个关键的创新点是我们的遗憾和查询复杂度界限仅取决于最优策略(而不是噪声专家或学习者)进入具有小间隔状态的次数。

We consider the problem of Imitation Learning (IL) by actively querying noisy expert for feedback. While imitation learning has been empirically successful, much of prior work assumes access to noiseless expert feedback which is not practical in many applications. In fact, when one only has access to noisy expert feedback, algorithms that rely on purely offline data (non-interactive IL) can be shown to need a prohibitively large number of samples to be successful. In contrast, in this work, we provide an interactive algorithm for IL that uses selective sampling to actively query the noisy expert for feedback. Our contributions are twofold: First, we provide a new selective sampling algorithm that works with general function classes and multiple actions, and obtains the best-known bounds for the regret and the number of queries. Next, we extend this analysis to the problem of IL with noisy expert feedback and provide a new IL algorithm that makes limited queries. Our algorithm for selective sampling leverages function approximation, and relies on an online regression oracle w.r.t.~the given model class to predict actions, and to decide whether to query the expert for its label. On the theoretical side, the regret bound of our algorithm is upper bounded by the regret of the online regression oracle, while the query complexity additionally depends on the eluder dimension of the model class. We complement this with a lower bound that demonstrates that our results are tight. We extend our selective sampling algorithm for IL with general function approximation and provide bounds on both the regret and the number of queries made to the noisy expert. A key novelty here is that our regret and query complexity bounds only depend on the number of times the optimal policy (and not the noisy expert, or the learner) go to states that have a small margin.

Continuous-time Analysis of Anchor Acceleration
Jaewook J. Suh Jisun Park Ernest K. Ryu



研究问题:本文旨在深入理解锚加速机制,这是一种与Nesterov加速不同的优化加速方法。
动机:尽管锚加速已被发现适用于最小最大优化和固定点问题,但其工作机制尚未得到充分理解。
方法:通过分析连续时间模型,对锚加速的收敛速度进行了严格的统一分析,并提出了受其启发的自适应方法。
效果:理论分析和实验表明,该方法具有高效性。

Recently, the anchor acceleration, an acceleration mechanism distinct from Nesterov's, has been discovered for minimax optimization and fixed-point problems, but its mechanism is not understood well, much less so than Nesterov acceleration. In this work, we analyze continuous-time models of anchor acceleration. We provide tight, unified analyses for characterizing the convergence rate as a function of the anchor coefficient $\beta(t)$, thereby providing insight into the anchor acceleration mechanism and its accelerated $\mathcal{O}(1/k^2)$-convergence rate. Finally, we present an adaptive method inspired by the continuous-time analyses and establish its effectiveness through theoretical analyses and experiments.

On Robust Streaming for Learning with Experts: Algorithms and Lower Bounds
David Woodruff Fred Zhang Samson Zhou



研究问题:在线学习与专家问题,即给定一组专家的预测,算法需要在T天内对结果进行预测,并最小化其预测成本。
动机:在现实中,专家或算法的预测会影响未来的结果,因此输入是自适应生成的。
方法:本文提出了一种随机化的鲁棒算法,该算法可以抵抗自适应输入,并使用$\widetilde{O}\left(frac{n}{R\sqrt{T}}\right)$的空间,从而实现平滑的空间-遗憾权衡。
效果:实验结果表明,使用鲁棒程序对抗具有访问算法内部状态的白盒对手是有益的。

In the online learning with experts problem, an algorithm makes predictions about an outcome on each of $T$ days, given a set of $n$ experts who make predictions on each day. The algorithm is given feedback on the outcomes of each day, including the cost of its prediction and the cost of the expert predictions, and the goal is to make a prediction with the minimum cost, compared to the best expert in hindsight. However, often the predictions made by experts or algorithms at some time influence future outcomes, so that the input is adaptively generated. In this paper, we study robust algorithms for the experts problem under memory constraints. We first give a randomized algorithm that is robust to adaptive inputs that uses $\widetilde{O}\left(\frac{n}{R\sqrt{T}}\right)$ space for $M=O\left(\frac{R^2 T}{\log^2 n}\right)$, thereby showing a smooth space-regret trade-off. We then show a space lower bound of $\widetilde{\Omega}\left(\frac{nM}{RT}\right)$ for any randomized algorithm that achieves regret $R$ with probability $1-2^{-\Omega(T)}$, when the best expert makes $M$ mistakes. Our result implies that the natural deterministic algorithm, which iterates through pools of experts until each expert in the pool has erred, is optimal up to polylogarithmic factors. Finally, we empirically demonstrate the benefit of using robust procedures against a white-box adversary that has access to the internal state of the algorithm.

Noise-Adaptive Thompson Sampling for Linear Contextual Bandits
Ruitu Xu Yifei Min Tianhao Wang



研究问题:如何开发一种能够有效处理具有未知方差的噪声,同时保证在最坏情况的常数方差噪声和确定性奖励场景下提供可靠保证的算法。
动机:在现实中,专家或算法的预测会影响未来的结果,因此输入是自适应生成的。
方法:本文提出了一种随机化的鲁棒算法,该算法可以抵抗自适应输入,并使用$\widetilde{O}\left(frac{n}{R\sqrt{T}}\right)$的空间,从而实现平滑的空间-遗憾权衡。
效果:实验结果表明,使用鲁棒程序对抗具有访问算法内部状态的白盒对手是有益的。

Linear contextual bandits represent a fundamental class of models with numerous real-world applications, and it is critical to develop algorithms that can effectively manage noise with unknown variance, ensuring provable guarantees for both worst-case constant-variance noise and deterministic reward scenarios. In this paper, we study linear contextual bandits with heteroscedastic noise and propose the first noise-adaptive Thompson sampling-style algorithm that achieves a variance-dependent regret upper bound of $\widetilde O\Big(d^{3/2} + d^{3/2} \sqrt{\sum_{t=1}^T \sigma_t^2}\Big)$, where $d$ is the dimension of the context vectors and $\sigma_t^2$ is the variance of the reward in round $t$. This recovers the existing $\widetilde O(d^{3/2}\sqrt{T})$ regret guarantee in the constant-variance regime and further improves to $\widetilde O(d^{3/2})$ in the deterministic regime, thus achieving a smooth interpolation in between. Our approach utilizes a stratified sampling procedure to overcome the too-conservative optimism in the linear Thompson sampling algorithm for linear contextual bandits.

Sensitivity in Translation Averaging
Lalit Manam Venu Madhav Govindu



研究问题:本文探讨了在不确定性下,翻译平均对相对方向小的扰动的敏感性。
动机:尽管已有大量关于鲁棒性和唯一性的研究,但本文关注了一个不同的问题,即翻译平均在不确定性下的敏感性。
方法:首先,我们分析了在相对方向的小扰动下估计对应比例的敏感性。然后,我们形式化定义了翻译平均问题的约束条件,该条件仅基于输入方向评估估计翻译的可靠性。我们给出了一个充分条件来确保问题是良态的。接下来,我们提供了一个有效的算法来识别和删除使问题病态的组合方向,同时确保解决方案的唯一性。
效果:我们在全局结构从运动管道中展示了这种分析的效用,用于获取3D重建,揭示了在翻译平均中过滤病态方向的好处,包括减少翻译错误、更多的3D点被三角测量以及捆绑调整更快的收敛。

In 3D computer vision, translation averaging solves for absolute translations given a set of pairwise relative translation directions. While there has been much work on robustness to outliers and studies on the uniqueness of the solution, this paper deals with a distinctly different problem of sensitivity in translation averaging under uncertainty. We first analyze sensitivity in estimating scales corresponding to relative directions under small perturbations of the relative directions. Then, we formally define the conditioning of the translation averaging problem, which assesses the reliability of estimated translations based solely on the input directions. We give a sufficient criterion to ensure that the problem is well-conditioned. Subsequently, we provide an efficient algorithm to identify and remove combinations of directions which make the problem ill-conditioned while ensuring uniqueness of the solution. We demonstrate the utility of such analysis in global structure-from-motion pipelines for obtaining 3D reconstructions, which reveals the benefits of filtering the ill-conditioned set of directions in translation averaging in terms of reduced translation errors, a higher number of 3D points triangulated and faster convergence of bundle adjustment.

BanditPAM++: Faster $k$-medoids Clustering
Mo Tiwari Ryan Kang Donghyun Lee Sebastian Thrun Ilan Shomorony Martin Jinye Zhang



研究问题:如何提高$k$-medoids聚类算法的效率?
动机:$k$-medoids聚类算法具有更好的解释性和对异常对象的处理能力,但效率是其一大缺点。
方法:提出了BanditPAM++算法,通过在每次迭代中重复使用聚类信息和在不同迭代间重复使用信息来加速BanditPAM。
效果:BanditPAM++在复杂度上比BanditPAM快$O(k)$倍,并且在CIFAR10数据集上运行速度比BanditPAM快10倍以上。

Clustering is a fundamental task in data science with wide-ranging applications. In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in $k$-medoids clustering, respectively. $k$-medoids clustering has recently grown in popularity due to the discovery of more efficient $k$-medoids algorithms. In particular, recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy. In this paper, we present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime. First, we demonstrate that BanditPAM has a special structure that allows the reuse of clustering information $\textit{within}$ each iteration. Second, we demonstrate that BanditPAM has additional structure that permits the reuse of information $\textit{across}$ different iterations. These observations inspire our proposed algorithm, BanditPAM++, which returns the same clustering solutions as BanditPAM but often several times faster. For example, on the CIFAR10 dataset, BanditPAM++ returns the same results as BanditPAM but runs over 10$\times$ faster. Finally, we provide a high-performance C++ implementation of BanditPAM++, callable from Python and R, that may be of interest to practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to reproduce all of our experiments via a one-line script is available at https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.

Core-sets for Fair and Diverse Data Summarization
Sepideh Mahabadi Stojan Trajanovski



研究问题:在公平/分区约束下,如何最大化多样性的核心集构造算法。
动机:给定一个点集P,在度量空间中被分为m组,目标是从每个组i中选择k_i个点,使得选取的k个点的全面多样性最大。
方法:考虑两种自然多样性测量:成对距离之和和最近邻距离之和,并展示了针对这些测量的改进核心集构造算法。
效果:我们展示了第一个与成对距离之和无关的常数因子核心集,其大小独立于数据集的大小和纵横比。其次,我们展示了第一个关于最近邻距离之和的核心集。最后,我们运行了几个实验,展示了我们的核心集方法的有效性。特别是在总结一系列定时消息的任务中,我们的应用实现了100倍的速度提升,同时只损失了百分之几的多样性。此外,我们的方法还提高了算法在流媒体设置中的空间利用率。

We study core-set construction algorithms for the task of Diversity Maximization under fairness/partition constraint. Given a set of points $P$ in a metric space partitioned into $m$ groups, and given $k_1,\ldots,k_m$, the goal of this problem is to pick $k_i$ points from each group $i$ such that the overall diversity of the $k=\sum_i k_i$ picked points is maximized. We consider two natural diversity measures: sum-of-pairwise distances and sum-of-nearest-neighbor distances, and show improved core-set construction algorithms with respect to these measures. More precisely, we show the first constant factor core-set w.r.t. sum-of-pairwise distances whose size is independent of the size of the dataset and the aspect ratio. Second, we show the first core-set w.r.t. the sum-of-nearest-neighbor distances. Finally, we run several experiments showing the effectiveness of our core-set approach. In particular, we apply constrained diversity maximization to summarize a set of timed messages that takes into account the messages' recency. Specifically, the summary should include more recent messages compared to older ones. This is a real task in one of the largest communication platforms, affecting the experience of hundreds of millions daily active users. By utilizing our core-set method for this task, we achieve a 100x speed-up while losing the diversity by only a few percent. Moreover, our approach allows us to improve the space usage of the algorithm in the streaming setting.

Agnostic Multi-Group Active Learning
Nicholas Rittler Kamalika Chaudhuri



研究问题:如何通过主动学习,最小化标签查询次数的同时,实现对多个分布的集合进行泛化学习。
动机:在面对罕见或困难的数据子集时,提高分类准确率的问题引发了关注。特别是在主动学习中,学习者有权决定从每个分布中选择哪些样本进行标注,目标是最小化标签查询次数,同时保持PAC学习保证。
方法:我们修改了现有的算法,为多组学习的非特定形式提供了一个一致的主动学习方法。该方法在给定一组G个分布和一个具有VC维数d的假设类H的情况下,使用$\tilde{O}\left( (
u^2/\epsilon^2) G d \theta_{\mathcal{G}}^2 \log^2(1/epsilon) + G\log(1/\epsilon)/\epsilon^2 right)$个标签查询,输出一个$\epsilon$-最优的假设。

Inspired by the problem of improving classification accuracy on rare or hard subsets of a population, there has been recent interest in models of learning where the goal is to generalize to a collection of distributions, each representing a ``group''. We consider a variant of this problem from the perspective of active learning, where the learner is endowed with the power to decide which examples are labeled from each distribution in the collection, and the goal is to minimize the number of label queries while maintaining PAC-learning guarantees. Our main challenge is that standard active learning techniques such as disagreement-based active learning do not directly apply to the multi-group learning objective. We modify existing algorithms to provide a consistent active learning algorithm for an agnostic formulation of multi-group learning, which given a collection of $G$ distributions and a hypothesis class $\mathcal{H}$ with VC-dimension $d$, outputs an $\epsilon$-optimal hypothesis using $\tilde{O}\left( (\nu^2/\epsilon^2) G d \theta_{\mathcal{G}}^2 \log^2(1/\epsilon) + G\log(1/\epsilon)/\epsilon^2 \right)$ label queries, where $\theta_{\mathcal{G}}$ is the worst-case disagreement coefficient over the collection. Roughly speaking, this guarantee improves upon the label complexity of standard multi-group learning in regimes where disagreement-based active learning algorithms may be expected to succeed, and the number of groups is not too large. We also consider the special case where each distribution in the collection is individually realizable with respect to $\mathcal{H}$, and demonstrate $\tilde{O}\left( G d \theta_{\mathcal{G}} \log(1/\epsilon) \right)$ label queries are sufficient for learning in this case. We further give an approximation result for the full agnostic case inspired by the group realizable strategy.

Dynamic Pricing and Learning with Bayesian Persuasion
Shipra Agrawal Yiding Feng Wei Tang



研究问题:本文旨在研究一种新的动态定价和学习设置,其中卖方除了在连续的回合中设定产品价格外,还预先承诺“广告方案”。
动机:通过使用流行的贝叶斯劝说框架来模拟这些信号对买方估值和购买反应的影响,以最大化卖方的预期收入为目标,制定最优的广告方案和定价方案。
方法:设计一种在线算法,无需事先了解买方需求函数的知识,而是使用过去的购买反应来自适应地学习最优的定价和广告策略。
效果:当估价函数是产品质量的线性函数时,该算法实现了$O(T^{2/3}(m \log T )^{1/3})$的遗憾界限。这一结果要求对估价函数进行一些自然单调性和Lipschitz假设,但对买方需求函数没有Lipschitz或平滑性假设。

We consider a novel dynamic pricing and learning setting where in addition to setting prices of products in sequential rounds, the seller also ex-ante commits to ‘advertising schemes’. That is, in the beginning of each round the seller can decide what kind of signal they will provide to the buyer about the product’s quality upon realization. Using the popular Bayesian persuasion framework to model the effect of these signals on the buyers’ valuation and purchase responses, we formulate the problem of finding an optimal design of the advertising scheme along with a pricing scheme that maximizes the seller’s expected revenue. Without any apriori knowledge of the buyers’ demand function, our goal is to design an online algorithm that can use past purchase responses to adaptively learn the optimal pricing and advertising strategy. We study the regret of the algorithm when compared to the optimal clairvoyant price and advertising scheme. Our main result is a computationally efficient online algorithm that achieves an $O(T^{2/3}(m \log T )^{1/3})$ regret bound when the valuation function is linear in the product quality. Here $m$ is the cardinality of the discrete product quality domain and $T$ is the time horizon. This result requires some natural monotonicity and Lipschitz assumptions on the valuation function, but no Lipschitz or smoothness assumption on the buyers’ demand function. For constant $m$, our result matches the regret lower bound for dynamic pricing within logarithmic factors, which is a special case of our problem. We also obtain several improved results for the widely considered special case of additive valuations, including an $\tilde{O}(T^{2/3})$ regret bound independent of $m$ when $m\le T^{1/3}$.

No-Regret Learning with Unbounded Losses: The Case of Logarithmic Pooling
Eric Neyman Tim Roughgarden



研究问题:如何有效地整合多个专家在T个时间步长上对n个结果的概率分布预测,以达到无遗憾的保证。
动机:当前的预测聚合方法无法有效利用丰富的结构化知识,通过结合知识图谱和文本语料库训练模型,可以更好地捕捉语义模式。
方法:采用大规模文本语料库和知识图谱进行联合训练,提出了一种新的增强语言表示模型ERNIE。该模型能够同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

For each of $T$ time steps, $m$ experts report probability distributions over $n$ outcomes; we wish to learn to aggregate these forecasts in a way that attains a no-regret guarantee. We focus on the fundamental and practical aggregation method known as *logarithmic pooling* -- a weighted average of log odds -- which is in a certain sense the optimal choice of pooling method if one is interested in minimizing log loss (as we take to be our loss function). We consider the problem of learning the best set of parameters (i.e. expert weights) in an online adversarial setting. We assume (by necessity) that the adversarial choices of outcomes and forecasts are consistent, in the sense that experts report calibrated forecasts. Imposing this constraint creates a (to our knowledge) novel semi-adversarial setting in which the adversary retains a large amount of flexibility. In this setting, we present an algorithm based on online mirror descent that learns expert weights in a way that attains $O(\sqrt{T} \log T)$ expected regret as compared with the best weights in hindsight.

Tight Bounds for Volumetric Spanners and Applications
Aditya Bhaskara Sepideh Mahabadi Ali Vakilian



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Given a set of points of interest, a volumetric spanner is a subset of the points using which all the points can be expressed using "small" coefficients (measured in an appropriate norm). Formally, given a set of vectors $X = [v_1, v_2, \dots, v_n]$, the goal is to find $T \subseteq [n]$ such that every $v \in X$ can be expressed as $\sum_{i\in T} \alpha_i v_i$, with $\Vert \alpha \Vert$ being small. This notion, which has also been referred to as a well-conditioned basis, has found several applications, including bandit linear optimization, determinant maximization, and matrix low rank approximation. In this paper, we give almost optimal bounds on the size of volumetric spanners for all $\ell_p$ norms, and show that they can be constructed using a simple local search procedure. We then show the applications of our result to other tasks and in particular the problem of finding coresets for the Minimum Volume Enclosing Ellipsoid (MVEE) problem.

Online Convex Optimization with Unbounded Memory
Raunak Kumar Sarah Dean Robert Kleinberg



研究问题:在线凸优化(OCO)框架在许多应用中,学习者的损失不仅取决于当前的决策,还取决于到当前为止的整个决策历史。然而,现有的OCO框架及其推广无法捕捉这种长期依赖性。
动机:为了解决这个问题,我们提出了一个OCO框架的泛化——“具有无界记忆的在线凸优化”,它可以捕捉到过去决策对当前损失的长期依赖性。
方法:我们引入了$p$-有效记忆容量$H_p$的概念,它量化了过去决策对当前损失的最大影响。我们证明了一个关于策略遗憾的$O(\sqrt{H_p T})$上界和一个匹配的(最坏情况)下界。
效果:通过使用我们的框架,我们为各种在线学习问题(包括在线线性控制和在线表演预测的一个变种)推导出了遗憾界限,并改进和简化了现有的遗憾界限推导。

Online convex optimization (OCO) is a widely used framework in online learning. In each round, the learner chooses a decision in a convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their current decision. However, in many applications the learner's loss depends not only on the current decision but on the entire history of decisions until that point. The OCO framework and its existing generalizations do not capture this, and they can only be applied to many settings of interest after a long series of approximation arguments. They also leave open the question of whether the dependence on memory is tight because there are no non-trivial lower bounds. In this work we introduce a generalization of the OCO framework, ``Online Convex Optimization with Unbounded Memory'', that captures long-term dependence on past decisions. We introduce the notion of $p$-effective memory capacity, $H_p$, that quantifies the maximum influence of past decisions on present losses. We prove an $O(\sqrt{H_p T})$ upper bound on the policy regret and a matching (worst-case) lower bound. As a special case, we prove the first non-trivial lower bound for OCO with finite memory~\citep{anavaHM2015online}, which could be of independent interest, and also improve existing upper bounds. We demonstrate the broad applicability of our framework by using it to derive regret bounds, and to improve and simplify existing regret bound derivations, for a variety of online learning problems including online linear control and an online variant of performative prediction.

Learning and Collusion in Multi-unit Auctions
Simina Branzei Mahsa Derakhshan Negin Golrezaei Yanjun Han



研究问题:在碳拍卖中,如何为多个感兴趣的参与者分配二氧化碳排放许可证。
动机:受到碳拍卖的启发,我们考虑了在实践中广泛使用的具有统一定价的重复多单位拍卖。
方法:通过设计低遗憾度的高效出价算法并给出遗憾度下界,对离线和在线设置中的这些拍卖进行分析。
效果:我们发现两种主要拍卖变体中的一个容易受到投标人之间的串通影响,而另一个则不会。

In a carbon auction, licenses for CO2 emissions are allocated among multiple interested players. Inspired by this setting, we consider repeated multi-unit auctions with uniform pricing, which are widely used in practice. Our contribution is to analyze these auctions in both the offline and online settings, by designing efficient bidding algorithms with low regret and giving regret lower bounds. We also analyze the quality of the equilibria in two main variants of the auction, finding that one variant is susceptible to collusion among the bidders while the other is not.

(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy
Elan Rosenfeld Saurabh Garg



研究问题:如何利用未标记的测试数据,对深度神经网络在分布转移下的错误进行新的(几乎)保证的上限推导。
动机:现有方法在实践中无效,或者平均准确但严重低估了大部分转移的错误。
方法:我们提出了一种基于“分歧损失”的新的损失函数,用于优化一个多分类器以与另一个多分类器产生分歧,从而推导出错误上限。
效果:在广泛的自然和合成分布转移基准测试中,我们的方法给出了有效的错误边界,同时实现了与竞争性估计基线相当的平均准确性。

We derive a new, (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods are either vacuous in practice or accurate on average but heavily underestimate error for a sizeable fraction of shifts. In particular, the latter only give guarantees based on complex continuous measures such as test calibration, which cannot be identified without labels, and are therefore unreliable. Instead, our bound requires a simple, intuitive condition which is well justified by prior empirical works and holds in practice effectively 100\% of the time. The bound is inspired by $\mathcal{H}\Delta\mathcal{H}$-divergence but is easier to evaluate and substantially tighter, consistently providing non-vacuous test error upper bounds. Estimating the bound requires optimizing one multiclass classifier to disagree with another, for which some prior works have used sub-optimal proxy losses; we devise a "disagreement loss" which is theoretically justified and performs better in practice. We expect this loss can serve as a drop-in replacement for future methods which require maximizing multiclass disagreement. Across a wide range of natural and synthetic distribution shift benchmarks, our method gives valid error bounds while achieving average accuracy comparable to—though not better than—competitive estimation baselines.

Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective
Jimmy Ba Murat A Erdogdu Taiji Suzuki Zhichao Wang Denny Wu



研究问题:在数据中存在低维结构的情况下,如何确定尖峰幅度(即低维成分的强度),以便核方法以及通过梯度下降优化的神经网络能够学习目标函数。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We consider the learning of a single-index target function $f_*: \mathbb{R}^d\to\mathbb{R}$ under spiked covariance data: $$f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle), ~~ \boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top), ~~ \theta\asymp d^{\beta} \text{ for } \beta\in[0,1), $$ where the link function $\sigma_*:\mathbb{R}\to\mathbb{R}$ is a degree-$p$ polynomial with information exponent $k$ (defined as the lowest degree in the Hermite expansion of $\sigma_*$), and it depends on the projection of input $\boldsymbol{x}$ onto the spike (signal) direction $\boldsymbol{\mu}\in\mathbb{R}^d$. In the proportional asymptotic limit where the number of training examples $n$ and the dimensionality $d$ jointly diverge: $n,d\to\infty, n/d\to\psi\in(0,\infty)$, we ask the following question: how large should the spike magnitude $\theta$ (i.e., the strength of the low-dimensional component) be, in order for $(i)$ kernel methods, $(ii)$ neural networks optimized by gradient descent, to learn $f_*$? We show that for kernel ridge regression, $\beta\ge 1-\frac{1}{p}$ is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, $\beta>1-\frac{1}{k}$ suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since $k\le p$ by definition, neural networks can adapt to such structures more effectively.

Swap Agnostic Learning, or Characterizing Omniprediction via Multicalibration
Parikshit Gopalan Michael P. Kim Omer Reingold



研究问题:本文介绍了并研究了Swap Agnostic Learning的概念。
动机:将预测者和对手之间的游戏形式化,预测者选择一个假设,然后对手做出反应,选择最小化损失的假设。尽管对手强大,但我们的主要结果表明,对于任何凸损失函数,Swap Agnostic Learning是可行的。
方法:通过证明Swap Agnostic Learning与最近的Omniprediction和Multicalibration概念的交换变体之间的等价性,来证明Swap Agnostic Learning的可行性。
效果:我们的研究结果建立了与现有文献中Outcome Indistinguishability的统一概念的联系,揭示了一个捕获所有现有的omniprediction和multicalibration概念的统一概念。

We introduce and study the notion of Swap Agnostic Learning. The problem can be phrased as a game between a *predictor* and an *adversary*: first, the predictor selects a hypothesis $h$; then, the adversary plays in response, and for each level set of the predictor, selects a loss-minimizing hypothesis $c_v \in \mathcal{C}$; the predictor wins if $h$ competes with the adaptive adversary's loss. Despite the strength of the adversary, our main result demonstrates the feasibility Swap Agnostic Learning for any convex loss. Somewhat surprisingly, the result follows by proving an *equivalence* between Swap Agnostic Learning and swap variants of the recent notions Omniprediction (ITCS'22) and Multicalibration (ICML'18). Beyond this equivalence, we establish further connections to the literature on Outcome Indistinguishability (STOC'20, ITCS'23), revealing a unified notion of OI that captures all existing notions of omniprediction and multicalibration.

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time
Xiang Ji Gen Li



研究问题:本文旨在解决强化学习中的一个重要问题,即在线学习下无限期离散马尔可夫决策过程的最优策略学习。
动机:现有的算法要么无法达到遗憾最优,要么需要高昂的内存和计算成本。此外,现有的最优算法都需要一个很长的磨合时间才能达到最优样本效率,即除非样本量超过一个很高的阈值,否则其最优性无法保证。
方法:通过引入一种使用方差减少的策略和一种新的以慢而适应的方式切换执行策略的技术,来解决这两个开放性问题。
效果:这是第一个在折扣设置下达到遗憾最优的无模型算法,具有磨合时间短的额外好处。

A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.

Global Optimality in Bivariate Gradient-based DAG Learning
Chang Deng Kevin Bello Pradeep Kumar Ravikumar Bryon Aragam



研究问题:如何优化非凸优化问题,特别是在从数据中学习无环有向图模型的统计问题。
动机:现有的方法使用标准的一阶优化方案来解决这个问题,但证明这种方法的全局最优性却很困难。这个问题与其他文献中的非凸问题不同,它并非“良性”,并且存在多个伪解,标准方法很容易陷入其中。
方法:本文提出了一种简单的路径跟踪优化方案,证明了它在双变量设置中全局收敛到种群损失的全局最小值。
效果:通过这种新的优化方案,可以有效地解决非凸优化问题,特别是从数据中学习无环有向图模型的问题。

Recently, a new class of non-convex optimization problems motivated by the statistical problem of learning an acyclic directed graphical model from data has attracted significant interest. While existing work uses standard first-order optimization schemes to solve this problem, proving the global optimality of such approaches has proven elusive. The difficulty lies in the fact that unlike other non-convex problems in the literature, this problem is not "benign", and possesses multiple spurious solutions that standard approaches can easily get trapped in. In this paper, we prove that a simple path-following optimization scheme globally converges to the global minimum of the population loss in the bivariate setting.

Adaptive Selective Sampling for Online Prediction with Experts
Rui M. Castro Fredrik Hellström Tim van Erven



研究问题:在线预测二元序列,考虑专家建议。
动机:设计标签高效的预测算法,通过选择性采样方案,收集的标签数量远少于标准程序。
方法:对于没有完美专家的一般情况,证明最佳的双重保证,即所提出的预测算法在最坏的情况下查询足够多的标签以获得最优的遗憾保证,同时在更温和的设置中查询更少的标签。
效果:数值实验表明,标签高效预测器的归一化遗憾可以渐近匹配已知的基于池的主动学习的最小最大速率,表明它可以优化适应温和的环境。

We consider online prediction of a binary sequence with expert advice. For this setting, we devise label-efficient forecasting algorithms, which use a selective sampling scheme that enables collecting much fewer labels than standard procedures. For the general case without a perfect expert, we prove best-of-both-worlds guarantees, demonstrating that the proposed forecasting algorithm always queries sufficiently many labels in the worst case to obtain optimal regret guarantees, while simultaneously querying much fewer labels in more benign settings. Specifically, for a scenario where one expert is strictly better than the others in expectation, we show that the label complexity of the label-efficient forecaster is roughly upper-bounded by the square root of the number of rounds. Finally, we present numerical experiments empirically showing that the normalized regret of the label-efficient forecaster can asymptotically match known minimax rates for pool-based active learning, suggesting it can optimally adapt to benign settings.

Towards Characterizing the First-order Query Complexity of Learning (Approximate) Nash Equilibria in Zero-sum Matrix Games
Hedi Hadiji Sarah Sachs Tim van Erven Wouter M Koolen



研究问题:本文旨在解决零和$K\times K$矩阵游戏中一阶查询模型的问题,即玩家如何通过对手的随机行动观察所有可能行动的预期收益。
动机:Rakhlin和Sridharan发现,可以从$O(\frac{\ln K}{epsilon})$次查询中有效地计算出$epsilon$-近似纳什均衡,而不是$O(\frac{\ln K}{epsilon^2})$次查询。然而,这种最优查询次数作为$\epsilon$和$K$的函数,目前尚不清楚。
方法:我们首先完全确定了学习精确均衡($epsilon=0$)的查询复杂度,即它们需要的查询次数与$K$呈线性关系。其次,对于$\epsilon > 0$,当前的查询复杂度上界为$O(\min(\frac{\ln(K)}{\epsilon} , K))$。然后,我们引入了一种新的技术来获得下界,该技术可以对任何$\epsilon \leq \frac{1}{cK^4}$获得下界$\tildeOmega(\log(\frac{1}{K\epsilon}))$,其中$c$是一个与$K$无关的常数。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,我们也成功地降低了查询复杂度的上界。

In the first-order query model for zero-sum $K\times K$ matrix games, players observe the expected pay-offs for all their possible actions under the randomized action played by their opponent. This classical model has received renewed interest after the discovery by Rakhlin and Sridharan that $\epsilon$-approximate Nash equilibria can be computed efficiently from $O(\frac{\ln K}{\epsilon})$ instead of $O(\frac{\ln K}{\epsilon^2})$ queries. Surprisingly, the optimal number of such queries, as a function of both $\epsilon$ and $K$, is not known. We make progress on this question on two fronts. First, we fully characterise the query complexity of learning exact equilibria ($\epsilon=0$), by showing that they require a number of queries that is linear in $K$, which means that it is essentially as hard as querying the whole matrix, which can also be done with $K$ queries. Second, for $\epsilon > 0$, the current query complexity upper bound stands at $O(\min(\frac{\ln(K)}{\epsilon} , K))$. We argue that, unfortunately, obtaining a matching lower bound is not possible with existing techniques: we prove that no lower bound can be derived by constructing hard matrices whose entries take values in a known countable set, because such matrices can be fully identified by a single query. This rules out, for instance, reducing to an optimization problem over the hypercube by encoding it as a binary payoff matrix. We then introduce a new technique for lower bounds, which allows us to obtain lower bounds of order $\tilde\Omega(\log(\frac{1}{K\epsilon})$ for any $\epsilon \leq 1 / (cK^4)$, where $c$ is a constant independent of $K$. We further discuss possible future directions to improve on our techniques in order to close the gap with the upper bounds.

No-Regret Learning in Dynamic Competition with Reference Effects Under Logit Demand
Mengzi Amy Guo Donghao Ying Javad Lavaei Zuo-Jun Shen



研究问题:本文旨在设计一种竞争框架下的算法,以学习稳定的均衡状态。
动机:在不透明的市场中,两家缺乏竞争对手信息的公司在动态价格竞争中进行考虑,并使用多变量逻辑选择模型来模拟消费者对价格和参考价格的观察。
方法:我们提出了在线投影梯度上升算法(OPGA),公司通过市场反馈机制获取的对数收入的一阶导数来调整价格。
效果:尽管在线游戏通常需要强单调性和变异稳定性等属性才能收敛,但我们证明,在减小步长的情况下,OPGA生成的价格和参考价格路径会收敛到唯一的稳定市场均衡点,实现了无遗憾学习和市场稳定,并且其收敛速度为O(1/t)。

This work is dedicated to the algorithm design in a competitive framework, with the primary goal of learning a stable equilibrium. We consider the dynamic price competition between two firms operating within an opaque marketplace, where each firm lacks information about its competitor. The demand follows the multinomial logit (MNL) choice model, which depends on the consumers' observed price and their reference price, and consecutive periods in the repeated games are connected by reference price updates. We use the notion of stationary Nash equilibrium (SNE), defined as the fixed point of the equilibrium pricing policy for the single-period game, to simultaneously capture the long-run market equilibrium and stability. We propose the online projected gradient ascent algorithm (OPGA), where the firms adjust prices using the first-order derivatives of their log-revenues that can be obtained from the market feedback mechanism. Despite the absence of typical properties required for the convergence of online games, such as strong monotonicity and variational stability, we demonstrate that under diminishing step-sizes, the price and reference price paths generated by OPGA converge to the unique SNE, thereby achieving the no-regret learning and a stable market. Moreover, with appropriate step-sizes, we prove that this convergence exhibits a rate of $\mathcal{O}(1/t)$.

Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities
Donghao Ying YUNKAI ZHANG Yuhao Ding Alec Koppel Javad Lavaei



研究问题:我们研究了多智能体强化学习中的安全问题,即在满足自身安全约束的条件下,如何让多个智能体共同最大化局部目标的总和。
动机:由于智能体数量的增加,状态-动作空间的大小呈指数增长,给全局可观察性带来了挑战。同时,智能体的安全约束也产生了全局耦合的问题。
方法:我们提出了一种原-对偶方法,利用影子奖励和κ-hop邻居截断,其中κ是通信半径。在精确设置中,我们的算法以O(T^{-2/3})的速度收敛到一阶平稳点(FOSP)。在基于样本的设置中,我们证明了我们的算法需要O(ε^{-3.5})的样本来达到误差为O(φ_0^{2κ})的ε-FOSP,其中φ_0∈(0,1)。
效果:通过大量的数值实验,我们展示了模型的有效性。

We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by general utilities, i.e., nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploration, or imitations. The exponential growth of the state-action space size with the number of agents presents challenges for global observability, further exacerbated by the global coupling arising from agents' safety constraints. To tackle this issue, we propose a primal-dual method utilizing shadow reward and $\kappa$-hop neighbor truncation under a form of correlation decay property, where $\kappa$ is the communication radius. In the exact setting, our algorithm converges to a first-order stationary point (FOSP) at the rate of $\mathcal{O}\left(T^{-2/3}\right)$. In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $\widetilde{\mathcal{O}}\left(\epsilon^{-3.5}\right)$ samples to achieve an $\epsilon$-FOSP with an approximation error of $\mathcal{O}(\phi_0^{2\kappa})$, where $\phi_0\in (0,1)$. Finally, we demonstrate the effectiveness of our model through extensive numerical experiments.

Accelerating Value Iteration with Anchoring
Jongmin Lee Ernest K. Ryu



研究问题:寻找一种通用的加速机制,以提高值迭代(VI)在现代强化学习中的理论和实践效果。
动机:尽管值迭代是现代强化学习的基础,但其最优收敛速度尚未明确,因此寻找一种通用的加速机制一直是一个问题。
方法:本文提出了一种名为Anc-VI的加速值迭代方法,该方法基于锚定机制(与Nesterov的加速方法不同),可以比标准VI更快地减少Bellman误差。
效果:实验结果表明,当γ≈1或γ=1时,Anc-VI显示出了O(1/k)的收敛速度,而标准VI在γ≥1-1/k时的收敛速度为O(1)。此外,我们还提供了与上界相匹配的复杂性下界,从而确立了Anc-VI的最优加速率。最后,我们证明锚定机制在近似VI和高斯-赛德尔VI设置中也能提供相同的优势。

Value Iteration (VI) is foundational to the theory and practice of modern reinforcement learning, and it is known to converge at a $\mathcal{O}(\gamma^k)$-rate. Surprisingly, however, the optimal rate for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an \emph{anchoring} mechanism (distinct from Nesterov's acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a $\mathcal{O}(1/k)$-rate for $\gamma\approx 1$ or even $\gamma=1$, while standard VI has rate $\mathcal{O}(1)$ for $\gamma\ge 1-1/k$, where $k$ is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of $4$, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss--Seidel VI setups as well.

Robust Lipschitz Bandits to Adversarial Corruptions
Yue Kang Cho-Jui Hsieh Thomas Chun Man Lee



研究问题:本文提出了一种Lipschitz bandit问题,即在存在对抗性干扰的情况下,如何进行连续臂集的随机bandits学习。
动机:传统的随机bandits算法通常假设奖励函数是确定性的,但在实际情况中,奖励函数可能会受到对抗性干扰的影响。
方法:本文提出了一种新的Lipschitz bandit算法,该算法可以处理对抗性干扰,并实现了次线性遗憾。
效果:通过实验,本文证明了该算法在面对两种经典攻击时的有效性。

Lipschitz bandit is a variant of stochastic bandits that deals with a continuous arm set defined on a metric space, where the reward function is subject to a Lipschitz constraint. In this paper, we introduce a new problem of Lipschitz bandits in the presence of adversarial corruptions where an adaptive adversary corrupts the stochastic rewards up to a total budget $C$. The budget is measured by the sum of corruption levels across the time horizon $T$. We consider both weak and strong adversaries, where the weak adversary is unaware of the current action before the attack, while the strong one can observe it. Our work presents the first line of robust Lipschitz bandit algorithms that can achieve sub-linear regret under both types of adversary, even when the total budget of corruption $C$ is unrevealed to the agent. We provide a lower bound under each type of adversary, and show that our algorithm is optimal under the strong case. Finally, we conduct experiments to illustrate the effectiveness of our algorithms against two classic kinds of attacks.

Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems
Sihan Zeng Thinh T. Doan Justin Romberg



研究问题:本文旨在提高对强化学习中策略优化问题优化景观的理解。
动机:我们发现无论在表格设置下还是在使用一类神经网络表示的策略下,目标函数关于策略参数的超水平集始终是连通的。此外,我们还发现作为策略参数和奖励的函数的优化目标满足更强的“等连通性”属性。
方法:我们利用这些超水平集的连通性来推导鲁棒强化学习的极小极大定理。
效果:我们发现任何一侧凸而在另一侧等连通的极小极大优化程序都遵循极小极大等式(即具有纳什均衡)。这是首次在文献中建立这样的结果。

The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the optimization objective as a function of the policy parameter and reward satisfies a stronger “equiconnectedness” property. To our best knowledge, these are novel and previously unknown discoveries. We present an application of the connectedness of these superlevel sets to the derivation of minimax theorems for robust reinforcement learning. We show that any minimax optimization program which is convex on one side and is equiconnected on the other side observes the minimax equality (i.e. has a Nash equilibrium). We find that this exact structure is exhibited by an interesting class of robust reinforcement learning problems under an adversarial reward attack, and the validity of its minimax equality immediately follows. This is the first time such a result is established in the literature.

Multi-Player Zero-Sum Markov Games with Networked Separable Interactions
Chanwoo Park Kaiqing Zhang Asuman E. Ozdaglar



研究问题:本文研究了一种新的马尔科夫博弈类别,即具有网络分离交互的零和马尔科夫博弈(零和NMGS),以模拟非合作多智能体序列决策中的局部交互结构。
动机:为了解决传统马尔科夫博弈在处理复杂交互结构时的局限性,提出了一种具有网络分离交互的零和马尔科夫博弈模型。
方法:首先定义了零和NMGS,并找出了将MG表示为零和NMGS的必要充分条件。然后,证明了在这些游戏中,Markov粗相关均衡集(CCE)会塌陷为Markov纳什均衡集(NE)。接着,提出了零和NMGS的虚拟游戏动态,并建立了在星形网络结构下的收敛性保证。最后,针对计算Markov非平稳NE的问题,设计了一系列值迭代算法,并提供了有限迭代保证。
效果:实验结果验证了理论结果的正确性。

We study a new class of Markov games, \textit{(multi-player) zero-sum Markov Games} with {\it Networked separable interactions} (zero-sum NMGs), to model the local interaction structure in non-cooperative multi-agent sequential decision-making. We define a zero-sum NMG as a model where {the payoffs of the auxiliary games associated with each state are zero-sum and} have some separable (i.e., polymatrix) structure across the neighbors over some interaction network. We first identify the necessary and sufficient conditions under which an MG can be presented as a zero-sum NMG, and show that the set of Markov coarse correlated equilibrium (CCE) collapses to the set of Markov Nash equilibrium (NE) in these games, in that the {product of} per-state marginalization of the former for all players yields the latter. Furthermore, we show that finding approximate Markov \emph{stationary} CCE in infinite-horizon discounted zero-sum NMGs is \texttt{PPAD}-hard, unless the underlying network has a ``star topology''. Then, we propose fictitious-play-type dynamics, the classical learning dynamics in normal-form games, for zero-sum NMGs, and establish convergence guarantees to Markov stationary NE under a star-shaped network structure. Finally, in light of the hardness result, we focus on computing a Markov \emph{non-stationary} NE and provide finite-iteration guarantees for a series of value-iteration-based algorithms. We also provide numerical experiments to corroborate our theoretical results.

Time-Reversed Dissipation Induces Duality Between Minimizing Gradient Norm and Function Value
Jaeyeon Kim Asuman E. Ozdaglar Chanwoo Park Ernest K. Ryu



研究问题:本文旨在探讨凸优化中,如何有效地最小化函数值和梯度幅值的问题。
动机:尽管Nesterov在1983年的工作开创了关于最小化函数值的优化方法的研究,但近年来,Kim和Fessler的OGM-G以及Lee等人的FISTA-G等以最小化梯度幅值为目标的方法也受到了关注。
方法:本文提出了H对偶性理论,这是一种将最小化函数值的方法与最小化梯度幅值的方法相互对应的理论。在连续时间形式中,H对偶性对应于反转耗散/摩擦项的时间依赖性。
效果:通过H对偶性,我们更深入地理解了Nesterov方法和OGM-G之间的对称性,推导出了一类新的有效降低平滑凸函数梯度幅值的方法,并发现了一种比FISTA-G更简单、更快的复合最小化方法。

In convex optimization, first-order optimization methods efficiently minimizing function values have been a central subject study since Nesterov's seminal work of 1983. Recently, however, Kim and Fessler's OGM-G and Lee et al.'s FISTA-G have been presented as alternatives that efficiently minimize the gradient magnitude instead. In this paper, we present H-duality, which represents a surprising one-to-one correspondence between methods efficiently minimizing function values and methods efficiently minimizing gradient magnitude. In continuous-time formulations, H-duality corresponds to reversing the time dependence of the dissipation/friction term. To the best of our knowledge, H-duality is different from Lagrange/Fenchel duality and is distinct from any previously known duality or symmetry relations. Using H-duality, we obtain a clearer understanding of the symmetry between Nesterov's method and OGM-G, derive a new class of methods efficiently reducing gradient magnitudes of smooth convex functions, and find a new composite minimization method that is simpler and faster than FISTA-G.

Recovering Unbalanced Communities in the Stochastic Block Model with Application to Clustering with a Faulty Oracle
Chandra Sekhar Mukherjee Pan Peng Jiapeng Zhang



研究问题:本文旨在解决在有不平衡社区的随机块模型(SBM)中,如何恢复大小各异的社区的问题。
动机:尽管平衡情况下的随机块模型(SBM)已被广泛研究,但具有不平衡社区的SBM在实践中更为常见,而我们对其理解仍然有限。
方法:本文提出了一种基于奇异值分解(SVD)的简单算法,用于恢复大小各异的社区。
效果:实验结果表明,当概率参数恒定时,该算法恢复的社区大小几乎最优。此外,作为副产品,我们还获得了一个具有次线性查询复杂度的高效聚类算法,即使在存在大量小社区的情况下,也能检测到所有大于 $\tilde{\Omega}({sqrt{n}})$ 的社区。

The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes. We improve upon a result of Ailon, Chen and Xu [ICML 2013; JMLR 2015] by removing the assumption that there is a large interval such that the sizes of clusters do not fall in, and also remove the dependency of the size of the recoverable clusters on the number of underlying clusters. We further complement our theoretical improvements with experimental comparisons. Under the planted clique conjecture, the size of the clusters that can be recovered by our algorithm is nearly optimal (up to poly-logarithmic factors) when the probability parameters are constant. As a byproduct, we obtain an efficient clustering algorithm with sublinear query complexity in a faulty oracle model, which is capable of detecting all clusters larger than $\tilde{\Omega}({\sqrt{n}})$, even in the presence of $\Omega(n)$ small clusters in the graph. In contrast, previous efficient algorithms that use a sublinear number of queries are incapable of recovering any large clusters if there are more than $\tilde{\Omega}(n^{2/5})$ small clusters.

Distributionally Robust Bayesian Optimization with $\varphi$-divergences
Hisham Husain Vu Nguyen Anton van den Hengel



研究问题:在数据驱动的设置中,许多系统面临不确定性,因此对鲁棒性的研究受到了广泛关注。其中一个值得关注的例子是贝叶斯优化(BO),其中存在多方面的不确定性,但目前只有少数工作致力于这个方向。
动机:Kirschner等人的工作通过从分布鲁棒优化(DRO)的视角看待BO问题,连接了现有的文献。然而,这项工作存在一些实际的局限性,如有限上下文假设,留下了主要的问题:“能否设计出一种计算上易于处理的算法来解决这个DRO-BO问题?”
方法:在这项工作中,我们在很大程度上解决了这个问题,考虑了$\varphi$-散度中的数据偏移的鲁棒性,这包括了许多流行的选择,如$\chi^2$-散度、总变差和现有的Kullback-Leibler(KL)散度。
效果:我们表明,在这种设置下的DRO-BO问题等价于一个有限维的优化问题,即使在连续的上下文设置中,也可以容易地实现具有可证明的次线性遗憾界限。然后,我们通过实验证明,我们的方法超过了现有的方法,证明了理论结果的正确性。

The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al., which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question \textit{Can one devise a computationally tractable algorithm for solving this DRO-BO problem}? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\varphi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results.

Bayesian Active Causal Discovery with Multi-Fidelity Experiments
Zeyu Zhang Chaozhuo Li Xu Chen Xing Xie



研究问题:本文研究了基于多保真度查询的任务,即在高保真实验更精确但昂贵,低保真实验便宜但准确度较低的情境下进行主动因果发现的问题。
动机:为了解决在实验中如何有效利用不同精度的查询来获取最大信息的问题。
方法:首先引入了一个基于互信息的采集函数来确定应在哪个保真度上对哪个变量进行干预,然后提出了一个级联模型来捕捉不同保真度查询之间的关联性。此外,还扩展到了批量干预场景。
效果:通过引入新的ε-submodular概念和设计一种约束保真度模型,理论上验证了广泛使用的贪婪方法的有效性。大量实验表明该模型的有效性。

This paper studies the problem of active causal discovery when the experiments can be done based on multi-fidelity oracles, where higher fidelity experiments are more precise and expensive, while the lower ones are cheaper but less accurate. In this paper, we formally define the task of multi-fidelity active causal discovery, and design a probabilistic model for solving this problem. In specific, we first introduce a mutual-information based acquisition function to determine which variable should be intervened at which fidelity, and then a cascading model is proposed to capture the correlations between different fidelity oracles. Beyond the above basic framework, we also extend it to the batch intervention scenario. We find that the theoretical foundations behind the widely used and efficient greedy method do not hold in our problem. To solve this problem, we introduce a new concept called $\epsilon$-submodular, and design a constraint based fidelity model to theoretically validate the greedy method. We conduct extensive experiments to demonstrate the effectiveness of our model.

Optimal Extragradient-Based Algorithms for Stochastic Variational Inequalities with Separable Structure
Angela Yuan Chris Junchi Li Gauthier Gidel Michael Jordan Quanquan Gu Simon Shaolei Du



研究问题:解决具有分离结构的概率单调变分不等式问题。
动机:利用随机一阶查询,提出新的算法——随机加速梯度-外梯度(AG-EG),用于处理强单调变分不等式(VIs)。
方法:结合外梯度和Nesterov加速的优点,通过证明其迭代保持在有界域内并应用定时重启,证明了AG-EG对于强单调VIs具有最优的收敛速度。
效果:当专门处理双线性耦合强凸-强凹鞍点问题时,包括双线性游戏,我们的算法实现了与相应下界相匹配的精细收敛速度,其中随机性由一个常数因子的最佳上界的附加统计误差项来描述。

We consider the problem of solving stochastic monotone variational inequalities with a separable structure using a stochastic first-order oracle. Building on standard extragradient for variational inequalities we propose a novel algorithm---stochastic \emph{accelerated gradient-extragradient} (AG-EG)---for strongly monotone variational inequalities (VIs). Our approach combines the strengths of extragradient and Nesterov acceleration. By showing that its iterates remain in a bounded domain and applying scheduled restarting, we prove that AG-EG has an optimal convergence rate for strongly monotone VIs. Furthermore, when specializing to the particular case of bilinearly coupled strongly-convex-strongly-concave saddle-point problems, including bilinear games, our algorithm achieves fine-grained convergence rates that match the respective lower bounds, with the stochasticity being characterized by an additive statistical error term that is optimal up to a constant prefactor.

Asymptotically Optimal Quantile Pure Exploration for Infinite-Armed Bandits
Xiao-Yue Gong Mark Sellke



研究问题:如何有效地从未知分布生成的无限多个强盗手臂中选择单个高质量手臂,其平均奖励在概率1-δ下,ε范围内是顶部η-fraction的手臂。
动机:对于无限的动作集,经典的PAC保证需要进行适应。我们考虑固定信心和固定预算设置,分别旨在实现最优预期和固定的样本复杂度。
方法:对于固定信心,我们给出了一个期望样本复杂度为O(log(1/η)log(1/δ)/ηε^2)的算法。这是最优的,除了log(1/η)因子和δ依赖性关闭了文献中的二次差距。对于固定预算,当δ趋近于0时,我们展示了最优样本复杂度为c^{-1}log(1/δ)(loglog(1/δ))^2到主导项;等价地,恰好有N个样本的最优失败概率衰减为exp(-(1±o(1))cN/log^2 N)。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study pure exploration with infinitely many bandit arms generated \iid from an unknown distribution. Our goal is to efficiently select a single high quality arm whose average reward is, with probability $1-\delta$, within $\varepsilon$ of being with the top $\eta$-fraction of arms; this is a natural adaptation of the classical PAC guarantee for infinite action sets. We consider both the fixed confidence and fixed budget settings, aiming respectively for optimal \emph{expected} and \emph{fixed} sample complexity. For fixed confidence, we give an algorithm with expected sample complexity $O\left(\frac{\log (1/\eta)\log (1/\delta)}{\eta\varepsilon^2}\right)$. This is optimal except for the $\log (1/\eta)$ factor, and the $\delta$-dependence closes a quadratic gap in the literature. For fixed budget, we show the asymptotically optimal sample complexity as $\delta\to 0$ is $c^{-1}\log(1/\delta)\big(\log\log(1/\delta)\big)^2$ to leading order; equivalently, the optimal failure probability with exactly $N$ samples decays as $\exp\big(-(1\pm o(1))\frac{cN}{\log^2 N}\big)$. The value of $c$ depends explicitly on the problem parameters (including the unknown arm distribution) through a certain Fisher information distance. Even the strictly super-linear dependence on $\log(1/\delta)$ was not known and resolves a question of Grossman-Moshkovitz (FOCS 2015).

Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation
Berivan Isik Wei-Ning Chen Ayfer Ozgur Tsachy Weissman Albert No



研究问题:本研究关注在通信和本地差分隐私约束下的平均估计问题。
动机:尽管先前的研究已经提出了相同问题的最优算法(即随着使用的比特数增加,渐近最优),但在非渐近设置下的精确最优性仍未实现。
方法:我们通过使用共享随机变量(服务器和用户之间共享的随机变量)来描述精确最优方法,并确定了几种精确最优的条件。我们证明其中一个条件是利用旋转对称的共享随机码本。基于此,我们提出了一种随机化机制,其中码本是随机旋转的单纯形——满足精确最优码本的属性。所提出的机制是基于$k$-最近编码,我们证明它在随机旋转的单纯形码本下是精确最优的。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed order-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), exact optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the exact-optimal approach in the presence of shared randomness (a random variable shared between the server and the user) and identify several conditions for exact optimality. We prove that one of the conditions is to utilize a rotationally symmetric shared random codebook. Based on this, we propose a randomization mechanism where the codebook is a randomly rotated simplex -- satisfying the properties of the exact-optimal codebook. The proposed mechanism is based on a $k$-closest encoding which we prove to be exact-optimal for the randomly rotated simplex codebook.

Training Neural Networks is NP-Hard in Fixed Dimension
Vincent Froese Christoph Hertrich



研究问题:本研究探讨了使用ReLU和线性阈值激活函数训练两层神经网络的参数化复杂性,主要关注输入数据维度和隐藏神经元数量。
动机:尽管近年来这些问题的计算复杂性已被多次研究,但仍有几个问题尚未解决。
方法:通过对比Arora等人(ICLR 2018)和Khalife与Basu(IPCO 2022)的研究结果,证明了在二维情况下这两个问题都是NP-hard的,排除了常数维的任何多项式时间算法。同时,也回答了Froese等人(JAIR 2022)的问题,证明了当有四个ReLUs(或两个线性阈值神经元)且训练误差为零时,该问题是W[1]-hard的。
效果:最后,对于ReLU情况,如果假设网络计算一个凸映射,那么当考虑维度和ReLUs的组合参数数量时,我们证明了其固定参数的可追踪性。我们的研究结果几乎完全解决了这些参数的复杂性状态。

We study the parameterized complexity of training two-layer neural networks with respect to the dimension of the input data and the number of hidden neurons, considering ReLU and linear threshold activation functions. Albeit the computational complexity of these problems has been studied numerous times in recent years, several questions are still open. We answer questions by Arora et al. (ICLR 2018) and Khalife and Basu (IPCO 2022) showing that both problems are NP-hard for two dimensions, which excludes any polynomial-time algorithm for constant dimension. We also answer a question by Froese et al. (JAIR 2022) proving W[1]-hardness for four ReLUs (or two linear threshold neurons) with zero training error. Finally, in the ReLU case, we show fixed-parameter tractability for the combined parameter number of dimensions and number of ReLUs if the network is assumed to compute a convex map. Our results settle the complexity status regarding these parameters almost completely.

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy
Anastasia Koloskova Ryan McKenna Zachary Charles J Keith Rush Hugh Brendan McMahan



研究问题:本研究关注在存在线性相关噪声的情况下进行梯度下降的问题。
动机:受到最近在差分隐私优化(如DP-FTRL)中实用方法的启发,这些方法在隐私放大技术不可行的情况下(如联邦学习)取得了良好的效果。这些方法通过矩阵分解机制注入隐私噪声,使噪声在迭代过程中保持线性相关。我们提出了一个简化的设置,提炼出这些方法的关键方面,并隔离了线性相关噪声的影响。
方法:我们在这种设置下分析了梯度下降的行为,包括凸函数和非凸函数。
效果:我们的结果明显优于先前的工作,并精确地恢复了多个重要的特殊情况(包括反相关的扰动梯度下降)。我们使用我们的结果为差分隐私优化开发了新的、有效的矩阵分解,并在理论上和实证上强调了这些分解的好处。

We study gradient descent under linearly correlated noise. Our work is motivated by recent practical methods for optimization with differential privacy (DP), such as DP-FTRL, which achieve strong performance in settings where privacy amplification techniques are infeasible (such as in federated learning). These methods inject privacy noise through a matrix factorization mechanism, making the noise *linearly correlated* over iterations. We propose a simplified setting that distills key facets of these methods and isolates the impact of linearly correlated noise. We analyze the behavior of gradient descent in this setting, for both convex and non-convex functions. Our analysis is demonstrably tighter than prior work and recovers multiple important special cases exactly (including anticorrelated perturbed gradient descent). We use our results to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.

Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+\alpha$ Moments
Trung Dang Jasper C.H. Lee Maoyuan Song Paul Valiant



研究问题:本文旨在探讨算法在均值估计问题上的基础统计理解,以了解从有限且有价值的数据中提取信息的极限。
动机:尽管现有的均值估计结果都是最优的,但它们只适用于最坏情况。因此,我们希望通过“超越最坏情况分析”来深入研究均值估计问题。
方法:我们构建了一个分布q_{n,δ},使得p和q的均值分离,但无法用n个样本以1-δ的概率区分p和q,同时保持p的有限矩不变。此外,如果p的方差存在,则q的方差最多是p的两倍。
效果:结果表明,任何合理的估计器都无法达到比最坏情况更好的结果,这与[Lee and Valiant, 2022]的结果相匹配。我们还引入了一个新的定义框架——“邻域最优性”,用于分析算法的精细最优性。

There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the fundamental limits of what we can extract from limited and valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [Lee and Valiant, 2022], attaining the optimal sub-Gaussian error constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [Bubeck, Cesa-Bianchi and Lugosi, 2013] and a matching lower bound by [Devroye, Lerasle, Lugosi, and Oliveira, 2016], characterizing the big-O optimal errors for distributions that have tails heavy enough that only a $1+\alpha$ moment exists for some $\alpha \in (0,1)$. Both of these results, however, are optimal only in the worst case. Motivated by the recent effort in the community to go "beyond the worst-case analysis" of algorithms, we initiate the fine-grained study of the mean estimation problem: Is it possible for algorithms to leverage *beneficial* features/quirks of their input distribution to *beat* the sub-Gaussian rate, without explicit knowledge of these features? We resolve this question, finding an unexpectedly nuanced answer: "Yes in limited regimes, but in general no". Given a distribution $p$, assuming *only* that it has a finite mean and absent any additional assumptions, we show how to construct a distribution $q_{n,\delta}$ such that the means of $p$ and $q$ are well-separated, yet $p$ and $q$ are impossible to distinguish with $n$ samples with probability $1-\delta$, and $q$ further preserves the finiteness of moments of $p$. Moreover, the variance of $q$ is at most twice the variance of $p$ if it exists. The main consequence of our result is that, no reasonable estimator can asymptotically achieve better than the sub-Gaussian error rate for any distribution, up to constant factors, which matches the worst-case result of [Lee and Valiant, 2022]. More generally, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak admissibility/Pareto optimality definitions. As an application of the new framework, we show that the median-of-means algorithm is neighborhood optimal, up to constant factors. It is an open question to find a neighborhood-optimal estimator *without* constant factor slackness.

Initialization-Dependent Sample Complexity of Linear Predictors and Neural Networks
Roey Magen Ohad Shamir



研究问题:本文旨在研究向量值线性预测器(由矩阵参数化)和更一般的神经网络的样本复杂度。
动机:本研究关注与固定参考矩阵$W_0$的参数弗罗贝尼乌斯范数距离有关的尺寸独立界,并表明样本复杂度行为可能与我们考虑标量值线性预测器的已知设置时的预期不同。
方法:通过联合训练大规模文本语料库和知识图谱来训练ERNIE模型,使ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We provide several new results on the sample complexity of vector-valued linear predictors (parameterized by a matrix), and more generally neural networks. Focusing on size-independent bounds, where only the Frobenius norm distance of the parameters from some fixed reference matrix $W_0$ is controlled, we show that the sample complexity behavior can be surprisingly different than what we may expect considering the well-studied setting of scalar-valued linear predictors. This also leads to new sample complexity bounds for feed-forward neural networks, tackling some open questions in the literature, and establishing a new convex linear prediction problem that is provably learnable without uniform convergence.

Kernelized Reinforcement Learning with Order Optimal Regret Bounds
Sattar Vakili Julia Olkhovskaya



研究问题:如何有效地处理具有复杂模型和大型状态-动作空间的强化学习问题。
动机:现有的分析结果通常关注于状态-动作数量较少或模型较简单的设置,如线性状态-动作值函数。为了推导出能有效处理大型状态-动作空间和更通用价值函数的RL策略,一些近期的研究考虑了使用核岭回归进行非线性函数近似。
方法:我们提出了π-KRVI,一种乐观修改的最小二乘值迭代法,当动作值函数由RKHS表示时。我们证明了在一般设置下的第一阶最优遗憾保证。
效果:我们的结果在多个实验中表现出比现有技术有显著的改进,特别是在使用高度非光滑内核(如Neural Tangent kernel或某些Matérn内核)的情况下,我们的方法实现了次线性的遗憾界限,这是已知的最低遗憾界限的情况(包括上述内核)。

Modern reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the action-value function is represented by an RKHS. We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Matérn kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the cases where a lower bound on regret is known (which includes the kernels mentioned above).

Closing the gap between the upper bound and lower bound of Adam's iteration complexity
Bohan Wang Jingwen Fu Huishuai Zhang Nanning Zheng Wei Chen



研究问题:本文旨在解决Adam优化器在一阶优化下的收敛性问题,并建立其迭代复杂度的上下界。
动机:尽管已有一些关于Adam优化器收敛性的研究,但它们都没有达到新的理论界限。因此,作者希望通过新的理论分析,填补这一研究空白。
方法:本文通过推导新的收敛保证,仅使用L-平滑条件和有界的噪声方差假设,来缩小现有文献中Adam优化器收敛性的理论差距。
效果:实验结果表明,该方法在所有超参数范围内都有效。特别是在适当选择超参数的情况下,作者推导出Adam优化器的迭代复杂度上限,并证明它满足一阶优化器的下限,这是首次为Adam的收敛性建立如此紧密的上限。

Recently, Arjevani et al. [1] establish a lower bound of iteration complexity for the first-order optimization under an $L$-smooth condition and a bounded noise variance assumption. However, a thorough review of existing literature on Adam's convergence reveals a noticeable gap: none of them meet the above lower bound. In this paper, we close the gap by deriving a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption. Our results remain valid across a broad spectrum of hyperparameters. Especially with properly chosen hyperparameters, we derive an upper bound of the iteration complexity of Adam and show that it meets the lower bound for first-order optimizers. To the best of our knowledge, this is the first to establish such a tight upper bound for Adam's convergence. Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate and to convert the first-order term in the Descent Lemma to the gradient norm, which may be of independent interest.

Conformalized matrix completion
Yu Gui Rina Barber Cong Ma



研究问题:本文旨在解决矩阵补全问题中的不确定性量化问题,即如何估计数据矩阵中缺失的条目。
动机:尽管现有的矩阵补全算法可以有效地估计缺失的条目,但这个问题的不确定性量化却非常具有挑战性,且现有方法对模型误设非常敏感。
方法:本文提出了一种分布自由的预测推理方法,该方法将一致性预测框架适应到矩阵补全问题中,无论低秩模型的准确性如何,都能提供保证分布自由的预测区间。
效果:在模拟数据和真实数据上的实验结果表明,该方法对模型误设具有鲁棒性,当模型正确时,其性能与现有的基于模型的方法相匹配。

Matrix completion aims to estimate missing entries in a data matrix, using the assumption of a low-complexity structure (e.g., low-rankness) so that imputation is possible. While many effective estimation algorithms exist in the literature, uncertainty quantification for this problem has proved to be challenging, and existing methods are extremely sensitive to model misspecification. In this work, we propose a distribution-free method for predictive inference in the matrix completion problem. Our method adapts the framework of conformal prediction, which provides prediction intervals with guaranteed distribution-free validity in the setting of regression, to the problem of matrix completion. Our resulting method, conformalized matrix completion (cmc), offers provable predictive coverage regardless of the accuracy of the low-rank model. Empirical results on simulated and real data demonstrate that cmc is robust to model misspecification while matching the performance of existing model-based methods when the model is correct.

Momentum Provably Improves Error Feedback!
Ilyas Fatkhullin Alexander Tyurin Peter Richtárik



研究问题:分布式环境下训练机器学习模型的通信开销大,现代算法需要依赖有损压缩通信。但未经处理的压缩错误会导致严重不稳定的行为,包括指数发散。
动机:Seide等人在2014年提出了一种称为EF14的错误反馈机制,可以有效缓解这个问题。然而,尽管过去十年中EF领域在算法和理论上取得了稳步进展,但我们的理解还远远不够。
方法:我们解决了最紧迫的问题之一,特别是在标准的非凸设置中,所有已知的EF变体都需要非常大的批量大小才能收敛,这在实践中可能是不可接受的。我们提出了一个非常简单的解决方案,即对Richtárik等人在2021年提出的最新EF版本(称为EF21)应用Polyak动量。我们的算法被称为EF21-SGDM,它在标准的平滑度和有界方差假设下改进了以前错误反馈算法的通信和样本复杂度,并且不需要任何进一步的强假设,如梯度差异有界。
效果:我们的证明即使在压缩方法中移除后也是新颖的,因此在我们的证明技术中,对于包含Polyak动量的非凸随机优化的研究具有独立的兴趣。

Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak's momentum to the latest incarnation of EF due to Richtárik et al. [2021] known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed form the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak's momentum.

Better Private Linear Regression Through Better Private Feature Selection
Travis Dick Jennifer Gillenwater Matthew Joseph



研究问题:现有的差分隐私线性回归方法通常需要用户精确设定数据边界或算法超参数,但用户在不直接查看数据的情况下很难满足这些要求。
动机:为了解决这一问题,本文提出了一种基于肯德尔等级相关性的差分隐私特征选择方法,将负担从用户转移到算法。
方法:该方法首先进行差分隐私的特征选择,然后再进行线性回归。通过引入基于肯德尔等级相关性的差分隐私特征选择方法,扩展了“即插即用”的私有线性回归算法在高维问题上的应用。
效果:在25个数据集上的实验表明,在进行回归之前添加这个私有特征选择步骤可以显著扩大“即插即用”私有线性回归算法的适用性,同时对用户的隐私、计算或决策增加的成本很小。

Existing work on differentially private linear regression typically assumes that end users can precisely set data bounds or algorithmic hyperparameters. End users often struggle to meet these requirements without directly examining the data (and violating privacy). Recent work has attempted to develop solutions that shift these burdens from users to algorithms, but they struggle to provide utility as the feature dimension grows. This work extends these algorithms to higher-dimensional problems by introducing a differentially private feature selection method based on Kendall rank correlation. We prove a utility guarantee for the setting where features are normally distributed and conduct experiments across 25 datasets. We find that adding this private feature selection step before regression significantly broadens the applicability of ``plug-and-play'' private linear regression algorithms at little additional cost to privacy, computation, or decision-making by the end user.

Data-Dependent Bounds for Online Portfolio Selection Without Lipschitzness and Smoothness
Chung-En Tsai Ying-Ting Lin Yen-Huan Li



研究问题:本文旨在为在线投资组合选择引入首个小损失和渐进变化遗憾界限。
动机:这是首次针对非Lipschitz、非平滑损失的在线凸优化问题,提出数据依赖的界限。
方法:我们提出的算法在最坏情况下表现出次线性遗憾率,当数据“容易”时,可以实现对数遗憾,每轮时间几乎与投资选择数量呈线性关系。
效果:通过使用新颖的对数损失平滑性描述、具有自协方差正则化器的FTRL的局部范数分析(这些不一定是障碍)、以及具有对数障碍的乐观FTRL的隐式变体,推导出了这些遗憾界限。

This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses. The algorithms we propose exhibit sublinear regret rates in the worst cases and achieve logarithmic regrets when the data is "easy," with per-round time almost linear in the number of investment alternatives. The regret bounds are derived using novel smoothness characterizations of the logarithmic loss, a local norm-based analysis of following the regularized leader (FTRL) with self-concordant regularizers, which are not necessarily barriers, and an implicit variant of optimistic FTRL with the log-barrier.

Replicability in Reinforcement Learning
Amin Karbasi Grigoris Velegkas Lin Yang Felix Zhou



研究问题:本文探讨了强化学习中复制性作为一种算法属性的数学研究,特别是在具有生成模型的折扣表格MDPs的基本设置中。
动机:受Impagliazzo等人的启发,如果两个独立同分布的样本从一个生成器中抽取,当其内部随机性相同时,一个强化学习算法以高概率输出完全相同的策略,我们就称这个算法是可复制的。
方法:首先,我们为$(\varepsilon, delta)$-最优策略估计提供了一个高效的$\rho$-可复制算法,其样本和时间复杂度为$\widetilde O\left(\frac{N^3\cdot\log(1/delta)}{(1-\gamma)^5\cdot\varepsilon^2cdot\rho^2}\right)$,其中$N$是状态-动作对的数量。然后,对于确定性算法的子类,我们提供了一个下界,其数量级为$Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$。接着,我们研究了Kalavasis等人[2023]提出的复制性的放松版本,称为TV不可区分性。我们设计了一个计算效率高的TV不可区分的算法进行策略估计,其样本复杂度为$\widetilde O\left(\frac{N^2\cdot\log(1/delta)}{(1-\gamma)^5\cdot\varepsilon^2cdot\rho^2}\right)$。在花费$\exp(N)$的运行时间的代价下,我们将这些TV不可区分的算法转化为$\rho$-可复制的算法,而不会增大它们的样本复杂度。最后,我们引入了近似可复制性的概念,只需要两个输出的策略在适当的统计距离(如Renyi)下接近即可。
效果:实验结果表明,我们的算法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We initiate the mathematical study of replicability as an algorithmic property in the context of reinforcement learning (RL). We focus on the fundamental setting of discounted tabular MDPs with access to a generative model. Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions on i.i.d. samples drawn from the generator when its internal randomness is the same. We first provide an efficient $\rho$-replicable algorithm for $(\varepsilon, \delta)$-optimal policy estimation with sample and time complexity $\widetilde O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$, where $N$ is the number of state-action pairs. Next, for the subclass of deterministic algorithms, we provide a lower bound of order $\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)$. Then, we study a relaxed version of replicability proposed by Kalavasis et al. [2023] called TV indistinguishability. We design a computationally efficient TV indistinguishable algorithm for policy estimation whose sample complexity is $\widetilde O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$. At the cost of $\exp(N)$ running time, we transform these TV indistinguishable algorithms to $\rho$-replicable ones without increasing their sample complexity. Finally, we introduce the notion of approximate-replicability where we only require that two outputted policies are close under an appropriate statistical divergence (e.g., Renyi) and show an improved sample complexity of $\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)$.

Replicable Clustering
Hossein Esfandiari Amin Karbasi Vahab Mirrokni Grigoris Velegkas Felix Zhou



研究问题:如何设计具有统计聚类概念下的可复制性的算法。
动机:在Impagliazzo等人[2022]最近提出的定义下,如果一个聚类算法的内部随机性在不同的执行中共享,那么它就是可复制的,即在两次对同一分布的不同输入进行执行后,其输出以高概率引发完全相同的样本空间划分。
方法:通过黑箱方式利用其组合计数问题的近似程序,为统计$k$-medians、统计$k$-means和统计$k$-centers问题提出这样的算法。
效果:我们展示了一个具有$\operatorname{poly}(d)$样本复杂度的可复制的$O(1)$-approximation算法用于统计欧几里得$k$-medians($k$-means)。此外,我们还描述了一种带有额外$O(1)$-additive误差的$O(1)$-approximation算法用于统计欧几里得$k$-centers,尽管其样本复杂度为$\exp(d)$。另外,我们在使用sklearn中的$k$-means++实现作为黑箱的二维合成分布上进行了实验,验证了我们的理论结果。

We design replicable algorithms in the context of statistical clustering under the recently introduced notion of replicability from Impagliazzo et al. [2022]. According to this definition, a clustering algorithm is replicable if, with high probability, its output induces the exact same partition of the sample space after two executions on different inputs drawn from the same distribution, when its internal randomness is shared across the executions. We propose such algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their combinatorial counterparts in a black-box manner. In particular, we demonstrate a replicable $O(1)$-approximation algorithm for statistical Euclidean $k$-medians ($k$-means) with $\operatorname{poly}(d)$ sample complexity. We also describe an $O(1)$-approximation algorithm with an additional $O(1)$-additive error for statistical Euclidean $k$-centers, albeit with $\exp(d)$ sample complexity. In addition, we provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.

Optimization of Inter-group criteria for clustering with minimum size constraints
Eduardo Sany Laber Lucas Murtinho



研究问题:评估聚类质量的内部指标通常考虑组内和/或组间标准,但组间标准的优化理解较少。
动机:文献中有许多算法具有可证明的近似保证,用于优化前者(组内标准)。然而,对组间标准(最小间距和最小生成树间距)的优化则知之甚少。
方法:我们设计了两种自然组间标准的最大化算法,即最小间距和最小生成树间距,并获得了无约束和有约束两种情况的结果。我们的约束条件是每个组必须包含最少数量的点,以解决流行的单链接法在无约束情况下产生许多小群集的问题。
效果:我们在10个真实数据集上进行了实证研究,结果表明我们的方法在实际环境中表现良好。

Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria. There are many papers in the literature that propose algorithms with provable approximation guarantees for optimizing the former. However, the optimization of inter-group criteria is much less understood. Here, we contribute to the state-of-the-art of this literature by devising algorithms with provable guarantees for the maximization of two natural inter-group criteria, namely the minimum spacing and the minimum spanning tree spacing. The former is the minimum distance between points in different groups while the latter captures separability through the cost of the minimum spanning tree that connects all groups. We obtain results for both the unrestricted case, in which no constraint on the clusters is imposed, and for the constrained case where each group is required to have a minimum number of points. Our constraint is motivated by the fact that the popular Single-Linkage, which optimizes both criteria in the unrestricted case, produces clustering with many tiny groups. To complement our work, we present an empirical study with 10 real datasets that provides evidence that our methods work very well in practical settings.

Optimal Time Complexities of Parallel Stochastic Optimization Methods Under a Fixed Computation Model
Alexander Tyurin Peter Richtárik



研究问题:如何提高优化方法的性能?
动机:优化方法需要并行化以提高性能,但并行优化方法的理论性质尚未完全探索。
方法:提出了一种新的协议,该协议扩展了经典的oracle框架方法,并建立了具有有界方差的无偏随机梯度oracle的并行优化方法的最小最大复杂性。
效果:结果对异步优化方法的文献产生了令人惊讶的影响。

Parallelization is a popular strategy for improving the performance of methods. Optimization methods are no exception: design of efficient parallel optimization methods and tight analysis of their theoretical properties are important research endeavors. While the minimax complexities are well known for sequential optimization methods, the theory of parallel optimization methods is less explored. In this paper, we propose a new protocol that generalizes the classical oracle framework approach. Using this protocol, we establish minimax complexities for parallel optimization methods that have access to an unbiased stochastic gradient oracle with bounded variance. We consider a fixed computation model characterized by each worker requiring a fixed but worker-dependent time to calculate stochastic gradient. We prove lower bounds and develop optimal algorithms that attain them. Our results have surprising consequences for the literature of asynchronous optimization methods.

2Direction: Theoretically Faster Distributed Training with Bidirectional Communication Compression
Alexander Tyurin Peter Richtárik



研究问题:本文研究了在服务器和工人之间的上行和下行通信都很昂贵的情况下的分布式凸优化问题。
动机:现有的加速方法中,通常使用的误差反馈机制并不适合用于加速方法,因此需要开发新的优化方法。
方法:提出了一种名为2Direction的新方法,该方法基于快速双向压缩通信和一种新的定制错误反馈机制。
效果:实验证明,2Direction方法在$\mu$强凸设置中,将先前最先进的通信复杂度从$\widetilde{\Theta}left(K \times \left(frac{L}{\alpha \mu} + \frac{L_{\max} \omega}{n \mu} + \omega\right)\right)$改进到了$\widetilde{\Theta}(K \times (\sqrt{\frac{L (\omega + 1)}{alpha \mu}} + \sqrt{frac{L_{\max} \omega^2}{n \mu}} + \frac{1}{\alpha} + \omega))$,是第一个超越基本加速梯度下降方法(AGD)的通信复杂度的方法。同时,在一般凸设置中也取得了类似的改进。

We consider distributed convex optimization problems in the regime when the communication between the server and the workers is expensive in both uplink and downlink directions. We develop a new and provably accelerated method, which we call 2Direction, based on fast bidirectional compressed communication and a new bespoke error-feedback mechanism which may be of independent interest. Indeed, we find that the EF and EF21-P mechanisms (Seide et al., 2014; Gruntkowska et al., 2023) that have considerable success in the design of efficient non-accelerated methods are not appropriate for accelerated methods. In particular, we prove that 2Direction improves the previous state-of-the-art communication complexity $\widetilde{\Theta}\left(K \times \left(\frac{L}{\alpha \mu} + \frac{L_{\max} \omega}{n \mu} + \omega\right)\right)$ (Gruntkowska et al., 2023) to $\widetilde{\Theta}(K \times (\sqrt{\frac{L (\omega + 1)}{\alpha \mu}} + \sqrt{\frac{L_{\max} \omega^2}{n \mu}} + \frac{1}{\alpha} + \omega))$ in the $\mu$--strongly-convex setting, where $L$ and $L_{\max}$ are smoothness constants, $n$ is \# of workers, $\omega$ and $\alpha$ are compression errors of the Rand$K$ and Top$K$ sparsifiers (as examples), $K$ is \# of coordinates/bits that the server and workers send to each other. Moreover, our method is the first that improves upon the communication complexity of the vanilla accelerated gradient descent method (AGD). We obtain similar improvements in the general convex regime as well. Finally, our theoretical findings are corroborated by experimental evidence.

How many samples are needed to leverage smoothness?
Vivien Cabannes Stefano Vigogna



研究问题:本文旨在解决统计学习中平滑目标函数的学习和样本数量与输入维度比例小的问题。
动机:在机器学习问题中,由于样本数量和输入维度的比例相对较小,因此难以获取有意义的高阶导数估计,这阻碍了平滑目标函数的学习。
方法:通过推导新的泛化误差下界,本文形式化了这个直觉,并研究了常数和过渡阶段的作用,这些通常在经典学习理论陈述之外并未描绘出来,但在实践中起着主导作用。
效果:实验结果表明,该方法能够有效地解决统计学习中的平滑目标函数的学习和样本数量与输入维度比例小的问题。

A core principle in statistical learning is that smoothness of target functions allows to break the curse of dimensionality. However, learning a smooth function seems to require enough samples close to one another to get meaningful estimate of high-order derivatives, which would be hard in machine learning problems where the ratio between number of data and input dimension is relatively small. By deriving new lower bounds on the generalization error, this paper formalizes such an intuition, before investigating the role of constants and transitory regimes which are usually not depicted beyond classical learning theory statements while they play a dominant role in practice.

Demographic Parity Constrained Minimax Optimal Regression under Linear Model
Kazuto Fukuchi Jun Sakuma



研究问题:本研究旨在探索线性模型中受人口平等约束的回归问题的最小最大最优误差。
动机:与Chzhen和Schreuder提出的模型相比,我们提出的模型涵盖了更广泛的歧视性偏见来源。
方法:我们的模型通过引入敏感属性产生的不同人口群体数量来描述歧视性偏见的来源。
效果:实验结果表明,随着模型中存在的偏见增加,最小最大误差也会增加。

We explore the minimax optimal error associated with a demographic parity-constrained regression problem within the context of a linear model. Our proposed model encompasses a broader range of discriminatory bias sources compared to the model presented by Chzhen and Schreuder. Our analysis reveals that the minimax optimal error for the demographic parity-constrained regression problem under our model is characterized by $\Theta(\frac{dM}{n})$, where $n$ denotes the sample size, $d$ represents the dimensionality, and $M$ signifies the number of demographic groups arising from sensitive attributes. Moreover, we demonstrate that the minimax error increases in conjunction with a larger bias present in the model.

No-regret Algorithms for Fair Resource Allocation
Abhishek Sinha Ativ Joshi Rajarshi Bhattacharjee Cameron N Musco Mohammad Hajiesmaili



研究问题:如何在无后悔的设置中,面对无限制的对手,公平地分配资源。
动机:解决全局公平性函数的不可分特性带来的困难,实现在线资源的公平分配。
方法:提出一种名为在线公平分配(OFA)的高效在线资源分配策略。
效果:该策略实现了次线性cα近似遗憾,且在参数α=1/2处表现出从幂律到常数的转变,解决了在线作业调度问题的某个参数区域的高效无遗憾策略设计问题。同时,引入了新的算法和分析技术,如非加性全局奖励函数的未来梯度贪婪估计和二阶遗憾边界的自举法。

We consider a fair resource allocation problem in the no-regret setting against an unrestricted adversary. The objective is to allocate resources equitably among several agents in an online fashion so that the difference of the aggregate $\alpha$-fair utilities of the agents achieved by an optimal static clairvoyant allocation and the online policy grows sublinearly with time. The problem inherits its difficulty from the non-separable nature of the global $\alpha$-fairness function. Previously, it was shown that no online policy could achieve a sublinear standard regret in this problem. In this paper, we propose an efficient online resource allocation policy, called Online Fair Allocation ($\texttt{OFA}$), that achieves sublinear $c_\alpha$-approximate regret with approximation factor $c_\alpha=(1-\alpha)^{-(1-\alpha)}\leq 1.445,$ for $0\leq \alpha < 1$. Our upper bound on the $c_\alpha$-regret for this problem exhibits a surprising \emph{phase transition} phenomenon -- transitioning from a power-law to a constant at the critical exponent $\alpha=\frac{1}{2}.$ Our result also resolves an open problem in designing an efficient no-regret policy for the online job scheduling problem in certain parameter regimes. Along the way, we introduce new algorithmic and analytical techniques, including greedy estimation of the future gradients for non-additive global reward functions and bootstrapping second-order regret bounds, which may be of independent interest.

Block Broyden's Methods for Solving Nonlinear Equations
Chengchang Liu Cheng Chen Luo Luo John C.S. Lui



研究问题:本文研究了求解非线性方程的拟牛顿方法。
动机:为了提高求解非线性方程的效率和准确性,提出了改进的Broyden方法。
方法:提出了块状版本的好与坏的Broyden方法,利用雅可比矩阵的多次秩修正来加速收敛速度,并直接估计雅可比矩阵的逆以降低计算成本。
效果:理论分析解释了为何好Broyden方法在大多数情况下优于坏Broyden方法,实验结果验证了所提方法的优越性。

This paper studies quasi-Newton methods for solving nonlinear equations. We propose block variants of both good and bad Broyden's methods, which enjoy explicit local superlinear convergence rates. Our block good Broyden's method has faster condition-number-free convergence rate than existing Broyden's methods because it takes the advantage of multiple rank modification on the Jacobian estimator. On the other hand, our block bad Broyden's method directly estimates the inverse of the Jacobian provably, which reduces the computational cost of the iteration. Our theoretical results provide some new insights on why good Broyden's method outperforms bad Broyden's method in most of the cases. The empirical results also demonstrate the superiority of our methods and validate our theoretical analysis.

Online Learning under Adversarial Nonlinear Constraints
Pavel Kolev Georg Martius Michael Muehlebach



研究问题:在线学习系统中,如何有效处理连续的非平稳数据流。
动机:许多应用中需要学习系统处理连续的、不断变化的数据流,并面临时间变化的非线性约束。
方法:提出一种名为“约束违反速度投影”(CVV-Pro)的算法,该算法仅依赖局部稀疏线性近似的可行集,避免在每次迭代中优化整个集合,从而高效地处理数据流和非线性约束。
效果:实验结果表明,CVV-Pro算法实现了$\sqrt{T}$遗憾并按$1/sqrt{T}$的速度收敛到可行集,即使在可行集缓慢变化且学习者事先未知的情况下也能有效工作。

In many applications, learning systems are required to process continuous non-stationary data streams. We study this problem in an online learning framework and propose an algorithm that can deal with adversarial time-varying and nonlinear constraints. As we show in our work, the algorithm called Constraint Violation Velocity Projection (CVV-Pro) achieves $\sqrt{T}$ regret and converges to the feasible set at a rate of $1/\sqrt{T}$, despite the fact that the feasible set is slowly time-varying and a priori unknown to the learner. CVV-Pro only relies on local sparse linear approximations of the feasible set and therefore avoids optimizing over the entire set at each iteration, which is in sharp contrast to projected gradients or Frank-Wolfe methods. We also empirically evaluate our algorithm on two-player games, where the players are subjected to a shared constraint.

(S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability
Mathieu Even Scott Pesme Suriya Gunasekar Nicolas Flammarion



研究问题:本文探讨了随机性和大步长对梯度下降(GD)和随机梯度下降(SGD)在$2$-层对角线性网络中隐含正则化的影响。
动机:为了理解随机性和步长对恢复解决方案的影响,特别是对于稀疏回归问题和“稳定性边缘”区域的效果。
方法:通过在过度参数化的回归设置中证明GD和SGD的大步骤收敛性,并通过一个隐含的正则化问题来描述他们的解决方案。
效果:实验结果表明,大步长始终有利于SGD的稀疏回归问题,而它们可能会阻碍GD恢复稀疏解。这些影响在刚好低于发散阈值的“稳定性边缘”区域中的紧密窗口内步长时被放大。

In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2$-layer diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the ``edge of stability'' regime. Our findings are supported by experimental results.

Toward Better PAC-Bayes Bounds for Uniformly Stable Algorithms
Sijia Zhou Yunwen Lei Ata Kaban



研究问题:在PAC-Bayesian框架下,为均匀稳定随机化算法提供更精确的界限。
动机:通过改进现有结果,提高样本大小的估计精度。
方法:利用Bousquet等人(2020)提出的弱相关随机变量的集中性来约束泛化差距的矩生成函数。引入次指数稳定性参数的假设,以实现对随机梯度下降和随机坐标下降的应用。
效果:消除了先前结果中强凸性的要求,适用于非光滑凸问题。

We give sharper bounds for uniformly stable randomized algorithms in a PAC-Bayesian framework, which improve the existing results by up to a factor of $\sqrt{n}$ (ignoring a log factor), where $n$ is the sample size. The key idea is to bound the moment generating function of the generalization gap using concentration of weakly dependent random variables due to Bousquet et al (2020). We introduce an assumption of sub-exponential stability parameter, which allows a general treatment that we instantiate in two applications: stochastic gradient descent and randomized coordinate descent. Our results eliminate the requirement of strong convexity from previous results, and hold for non-smooth convex problems.

On the Convergence of Black-Box Variational Inference
Kyurae Kim Jisu Oh Kaiwen Wu Yian Ma Jacob R. Gardner



研究问题:本文旨在为黑箱变分推断(BBVI)提供首次收敛保证。
动机:尽管初步的调查在简化版的BBVI上进行(例如,有界领域、有界支持、仅优化规模等),但我们的设置不需要任何此类算法修改。
方法:我们使用重参数化梯度为黑箱变分推断提供首次收敛保证,适用于具有和不具有强对数凹性的对数平滑后验密度以及位置-尺度变分族。
效果:值得注意的是,我们的分析揭示了在实践中通常采用的某些算法设计选择,如尺度矩阵的非线性参数化,可能导致次优的收敛速度。幸运的是,使用近端随机梯度下降运行BBVI可以解决这些限制,从而实现已知的最强大的收敛保证。我们通过比较近端SGD与其他标准的BBVI实现方式来评估这一理论洞察,应用于大规模的贝叶斯推理问题。

We provide the first convergence guarantee for black-box variational inference (BBVI) with the reparameterization gradient. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Notably, our analysis reveals that certain algorithm design choices commonly employed in practice, such as nonlinear parameterizations of the scale matrix, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations and thus achieves the strongest known convergence guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.

On Private and Robust Bandits
Yulian Wu Xingyu Zhou Youming Tao Di Wang



研究问题:我们研究了私有且鲁棒的多臂老虎机(MABs),其中代理
动机:尽管初步的调查在简化版的BBVI上进行(例如,有界领域、有界支持、仅优化规模等),但我们的设置不需要任何此类算法修改。
方法:我们使用重参数化梯度为黑箱变分推断提供首次收敛保证,适用于具有和不具有强对数凹性的对数平滑后验密度以及位置-尺度变分族。
效果:值得注意的是,我们的分析揭示了在实践中通常采用的某些算法设计选择,如尺度矩阵的非线性参数化,可能导致次优的收敛速度。幸运的是,使用近端随机梯度下降运行BBVI可以解决这些限制,从而实现已知的最强大的收敛保证。我们通过比较近端SGD与其他标准的BBVI实现方式来评估这一理论洞察,应用于大规模的贝叶斯推理问题。

We study private and robust multi-armed bandits (MABs), where the agent receives Huber's contaminated heavy-tailed rewards and meanwhile needs to ensure differential privacy. We consider both the finite $k$-th raw moment and the finite $k$-th central moment settings for heavy-tailed rewards distributions with $k\ge 2$. We first present its minimax lower bound, characterizing the information-theoretic limit of regret with respect to privacy budget, contamination level, and heavy-tailedness. Then, we propose a meta-algorithm that builds on a private and robust mean estimation sub-routine \texttt{PRM} that essentially relies on reward truncation and the Laplace mechanism. For the above two different heavy-tailed settings, we give corresponding schemes of \texttt{PRM}, which enable us to achieve nearly-optimal regrets. Moreover, our two proposed truncation-based or histogram-based \texttt{PRM} schemes achieve the optimal trade-off between estimation accuracy, privacy and robustness. Finally, we support our theoretical results and show the effectiveness of our algorithms with experimental studies.

Closing the Computational-Statistical Gap in Best Arm Identification for Combinatorial Semi-bandits
Ruo-Chun Tzeng Po-An Wang Alexandre Proutiere Chi-Jen Lu



研究问题:本文研究了固定置信度设置下的组合半鞅中的最佳手臂识别问题。
动机:现有的算法无法在高置信度和中等置信度下同时实现实例特定的最小样本复杂度,因此需要一种新方法来缩小计算统计差距。
方法:提出了一种名为“干扰式弗兰克-沃尔夫采样”(P-FWS)的算法,该算法能在多项式时间内运行,并在高置信度下实现实例特定的最小样本复杂度,同时在中等置信度下享有多项式样本复杂度保证。
效果:通过P-FWS,我们成功解决了组合半鞅中最佳手臂识别的计算统计差距问题。

We study the best arm identification problem in combinatorial semi-bandits in the fixed confidence setting. We present Perturbed Frank-Wolfe Sampling (P-FWS), an algorithm that (i) runs in polynomial time, (ii) achieves the instance-specific minimal sample complexity in the high confidence regime, and (iii) enjoys polynomial sample complexity guarantees in the moderate confidence regime. To our best knowledge, existing algorithms cannot achieve (ii) and (iii) simultaneously in vanilla bandits. With P-FWS, we close the computational-statistical gap in best arm identification in combinatorial semi-bandits. The design of P-FWS starts from the optimization problem that defines the information-theoretical and instance-specific sample complexity lower bound. P-FWS solves this problem in an online manner using, in each round, a single iteration of the Frank-Wolfe algorithm. Structural properties of the problem are leveraged to make the P-FWS successive updates computationally efficient. In turn, P-FWS only relies on a simple linear maximization oracle.

A Smooth Binary Mechanism for Efficient Private Continual Observation
Rasmus Pagh Joel Daniel Andersson



研究问题:在持续观察的情况下,如何发布基于随时间演变的数据集的差分隐私估计。
动机:发布私人前缀和的问题已经得到了很好的研究,并在最先进的私人随机梯度下降(SGD)方法中得到了广泛应用。
方法:提出了一种简单的替代二进制机制的方法,生成噪声的平均时间恒定,与二进制机制相比,方差降低了约4倍,每一步的噪声分布相同。
效果:实证表明,这种方法的运行时间优于Henzinger等人的方法,以及使用高性能Toeplitz矩阵乘法算法改进他们的算法的尝试。

In privacy under continual observation we study how to release differentially private estimates based on a dataset that evolves over time. The problem of releasing private prefix sums of $x_1, x_2, x_3,\dots\in${$0,1$} (where the value of each $x_i$ is to be private) is particularly well-studied, and a generalized form is used in state-of-the-art methods for private stochastic gradient descent (SGD). The seminal binary mechanism privately releases the first $t$ prefix sums with noise of variance polylogarithmic in $t$. Recently, Henzinger et al. and Denisov et al. showed that it is possible to improve on the binary mechanism in two ways: The variance of the noise can be reduced by a (large) constant factor, and also made more even across time steps. However, their algorithms for generating the noise distribution are not as efficient as one would like in terms of computation time and (in particular) space. We address the efficiency problem by presenting a simple alternative to the binary mechanism in which 1) generating the noise takes constant average time per value, 2) the variance is reduced by a factor about 4 compared to the binary mechanism, and 3) the noise distribution at each step is identical. Empirically, a simple Python implementation of our approach outperforms the running time of the approach of Henzinger et al., as well as an attempt to improve their algorithm using high-performance algorithms for multiplication with Toeplitz matrices.

Similarity, Compression and Local Steps: Three Pillars of Efficient Communications for Distributed Variational Inequalities
Aleksandr Beznosikov Martin Takáč Alexander Gasnikov



研究问题:本文旨在解决变分不等式问题,这是一种广泛且灵活的问题类别,包括最小化、鞍点和固定点问题。
动机:由于数据和模型规模的增大,现今的实例需要并行和分布式计算来解决实际的机器学习问题,其中大部分可以表示为变分不等式。然而,大多数分布式方法存在明显的瓶颈——通信成本。
方法:本文结合了局部函数相似性、传输信息的压缩以及局部更新这三种主要技术,以减少总的通信轮数和单轮通信的成本。这种三重协同在变分不等式和鞍点问题上是前所未有的,甚至在最小化问题上也没有。
效果:本文提出的方法在通信复杂度上具有最好的理论保证,并且明显领先于其他分布式变分不等式方法。理论结果通过对抗性学习实验在合成和真实数据集上得到了验证。

Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, and fixed point problems as special cases. Therefore, variational inequalities are used in various applications ranging from equilibrium search to adversarial learning. With the increasing size of data and models, today's instances demand parallel and distributed computing for real-world machine learning problems, most of which can be represented as variational inequalities. Meanwhile, most distributed approaches have a significant bottleneck -- the cost of communications. The three main techniques to reduce the total number of communication rounds and the cost of one such round are the similarity of local functions, compression of transmitted information, and local updates. In this paper, we combine all these approaches. Such a triple synergy did not exist before for variational inequalities and saddle problems, nor even for minimization problems. The methods presented in this paper have the best theoretical guarantees of communication complexity and are significantly ahead of other methods for distributed variational inequalities. The theoretical results are confirmed by adversarial learning experiments on synthetic and real datasets.

Leveraging the two-timescale regime to demonstrate convergence of neural networks
Pierre Marion Raphaël Berthier



研究问题:本研究探讨了浅层神经网络在内外层学习率相差较大的双时间尺度训练动态。
动机:在内外层学习率差距大的双时间尺度下,我们证明了梯度流会收敛到一个非凸优化问题的全局最优解。
方法:我们在简单的一元设置中证明了梯度流的收敛性,并指出神经元数量不需要渐近地大,这与近期流行的神经网络切线核或平均场机制等方法有所不同。
效果:实验证明,随机梯度下降按照我们对梯度流的描述进行,因此在双时间尺度下会收敛到全局最优解,但在该尺度之外可能会失败。

We study the training dynamics of shallow neural networks, in a two-timescale regime in which the stepsizes for the inner layer are much smaller than those for the outer layer. In this regime, we prove convergence of the gradient flow to a global optimum of the non-convex optimization problem in a simple univariate setting. The number of neurons need not be asymptotically large for our result to hold, distinguishing our result from popular recent approaches such as the neural tangent kernel or mean-field regimes. Experimental illustration is provided, showing that the stochastic gradient descent behaves according to our description of the gradient flow and thus converges to a global optimum in the two-timescale regime, but can fail outside of this regime.

A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence
Carlo Alfano Rui Yuan Patrick Rebeschini



研究问题:本文旨在提出一种新的基于镜像下降的策略优化框架,以适应一般参数化策略。
动机:尽管在表格设置中已经为这类算法建立了理论保证,但通用参数化方案的使用仍然没有得到充分证明。
方法:我们引入了一个新的基于镜像下降的策略优化框架,该框架自然地适应了一般的参数化策略。
效果:我们的研究首次证明了涉及一般参数化的策略梯度方法的线性收敛性。此外,我们还展示了该框架在使用浅层神经网络时具有样本复杂度,并表明它超越了之前的最佳结果。最后,我们在经典控制任务上验证了我们的理论主张的有效性。

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.

Stochastic Distributed Optimization under Average Second-order Similarity: Algorithms and Analysis
Dachao Lin Yuze Han Haishan Ye Zhihua Zhang



研究问题:本研究针对包含一个主节点和$n-1$个局部节点的有限和分布式优化问题进行研究,考虑了$\delta$-相似性和$\mu$-强凸性条件。
动机:受先前工作启发,我们提出了两种新算法SVRS和AccSVRS。
方法:非加速的SVRS方法结合了梯度滑动和方差减少的技术,与现有的非加速算法相比,实现了更好的通信复杂度$\tilde{mathcal{O}}(n {+} \sqrt{n}\delta/mu)$。通过应用Katyusha X提出的框架,我们还开发了一个直接加速的版本AccSVRS,其通信复杂度为$\tilde{\mathcal{O}}(n {+} n^{3/4}\sqrt{\delta/\mu})$。
效果:与现有结果相比,我们的复杂度界限完全无平滑性,并且在病态情况下表现出优越性。此外,我们还建立了一个几乎匹配的下界,以验证我们的AccSVRS方法的紧密性。

We study finite-sum distributed optimization problems involving a master node and $n-1$ local nodes under the popular $\delta$-similarity and $\mu$-strong convexity conditions. We propose two new algorithms, SVRS and AccSVRS, motivated by previous works. The non-accelerated SVRS method combines the techniques of gradient sliding and variance reduction and achieves a better communication complexity of $\tilde{\mathcal{O}}(n {+} \sqrt{n}\delta/\mu)$ compared to existing non-accelerated algorithms. Applying the framework proposed in Katyusha X, we also develop a directly accelerated version named AccSVRS with the $\tilde{\mathcal{O}}(n {+} n^{3/4}\sqrt{\delta/\mu})$ communication complexity. In contrast to existing results, our complexity bounds are entirely smoothness-free and exhibit superiority in ill-conditioned cases. Furthermore, we establish a nearly matched lower bound to verify the tightness of our AccSVRS method.

An Information Theory Perspective on Variance-Invariance-Covariance Regularization
Ravid Shwartz-Ziv Randall Balestriero Kenji Kawaguchi Tim G. J. Rudner Yann LeCun



研究问题:本文旨在从信息论的角度探讨Variance-Invariance-Covariance Regularization (VICReg)方法的基本机制。
动机:尽管VICReg在各种任务上表现出了有希望的结果,但其基本机制尚未得到探索。
方法:通过推导确定性网络的信息论量,并将其与VICReg目标的优化相关联,将VICReg的目标优化与互信息优化联系起来,并引出其固有的优势。
效果:基于这些结果,我们提出了一种基于信息论原理的SSL方法,该方法优于现有的SSL技术。

Variance-Invariance-Covariance Regularization (VICReg) is a self-supervised learning (SSL) method that has shown promising results on a variety of tasks. However, the fundamental mechanisms underlying VICReg remain unexplored. In this paper, we present an information-theoretic perspective on the VICReg objective. We begin by deriving information-theoretic quantities for deterministic networks as an alternative to unrealistic stochastic network assumptions. We then relate the optimization of the VICReg objective to mutual information optimization, highlighting underlying assumptions and facilitating a constructive comparison with other SSL algorithms and derive a generalization bound for VICReg, revealing its inherent advantages for downstream tasks. Building on these results, we introduce a family of SSL methods derived from information-theoretic principles that outperform existing SSL techniques.

Faster Differentially Private Convex Optimization via Second-Order Methods
Arun Ganesh MAHDI HAGHIFAM Thomas Steinke Abhradeep Guha Thakurta



研究问题:本文旨在探讨使用损失函数的二阶信息来加速具有隐私保护的凸优化问题的可行性。
动机:在没有隐私约束的情况下,二阶方法如牛顿法比一阶方法如梯度下降法具有更快的收敛速度。
方法:本文首先开发了一种受隐私保护的正则化牛顿法,并证明对于强凸损失函数类,该算法具有二次收敛性并实现最优超额损失。然后设计了一种实用的无约束逻辑回归问题的二阶隐私保护算法。
效果:实验结果表明,与其他基线相比,该算法始终实现最优超额损失,并且在挑战性数据集上比DP-GD/DP-SGD快10-40倍。

Differentially private (stochastic) gradient descent is the workhorse of DP private machine learning in both the convex and non-convex settings. Without privacy constraints, second-order methods, like Newton's method, converge faster than first-order methods like gradient descent. In this work, we investigate the prospect of using the second-order information from the loss function to accelerate DP convex optimization. We first develop a private variant of the regularized cubic Newton method of Nesterov and Polyak, and show that for the class of strongly convex loss functions, our algorithm has quadratic convergence and achieves the optimal excess loss. We then design a practical second-order DP algorithm for the unconstrained logistic regression problem. We theoretically and empirically study the performance of our algorithm. Empirical results show our algorithm consistently achieves the best excess loss compared to other baselines and is 10-40x faster than DP-GD/DP-SGD for challenging datasets.

Counting Distinct Elements Under Person-Level Differential Privacy
Thomas Steinke Alexander Knop



研究问题:如何在满足差分隐私约束的条件下,对数据集中的不同元素进行计数。
动机:在个人级别的差分隐私(也称为用户级别差分隐私)设置中,每个个体可能贡献无限数量的项目,因此敏感性是无限的。
方法:计算这种查询的有界敏感性版本,这归结为解决最大流问题。敏感性边界被优化以平衡必须添加以使答案私有化的噪声与有界敏感性查询对真实唯一元素数量的近似值的误差。
效果:实验结果表明,该方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We study the problem of counting the number of distinct elements in a dataset subject to the constraint of differential privacy. We consider the challenging setting of person-level DP (a.k.a. user-level DP) where each person may contribute an unbounded number of items and hence the sensitivity is unbounded. Our approach is to compute a bounded-sensitivity version of this query, which reduces to solving a max-flow problem. The sensitivity bound is optimized to balance the noise we must add to privatize the answer against the error of the approximation of the bounded-sensitivity query to the true number of unique elements.

Faster Discrete Convex Function Minimization with Predictions: The M-Convex Case
Taihei Oki Shinsaku Sakaue



研究问题:本文旨在利用机器学习预测来加速优化算法。
动机:近年来,人们越来越关注使用机器学习预测来加速优化算法。作者开发了一个通用框架,以预测启动L-凸函数最小化方法,揭示了这一思想对各种离散优化问题的实用性。
方法:本文提出了一个使用预测来加速M-凸函数最小化的框架,从而补充了以前的研究,并扩大了可以从预测中受益的离散优化算法的范围。
效果:实验结果表明,该方法可以改进时间复杂度界线,甚至有可能超越下界结果。

Recent years have seen a growing interest in accelerating optimization algorithms with machine-learned predictions. Sakaue and Oki (NeurIPS 2022) have developed a general framework that warm-starts the *L-convex function minimization* method with predictions, revealing the idea's usefulness for various discrete optimization problems. In this paper, we present a framework for using predictions to accelerate *M-convex function minimization*, thus complementing previous research and extending the range of discrete optimization algorithms that can benefit from predictions. Our framework is particularly effective for an important subclass called *laminar convex minimization*, which appears in many operations research applications. Our methods can improve time complexity bounds upon the best worst-case results by using predictions and even have potential to go beyond a lower-bound result.

topic-8

Topic words :  models,  language,  model,  tasks,  pre,  large,  human,  text

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
Royi Rassin Eran Hirsch Daniel Glickman Shauli Ravfogel Yoav Goldberg Gal Chechik



研究问题:文本条件图像生成模型在实体和视觉属性之间的关联上常产生错误,映射语言提示中实体和修饰词的绑定与生成图像中对应元素的视觉绑定受损。
动机:为了解决这一问题,我们提出了SynGen方法,该方法首先对提示进行句法分析以识别实体及其修饰词,然后使用一种新的损失函数来鼓励交叉注意力图与语言提示反映的语言学绑定一致。
方法:具体来说,我们鼓励实体和其修饰词的注意力图之间有较大的重叠,与其他实体和修饰词单词的注意力图有较小的重叠。这种损失在推理过程中进行优化,无需重新训练或微调模型。
效果:在包括一个新的具有挑战性的数据集在内的三个数据集上的人类评估显示,SynGen相比当前最先进的方法有了显著的改进。这项工作强调了在推理过程中利用句子结构可以有效且大幅度提高文本到图像生成的忠实度。

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one example, a query like ``a pink sunflower and a yellow flamingo'' may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

Learning Transformer Programs
Dan Friedman Alexander Wettig Danqi Chen



研究问题:如何通过训练可解释的Transformer模型,实现对深度学习算法的内在理解。
动机:目前的深度学习模型虽然强大,但其内部运行机制复杂难懂,需要大量的手动操作才能解析网络权重和激活值。
方法:本文提出了一种新方法,通过设计并训练一种改进的Transformer模型,该模型可以自动转化为人类可读的程序,称为“Transformer程序”。
效果:实验证明,这种Transformer程序不仅可以找到合理的解决方案,性能与同等规模的常规Transformer相当,更重要的是,它易于解释。

Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short of providing complete, faithful descriptions of the underlying algorithms. In this work, we introduce a procedure for training Transformers that are mechanistically interpretable by design. We build on RASP [Weiss et al., 2021], a programming language that can be compiled into Transformer weights. Instead of compiling human-written programs into Transformers, we design a modified Transformer that can be trained using gradient-based optimization and then automatically converted into a discrete, human-readable program. We refer to these models as Transformer Programs. To validate our approach, we learn Transformer Programs for a variety of problems, including an in-context learning task, a suite of algorithmic problems (e.g. sorting, recognizing Dyck languages), and NLP tasks including named entity recognition and text classification. The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size; and, more importantly, they are easy to interpret. To demonstrate these advantages, we convert Transformers into Python programs and use off-the-shelf code analysis tools to debug model errors and identify the “circuits” used to solve different sub-problems. We hope that Transformer Programs open a new path toward the goal of intrinsically interpretable machine learning.

QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers Artidoro Pagnoni Ari Holtzman Luke Zettlemoyer



研究问题:如何减少预训练语言模型的内存使用,同时保持其性能?
动机:目前的预训练语言模型在大型GPU上进行微调时,内存使用过大。
方法:提出了一种名为QLoRA的高效微调方法,通过4位量化预训练语言模型到低秩适配器(LoRA)进行反向传播梯度,实现了在单个48GB GPU上微调650亿参数模型,同时保持了16位微调任务的性能。
效果:该方法减少了内存使用,同时没有牺牲性能。实验结果表明,该方法在所有先前公开发布的模型上表现优于Vicuna基准测试,达到了ChatGPT性能水平的99.3%,并且只需要在单个GPU上进行24小时的微调。

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information-theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small, high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations, showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

Why think step by step? Reasoning emerges from the locality of experience
Ben Prystawski Michael Y. Li Noah Goodman



研究问题:本文旨在探究语言模型中逐步推理(chain-of-thought reasoning)的作用及其有效性。
动机:人类通过逐步推理进行推断,同样地,大型语言模型在生成中间步骤(思维链)后再回答问题时,通常会产生更好的答案。
方法:通过实验和理论分析,研究了当训练数据由相互强烈影响的局部变量簇组成时,逐步推理为何有效。
效果:研究发现,当训练数据具有局部结构并考虑变量之间的依赖关系时,逐步推理是有效的。结合局部结构化观察和推理比对所有变量进行训练更为高效。这些结果揭示了逐步推理的有效性源于训练数据的局部统计结构。

Humans have a powerful and mysterious capacity to reason. Working through a set of mental steps enables us to make inferences we would not be capable of making directly even though we get no additional data from the world. Similarly, when large language models generate intermediate steps (a chain of thought) before answering a question, they often produce better answers than they would directly. We investigate why and how chain-of-thought reasoning is useful in language models, testing the hypothesis that reasoning is effective when training data consists of overlapping local clusters of variables that influence each other strongly. These training conditions enable the chaining of accurate local inferences to estimate relationships between variables that were not seen together in training. We prove that there will exist a "reasoning gap", where reasoning through intermediate variables reduces bias, for the simple case of an autoregressive density estimator trained on local samples from a chain-structured probabilistic model. We then test our hypothesis experimentally in more complex models, training an autoregressive language model on samples from Bayes nets but only including a subset of variables in each sample. We test language models’ ability to match conditional probabilities with and without intermediate reasoning steps, finding that intermediate steps are only helpful when the training data is locally structured with respect to dependencies between variables. The combination of locally structured observations and reasoning is much more data-efficient than training on all variables. Our results illustrate how the effectiveness of reasoning step by step is rooted in the local statistical structure of the training data.

Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models
Andrew Luo Margaret Marie Henderson Leila Wehbe Michael J. Tarr



研究问题:揭示大脑的功能组织。
动机:传统的研究方法依赖于手动组装的刺激集,这限制了对大脑功能组织的理解。
方法:引入数据驱动的方法,通过配对的自然图像和fMRI记录来合成预测激活特定大脑区域的图像。
效果:证明了该方法能够为被充分表征的类别选择性ROIs合成具有适当语义特异性的优选图像,并揭示了人类视觉皮层中同一高级类别选择性ROIs之间的差异以及新的功能性子划分。

A long standing goal in neuroscience has been to elucidate the functional organization of the brain. Within higher visual cortex, functional accounts have remained relatively coarse, focusing on regions of interest (ROIs) and taking the form of selectivity for broad categories such as faces, places, bodies, food, or words. Because the identification of such ROIs has typically relied on manually assembled stimulus sets consisting of isolated objects in non-ecological contexts, exploring functional organization without robust a priori hypotheses has been challenging. To overcome these limitations, we introduce a data-driven approach in which we synthesize images predicted to activate a given brain region using paired natural images and fMRI recordings, bypassing the need for category-specific stimuli. Our approach -- Brain Diffusion for Visual Exploration ("BrainDiVE") -- builds on recent generative methods by combining large-scale diffusion models with brain-guided image synthesis. Validating our method, we demonstrate the ability to synthesize preferred images with appropriate semantic specificity for well-characterized category-selective ROIs. We then show that BrainDiVE can characterize differences between ROIs selective for the same high-level category. Finally we identify novel functional subdivisions within these ROIs, validated with behavioral data. These results advance our understanding of the fine-grained functional organization of human visual cortex, and provide well-specified constraints for further examination of cortical organization using hypothesis-driven methods.

Are Emergent Abilities of Large Language Models a Mirage?
Rylan Schaeffer Brando Miranda Sanmi Koyejo



研究问题:本文旨在探讨大型语言模型的“涌现能力”,即在较大规模模型中出现而在较小规模模型中未出现的能力,并对其产生的原因进行解析。
动机:涌现能力的出现具有尖锐性和不可预测性,引发了研究者对其产生原因的关注和探讨。
方法:本文提出了一种新的解释方式,认为涌现能力的产生并非由于模型行为的根本变化,而是研究者选择的度量标准不同导致的。非线性或不连续的度量标准会产生明显的涌现能力,而线性或连续的度量标准则会导致模型性能的平滑、连续和可预测的变化。
效果:通过数学模型和三种互补的方法进行了验证,证实了这种新的解释方式,并提供了证据表明所谓的涌现能力可能会随着不同的度量标准或更好的统计结果而消失,可能并非AI模型规模扩展的基本属性。

Recent work claims that large language models display \textit{emergent abilities}, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their \textit{sharpness}, transitioning seemingly instantaneously from not present to present, and their \textit{unpredictability}, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous, predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities, (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language
Kevin Ellis



研究问题:模型的概念学习中存在一个核心的矛盾,即模型必须在推理的可处理性和假设类的表达能力之间进行精细的平衡。
动机:然而,人类能够有效地学习广泛的概念。
方法:我们引入了一种归纳学习方法,该方法试图在那种意义上类似于人类。它实现了一种贝叶斯推理过程,其中语言模型首先提出用自然语言表达的候选假设,然后由先验和似然性重新加权。
效果:通过从人类数据中估计先验,我们可以预测人类对涉及数字和集合的学习问题的评判,涵盖生成性、判别性、命题性和高阶概念。

A core tension in models of concept learning is that the model must carefully balance the tractability of inference against the expressivity of the hypothesis class. Humans, however, can efficiently learn a broad range of concepts. We introduce a model of inductive learning that seeks to be human-like in that sense. It implements a Bayesian reasoning process where a language model first proposes candidate hypotheses expressed in natural language, which are then re-weighed by a prior and a likelihood. By estimating the prior from human data, we can predict human judgments on learning problems involving numbers and sets, spanning concepts that are generative, discriminative, propositional, and higher-order.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao Dian Yu Jeffrey Zhao Izhak Shafran Thomas L. Griffiths Yuan Cao Karthik R Narasimhan



研究问题:如何让语言模型在推理过程中进行探索、策略前瞻或初始决策,以解决需要这些能力的任务。
动机:现有的语言模型在推理过程中仅限于基于标记的、从左到右的决策过程,这在一些需要探索、策略前瞻或初始决策的任务中可能无法满足需求。
方法:提出了一种新的语言模型推理框架——思维树(ToT),它对流行的思维链提示语言模型的方法进行了泛化,并允许在通向问题解决的文本(思维)单位上进行探索。
效果:实验表明,思维树显著提高了语言模型在三个新任务上的解决问题能力,这些任务需要复杂的规划或搜索,例如24点游戏、创意写作和迷你字谜。

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4\% of tasks, our method achieved a success rate of 74\%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.

Image Captioners Are Scalable Vision Learners Too
Michael Tschannen Manoj Kumar Andreas Peter Steiner Xiaohua Zhai Neil Houlsby Lucas Beyer



研究问题:本文比较了在大规模多模态模型中,图像-文本对的对比预训练和图像标注两种预训练策略的效果。
动机:尽管对比预训练在视觉骨干网络中非常流行,但人们通常认为图像标注是一种效果较差的预训练策略。
方法:通过仔细匹配训练数据、计算资源和模型容量,使用标准的编码器-解码器转换器进行实验,发现仅图像标注就非常有效。
效果:实验结果表明,图像标注产生的视觉编码器在分类任务上与对比预训练的编码器具有竞争力,而在视觉和语言任务上则超过它们。

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.

Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick Jane Dwivedi-Yu Roberto Dessi Roberta Raileanu Maria Lomeli Eric Hambro Luke Zettlemoyer Nicola Cancedda Thomas Scialom



研究问题:如何让语言模型通过简单的API使用外部工具,并实现最佳效果。
动机:尽管语言模型在解决新任务上表现出色,但在基本功能如算术或事实查找方面却表现不佳,而更小的专用模型在这方面却很优秀。
方法:提出Toolformer模型,该模型可以自行决定何时调用哪个API,传递什么参数,以及如何将结果融入未来的标记预测中。
效果:Toolformer在一系列下游任务上的零样本性能得到了显著提高,通常与更大的模型具有竞争力,同时不牺牲其核心的语言建模能力。

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. In this paper, we show that LMs can teach themselves to *use external tools* via simple APIs and achieve the best of both worlds. We introduce *Toolformer*, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
Zijiao Chen Jiaxin Qing Juan Helen Zhou



研究问题:如何从大脑活动中重建连续的视觉体验,即视频。
动机:理解人类的认知过程,以及从非侵入性脑记录中恢复静态图像的成功尝试。
方法:通过蒙版大脑建模、带有时空注意力的多模态对比学习以及结合网络时间膨胀的增强稳定扩散模型进行共同训练,从连续的皮层功能磁共振成像数据中逐步学习时空信息。
效果:使用对抗性引导,Mind-Video能够成功重建高质量任意帧率的视频。在语义分类任务上平均准确率达到85%,结构相似性指数(SSIM)为0.19,比先前最先进的方法提高了45%。此外,该模型在生物学上是可信且可解释的,反映了已建立的生理过程。

Reconstructing human vision from brain activities has been an appealing task that helps to understand our cognitive process. Even though recent research has seen great success in reconstructing static images from non-invasive brain recordings, work on recovering continuous visual experiences in the form of videos is limited. In this work, we propose Mind-Video that learns spatiotemporal information from continuous fMRI data of the cerebral cortex progressively through masked brain modeling, multimodal contrastive learning with spatiotemporal attention, and co-training with an augmented Stable Diffusion model that incorporates network temporal inflation. We show that high-quality videos of arbitrary frame rates can be reconstructed with Mind-Video using adversarial guidance. The recovered videos were evaluated with various semantic and pixel-level metrics. We achieved an average accuracy of 85% in semantic classification tasks and 0.19 in structural similarity index (SSIM), outperforming the previous state-of-the-art by 45%. We also show that our model is biologically plausible and interpretable, reflecting established physiological processes.

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
Shibo Hao Tianyang Liu Zhen Wang Zhiting Hu



研究问题:如何有效地将大型语言模型(LLMs)与各种工具集成,以解决计算成本高和固定工具集限制的问题。
动机:现有的方法要么通过微调LLM来适应新工具,这既计算量大又局限于固定的工具集;要么通过上下文工具演示提示LLM,但这在展示许多新工具时受到LLM固有的上下文长度限制,且难以通过少量示例掌握新工具集,导致性能不佳。
方法:提出一种名为ToolkenGPT的新解决方案,其中LLM通过工具嵌入有效地学习掌握工具作为预测标记来解决复杂任务。在这个框架中,每个工具都被转化为向量嵌入并插入到语言模型的头部。一旦在文本生成过程中触发该功能,LLM就会进入一种特殊的函数模式来执行工具调用。
效果:实验表明,功能嵌入有效地帮助LLM理解工具的使用,并在几个任务上有所改进,包括数值推理、基于知识的问答和具身决策。

Integrating large language models (LLMs) with various tools has led to increased attention in the field. Existing approaches either involve fine-tuning the LLM, which is both computationally costly and limited to a fixed set of tools, or prompting LLMs by in-context tool demonstrations. Although the latter method offers adaptability to new tools, it struggles with the inherent context length constraint of LLMs when many new tools are presented, and mastering a new set of tools with few-shot examples remains challenging, resulting in suboptimal performance. To address these limitations, we propose a novel solution, named **ToolkenGPT**, wherein LLMs effectively learn to master tools as predicting tokens through **tool embeddings** for solving complex tasks. In this framework, each tool is transformed into vector embeddings and plugged into the language model head. Once the function is triggered during text generation, the LLM enters a special function mode to execute the tool calls. Our experiments show that function embeddings effectively help LLMs understand tool use and improve on several tasks, including numerical reasoning, knowledge-based question answering and embodied decision-making.

Visual Instruction Tuning
Haotian Liu Chunyuan Li Qingyang Wu Yong Jae Lee



研究问题:如何利用机器生成的指令跟随数据对大型语言模型进行微调,以提高其在多模态领域的零样本能力。
动机:尽管在单模态领域使用语言-图像指令跟随数据进行模型微调已取得一定成果,但在多模态领域的探索还相对较少。
方法:首次尝试使用纯语言的GPT-4生成语言-图像指令跟随数据,并通过对这些数据的指令微调,训练出端到端的大型多模态模型LLaVA。LLaVA将视觉编码器和语言模型连接起来,用于通用的视觉和语言理解。
效果:实验表明,LLaVA在多模态聊天方面表现出色,有时在未见过的图像/指令上展现出类似多模态GPT-4的行为,并在一个合成的多模态指令跟随数据集上取得了相对于GPT-4 85.1%的相对分数。当在科学问答任务上进行微调时,LLaVA与GPT-4的协同作用达到了新的最先进的准确率92.53%。

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
Guhao Feng Bohang Zhang Yuntian Gu Haotian Ye Di He Liwei Wang



研究问题:本文旨在探索链式思维提示(CoT)在大型语言模型(LLMs)中的作用机制,以及如何解锁其潜力。
动机:尽管链式思维提示(CoT)在处理涉及数学或推理的复杂任务时,能显著提高大型语言模型(LLMs)的性能,但其背后的机制和如何发挥其潜力仍然不清楚。
方法:本文使用电路复杂度理论,首先给出了一些不可能的结果,表明有限深度的Transformers无法直接生成正确的基本算术/方程任务的答案,除非模型大小随输入长度呈超多项式增长。然后通过构造证明,自动回归的Transformers只需使用一种常用的数学语言格式生成CoT推导,就可以解决这两种任务,且模型大小恒定。此外,我们还展示了LLMs with CoT可以处理动态规划这一类别的决策问题,从而证明了其在处理复杂实际任务中的强大能力。
效果:实验结果表明,虽然Transformers始终无法直接预测答案,但它们可以在给定足够的CoT演示的情况下,逐步学习生成正确的解决方案。

Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the \emph{expressivity} of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows \emph{super-polynomially} with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of \emph{constant size} suffice to solve both tasks by generating CoT derivations using a commonly used math language format. Moreover, we show LLMs with CoT can handle a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, an extensive set of experiments show that, while Transformers always fail to directly predict the answers, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
Yu Bai Fan Chen Huan Wang Caiming Xiong Song Mei



研究问题:本文旨在为基于transformer架构的神经序列模型提供一种全面的统计理论,使其能够进行上下文学习(ICL)。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过大规模文本语料库和知识图谱训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Neural sequence models based on the transformer architecture have demonstrated remarkable \emph{in-context learning} (ICL) abilities, where they can perform new tasks when prompted with training and test examples, without any parameter update to the model. This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Using an efficient implementation of in-context gradient descent as the underlying mechanism, our transformer constructions admit mild size bounds, and can be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life---A \emph{single} transformer can adaptively select different base ICL algorithms---or even perform qualitatively different tasks---on different input sequences, without any explicit prompting of the right algorithm or task. We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. In theory, we construct two general mechanisms for algorithm selection with concrete examples: pre-ICL testing, and post-ICL validation. As an example, we use the post-ICL validation mechanism to construct a transformer that can perform nearly Bayes-optimal ICL on a challenging task---noisy linear models with mixed noise levels. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Shalev Lifshitz Keiran Paster Harris Chan Jimmy Ba Sheila A. McIlraith



研究问题:如何构建能响应文本指令的AI模型,特别是在顺序决策任务中。
动机:现有的AI模型在处理顺序决策任务时面临挑战,需要开发新的模型和方法。
方法:本文提出了一种名为STEVE-1的Minecraft视频预训练模型,通过两步训练法使模型学会遵循文本指令。首先,将预训练的VPT模型适应到MineCLIP的潜在空间中以执行命令;然后,训练一个模型来预测文本的潜在代码。
效果:实验结果表明,STEVE-1在低成本(只需60美元)和低层次控制(鼠标和键盘)的情况下,能够出色地完成12项中的13项早期游戏评估任务,显著优于先前的基线模型。

Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL•E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines and robustly completing 12 of 13 tasks in our early-game evaluation suite. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Zeqiu Wu Yushi Hu Weijia Shi Nouha Dziri Alane Suhr Prithviraj Ammanabrolu Noah A. Smith Mari Ostendorf Hannaneh Hajishirzi



研究问题:现有的语言模型在生成文本时,常常会产生不真实、有害或无关的输出。
动机:虽然利用人类反馈进行强化学习(RLHF)可以解决这些问题,但这种全面反馈方式对长文本输出的信息有限,无法明确指出哪些部分影响了用户的偏好。
方法:本文提出了细粒度RLHF框架,使用细粒度的人类反馈(如哪句是错的,哪个子句是不相关的)作为明确的训练信号。该框架在两个方面提供细粒度的奖励:一是密度,每生成一个部分(如一句)就提供奖励;二是结合多种与不同反馈类型(如事实错误、不相关和信息不完整)相关的奖励模型。
效果:实验结果显示,通过这种奖励函数的学习,可以显著提高性能,无论是自动评估还是人工评估都得到了支持。此外,我们还展示了可以通过不同的细粒度奖励模型组合来定制LM的行为。

Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF)---where human preference judgments on LM outputs are transformed into a learning signal---has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with this reward function leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
Roland S. Zimmermann Thomas Klein Wieland Brendel



研究问题:随着AI系统的广泛使用,理解神经网络的内部信息处理变得越来越重要。最近,通过在数据集和模型规模上扩大神经网络的规模,机器视觉取得了显著的进步。我们想知道这种规模的扩大是否也对机械可解释性领域产生了积极的影响。换句话说,我们对规模扩大的神经网络内部工作机制的理解是否也有改善?
动机:我们使用心理物理范式来量化九种模型的一种形式的机械可解释性,发现规模扩大对可解释性没有影响——无论是模型还是数据集规模。具体来说,与近十年前的GoogLeNet模型相比,调查中的所有最先进的模型都没有更容易解释。最新的视觉模型似乎比旧的架构更不易解释,这表明现代模型牺牲了可解释性以换取准确性。这些结果强调了需要明确设计成机械可解释性的模型,以及需要更有效的可解释性方法来增加我们对网络原子级别的理解。
方法:我们发布了一个包含13万多个人类响应的数据集,这些响应来自我们对九个模型的767个单位的心理学评估。这个数据集促进了基于自动化而不是人工的可解释性评估的研究,这最终可以用来直接优化模型的机械可解释性。
效果:实验结果表明,在各种知识驱动任务上,ERNIE取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,我们发现最新一代的视觉模型比旧的架构更不易解释,这暗示着现代模型在牺牲可解释性以换取准确性方面可能正在倒退。

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify one form of mechanistic interpretability for a diverse suite of nine models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 130'000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset facilitates research on automated instead of human-based interpretability evaluations, which can ultimately be leveraged to directly optimize the mechanistic interpretability of models.

In-Context Impersonation Reveals Large Language Models' Strengths and Biases
Leonard Salewski Stephan Alaniz Isabel Rio-Torto Eric Schulz Zeynep Akata



研究问题:本研究旨在探索预训练语言模型(LLMs)在生成文本时是否能扮演不同的角色。
动机:人类在日常生活中可以扮演不同的角色,并调整自己的词汇以适应所选的角色。我们想知道预训练语言模型是否也能做到这一点。
方法:通过在提示前添加与社交身份或专业领域相关的人称,让LLMs在解决视觉和语言任务之前假设不同的人物角色。
效果:实验结果表明,假装成不同年龄的儿童的LLMs能够恢复类似人类的探索发展阶段。在语言推理任务中,假装成专业领域的专家的LLMs的表现优于假装成非专业领域的专家的LLMs。此外,我们发现,当描述不同类别时,模仿可以提高性能:被提示为鸟类专家的LLMs对鸟类的描述比被提示为汽车专家的LLMs更好。然而,模仿也可能会揭示LLMs的偏见:被提示为男性的LLMs对汽车的描述比被提示为女性的LLMs更好。这些发现表明,LLMs有能力扮演不同的角色,并且这种上下文中的模仿可以用来揭示他们的优势和隐藏的偏见。

In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs' impersonations are complementary to visual information when describing different categories. We find that impersonation can improve performance: an LLM prompted to be a bird expert describes birds better than one prompted to be a car expert. However, impersonation can also uncover LLMs' biases: an LLM prompted to be a man describes cars better than one prompted to be a woman. These findings demonstrate that LLMs are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their strengths and hidden biases. Our code is available at https://github.com/ExplainableML/in-context-impersonation.

Towards In-context Scene Understanding
Ivana Balazevic David Steiner Nikhil Parthasarathy Relja Arandjelovic Olivier J Henaff



研究问题:本文旨在探索一种简单机制,通过提示标注特征进行最近邻检索,实现密集任务如语义分割和深度估计的上下文学习。
动机:与自然语言处理领域相比,计算机视觉领域在上下文学习方面进展较慢,通常需要专门的解码器和微调协议来执行密集任务。
方法:提出一种新的预训练协议,利用图像内的和跨图像的注意力,得到对这种场景理解任务特别有用的表示。
效果:由此产生的蜂鸟模型,在适当提示下,无需修改即可执行各种场景理解任务,性能接近专门针对每个任务进行微调的专家模型。此外,蜂鸟模型可以比微调模型更高效地配置以执行新任务,提高了交互式助手中的场景理解的可能性。

In-context learning––the ability to configure a model's behavior with different prompts––has revolutionized the field of natural language processing, alleviating the need for task-specific models and paving the way for generalist models capable of assisting with any query. Computer vision, in contrast, has largely stayed in the former regime: specialized decoders and finetuning protocols are generally required to perform dense tasks such as semantic segmentation and depth estimation. In this work we explore a simple mechanism for in-context learning of such scene understanding tasks: nearest neighbor retrieval from a prompt of annotated features. We propose a new pretraining protocol––leveraging attention within and across images––which yields representations particularly useful in this regime. The resulting Hummingbird model, suitably prompted, performs various scene understanding tasks without modification while approaching the performance of specialists that have been finetuned for each task. Moreover, Hummingbird can be configured to perform new tasks much more efficiently than finetuned models, raising the possibility of scene understanding in the interactive assistant regime.

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Peter Hase Mohit Bansal Been Kim Asma Ghandeharioun



研究问题:如何有效地改变预训练语言模型中存储的事实信息。
动机:现有的方法将事实信息定位到特定的模型参数,如中层MLP权重,但这种方法并不能保证最佳的效果。
方法:通过编辑与现有方法建议的位置不同的权重,来改变模型中存储的事实信息。
效果:实验结果表明,事实信息的本地化结论并不能提供任何关于哪个模型MLP层最适合编辑的洞察,而编辑的层数是预测性能的更好指标。这提出了一个问题,即过去依赖因果追踪选择要编辑的模型层的工作是否有效。

Language models learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights. In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific model parameters would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit. Next, we consider several variants of the editing problem, including erasing and amplifying facts. For one of our editing problems, editing performance does relate to localization results from representation denoising, but we find that which layer we edit is a far better predictor of performance. Our results suggest, counterintuitively, that better mechanistic understanding of how pretrained language models work may not always translate to insights about how to best change their behavior.

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
Laura Eline Ruis Akbir Khan Stella Biderman Sara Hooker Tim Rocktäschel Edward Grefenstette



研究问题:评估语言模型在语境中解释语言的能力,特别是其是否能理解含意。
动机:尽管语言模型广泛应用于对话代理,但性能评估未能捕捉到交流的关键方面:在语境中解释语言,即包含其语用学。
方法:设计了一个简单的任务,对四种广泛使用的最先进的模型进行了评估。
效果:发现尽管只评估需要二元推理(是或否)的表述,三类模型的表现接近随机。然而,通过示例级别调整的语言模型表现显著更好。这些结果表明,某些微调策略能更好地引导模型理解语用学。

Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context---incorporating its pragmatics. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), models in three of these categories perform close to random. However, LLMs instruction-tuned at the example-level perform significantly better. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.

Tracr: Compiled Transformers as a Laboratory for Interpretability
David Lindner Janos Kramar Sebastian Farquhar Matthew Rahtz Thomas McGrath Vladimir Mikulik



研究问题:如何将人类可读的程序编译为标准的只解码的transformer模型。
动机:当前的transformer模型学习到的程序结构未知,使得评估解释性方法的成功与否变得困难。
方法:开发了一个名为Tracr的编译器,可以将人类可读的程序编译为具有已知结构的transformer模型。
效果:通过实现和检查包括计算词频、排序和括号检查在内的程序,证明了这种方法的有效性。同时,这种已知的结构可以作为评估解释性方法的基准。

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as _ground-truth_ for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.

Schema-learning and rebinding as mechanisms of in-context learning and emergence
Sivaramakrishnan Swaminathan Antoine Dedieu Rajkumar Vasudeva Raju Murray Shanahan Miguel Lazaro-Gredilla Dileep George



研究问题:本文旨在揭示基于变压器的大型语言模型(LLMs)中的内联学习(ICL)机制。
动机:尽管内联学习是大型语言模型中一种强大且意外的能力,但其背后的机制尚不清楚。
方法:通过使用克隆结构因果图(CSCGs)的替代序列预测学习方法,作者展示了可以获得类似的内联学习能力。此外,CSCGs的关键特性是它们与基于变压器的LLMs不同,它们是可解释的,这大大简化了解释内联学习如何工作的任务。
效果:作者收集证据支持这样一个假设,即类似的机制也存在于LLMs中的内联学习。例如,他们发现,无论是使用CSCGs还是LLMs,不同的能力在不同的过参数化水平上出现,这表明过参数化有助于学习更复杂的模板电路。通过展示如何使用小模型和数据集实现内联学习,作者为新的架构开辟了一条道路,并朝着更全面理解这一重要能力背后的机制迈出了重要一步。

In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs). Moreover, a key property of CSCGs is that, unlike transformer-based LLMs, they are {\em interpretable}, which considerably simplifies the task of explaining how ICL works. Specifically, we show that it uses a combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and (c) rebinding of novel tokens to appropriate slots in the templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.

Curriculum Learning With Infant Egocentric Videos
Saber Sheybani Himanshu Hansaria Justin Newell Wood Linda B. Smith Zoran Tiganj



研究问题:婴儿视觉输入属性的变化是否对视觉系统的正常发展有益或至关重要?
动机:通过使用头戴式摄像机录制的婴儿视频,训练各种自监督学习模型,以了解婴儿视觉输入属性的变化对其视觉系统发展的影响。
方法:将婴儿数据按年龄组进行分类,并评估按照发育顺序进行培训的重要性。从最年轻的年龄组开始学习,发现这种学习方式提供了最强的学习信号,并在下游任务性能方面取得了最好的学习成果。
效果:结果表明,最年轻年龄组的数据之所以具有优势,是因为其视觉体验的速度和简单性。这些结果为使用人工智能中的图像计算模型来逆向工程新生儿大脑的学习机制提供了有力的实证证据。

Infants possess a remarkable ability to rapidly learn and process visual inputs. As an infant's mobility increases, so does the variety and dynamics of their visual inputs. Is this change in the properties of the visual inputs beneficial or even critical for the proper development of the visual system? To address this question, we used video recordings from infants wearing head-mounted cameras to train a variety of self-supervised learning models. Critically, we separated the infant data by age group and evaluated the importance of training with a curriculum aligned with developmental order. We found that initiating learning with the data from the youngest age group provided the strongest learning signal and led to the best learning outcomes in terms of downstream task performance. We then showed that the benefits of the data from the youngest age group are due to the slowness and simplicity of the visual experience. The results provide strong empirical evidence for the importance of the properties of the early infant experience and developmental progression in training. More broadly, our approach and findings take a noteworthy step towards reverse engineering the learning mechanisms in newborn brains using image-computable models from artificial intelligence.

On the Planning Abilities of Large Language Models - A Critical Investigation
Karthik Valmeekam Matthew Marquez Sarath Sreedharan Subbarao Kambhampati



研究问题:本文旨在调查在通用网络语料库上训练的LLMs的新兴推理能力,特别是他们的计划能力。
动机:对LLMs在常识规划任务中自主生成计划的能力及其作为其他代理(AI规划器)规划任务中启发式指导来源的潜力感兴趣。
方法:通过在类似于国际规划竞赛中使用的领域生成一系列实例,以两种不同的模式(自主和启发式)评估LLMs。
效果:研究发现,LLMs自主生成可执行计划的能力相当有限,最好的模型(GPT-4)在所有领域的平均成功率约为12%。然而,启发式模式下的结果更具前景。在该模式下,我们证明LLM生成的计划可以改善基础健全规划器的搜索过程,并额外表明外部验证器可以帮助提供关于生成计划的反馈并反向提示LLM进行更好的计划生成。

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs as a source of heuristic guidance for other agents (AI planners) in their planning tasks. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs’ ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains. However, the results in the heuristic mode show more promise. In the heuristic mode, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.

Plug-and-Play Stability for Intracortical Brain-Computer Interfaces: A One-Year Demonstration of Seamless Brain-to-Text Communication
Chaofei Fan Nick Hahn Foram Kamdar Donald Avansino Guy H Wilson Leigh Hochberg Krishna V. Shenoy Jaimie M. Henderson Francis R Willett



研究问题:如何实现大脑皮层脑机接口(iBCIs)的长期稳定,以恢复患有神经障碍如肌萎缩侧索硬化症(ALS)的人的快速交流。
动机:为了保持高性能,iBCIs通常需要频繁重新校准以对抗数天来累积的神经记录变化,这要求iBCI用户停止使用并参与监督数据收集,使得iBCI系统难以使用。
方法:本文提出了一种无需中断用户就能自我校准通信iBCIs的方法。该方法利用大型语言模型(LMs)自动纠正iBCI输出的错误。自我校准过程使用这些纠正后的输出("伪标签")不断在线更新iBCI解码器。
效果:在超过一年(403天)的时间里,我们用一个临床试验参与者评估了我们的持续在线伪标签再校准(CORP)框架。CORP在一个在线手写iBCI任务中实现了93.84%的稳定解码精度,显著优于其他基线方法。这是涉及人类参与者的最长时间的iBCI稳定性演示。我们的研究结果为高性能即插即用通信iBCIs的长期稳定提供了首次证据,解决了iBCIs临床转化的主要障碍。

Intracortical brain-computer interfaces (iBCIs) have shown promise for restoring rapid communication to people with neurological disorders such as amyotrophic lateral sclerosis (ALS). However, to maintain high performance over time, iBCIs typically need frequent recalibration to combat changes in the neural recordings that accrue over days. This requires iBCI users to stop using the iBCI and engage in supervised data collection, making the iBCI system hard to use. In this paper, we propose a method that enables self-recalibration of communication iBCIs without interrupting the user. Our method leverages large language models (LMs) to automatically correct errors in iBCI outputs. The self-recalibration process uses these corrected outputs ("pseudo-labels") to continually update the iBCI decoder online. Over a period of more than one year (403 days), we evaluated our Continual Online Recalibration with Pseudo-labels (CORP) framework with one clinical trial participant. CORP achieved a stable decoding accuracy of 93.84% in an online handwriting iBCI task, significantly outperforming other baseline methods. Notably, this is the longest-running iBCI stability demonstration involving a human participant. Our results provide the first evidence for long-term stabilization of a plug-and-play, high-performance communication iBCI, addressing a major barrier for the clinical translation of iBCIs.

Can Language Models Solve Graph Problems in Natural Language?
Heng Wang Shangbin Feng Tianxing He Zhaoxuan Tan Xiaochuang Han Yulia Tsvetkov



研究问题:大型语言模型是否能够明确处理文本描述的图形和结构,将它们映射到基础的概念空间,并进行结构化操作。
动机:尽管大型语言模型在具有结构含义的任务上取得了进展,但它们是否能明确处理文本描述的图形和结构,并将它们映射到基础的概念空间,进行结构化操作的问题仍然未被充分探索。
方法:提出自然语言图(NLGraph),这是一个以自然语言设计的基于图形问题的全面基准。NLGraph包含29,370个问题,涵盖了从简单任务如连通性和最短路径到复杂任务如最大流和模拟图神经网络等八种图形推理任务。
效果:评估了大型语言模型(GPT-3/4)在NLGraph基准上的不同提示方法,发现1) 语言模型确实表现出初步的图形推理能力,2) 高级提示和上下文学习的好处在更复杂的图形问题上减少,而3) 面对图中和问题设置中的虚假相关性,大型语言模型也显得相当脆弱。然后提出了构建图形提示和算法提示两种基于指令的方法来提高大型语言模型解决自然语言图形问题的能力。这两种方法使大型语言模型在NLGraph上的表现提高了3.07%至16.85%,但在我们设置中如何解决最复杂的图形推理任务仍然是开放的研究问题。

Large language models (LLMs) are increasingly adopted for a variety of tasks with implicit graphical structures, such as planning in robotics, multi-hop question answering or knowledge probing, structured commonsense reasoning, and more. While LLMs have advanced the state-of-the-art on these tasks with structure implications, whether LLMs could explicitly process textual descriptions of graphs and structures, map them to grounded conceptual spaces, and perform structured operations remains underexplored. To this end, we propose NLGraph (Natural Language Graph), a comprehensive benchmark of graph-based problem solving designed in natural language. NLGraph contains 29,370 problems, covering eight graph reasoning tasks with varying complexity from simple tasks such as connectivity and shortest path up to complex problems such as maximum flow and simulating graph neural networks. We evaluate LLMs (GPT-3/4) with various prompting approaches on the NLGraph benchmark and find that 1) language models do demonstrate preliminary graph reasoning abilities, 2) the benefit of advanced prompting and in-context learning diminishes on more complex graph problems, while 3) LLMs are also (un)surprisingly brittle in the face of spurious correlations in graph and problem settings. We then propose Build-a-Graph Prompting and Algorithmic Prompting, two instruction-based approaches to enhance LLMs in solving natural language graph problems. Build-a-Graph and Algorithmic prompting improve the performance of LLMs on NLGraph by 3.07% to 16.85% across multiple tasks and settings, while how to solve the most complicated graph reasoning tasks in our setup with language models remains an open research question.

Supervised Pretraining Can Learn In-Context Reinforcement Learning
Jonathan Lee Annie Xie Aldo Pacchiano Yash Chandak Chelsea Finn Ofir Nachum Emma Brunskill



研究问题:本文旨在研究大型变压器模型在决策问题中的上下文学习能力,特别是在强化学习(RL)中。
动机:尽管大型变压器模型在各种任务上表现出色,但它们在未明确训练的任务上的上下文学习能力尚未得到充分研究。
方法:本文引入并研究了决策预训练变压器(DPT),这是一种监督预训练方法,其中变压器根据查询状态和来自各种任务的交互数据集预测最优行动。
效果:实验结果表明,训练后的变压器可以在上下文中解决一系列RL问题,展示出在线探索和离线保守的特性,尽管它并未被明确地训练来这样做。此外,该模型还能推广到预训练分布之外的任务,并能自动适应未知结构。理论上,我们证明了DPT可以被视为贝叶斯后验采样的有效实现,这是一种被证明是样本高效的RL算法。

Large transformer models trained on diverse datasets have shown a remarkable ability to learn in-context, achieving high few-shot performance on tasks they were not explicitly trained to solve. In this paper, we study the in-context learning capabilities of transformers in decision-making problems, i.e., reinforcement learning (RL) for bandits and Markov decision processes. To do so, we introduce and study the Decision-Pretrained Transformer (DPT), a supervised pretraining method where a transformer predicts an optimal action given a query state and an in-context dataset of interactions from a diverse set of tasks. While simple, this procedure produces a model with several surprising capabilities. We find that the trained transformer can solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline, despite not being explicitly trained to do so. The model also generalizes beyond the pretraining distribution to new tasks and automatically adapts its decision-making strategies to unknown structure. Theoretically, we show DPT can be viewed as an efficient implementation of Bayesian posterior sampling, a provably sample-efficient RL algorithm. We further leverage this connection to provide guarantees on the regret of the in-context algorithm yielded by DPT, and prove that it can learn faster than algorithms used to generate the pretraining data. These results suggest a promising yet simple path towards instilling strong in-context decision-making abilities in transformers.

Faith and Fate: Limits of Transformers on Compositionality
Nouha Dziri Ximing Lu Melanie Sclar Xiang Lorraine Li Liwei Jiang Bill Yuchen Lin Sean Welleck Peter West Chandra Bhagavatula Ronan Le Bras Jena D. Hwang Soumya Sanyal Xiang Ren Allyson Ettinger Zaid Harchaoui Yejin Choi



研究问题:Transformer大型语言模型在复杂的多步推理任务中表现出色,但在一些简单问题上却出现错误,这是否暗示了其存在更深层次的限制?
动机:为了揭示Transformer大型语言模型的局限性,我们对其在三个具有代表性的复合任务上的表现进行了研究。
方法:我们选取了乘法运算、逻辑网格谜题和经典的动态规划问题这三个需要分解为子步骤并综合得出精确答案的任务进行研究。我们将复合任务形式化为计算图,系统地量化了任务的复杂程度,并将推理步骤分解为中间过程。
效果:我们的实证研究发现,Transformer大型语言模型解决复合任务的方式是将多步复合推理简化为线性化的子图匹配,而并非必然发展出系统性的问题解决技巧。此外,我们还提供了关于抽象多步推理问题的理论研究,强调了自回归生成的性能会随着任务复杂度的增加而迅速下降。

Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify transformer LLMs, we investigate the limits of these models across three representative compositional tasks---multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how autoregressive generations' performance can rapidly decay with increased task complexity.

Effective Human-AI Teams via Learned Natural Language Rules and Onboarding
Hussein Mozannar Jimin J Lee Dennis Wei Prasanna Sattigeri Subhro Das David Sontag



研究问题:如何让人类知道何时依赖AI代理,何时与AI代理协作,或忽视其建议。
动机:提出一种基于数据区域和自然语言描述的规则学习方法,以说明人类应如何与AI协作。
方法:通过新颖的区域发现算法在数据中寻找局部区域作为嵌入空间中的邻域来纠正人类的先验知识,然后使用大型语言模型迭代并对比地描述每个区域。
效果:通过用户研究在对象检测和问答任务上,证明该方法可以使人类-AI团队更准确。同时,还对区域发现和描述算法进行了单独评估。

People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.

Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang Ansel Blume Sha Li Genglin Liu Jaemin Cho Zineng Tang Mohit Bansal Heng Ji



研究问题:现有的视频-语言模型在动作知识理解方面存在缺陷,主要依赖对象识别能力作为理解动作的捷径。
动机:为了解决这一问题,我们提出了一种新的框架Paxion和一个新的目标函数Discriminative Video Dynamics Modeling (DVDM)。
方法:Paxion框架利用知识补丁网络编码新的动作知识,并通过知识融合组件将补丁整合到已冻结的视频-语言模型中,同时不损害其现有功能。我们还引入了DVDM目标来训练知识补丁网络,以弥补广泛使用的Video-Text Contrastive (VTC)损失在学习动作知识方面的局限性。
效果:实验结果显示,Paxion和DVDM一起有效地填补了动作知识理解的差距(从50%提高到80%),同时在广泛的对象和动作为中心的下游任务上保持或提高了性能。

Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the **Action Dynamics Benchmark (ActionBench)** containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models’ (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, **Paxion**, along with a new **Discriminative Video Dynamics Modeling (DVDM)** objective. The Paxion framework utilizes a **Knowledge Patcher** network to encode new action knowledge and a **Knowledge Fuser** component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that Paxion and DVDM together effectively fill the gap in action knowledge understanding (~50% → 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks.

Bypassing spike sorting: Density-based decoding using spike localization from dense multielectrode probes
Yizi Zhang Tianxiao He Julien Boussard Charlie Windolf Olivier Winter Eric M. Trautmann Noam Roth Hailey Barrel Mark M Churchland Nick Steinmetz Erdem Varol Cole Lincoln Hurwitz Liam Paninski



研究问题:如何更准确地将动作电位分配给单个神经元,以提高神经解码在脑机接口(BCI)中的应用。
动机:目前的尖峰分选算法可能不准确,并且没有正确模拟尖峰分配的不确定性,因此丢弃了可能提高解码性能的信息。
方法:提出了一种无需尖峰分选的解码方法,该方法直接使用混合高斯分布(MoG)对提取的尖峰特征的分布进行建模,以表示尖峰分配的不确定性,而无需显式解决尖峰聚类问题。
效果:通过大量的动物和探针几何记录进行基准测试,证明所提出的解码器可以持续优于基于阈值(即多单元活动)和尖峰分选的当前方法。

Neural decoding and its applications to brain computer interfaces (BCI) are essential for understanding the association between neural activity and behavior. A prerequisite for many decoding approaches is spike sorting, the assignment of action potentials (spikes) to individual neurons. Current spike sorting algorithms, however, can be inaccurate and do not properly model uncertainty of spike assignments, therefore discarding information that could potentially improve decoding performance. Recent advances in high-density probes (e.g., Neuropixels) and computational methods now allow for extracting a rich set of spike features from unsorted data; these features can in turn be used to directly decode behavioral correlates. To this end, we propose a spike sorting-free decoding method that directly models the distribution of extracted spike features using a mixture of Gaussians (MoG) encoding the uncertainty of spike assignments, without aiming to solve the spike clustering problem explicitly. We allow the mixing proportion of the MoG to change over time in response to the behavior and develop variational inference methods to fit the resulting model and to perform decoding. We benchmark our method with an extensive suite of recordings from different animals and probe geometries, demonstrating that our proposed decoder can consistently outperform current methods based on thresholding (i.e. multi-unit activity) and spike sorting. Open source code is available at https://github.com/yzhang511/density_decoding.

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
Lijun Yu Yong Cheng Zhiruo Wang Vivek Kumar Wolfgang Macherey Yanping Huang David A Ross Irfan Essa Yonatan Bisk Ming-Hsuan Yang Kevin Patrick Murphy Alexander G Hauptmann Lu Jiang



研究问题:如何使预训练语言模型(PLMs)执行涉及非语言模态(如图像或视频)的理解与生成任务。
动机:目前的预训练语言模型在处理涉及非语言模态的任务时,往往需要重新进行训练,这增加了计算成本和时间。
方法:提出语义金字塔自动编码器(SPAE),将原始像素和从PLM的词汇表中提取的可解释词条(或单词)进行转换。这种方法能够有效地将视觉内容转化为语言,使PLM能够执行各种多模态任务。
效果:通过在冻结的PaLM 2和GPT 3.5上进行上下文学习实验,验证了该方法在理解图像和生成图像内容方面的有效性。这是第一次成功地让一个冻结的PLM在相同的设置下,同时超过最先进的性能,在图像理解任务上提高了25%以上。

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the rich semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.

Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Shengran Hu Jeff Clune



研究问题:强化学习代理在人类思维能力上的表现远不及人类,作者认为其中一个原因是它们缺乏语言思考的益处。
动机:作者提出一种新的模仿学习框架——"思维克隆",旨在不仅复制人类演示者的行为,也复制他们在执行这些行为时的思考方式,以提升AI代理的能力。
方法:通过合成生成的思维和行动数据进行实验,结果显示,思维克隆比行为克隆学习得更快,其性能优势随着测试任务越偏离分布越大而增强,显示出其更好地处理新情况的能力。
效果:通过训练AI代理如何思考以及如何表现,思维克隆创造了更安全、更强大的代理。同时,由于我们可以观察到代理的思考过程,因此可以更容易地诊断问题、纠正错误或防止不安全的行为。

Language is often considered a key aspect of human thinking, providing us with exceptional abilities to generalize, explore, plan, replan, and adapt to new situations. However, Reinforcement Learning (RL) agents are far from human-level performance in any of these abilities. We hypothesize one reason for such cognitive deficiencies is that they lack the benefits of thinking in language and that we can improve AI agents by training them to $\textit{think like humans do}$. We introduce a novel Imitation Learning framework, Thought Cloning, where the idea is to not just clone the behaviors of human demonstrators, $\textit{but also the thoughts humans have as they perform these behaviors}$. While we expect Thought Cloning to truly shine at scale on internet-sized datasets (e.g. online videos with transcripts), here we conduct experiments in a domain where the thinking and action data are synthetically generated. Results reveal that Thought Cloning learns much faster than Behavioral Cloning and its performance advantage grows the further out of distribution test tasks are, highlighting its ability to better handle novel situations. Thought Cloning also provides important benefits for AI Safety and Interpretability, and makes it easier to debug and improve AI. Because we can observe the agent’s thoughts, we can (1) more easily diagnose why things are going wrong, making it easier to fix the problem, (2) steer the agent by correcting its thinking, or (3) prevent it from doing unsafe things it plans to do. Overall, by training agents $\textit{how to think}$ as well as behave, Thought Cloning creates safer, more powerful agents.

4M: Massively Multimodal Masked Modeling
David Mizrahi Roman Bachmann Oguzhan Fatih Kar Teresa Yeo Mingfei Gao Afshin Dehghan Amir Zamir



研究问题:本文旨在开发一种多模态训练方案,以实现计算机视觉任务的通用性和可扩展性。
动机:目前的机器学习模型在视觉领域通常高度专业化且仅限于单个模态和任务。相比之下,最近的大规模语言模型表现出广泛的能力,暗示了在计算机视觉中也存在类似的多功能模型的可能性。
方法:本文提出了一种名为4M的多模态训练方案,它包括使用跨多种输入/输出模态(包括文本、图像、几何和语义模态以及神经网络特征图)的屏蔽建模目标来训练一个统一的Transformer编码器-解码器。4M通过将所有模态映射到离散的标记并对其上的一小部分随机标记进行多模态屏蔽建模来实现可扩展性。
效果:实验结果表明,4M能够训练出具备多种关键能力的模型:(1)它们可以立即执行一系列视觉任务;(2)当针对未见过的任务或新的输入模态进行微调时,它们表现出色;(3)它们可以作为生成模型运行,能够根据任意模态进行条件化,从而具备灵活多样的多模态编辑能力。

Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities – including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens. 4M leads to models that exhibit several key capabilities: (1) they can perform a diverse set of vision tasks out of the box, (2) they excel when fine-tuned for unseen downstream tasks or new input modalities, and (3) they can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility. Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.

Computing a human-like reaction time metric from stable recurrent vision models
Lore Goetschalckx Lakshmi Narasimhan Govindarajan Alekh Karkada Ashok Aarit Ahuja David Sheinberg Thomas Serre



研究问题:如何构建一个可以从刺激中计算、任务优化的模型来描述人类视觉决策的时间维度。
动机:随着深度神经网络在视觉计算模型中的广泛应用,人们开始尝试将这些模型与人类的认知过程进行对齐,其中一个重要的研究方向就是反应时间的建模。
方法:本文提出了一种基于主题逻辑理论的新颖指标,用于总结循环视觉模型中的证据积累情况,从而构建出能够描述人类反应时间模式的计算模型。
效果:通过在四个不同的视觉决策任务(包括知觉分组、心理模拟和场景分类)上的应用,验证了该模型能够有效地匹配人类的反应时间模式,为进一步探索模型和人类视觉策略的时间对齐性提供了可能。

The meteoric rise in the adoption of deep neural networks as computational models of vision has inspired efforts to ``align” these models with humans. One dimension of interest for alignment includes behavioral choices, but moving beyond characterizing choice patterns to capturing temporal aspects of visual decision-making has been challenging. Here, we sketch a general-purpose methodology to construct computational accounts of reaction times from a stimulus-computable, task-optimized model. Specifically, we introduce a novel metric leveraging insights from subjective logic theory summarizing evidence accumulation in recurrent vision models. We demonstrate that our metric aligns with patterns of human reaction times for stimulus manipulations across four disparate visual decision-making tasks spanning perceptual grouping, mental simulation, and scene categorization. This work paves the way for exploring the temporal alignment of model and human visual strategies in the context of various other cognitive tasks toward generating testable hypotheses for neuroscience. Links to the code and data can be found on the project page: https://serre-lab.github.io/rnn_rts_site/.

3D-LLM: Injecting the 3D World into Large Language Models
Yining Hong Haoyu Zhen Peihao Chen Shuhong Zheng Yilun Du Zhenfang Chen Chuang Gan



研究问题:如何将3D世界融入大型语言模型,以处理更丰富的概念,如空间关系、功能、物理、布局等。
动机:现有的大型语言模型和视觉-语言模型虽然在多个任务上表现出色,但并未基于三维物理世界进行训练,无法处理更复杂的3D相关任务。
方法:提出一种新的3D-LLMs,可以将3D点云及其特征作为输入,执行一系列3D相关任务,包括描述、密集描述、3D问答、任务分解、3D定位、3D辅助对话、导航等。通过设计三种提示机制收集超过30万条3D语言数据。利用从渲染的多视图图像中获取的3D特征提取器和2D VLMs作为基础来训练3D-LLMs。引入3D定位机制使3D-LLMs能更好地捕获3D空间信息。
效果:在ScanQA数据集上的实验表明,该模型比最先进的基线模型有大幅度的性能提升(例如BLEU-1分数比最先进的分数高出9%)。在自行构建的3D描述、任务合成和3D辅助对话数据集上的实验也表明,该模型优于2D VLMs。定性示例还显示,该模型可以执行现有LLMs和VLMs范围之外的任务。该模型和数据集将公开发布。

Large language models (LLMs) and Vision-Language Models (VLMs) have been proved to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models, and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi-view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs could better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (\textit{e.g.}, the BLEU-1 score surpasses state-of-the-art score by 9\%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Our model and data will be publicly available.

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Zhiqing Sun Yikang Shen Qinhong Zhou Hongxin Zhang Zhenfang Chen David Daniel Cox Yiming Yang Chuang Gan



研究问题:如何减少对人工监督的依赖,提高AI助手的效用、道德性和可靠性。
动机:目前的AI助手主要依靠人类注释的监督微调(SFT)和来自人类反馈的强化学习(RLHF),但这种方法的成本高且存在质量、可靠性、多样性、自一致性和不受欢迎的偏见等问题。
方法:提出一种名为SELF-ALIGN的新方法,结合原则驱动的推理和大型语言模型(LLMs)的生成能力,实现AI代理的自我对齐,最小化人工监督。
效果:将SELF-ALIGN应用于LLaMA-65b基础语言模型,开发了一个名为Dromedary的AI助手。在不到300行的人工注释下(包括<200个种子提示,16个通用原则和5个用于上下文学习的示例),Dromedary在各种设置的基准数据集上显著超越了几个最先进的AI系统,包括Text-Davinci-003和Alpaca。

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

Birth of a Transformer: A Memory Viewpoint
Alberto Bietti Vivien Cabannes Diane Bouchacourt Herve Jegou Leon Bottou



研究问题:本研究旨在理解大型基于转换器的模型的内部机制,以增强其可靠性。
动机:随着这些模型的广泛应用,对其内部机制的理解需求日益增长。
方法:通过考虑一个合成设置,其中标记由全局或上下文特定的二元分布生成,研究转换器如何平衡这两种类型的知识。通过对简化的两层转换器的训练过程进行细致的实证分析,说明了全局二元组的快速学习和上下文二元组的"归纳头"机制的较慢发展。
效果:我们强调了权重矩阵作为关联记忆的作用,提供了关于梯度如何在训练期间实现其学习的理论知识,并研究了数据分布属性的作用。

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Yann Dubois Xuechen Li Rohan Taori Tianyi Zhang Ishaan Gulrajani Jimmy Ba Carlos Guestrin Percy Liang Tatsunori Hashimoto



研究问题:大型语言模型(LLMs)如何更好地遵循用户指令,其训练过程复杂且理解不足。
动机:复制和理解这一指令遵循过程面临三大挑战:数据收集成本高、缺乏可信赖的评估方法以及缺乏参考实现。
方法:开发了一个名为AlpacaFarm的模拟器,用于在低成本下进行基于反馈的学习研究和发展。设计了基于LLM的人类反馈模拟器,比众包工人便宜45倍,并且与人类高度一致。确定了代表真实世界指令的评估数据集并提出了自动评估程序。为几种从配对反馈中学习的方法(PPO、最佳n次、专家迭代等)提供了参考实现。
效果:通过在AlpacaFarm上训练和评估11个模型,发现这些模型的排名与在人类数据上训练的模型排名相匹配。作为AlpacaFarm端到端验证的一部分,发现使用奖励模型的方法可以显著优于有监督微调,而我们的参考PPO实现相对于Davinci003的胜率提高了+10%。

Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these bottlenecks with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM based simulator for human feedback that is 45x cheaper than crowdworkers and displays high agreement with humans. Second, we identify an evaluation dataset representative of real-world instructions and propose an automatic evaluation procedure. Third, we contribute reference implementations for several methods (PPO, best-of-n, expert iteration, among others) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% win-rate improvement against Davinci003.

In-Context Learning Unlocked for Diffusion Models
Zhendong Wang Yifan Jiang Yadong Lu yelong shen Pengcheng He Weizhu Chen Zhangyang Wang Mingyuan Zhou



研究问题:本文旨在开发一种名为“Prompt Diffusion”的框架,以实现基于扩散的生成模型中的上下文学习。
动机:现有的扩散基生成模型无法进行上下文学习,需要通过新的文本引导来执行新任务。
方法:提出了一个视觉语言提示,可以模拟广泛的视觉语言任务,并设计了一个接受此提示作为输入的扩散模型。该模型在六个不同任务上使用这些提示进行联合训练。
效果:Prompt Diffusion模型成为首个能够进行上下文学习的扩散基视觉语言基础模型。它在已训练的任务上表现出高质量的上下文生成,并能有效地利用各自的提示推广到新的、未见过视觉任务。此外,该模型还显示出引人注目的文本引导图像编辑结果。

We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly on six different tasks using these prompts. The resulting Prompt Diffusion model becomes the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation for the trained tasks and effectively generalizes to new, unseen vision tasks using their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Prompt-Diffusion.

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian Chongyang Gao Soroush Vosoughi



研究问题:如何优化大型语言模型在资源密集型视觉-语言预训练中的应用。
动机:目前的方法主要关注确定与文本最相关的视觉特征,而本研究专注于语言部分,即确定与视觉特征最佳对齐的理想提示。
方法:提出Prompt-Transformer(P-Former)模型,该模型只使用语言学数据进行训练,无需图像-文本配对,可以预测出理想的提示。
效果:实验表明,该方法显著提高了强大的图像到文本基线(BLIP-2)的性能,有效缩小了使用4M或129M图像-文本对训练的模型之间的性能差距。此外,该方法具有良好的模态无关性和结构设计的灵活性,已在视频学习任务中得到验证。

We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code will be made available at https://github.com/yiren-jian/BLIText.

Counterfactual Memorization in Neural Language Models
Chiyuan Zhang Daphne Ippolito Katherine Lee Matthew Jagielski Florian Tramèr Nicholas Carlini



研究问题:现代神经网络语言模型在各种NLP任务中可能会记住训练数据中的敏感信息,理解研究问题:现代神经网络语言模型在各种NLP任务中可能会记住训练数据中的敏感信息,理解这种记忆现象对于实际应用和学习理论都很重要。
动机:先前对语言模型记忆的研究中的一个开放性问题是,如何过滤掉“常见”的记忆。大多数记忆标准与训练集中的出现次数高度相关,捕捉到的是熟悉的短语、公共知识、模板文本或其他重复数据。
方法:我们提出了一种反事实记忆的概念,描述了如果在某次训练中省略了特定的文档,模型的预测会发生什么变化。我们在标准的文本数据集中找到并研究了反事实记忆的训练样本。
效果:我们估计了每个记忆训练样本对验证集和生成文本的影响,展示了如何在测试时间提供直接的证据来证明记忆的来源。

Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data. Understanding this memorization is important in real world applications and also from a learning-theoretical perspective. An open question in previous studies of language model memorization is how to filter out ``common'' memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing memorized familiar phrases, public knowledge, templated texts, or other repeated data. We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We estimate the influence of each memorized training example on the validation set and on generated texts, showing how this can provide direct evidence of the source of memorization at test time.

PRODIGY: Enabling In-context Learning Over Graphs
Qian Huang Hongyu Ren Peng Chen Gregor Kržmanc Daniel Zeng Percy Liang Jure Leskovec



研究问题:如何实现在图结构上进行上下文学习。
动机:尽管大型语言模型已经展示了这种能力,但在图结构上如何进行上下文学习尚未探索。
方法:开发了首个支持在图结构上进行上下文学习的预训练框架PRODIGY,该框架通过将提示示例和查询连接起来的新“提示图”表示形式来形式化图上的上下文学习。然后提出了一个基于提示图的图神经网络架构和相应的一系列上下文预训练目标。
效果:实验证明,使用PRODIGY,预训练模型可以直接通过上下文学习在未见过图上执行新的下游分类任务。该方法在所有设置中的平均上下文学习性能比对比性预训练基线的硬编码适应提高了18%,并且在有限的数据上进行标准微调时,平均上下文学习性能提高了33%。

In-context learning is the ability of a pretrained model to adapt to novel and diverse downstream tasks by conditioning on prompt examples, without optimizing any parameters. While large language models have demonstrated this ability, how in-context learning could be performed over graphs is unexplored. In this paper, we develop \textbf{Pr}etraining \textbf{O}ver \textbf{D}iverse \textbf{I}n-Context \textbf{G}raph S\textbf{y}stems (PRODIGY), the first pretraining framework that enables in-context learning over graphs. The key idea of our framework is to formulate in-context learning over graphs with a novel \emph{prompt graph} representation, which connects prompt examples and queries. We then propose a graph neural network architecture over the prompt graph and a corresponding family of in-context pretraining objectives. With PRODIGY, the pretrained model can directly perform novel downstream classification tasks on unseen graphs via in-context learning. We provide empirical evidence of the effectiveness of our framework by showcasing its strong in-context learning performance on tasks involving citation networks and knowledge graphs. Our approach outperforms the in-context learning accuracy of contrastive pretraining baselines with hard-coded adaptation by 18\% on average across all setups. Moreover, it also outperforms standard finetuning with limited data by 33\% on average with in-context learning.

Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
Aran Nayebi Rishi Rajalingham Mehrdad Jazayeri Guangyu Robert Yang



研究问题:人类和动物如何通过理解物理世界来预测物体和事件的动态轨迹、可能的未来状态,并据此进行计划和预期结果?
动机:目前尚不清楚这些计算背后的神经机制。
方法:结合目标驱动的建模方法和高密度神经生理数据以及高吞吐量的人类行为读数,直接解决这个问题。具体来说,构建并评估几类感官认知网络,以预测丰富、具有生态学意义的环境的未来发展状态。
效果:我们发现“规模并不是你所需要的全部”,并且许多最先进的机器学习模型在我们的神经和行为基准测试中表现不佳。实际上,只有一类模型整体上很好地匹配了这些数据。我们发现,目前最佳的神经网络响应预测模型是那些在预训练的基础模型的潜在空间中,以自监督的方式优化动态场景进行未来环境状态预测的训练模型。这些模型还能够接近神经元预测视觉上隐藏的环境状态变量的能力,尽管它们并未被明确地训练来做这件事。最后,我们发现并非所有的基础模型潜在空间都是平等的。值得注意的是,在视频基础模型的潜在空间中进行未来预测,且优化以支持各种自我中心的感测运动任务的模型,能够合理地匹配人类的行为误差模式和所有我们能够测试的环境场景中的神经动力学。总的来说,这些发现表明,灵长类动物的心理模拟的神经机制和行为具有很强的归纳偏置,因此迄今为止最符合在可重用的视觉表示上进行未来预测的优化,这对广义上的具身人工智能更有用。

Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts that contain thousands of comparisons to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-slot objectives, to models that future predict in the latent space of purely static image-pretrained or dynamic video-pretrained foundation models. We find that ``scale is \emph{not} all you need'', and that many state-of-the-art machine learning models fail to perform well on our neural and behavioral benchmarks for future prediction. In fact, only one class of models matches these data well overall. We find that neural responses are currently best predicted by models trained to predict the future state of their environment in the \emph{latent} space of pretrained foundation models optimized for \emph{dynamic} scenes in a self-supervised manner. These models also approach the neurons' ability to predict the environmental state variables that are visually hidden from view, despite not being explicitly trained to do so. Finally, we find that not all foundation model latents are equal. Notably, models that future predict in the latent space of video foundation models that are optimized to support a \emph{diverse} range of egocentric sensorimotor tasks, reasonably match \emph{both} human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases associated with them, and are thus far most consistent with being optimized to future predict on \emph{reusable} visual representations that are useful for Embodied AI more generally.

Optimizing Prompts for Text-to-Image Generation
Yaru Hao Zewen Chi Li Dong Furu Wei



研究问题:如何设计有效的提示来引导文本到图像模型生成惊人的图像?
动机:现有的有效提示通常是针对特定模型的,并且与用户输入不匹配。
方法:提出一个通用框架——提示适应,自动将原始用户输入适应为模型偏好的提示。具体来说,首先使用预训练的语言模型在手动设计的提示集合上进行有监督的微调。然后使用强化学习探索更好的提示。定义一个奖励函数,鼓励策略生成更具美感的图像,同时保留原始用户意图。
效果:在Stable Diffusion上的实验结果表明,我们的方法在自动指标和人类偏好评级方面优于手动提示工程。此外,强化学习进一步提高了性能,尤其是在非领域特定的提示上。

Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretrained language model on a small collection of manually engineered prompts. Then we use reinforcement learning to explore better prompts. We define a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Moreover, reinforcement learning further boosts performance, especially on out-of-domain prompts.

Selective Amnesia: A Continual Learning Approach to Forgetting in Deep Generative Models
Alvin Heng Harold Soh



研究问题:如何防止大型文本到图像模型被误用来生成有害、误导和不适当的内容。
动机:由于大型文本到图像模型的广泛使用,人们越来越担心这些模型可能被误用。
方法:研究人员提出了一种受持续学习启发的技术,可以在预训练的深度生成模型中选择性地忘记概念,这种方法被称为“选择性遗忘”。
效果:实验表明,这种方法可以在不同的模型中引发对各种概念的遗忘,包括标准数据集中的所有类别,以及文本到图像模型中的名人和裸露提示。

The recent proliferation of large-scale text-to-image models has led to growing concerns that such models may be misused to generate harmful, misleading, and inappropriate content. Motivated by this issue, we derive a technique inspired by continual learning to selectively forget concepts in pretrained deep generative models. Our method, dubbed Selective Amnesia, enables controllable forgetting where a user can specify how a concept should be forgotten. Selective Amnesia can be applied to conditional variational likelihood models, which encompass a variety of popular deep generative frameworks, including variational autoencoders and large-scale text-to-image diffusion models. Experiments across different models demonstrate that our approach induces forgetting on a variety of concepts, from entire classes in standard datasets to celebrity and nudity prompts in text-to-image models.

Exposing Attention Glitches with Flip-Flop Language Modeling
Bingbin Liu Jordan T. Ash Surbhi Goel Akshay Krishnamurthy Cyril Zhang



研究问题:大型语言模型为何有时会输出事实错误并表现出错误的推理?
动机:为了解决这一基本未解决的问题,本工作识别并分析了注意力失误现象,即Transformer架构的归纳偏见间歇性地无法捕捉到稳健的推理。
方法:引入“翻来覆去的语言建模”(FFLM),这是一种用于探测神经语言模型外推行为的合成基准的参数族。这个简单的生成任务要求模型在忽略中间的标记的情况下复制二进制符号的长范围依赖关系。
效果:我们发现Transformer FFLMs遭受了一连串偶发性的推理错误,其中一些我们可以通过各种正则化技术消除。我们的初步机制分析表明,为什么剩余的错误可能非常难以诊断和解决。我们假设注意力失误是自然LLMs中封闭领域幻觉的原因之一。

Why do large language models sometimes output factual inaccuracies and exhibit erroneous reasoning? The brittleness of these models, particularly when executing long chains of reasoning, currently seems to be an inevitable price to pay for their advanced capabilities of coherently synthesizing knowledge, pragmatics, and abstract thought. Towards making sense of this fundamentally unsolved problem, this work identifies and analyzes the phenomenon of _attention glitches_, in which the Transformer architecture's inductive biases intermittently fail to capture robust reasoning. To isolate the issue, we introduce _flip-flop language modeling_ (FFLM), a parametric family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. This simple generative task requires a model to copy binary symbols over long-range dependencies, ignoring the tokens in between. We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques. Our preliminary mechanistic analyses show why the remaining errors may be very difficult to diagnose and resolve. We hypothesize that attention glitches account for (some of) the closed-domain hallucinations in natural LLMs.

Alignment with human representations supports robust few-shot learning
Ilia Sucholutsky Thomas L. Griffiths



研究问题:我们是否应该关心AI系统对世界的理解是否与人类相似?
动机:我们进行了信息理论分析,并发现AI系统对人类理解的相似度和其在少量学习任务上的表现之间存在U型关系。
方法:我们对491个计算机视觉模型的性能进行了分析,以确认这一预测。
效果:我们的结果显示,高度相似的模型对自然对抗攻击和领域转移更具鲁棒性。我们的研究结果表明,人类的理解通常是模型有效利用有限数据、保持鲁棒性和良好泛化的必要但不充分条件。

Should we care whether AI systems have representations of the world that are similar to those of humans? We provide an information-theoretic analysis that suggests that there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks. We confirm this prediction empirically, finding such a relationship in an analysis of the performance of 491 computer vision models. We also show that highly-aligned models are more robust to both natural adversarial attacks and domain shifts. Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.

ProPILE: Probing Privacy Leakage in Large Language Models
Siwon Kim Sangdoo Yun Hwaran Lee Martin Gubri Sungroh Yoon Seong Joon Oh



研究问题:大型语言模型(LLMs)的快速发展和广泛应用引发了对个人身份信息(PII)泄露的严重关注。
动机:这些模型通常在大量的网络收集数据上进行训练,可能会无意中包含敏感的个人数据。
方法:本文提出了一种名为ProPILE的新型探测工具,旨在让数据主体或PII所有者了解基于LLM的服务中可能存在的PII泄露。
效果:实验结果表明,ProPILE可以有效地评估其自身的PII泄露程度,为数据主体提供了对其自身数据的意识与控制能力。

The rapid advancement and widespread use of large language models (LLMs) have raised significant concerns regarding the potential leakage of personally identifiable information (PII). These models are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage in LLM-based services. ProPILE lets data subjects formulate prompts based on their own PII to evaluate the level of privacy intrusion in LLMs. We demonstrate its application on the OPT-1.3B model trained on the publicly available Pile dataset. We show how hypothetical data subjects may assess the likelihood of their PII being included in the Pile dataset being revealed. ProPILE can also be leveraged by LLM service providers to effectively evaluate their own levels of PII leakage with more powerful prompts specifically tuned for their in-house models. This tool represents a pioneering step towards empowering the data subjects for their awareness and control over their own data on the web.

Model Spider: Learning to Rank Pre-Trained Models Efficiently
Yi-Kai Zhang Ting-Ji Huang Yao-Xiang Ding De-Chuan Zhan Han-Jia Ye



研究问题:如何从模型库中选择最适合目标任务的预训练模型。
动机:由于存在大量来自不同领域的异构预训练模型,因此有效地选择最合适的模型是具有挑战性的,因为对所有预训练模型进行前向或后向传递都需要花费大量时间。
方法:本文提出了Model Spider,该方法通过将预训练模型和任务的特性总结为向量来进行标记化,以实现高效的预训练模型选择。通过利用预训练模型在单独的训练任务集上的性能,Model Spider学习构建表示并测量模型-任务对之间的适应度分数。将相关预训练模型排名高于其他的能力可以推广到新任务上。
效果:Model Spider在包括视觉模型和大型语言模型在内的各种模型库中表现出良好的性能。代码可在https://github.com/zhangyikaii/Model-Spider获取。

Figuring out which Pre-Trained Model (PTM) from a model zoo fits the target task is essential to take advantage of plentiful model resources. With the availability of numerous heterogeneous PTMs from diverse fields, efficiently selecting the most suitable one is challenging due to the time-consuming costs of carrying out forward or backward passes over all PTMs. In this paper, we propose Model Spider, which tokenizes both PTMs and tasks by summarizing their characteristics into vectors to enable efficient PTM selection. By leveraging the approximated performance of PTMs on a separate set of training tasks, Model Spider learns to construct representation and measure the fitness score between a model-task pair via their representation. The ability to rank relevant PTMs higher than others generalizes to new tasks. With the top-ranked PTM candidates, we further learn to enrich task repr. with their PTM-specific semantics to re-rank the PTMs for better selection. Model Spider balances efficiency and selection ability, making PTM selection like a spider preying on a web. Model Spider exhibits promising performance across diverse model zoos, including visual models and Large Language Models (LLMs). Code is available at https://github.com/zhangyikaii/Model-Spider.

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Sang Michael Xie Hieu Pham Xuanyi Dong Nan Du Hanxiao Liu Yifeng Lu Percy Liang Quoc V Le Tengyu Ma Adams Wei Yu



研究问题:预训练数据领域的混合比例对语言模型性能有很大影响。
动机:提出了一种优化预训练数据领域权重的方法,以提高大型语言模型的训练效率和性能。
方法:通过在领域上使用组分布鲁棒优化(Group DRO)训练小型代理模型来生成领域权重,然后根据这些权重重新采样数据集并训练大型全尺寸模型。
效果:实验结果表明,该方法在所有领域上都提高了困惑度,并在较少的训练步骤下达到了基线精度。

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

Text-to-Image Diffusion Models are Zero Shot Classifiers
Kevin Clark Priyank Jaini



研究问题:扩散模型的表示学习能力尚未完全理解,其在下游任务中的应用也未得到充分探索。
动机:扩散模型具有优秀的生成能力,可以学习图像-文本数据的有信息量表示,但其学习到的知识内容尚不清楚。
方法:提出一种评估扩散模型的方法,将其作为零样本分类器。主要思想是利用扩散模型根据标签的文本描述对噪声图像进行去噪的能力作为该标签可能性的代理。
效果:该方法在Stable Diffusion和Imagen上进行了应用,发现它们在广泛的零样本图像分类数据集上与CLIP具有竞争力。此外,它们在形状/纹理偏见测试上取得了最先进的结果,并能成功执行属性绑定,而CLIP则无法做到。因此,作者认为应该探索将生成预训练作为视觉和视觉语言问题的引人注目的替代方案。

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision and vision-language problems.

One Fits All: Power General Time Series Analysis by Pretrained LM
Tian Zhou Peisong Niu Xue Wang Liang Sun Rong Jin



研究问题:尽管预训练模型在自然语言处理和计算机视觉领域取得了巨大成功,但在时间序列分析领域的进展有限。
动机:时间序列分析任务需要专门设计的方法,且缺乏大量用于训练的数据,限制了预训练模型在此领域的发展。
方法:利用已经在数十亿个标记上预训练的语言或图像模型进行时间序列分析,不改变预训练模型中的自注意力和前馈层。
效果:实验结果表明,预训练的语言或图像模型可以在所有主要的时间序列分析任务中达到相当或领先的性能。同时,理论和实证发现,自注意力模块的行为类似于主成分分析(PCA),这有助于理解预训练的转换器如何弥合领域差距,是理解预训练转换器普适性的关键一步。

Although we have witnessed great success of pre-trained models in natural language processing (NLP) and computer vision (CV), limited progress has been made for general time series analysis. Unlike NLP and CV where a unified model can be used to perform different tasks, specially designed approach still dominates in each time series analysis task such as classification, anomaly detection, forecasting, and few-shot learning. The main challenge that blocks the development of pre-trained model for time series analysis is the lack of a large amount of data for training. In this work, we address this challenge by leveraging language or CV models, pre-trained from billions of tokens, for time series analysis. Specifically, we refrain from altering the self-attention and feedforward layers of the residual blocks in the pre-trained language or image model. This model, known as the Frozen Pretrained Transformer (FPT), is evaluated through fine-tuning on all major types of tasks involving time series. Our results demonstrate that pre-trained models on natural language or images can lead to a comparable or state-of-the-art performance in all main time series analysis tasks, as illustrated in Figure1. We also found both theoretically and empirically that the self-attention module behaviors similarly to principle component analysis (PCA), an observation that helps explains how transformer bridges the domain gap and a crucial step towards understanding the universality of a pre-trained transformer. The code is publicly available at https://anonymous.4open.science/r/Pretrained-LM-for-TSForcasting-C561.

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models
Morris Alper Hadar Averbuch-Elor



研究问题:本文旨在探讨声音象征主义是否反映在视觉和语言模型中,如CLIP和Stable Diffusion。
动机:尽管人类语言中的声音和意义之间的映射被认为是大致任意的,但认知科学研究已经表明,在语言和人口群体之间存在特定的音义相关性,这种现象被称为声音象征主义。
方法:通过零射知识探测来研究这些模型的内在知识,我们发现有强烈证据表明它们确实表现出这种模式,与心理语言学中著名的kiki-bouba效应相吻合。
效果:我们的研究提供了一种新的方法来证明声音象征主义并理解其性质,使用计算工具。我们的代码将公开发布。

Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Sivan Doveh Assaf Arbelle Sivan Harary Roei Herzig Donghyun Kim Paola Cascante-Bonilla Amit Alfassy Rameswar Panda Raja Giryes Rogerio Feris Shimon Ullman Leonid Karlinsky



研究问题:本文旨在解决视觉和语言(VL)模型在对齐图像和文本表示空间时存在的“对象偏见”问题,即其表示主要像“名词袋”,忽视了或缩小了文本/图像中描述的对象的属性、关系和状态。
动机:尽管最近的文献中提出了一些解决这些“组合推理”问题的尝试,但问题仍然远未得到解决。
方法:本文揭示了影响VL模型组合推理性能的两个因素,这两个因素是用于微调(或预训练)VL模型的配对VL数据集的属性:(i)文本的标题质量,或换句话说是“图像对齐”;(ii)标题的“密度”,即提到图像上出现的所有细节。并提出了一种在标准配对VL数据集(CC3M)上自动处理这些因素的微调方法。
效果:将该方法应用于CLIP,实验结果表明,其组合推理性能比基础模型提高了约27%,比最强的基线提高了约20%,平均提高了6.7%。

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text allowing for numerous applications such as cross-modal retrieval, visual and multi-hop question answering, captioning, and many more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called 'object bias' - their representations behave as 'bags of nouns' mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning (or pre-training) the VL model: (i) the caption quality, or in other words 'image-alignment', of the texts; and (ii) the 'density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors on a standard collection of paired VL data (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27$\% over the base model, up to $\sim20$\% over the strongest baseline, and by $6.7$\% on average. Our code is provided in the Supplementary and would be released upon acceptance.

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li Oam Patel Fernanda Viégas Hanspeter Pfister Martin Wattenberg



研究问题:如何提高大型语言模型的"真实性"。
动机:现有的大型语言模型在生成内容时可能会产生错误,需要一种方法来提高其真实性。
方法:提出推理时间干预(ITI)技术,通过在推理过程中改变模型激活,按照一组学习的方向在有限的关注头之间进行移动,从而显著提高LLaMA模型在TruthfulQA基准上的性能。
效果:实验结果表明,ITI可以显著提高大型语言模型的真实性,同时保持其帮助性,且该方法计算效率高,数据需求少。

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a learned set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from $32.5\%$ to $65.1\%$. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

Lexinvariant Language Models
Qian Huang Eric Zelikman Sarah Li Chen Yuhuai Wu Gregory Valiant Percy Liang



研究问题:语言模型是否可以在没有固定词元嵌入的情况下表现良好?
动机:目前的预训练语言模型主要依赖于词元嵌入,但词元的意义也可以通过其在长上下文中的角色来确定。本文探讨了是否可以构建一种不依赖于词元嵌入的语言模型。
方法:通过随机高斯向量对词元进行编码,使每个词元在每个序列中映射到相同的表示,但在不同序列中映射到不同的表示,从而构建了一种与词元无关的语言模型。
效果:实验结果表明,这种语言模型在给定足够长的上下文时,可以达到与标准语言模型相当的困惑度。此外,它还能实现贝叶斯上下文解密,并在合成上下文推理任务上平均有4倍的准确率提升。

Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without \emph{any} fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the \textit{a priori} identity of any token. To answer this, we study \textit{lexinvariant}language models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications.

Parsel🐍: Algorithmic Reasoning with Language Models by Composing Decompositions
Eric Zelikman Qian Huang Gabriel Poesia Noah Goodman Nick Haber



研究问题:大型语言模型在复杂的多步推理任务,如生成复杂程序方面存在困难。
动机:人类在进行这类任务时,通常会从高级算法设计开始,逐步实现每个部分。因此,研究人员提出了一个名为Parsel的框架,以帮助LLM自动实现和验证复杂算法。
方法:Parsel首先将算法任务自动分解为分层的自然语言功能描述,然后通过测试搜索可能的功能实现组合。
效果:实验结果显示,使用Parsel,LLM在APPS数据集上解决了更多竞赛级别的问题,其通过率比直接采样AlphaCode和Codex的结果高出75%以上,而且使用的样本预算通常更小。此外,通过自动生成的测试,发现Parsel可以将HumanEval上的state-of-the-art pass@1性能从67%提高到85%。最后,研究发现使用Parsel生成的机器人计划被认为准确的可能性是直接生成计划的两倍以上。

Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs. With Parsel, we automatically decompose algorithmic tasks into hierarchical natural language function descriptions and then search over combinations of possible function implementations using tests. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis and robotic planning. We find that, using Parsel, LLMs solve more competition-level problems in the APPS dataset, resulting in pass rates over 75\% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. Moreover, with automatically generated tests, we find that Parsel can improve the state-of-the-art pass@1 performance on HumanEval from 67\% to 85\%. We also find that LLM-generated robotic plans using Parsel are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers. We release our code at https://github.com/ezelikman/parsel.

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Yao Mu Qinglong Zhang Mengkang Hu Wenhai Wang Mingyu Ding Jun Jin Bin Wang Jifeng Dai Yu Qiao Ping Luo



研究问题:如何让机器人通过理解和执行多模态信息来完成长期任务?
动机:现有的机器人技术缺乏对多模态信息的理解与执行能力。
方法:提出了EmbodiedGPT,一种用于增强AI的端到端多模态基础模型,通过大规模规划数据集EgoCOT和高效的训练方法,使机器人具备多模态理解和执行能力。
效果:实验证明,EmbodiedGPT在规划、控制、视觉描述和视觉问答等任务上表现出色,尤其在控制任务上,其成功率比使用Ego4D数据集进行微调的BLIP-2基线提高了1.6倍和1.3倍。

Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

DeWave: Discrete Encoding of EEG Waves for EEG to Text Translation
Yiqun Duan Charles Zhou Zhen Wang Yu-Kai Wang Chin-teng Lin



研究问题:如何将大脑动态转化为自然语言,以实现脑机接口(BCIs)的应用。
动机:随着大型语言模型的快速发展,如ChatGPT,弥补大脑和语言之间的鸿沟的需求日益迫切。然而,现有的方法需要通过眼球追踪定位或事件标记来分割大脑动态为词级特征,这可能限制了这些系统的实际应用。
方法:本文提出了一种新的框架DeWave,该框架将离散编码序列整合到开放式的EEG到文本翻译任务中。DeWave使用量化变分编码器生成离散编码,并将其与预训练的语言模型对齐。这种离散编码表示有两个优点:1)通过引入文本-EEG对比对齐训练,缓解了眼球定位和口语词汇顺序不匹配的问题;2)通过不变的离散编码,最小化了个体差异对EEG波的干扰。
效果:在ZuCo数据集上,DeWave模型的表现超过了之前的基线(分别提高了3.06%和6.34%),达到了41.35 BLEU-1和33.71 Rouge-F。此外,这是首次实现了无需词级顺序标记(如眼球定位)即可翻译整个EEG信号周期的工作,在ZuCo数据集上分别达到了20.5 BLEU-1和29.5 Rouge-1。

The translation of brain dynamics into natural language is pivotal for brain-computer interfaces (BCIs), a field that has seen substantial growth in recent years. With the swift advancement of large language models, such as ChatGPT, the need to bridge the gap between the brain and languages becomes increasingly pressing. Current methods, however, require eye-tracking fixations or event markers to segment brain dynamics into word-level features, which can restrict the practical application of these systems. These event markers may not be readily available or could be challenging to acquire during real-time inference, and the sequence of eye fixations may not align with the order of spoken words. To tackle these issues, we introduce a novel framework, DeWave, that integrates discrete encoding sequences into open-vocabulary EEG-to-text translation tasks. DeWave uses a quantized variational encoder to derive discrete codex encoding and align it with pre-trained language models. This discrete codex representation brings forth two advantages: 1) it alleviates the order mismatch between eye fixations and spoken words by introducing text-EEG contrastive alignment training, and 2) it minimizes the interference caused by individual differences in EEG waves through an invariant discrete codex. Our model surpasses the previous baseline (40.1 and 31.7) by 3.06% and 6.34\%, respectively, achieving 41.35 BLEU-1 and 33.71 Rouge-F on the ZuCo Dataset. Furthermore, this work is the first to facilitate the translation of entire EEG signal periods without the need for word-level order markers (e.g., eye fixations), scoring 20.5 BLEU-1 and 29.5 Rouge-1 on the ZuCo Dataset, respectively.

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
Bowen Tan Yun Zhu Lijuan Liu Eric Xing Zhiting Hu Jindong Chen



研究问题:大型语言模型虽然在多任务处理上表现出色,但需要大量的计算资源,训练和推理成本高且效率低,且难以适应复杂的下游应用。
动机:为了解决这些问题,本文提出了一种预训练的小模型Cappy,旨在提高大型语言模型的性能和效率。
方法:Cappy只有3.6亿个参数,可以独立进行分类任务,也可以作为大型语言模型的辅助组件,提升其性能。此外,Cappy可以在不需要微调大型语言模型或访问其参数的情况下,有效地整合下游监督。
效果:实验结果表明,Cappy在独立完成11个语言理解任务时,表现优于参数规模大几个数量级的大型语言模型。在复杂任务上,Cappy也能大幅提高先进的多任务大型语言模型FLAN-T5的性能。

Large language models (LLMs) such as T0, FLAN, and OPT-IML excel in multi-tasking under a unified instruction-following paradigm, where they also exhibit remarkable generalization abilities to unseen tasks. Despite their impressive performance, these LLMs, with sizes ranging from several billion to hundreds of billions of parameters, demand substantial computational resources, making their training and inference expensive and inefficient. Furthermore, adapting these models to downstream applications, particularly complex tasks, is often unfeasible due to the extensive hardware requirements for finetuning, even when utilizing parameter-efficient approaches such as prompt tuning. Additionally, the most powerful multi-task LLMs, such as OPT-IML-175B and FLAN-PaLM-540B, are not publicly accessible, severely limiting their customization potential. To address these challenges, we introduce a pretrained small scorer, \textit{Cappy}, designed to enhance the performance and efficiency of multi-task LLMs. With merely 360 million parameters, Cappy functions either independently on classification tasks or serve as an auxiliary component for LLMs, boosting their performance. Moreover, Cappy enables efficiently integrating downstream supervision without requiring LLM finetuning nor the access to their parameters. Our experiments demonstrate that, when working independently on 11 language understanding tasks from PromptSource, Cappy outperforms LLMs that are several orders of magnitude larger. Besides, on 45 complex tasks from BIG-Bench, Cappy boosts the performance of the advanced multi-task LLM, FLAN-T5, by a large margin. Furthermore, Cappy is flexible to cooperate with other LLM adaptations, including finetuning and in-context learning, offering additional performance enhancement.

Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task
Maya Okawa Ekdeep Singh Lubana Robert P. Dick Hidenori Tanaka



研究问题:现代生成模型在实际应用中需要具备组合能力,以生成和推理从未见过的训练分布中的全新样本。本研究旨在探究条件扩散模型在合成环境中的组合泛化能力,并分析其困难的原因。
动机:现有的视觉扩散模型表现出令人感兴趣的组合泛化能力,但其行为往往难以预测。为了理解这种行为背后的原因以及模型在组合时遇到困难的模式,本研究进行了一项受控实验。
方法:通过改变训练数据的属性,并在合成环境中测量模型生成超出分布样本的能力,对条件扩散模型的组合泛化进行研究。
效果:(i)数据的生成过程的结构决定了能力和组合能力的出现顺序;(ii)学习单个概念对组合任务的性能有影响,可以解释突然的涌现;(iii)在相关性下学习和组合能力是困难的。

Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of real world, reliable use of these models in practical applications mandates they exhibit the ability to compose their capabilities, generating and reasoning over entirely novel samples never seen in the training distribution. Prior work demonstrates recent vision diffusion models exhibit intriguing compositional generalization abilities, but also fail rather unpredictably. What are the reasons underlying this behavior? Which concepts does the model generally find difficult to compose to form novel data? To address these questions, we perform a controlled study of compositional generalization in conditional diffusion models in a synthetic setting, varying different attributes of the training data and measuring the model's ability to generate samples out-of-distribution. Our results show that: (i) the compositional structure of the data-generating process governs the order in which capabilities and an ability to compose them emerges; (ii) learning individual concepts impacts performance on compositional tasks, multiplicatively explaining sudden emergence; and (iii) learning and composing capabilities is difficult under correlations. We hope our study inspires further grounded research on understanding capabilities and compositionality in generative models from a data-centric perspective.

Human-in-the-Loop Optimization for Deep Stimulus Encoding in Visual Prostheses
Jacob Granley Tristan Fauvel Matthew Chalk Michael Beyeler



研究问题:如何优化神经假肢的刺激参数,以恢复失去的感官功能并增强人类能力。
动机:目前的神经假肢设备产生的刺激感觉往往不自然或失真,个体感知的差异和植入物的位置差异导致刺激反应存在显著变化,个性化刺激优化成为关键挑战。
方法:提出一种新的、实际可行的方法,通过训练深度编码器网络来产生任何个体患者的最佳刺激,然后利用优先贝叶斯优化策略学习新患者的最优个体特异性参数。
效果:该方法在最先进的视觉假肢模型上展示了可行性,能快速学习个性化的刺激编码器,显著提高恢复视力的质量,优于现有的编码策略。此外,该方法对患者反馈的噪声和基础前向模型的误规格具有鲁棒性。总的来说,深度学习和贝叶斯优化的结合可以显著改善佩戴视觉假肢的患者的感觉体验,可能为一系列神经假体技术提供可行的解决方案。

Neuroprostheses show potential in restoring lost sensory function and enhancing human capabilities, but the sensations produced by current devices often seem unnatural or distorted. Exact placement of implants and differences in individual perception lead to significant variations in stimulus response, making personalized stimulus optimization a key challenge. Bayesian optimization could be used to optimize patient-specific stimulation parameters with limited noisy observations, but is not feasible for high-dimensional stimuli. Alternatively, deep learning models can optimize stimulus encoding strategies, but typically assume perfect knowledge of patient-specific variations. Here we propose a novel, practically feasible approach that overcomes both of these fundamental limitations. First, a deep encoder network is trained to produce optimal stimuli for any individual patient by inverting a forward model mapping electrical stimuli to visual percepts. Second, a preferential Bayesian optimization strategy utilizes this encoder to learn the optimal patient-specific parameters for a new patient, using a minimal number of pairwise comparisons between candidate stimuli. We demonstrate the viability of this approach on a novel, state-of-the-art visual prosthesis model. Our approach quickly learns a personalized stimulus encoder and leads to dramatic improvements in the quality of restored vision, outperforming existing encoding strategies. Further, this approach is robust to noisy patient feedback and misspecifications in the underlying forward model. Overall, our results suggest that combining the strengths of deep learning and Bayesian optimization could significantly improve the perceptual experience of patients fitted with visual prostheses and may prove a viable solution for a range of neuroprosthetic technologies

Training Chain-of-Thought via Latent-Variable Inference
Matthew Douglas Hoffman Du Phan david dohan Sholto Douglas Tuan Anh Le Aaron T Parisi Pavel Sountsov Charles Sutton Sharad Vikram Rif A. Saurous



研究问题:如何通过使用"chain-of-thought"提示,使大型语言模型更准确地解决问题?
动机:目前的预训练语言模型在处理特定任务时,可以通过监督微调来提高性能。然而,简单地结合CoT和监督微调需要对正确答案以及导致这些答案的详细推理过程进行监督,这是非常昂贵的。
方法:我们提出了一种微调策略,试图最大化生成正确答案的边际对数似然性,近似于平均所有可能的推理过程。核心挑战在于从正确答案的条件下的推理后验中采样;我们通过一种简单的MCMC期望最大化(EM)算法来解决,该算法受到自我教学推理器(STaR)、记忆唤醒、马尔可夫分数攀爬和持续对比分歧的启发。
效果:将我们的技术应用于GSM8K和BIG-Bench Hard的任务中,我们发现这种MCMC-EM微调技术通常比STaR或带有或不带有CoT的提示微调更能提高模型在保留示例上的准确性。

Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a "chain-of-thought" (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.

Language Models Meet World Models: Embodied Experiences Enhance Language Models
Jiannan Xiang Tianhua Tao Yi Gu Tianmin Shu Zirui Wang Zichao Yang Zhiting Hu



研究问题:大型语言模型在理解物体持久性或规划家庭活动等物理环境中的简单推理和计划方面表现不佳。
动机:由于大型语言模型仅在书面文本上进行训练,缺乏必要的具体知识和技能,因此存在上述限制。
方法:提出一种新的范式,通过使用世界模型(如VirtualHome)中的实体来获取多样化的具体知识,同时保留其通用的语言能力,以此对大型语言模型进行微调。
效果:实验表明,这种方法可以显著提高大型语言模型在18个下游任务上的表现,平均提高了64.28%。特别地,经过这种方法增强的小语言模型(1.3B、6B、13B)甚至能与更大的语言模型(如ChatGPT)相媲美。

While large language models (LMs) have shown remarkable capabilities across numerous tasks, they often struggle with simple reasoning and planning in physical environments, such as understanding object permanence or planning household activities. The limitation arises from the fact that LMs are trained only on written text and miss essential embodied knowledge and skills. In this paper, we propose a new paradigm of enhancing LMs by finetuning them with world models, to gain diverse embodied knowledge while retaining their general language capabilities. Our approach deploys an embodied agent in a world model, particularly a simulator of the physical world (VirtualHome), and acquires a diverse set of embodied experiences through both goal-oriented planning and random exploration. These experiences are then used to finetune LMs to teach diverse abilities of reasoning and acting in the physical world, e.g., planning and completing goals, object permanence and tracking, etc. Moreover, it is desirable to preserve the generality of LMs during finetuning, which facilitates generalizing the embodied knowledge across tasks rather than being tied to specific simulations. We thus further introduce the classical elastic weight consolidation (EWC) for selective weight updates, combined with low-rank adapters (LoRA) for training efficiency. Extensive experiments show our approach substantially improves base LMs on 18 downstream tasks by 64.28% on average. In particular, the small LMs (1.3B, 6B, and 13B) enhanced by our approach match or even outperform much larger LMs (e.g., ChatGPT).

Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
Yuheng Zha Yichi Yang Ruichen Li Zhiting Hu



研究问题:如何构建一种更高效的模型,用于处理广泛的NLP任务,如文本蕴含、相似性、问答等。
动机:虽然大型语言模型在各种NLP任务上表现出色,但它们的通用性往往需要巨大的模型参数和有时次优的性能。
方法:本文提出了文本对齐作为处理包括文本蕴含、相似性、问答等多种关键任务的高效统一模型。给定一对文本,该模型测量它们之间的信息对齐程度。我们通过使用28个数据集的590万个示例对轻量级的RoBERTa进行微调来实现对齐模型。
效果:实验表明,尽管模型尺寸紧凑,但其效率高且性能强大:(1) 在上述多样化任务的20多个数据集上,该模型匹配或超过了FLAN-T5模型,后者的参数大约多出2倍或10倍;统一的单一模型也优于在单个数据集上进行微调的任务特定模型;(2) 当应用于评估23个数据集的语言生成的事实一致性时,我们的模型超过了各种基线,包括大得多的GPT-3.5(ChatGPT),有时甚至超过了GPT-4;(3) 这种轻量级模型也可以作为LLMs(如GPT-3.5)的附加组件,用于问答任务,通过识别无法回答的问题,将平均精确匹配(EM)得分提高了17.94,F1得分提高了15.05。

Large language models (LLMs), typically designed as a function of next-word prediction, have excelled across extensive NLP tasks. Despite the generality, next-word prediction is often not an efficient formulation for many of the tasks, demanding an extreme scale of model parameters (10s or 100s of billions) and sometimes yielding suboptimal performance. In practice, it is often desirable to build more efficient models---despite being less versatile, they still apply to a substantial subset of problems, delivering on par or even superior performance with much smaller model sizes. In this paper, we propose text alignment as an efficient unified model for a wide range of crucial tasks involving text entailment, similarity, question answering (and answerability), factual consistency, and so forth. Given a pair of texts, the model measures the degree of alignment between their information. We instantiate an alignment model through lightweight finetuning of RoBERTa (355M parameters) using 5.9M examples from 28 datasets. Despite its compact size, extensive experiments show the model's efficiency and strong performance: (1) On over 20 datasets of aforementioned diverse tasks, the model matches or surpasses FLAN-T5 models that have around 2x or 10x more parameters; the single unified model also outperforms task-specific models finetuned on individual datasets; (2) When applied to evaluate factual consistency of language generation on 23 datasets, our model improves over various baselines, including the much larger GPT-3.5 (ChatGPT) and sometimes even GPT-4; (3) The lightweight model can also serve as an add-on component for LLMs such as GPT-3.5 in question answering tasks, improving the average exact match (EM) score by 17.94 and F1 score by 15.05 through identifying unanswerable questions.

Self-supervised video pretraining yields robust and more human-aligned visual representations
Nikhil Parthasarathy S. M. Ali Eslami Joao Carreira Olivier J Henaff



研究问题:本文探讨了视频预训练是否能产生具有人类感知特征的视觉表示,如跨任务泛化、对扰动的鲁棒性和与人类判断的一致性。
动机:当前的视觉基础模型主要采用静态图像预训练,但在需要明确的时间理解的任务之外,这种模式与人类的感知方式存在不匹配。作者质疑这种不匹配,并探索视频预训练是否能产生更好的视觉表示。
方法:作者提出了一种新的视频策划方法,并开发了一个对比框架,从其中的复杂转换中学习。这个名为VITO的简单范式,用于从视频中提炼知识,其产生的通用表示在图像理解任务上远超过现有的视频预训练方法和图像预训练方法。
效果:VITO的表示比图像、视频和对抗性训练的表示更能抵抗自然和合成的变形。此外,VITO的预测结果与人类判断高度一致,超过了专门为此目的训练的模型。这些结果表明,视频预训练可能是学习统一、鲁棒和与人的认知一致的视觉世界表示的一种简单方法。

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO’s predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Miles Turpin Julian Michael Ethan Perez Samuel R. Bowman



研究问题:大型语言模型(LLMs)的链式推理解释是否真实反映了模型预测的原因?
动机:为了提高LLMs的透明度和安全性,研究人员试图通过链式推理解释来理解模型的预测过程。
方法:通过对模型输入添加偏见特征,如改变多选题选项的顺序,使答案总是"(A)",然后观察模型的解释是否会受到影响。
效果:研究发现,当模型偏向错误答案时,它们经常生成链式推理解释来合理化这些答案。在一系列任务中,准确率下降了高达36%。此外,模型的解释也会在不提及社会偏见影响的情况下,给出符合刻板印象的答案。这表明,链式推理解释可能是误导性的,增加了我们对LLMs的信任,但不能保证其安全性。因此,提高模型的透明度和可解释性需要通过改进链式推理的忠实度或采用替代方法来实现。

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs—e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)"—which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations rationalizing those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods.

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression
Allan Raventos Mansheej Paul Feng Chen Surya Ganguli



研究问题:预训练的转换器是否可以通过上下文学习解决与预训练期间看到的任务完全不同的新任务?
动机:探索预训练模型的上下文学习能力,即仅通过提示中的几个例子进行学习而无需更新任何权重。
方法:通过改变预训练数据集的任务多样性,对预训练转换器在线性回归任务上的表现进行实证研究。
效果:发现预训练任务多样性存在一个阈值,低于此阈值时,预训练转换器无法解决未见过的任务;高于此阈值时,转换器的性能显著优于贝叶斯估计器,能够优化解决全新的任务。同时,这项研究还探讨了正则化、模型容量和任务结构的影响,强调了任务多样性、数据和模型规模在上下文学习能力出现中的关键作用。

Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally _new_ tasks that are very different from those seen during pretraining? To probe this question, we examine ICL’s performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a _task diversity threshold_ for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the _non-diverse pretraining task distribution_ as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over _all tasks_, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers _can_ optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL.

Not All Neuro-Symbolic Concepts Are Created Equal: Analysis and Mitigation of Reasoning Shortcuts
Emanuele Marconato Stefano Teso Antonio Vergari Andrea Passerini



研究问题:本文旨在解决神经符号预测模型中存在的推理捷径问题,即模型通过利用子研究问题:本文旨在解决神经符号预测模型中存在的推理捷径问题,即模型通过利用子符号输入中提取的高级概念来推断与一些先验知识一致的标签,但可能会产生意外的语义。
动机:推理捷径问题影响了神经符号预测模型的性能和可解释性,因此需要对其进行系统化的研究以找出其发生的原因并寻找可能的缓解策略。
方法:本文将推理捷径问题定义为学习目标的非预期最优解,并识别了其发生的四个关键条件。基于此,我们提出了几种自然缓解策略,并从理论和实证两方面分析了它们的有效性。
效果:分析表明,推理捷径问题难以处理,对现有神经符号解决方案的信任度和可解释性产生了质疑。

Neuro-Symbolic (NeSy) predictive models hold the promise of improved compliance with given constraints, systematic generalization, and interpretability, as they allow to infer labels that are consistent with some prior knowledge by reasoning over high-level concepts extracted from sub-symbolic inputs. It was recently shown that NeSy predictors are affected by *reasoning shortcuts*: they can attain high accuracy but by leveraging concepts with \textit{unintended semantics}, thus coming short of their promised advantages. Yet, a systematic characterization of reasoning shortcuts and of potential mitigation strategies is missing. This work fills this gap by characterizing them as unintended optima of the learning objective and identifying four key conditions behind their occurrence. Based on this, we derive several natural mitigation strategies, and analyze their efficacy both theoretically and empirically. Our analysis shows reasoning shortcuts are difficult to deal with, casting doubts on the trustworthiness and interpretability of existing NeSy solutions.

Grammar Prompting for Domain-Specific Language Generation with Large Language Models
Bailin Wang Zi Wang Xuezhi Wang Yuan Cao Rif A. Saurous Yoon Kim



研究问题:大型语言模型如何从少量示例中泛化到高度结构化的语言生成任务。
动机:现有的大型语言模型在面对高度结构化语言的生成任务时,如语义解析和特定领域的语言生成,仅通过少量示例进行学习往往效果不佳。
方法:提出“语法提示”方法,该方法允许大型语言模型在学习过程中使用外部知识和特定领域的约束,这些约束以巴科斯-诺尔形式(BNF)的语法来表达。在推理阶段,大型语言模型首先根据测试输入预测出BNF语法,然后根据该语法的规则生成输出。
效果:实验证明,语法提示可以使大型语言模型在一系列DSL生成任务上表现出色,包括语义解析、PDDL规划以及基于SMILES的分子生成等任务。

Large language models (LLMs) can learn to perform a wide range of natural language tasks from just a handful of in-context examples. However, for generating strings from highly structured languages (e.g., semantic parsing to complex domain-specific languages), it is challenging for the LLM to generalize from just a few exemplars. We propose \emph{grammar prompting}, a simple approach to enable LLMs to use external knowledge and domain-specific constraints, expressed through a grammar in Backus--Naur Form (BNF), during in-context learning. Grammar prompting augments each demonstration example with a specialized grammar that is minimally sufficient for generating the particular output example, where the specialized grammar is a subset of the full DSL grammar. For inference, the LLM first predicts a BNF grammar given a test input, and then generates the output according to the rules of the grammar. Experiments demonstrate that grammar prompting can enable LLMs to perform competitively on a diverse set of DSL generation tasks, including semantic parsing (SMCalFlow, Overnight, GeoQuery), PDDL planning, and SMILES-based molecule generation.

Multimodal Deep Learning Model Unveils Behavioral Dynamics of V1 Activity in Freely Moving Mice
Aiwen Xu Yuchen Hou Cris M. Niell Michael Beyeler



研究问题:尽管深度卷积神经网络在模拟猕猴视觉皮层方面取得了巨大成功,但它们在预测小鼠视觉皮层活动方面却面临困难,因为小鼠的视觉皮层活动强烈依赖于动物的行为状态。此外,大多数计算模型都专注于预测头部固定下呈现静态图像时产生的神经反应,这与现实世界中运动过程中产生的动态、连续的视觉刺激大不相同。因此,目前还不清楚自然视觉输入和不同的行为变量如何随时间整合以产生初级视觉皮层(V1)的反应。
动机:为了解决这个问题,我们引入了一种多模态循环神经网络,该网络将注视依赖的视觉输入与行为和时间动态相结合,以解释自由移动小鼠的初级视觉皮层活动。
方法:我们的模型通过整合注视相关的视觉输入、行为和时间动态来预测自由探索期间的V1活动,并展示了每个组件的重要性。
效果:实验结果表明,我们的模型在预测自由探索期间的V1活动方面达到了最先进的水平。通过使用最大激活刺激和显著性图进行分析,我们对皮层功能有了新的理解,包括小鼠V1中行为变量混合选择性的普遍性。总之,我们的模型为探索自由活动动物初级视觉皮层神经元背后的计算原理提供了一个全面的深度学习框架。

Despite their immense success as a model of macaque visual cortex, deep convolutional neural networks (CNNs) have struggled to predict activity in visual cortex of the mouse, which is thought to be strongly dependent on the animal’s behavioral state. Furthermore, most computational models focus on predicting neural responses to static images presented under head fixation, which are dramatically different from the dynamic, continuous visual stimuli that arise during movement in the real world. Consequently, it is still unknown how natural visual input and different behavioral variables may integrate over time to generate responses in primary visual cortex (V1). To address this, we introduce a multimodal recurrent neural network that integrates gaze-contingent visual input with behavioral and temporal dynamics to explain V1 activity in freely moving mice. We show that the model achieves state-of-the-art predictions of V1 activity during free exploration and demonstrate the importance of each component in an extensive ablation study. Analyzing our model using maximally activating stimuli and saliency maps, we reveal new insights into cortical function, including the prevalence of mixed selectivity for behavioral variables in mouse V1. In summary, our model offers a comprehensive deep-learning framework for exploring the computational principles underlying V1 neurons in freely-moving animals engaged in natural behavior.

Are aligned neural networks adversarially aligned?
Nicholas Carlini Milad Nasr Christopher A. Choquette-Choo Matthew Jagielski Irena Gao Pang Wei Koh Daphne Ippolito Florian Tramèr Ludwig Schmidt



研究问题:大型语言模型在与恶意用户交互时,其对有害内容的抵制能力如何?
动机:现有的大型语言模型虽然被设计为“有益无害”,但恶意用户可以构造输入来规避这种对齐尝试。
方法:通过构建最坏情况的输入(对抗性示例),研究对抗性对齐,并探讨这些模型在与恶意用户交互时是否能保持对齐。
效果:发现现有的基于NLP的攻击不足以可靠地攻击对齐的文本模型。即使当前基于NLP的攻击失败,也可以通过暴力找到对抗性输入。此外,多模态模型容易受到攻击,即通过对抗性地修改输入图像,可以诱导模型执行任意非对齐行为。

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

HiBug: On Human-Interpretable Model Debug
Muxi Chen YU LI Qiang Xu



研究问题:如何发现和解释机器学习模型在特定数据子集上产生的系统性错误。
动机:现有的错误发现和解释方法需要大量的人工干预和标注,过程繁琐且错误覆盖率低。
方法:提出HiBug自动化模型调试框架,利用预训练的大型模型(如chatGPT)来识别与目标计算机视觉任务相关的可理解属性。通过预训练的视觉语言模型,我们能以人类可理解的方式快速找出表现不佳的数据子集的共同视觉属性。
效果:实验结果表明,HiBug框架能有效发现并解释模型的错误,提高模型的性能。

Machine learning models can frequently produce systematic errors on critical subsets (or slices) of data that share common attributes. Discovering and explaining such model bugs is crucial for reliable model deployment. However, existing bug discovery and interpretation methods usually involve heavy human intervention and annotation, which can be cumbersome and have low bug coverage. In this paper, we propose HiBug, an automated framework for interpretable model debugging. Our approach utilizes large pre-trained models, such as chatGPT, to suggest human-understandable attributes that are related to the targeted computer vision tasks. By leveraging pre-trained vision-language models, we can efficiently identify common visual attributes of underperforming data slices using human-understandable terms. This enables us to uncover rare cases in the training data, identify spurious correlations in the model, and use the interpretable debug results to select or generate new training data for model improvement. Experimental results demonstrate the efficacy of the HiBug framework.

Joint Prompt Optimization of Stacked LLMs using Variational Inference
Alessandro Sordoni Xingdi Yuan Marc-Alexandre Côté Matheus Pereira Adam Trischler Ziang Xiao Arian Hosseini Friederike Niedtner Nicolas Le Roux



研究问题:如何利用大型语言模型(LLMs)进行深度语言网络(DLN)的优化,以提高多任务处理和自然语言理解能力。
动机:通过堆叠两个LLMs并让一个层的输出作为下一个层的输入,构建一个深度语言网络,以提升模型性能。
方法:首先对单层深度语言网络(DLN-1)进行有效提示优化,然后扩展到两层深度语言网络(DLN-2),其中需要学习两个提示。将第一层的输出视为潜在变量,需要通过推理来确定,而提示则作为生成分布的参数来学习。
效果:实验证明,DLN-1在多个推理和自然语言理解任务上表现出色。DLN-2的性能超过了单层网络,显示出有潜力达到与GPT-4相当的性能,即使网络中的每个LLM更小、功能更弱。

Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can be seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.

Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts
Eduard Tulchinskii Kristian Kuznetsov Kushnareva Laida Daniil Cherniavskii Sergey Nikolenko Evgeny Burnaev Serguei Barannikov Irina Piontkovskaya



研究问题:如何区分人类和AI生成的文本,特别是在AI生成内容的质量不断提高的情况下。
动机:随着AI生成内容的质量和数量的快速增长,区分人类和AI生成的文本变得越来越困难,这可能会对社会产生不良影响。
方法:提出了一种基于嵌入表示的平均内在维度的方法来区分人类和AI生成的文本。通过计算给定文本样本的嵌入表示的内在维度,发现自然语言中流畅文本的平均内在维度在7-9之间,而AI生成的文本的平均内在维度比人类生成的文本低约1.5。
效果:这种方法可以稳定地应用于不同的文本领域、生成模型和人类写作者的熟练程度水平,并在模型无关和跨领域情况下显著优于最先进的检测器。

Rapidly increasing quality of AI-generated content makes it difficult to distinguish between human and AI-generated texts, which may lead to undesirable consequences for society. Therefore, it becomes increasingly important to study the properties of human texts that are invariant over text domains and various proficiency of human writers, can be easily calculated for any language, and can robustly separate natural and AI-generated texts regardless of the generation model and sampling method. In this work, we propose such an invariant of human texts, namely the intrinsic dimensionality of the manifold underlying the set of embeddings of a given text sample. We show that the average intrinsic dimensionality of fluent texts in natural language is hovering around the value $9$ for several alphabet-based languages and around $7$ for Chinese, while the average intrinsic dimensionality of AI-generated texts for each language is $\approx 1.5$ lower, with a clear statistical separation between human-generated and AI-generated distributions. This property allows us to build a score-based artificial text detector. The proposed detector's accuracy is stable over text domains, generator models, and human writer proficiency levels, outperforming SOTA detectors in model-agnostic and cross-domain scenarios by a significant margin.

Reflexion: language agents with verbal reinforcement learning
Noah Shinn Federico Cassano Ashwin Gopinath Karthik R Narasimhan Shunyu Yao



研究问题:如何让大型语言模型(LLMs)更有效地从试错中学习,而无需大量的训练样本和昂贵的模型微调。
动机:传统的强化学习方法需要大量的训练样本和昂贵的模型微调,这对于语言模型来说是一个挑战。
方法:提出了一种新的框架Reflexion,通过语言反馈来强化语言模型。具体来说,Reflexion模型会口头反映任务反馈信号,然后在一个事件记忆缓冲区中保留自己的反射文本,以引导后续的决策制定。
效果:Reflexion在各种任务(序列决策、编码、语言推理)上都取得了显著的改进。例如,在HumanEval编码基准测试上,Reflexion达到了91%的准确率,超过了之前最先进的GPT-4的80%。

Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose \emph{Reflexion}, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91\% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80\%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance. We release all code, demos, and datasets at \url{https://github.com/noahshinn024/reflexion}.

Information Geometry of the Retinal Representation Manifold
Xuehao Ding Dongsoo Lee Joshua Brendan Melander George Sivulka Surya Ganguli Stephen Baccus



研究问题:大脑对视觉刺激的辨别能力受到视网膜表示的限制。
动机:以前的视觉辨别研究仅限于低维人造刺激或纯理论考虑,没有现实编码模型。
方法:提出了一种新的框架,通过信息几何的方法理解自然刺激的视网膜表示实现的刺激辨别能力。创建了一个基于三层卷积神经网络模型的随机编码模型,模拟了群体蝾螈视网膜神经节细胞对自然场景的反应条件概率分布。
效果:发现最具有辨别力的刺激在不同的刺激中变化很大,允许研究当前刺激和最具辨别力的刺激之间的关系。观察到在自然场景下,视网膜噪声相关性是信息限制的,而不是像以前推测的那样增加信息传输。此外,发现群体编码比单个细胞更不易饱和,并且作为发射率的函数,费雪信息的变化小于敏感性。得出结论,在自然场景下,群体编码受益于互补编码,有助于平衡不同发射率携带的信息,这可能有利于根据信息最大化原则解码刺激。

The ability for the brain to discriminate among visual stimuli is constrained by their retinal representations. Previous studies of visual discriminability have been limited to either low-dimensional artificial stimuli or pure theoretical considerations without a realistic encoding model. Here we propose a novel framework for understanding stimulus discriminability achieved by retinal representations of naturalistic stimuli with the method of information geometry. To model the joint probability distribution of neural responses conditioned on the stimulus, we created a stochastic encoding model of a population of salamander retinal ganglion cells based on a three-layer convolutional neural network model. This model not only accurately captured the mean response to natural scenes but also a variety of second-order statistics. With the model and the proposed theory, we computed the Fisher information metric over stimuli to study the most discriminable stimulus directions. We found that the most discriminable stimulus varied substantially across stimuli, allowing an examination of the relationship between the most discriminable stimulus and the current stimulus. By examining responses generated by the most discriminable stimuli we further found that the most discriminative response mode is often aligned with the most stochastic mode. This finding carries the important implication that under natural scenes, retinal noise correlations are information-limiting rather than increasing information transmission as has been previously speculated. We additionally observed that sensitivity saturates less in the population than for single cells and that as a function of firing rate, Fisher information varies less than sensitivity. We conclude that under natural scenes, population coding benefits from complementary coding and helps to equalize the information carried by different firing rates, which may facilitate decoding of the stimulus under principles of information maximization.

MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks
Allen Nie Yuhui Zhang Atharva Amdekar Christopher J Piech Tatsunori Hashimoto Tobias Gerstenberg



研究问题:本文旨在探索大型语言模型(LLMs)在文本情境中做出的因果和道德判断是否与人类参与者的判断一致。
动机:尽管最新的大型语言模型在总体水平上的判断已经接近人类,但通过统计分析发现,它们对不同因素的权重分配与人类参与者存在显著差异。
方法:从24篇认知科学论文中收集故事并开发系统对这些故事进行标注,使用这个数据集来测试大型语言模型是否能做出与人类参与者一致的因果和道德判断。
效果:结果显示,虽然大型语言模型的隐含倾向在一定程度上与人类的直觉相吻合,但在权重分配上仍存在明显的差异。

Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people's judgments, such as the violation of norms and whether the harm is avoidable or inevitable. We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. On the aggregate level, alignment has improved with more recent LLMs. However, using statistical analyses, we find that LLMs weigh the different factors quite differently from human participants. These results show how curated, challenge datasets combined with insights from cognitive science can help us go beyond comparisons based merely on aggregate metrics: we uncover LLMs implicit tendencies and show to what extent these align with human intuitions.

On the Exploitability of Instruction Tuning
Manli Shu Jiongxiao Wang Chen Zhu Jonas Geiping Chaowei Xiao Tom Goldstein



研究问题:本文旨在调查对手如何通过在训练数据中注入特定的指令跟随示例来利用指令调优,从而改变模型的行为。
动机:当前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:我们提出了一种名为AutoPoison的自动化数据中毒管道,它自然而连贯地将多种攻击目标融入到被污染的数据中,以实现改变模型行为的目标。
效果:实验结果表明,AutoPoison允许攻击者仅通过污染一小部分数据就能改变模型的行为,同时保持被污染示例的高度隐蔽性。我们希望我的工作能揭示数据质量如何影响指令调优模型的行为,并提高人们对负责任部署LLMs的重要性的认识。

Instruction tuning is an effective technique to align large language models (LLMs) with human intent. In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior. For example, an adversary can achieve content injection by injecting training examples that mention target content and eliciting such behavior from downstream models. To achieve this goal, we propose \textit{AutoPoison}, an automated data poisoning pipeline. It naturally and coherently incorporates versatile attack goals into poisoned data with the help of an oracle LLM. We showcase two example attacks: content injection and over-refusal attacks, each aiming to induce a specific exploitable behavior. We quantify and benchmark the strength and the stealthiness of our data poisoning scheme. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data while maintaining a high level of stealthiness in the poisoned examples. We hope our work sheds light on how data quality affects the behavior of instruction-tuned models and raises awareness of the importance of data quality for responsible deployments of LLMs.

Are Vision Transformers More Data Hungry Than Newborn Visual Systems?
Lalit Pandey Samantha Marie Waters Wood Justin Newell Wood



研究问题:现有的视觉转换器模型是否比生物学习模型更“数据饥饿”?
动机:研究者对使用视觉转换器作为生物学习模型的价值提出质疑,因为视觉转换器被认为比大脑更需要数据。
方法:通过在视频游戏引擎中构建虚拟动物实验室,模拟贫乏的视觉环境,并记录代理在其中移动时获取的第一人称图像,用于训练利用时间作为教学信号的自监督视觉转换器。
效果:当视觉转换器通过新生小鸡的眼睛进行训练时,它们解决了与小鸡相同的视不变对象识别任务。因此,视觉转换器并不比新生小鸡更“数据饥饿”:两者都在贫乏的视觉环境中学习了视不变的对象表示。视觉转换器的灵活和通用的注意力基础学习机制,结合新生动物可用的具身数据流,似乎足以驱动类似动物的对象识别的发展。

Vision transformers (ViTs) are top-performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more “data hungry” than brains, with ViTs requiring more training data than brains to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled-rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self-supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained “through the eyes” of newborn chicks, the ViTs solved the same view-invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn chicks: both learned view-invariant object representations in impoverished visual environments. The flexible and generic attention-based learning mechanism in ViTs—combined with the embodied data streams available to newborn animals—appears sufficient to drive the development of animal-like object recognition.

On Transfer of Adversarial Robustness from Pretraining to Downstream Tasks
Laura Fee Nern Harsh Raj Maurice Georgi Yash Sharma



研究问题:本研究旨在探讨预训练模型在下游任务中的鲁棒性转移问题。
动机:尽管预训练已被证明能提高模型在实践中的性能,但预训练的鲁棒性属性如何转移到下游任务中仍然不清楚。
方法:通过理论分析和实际应用验证,证明了线性预测器在下游任务中的鲁棒性受其基础表示的鲁棒性约束。
效果:研究结果为描述表示函数对可靠后适应性能的需求提供了初步步骤,并可用于校准对下游鲁棒性的期待和优化迁移学习。

As large-scale training regimes have gained popularity, the use of pretrained models for downstream tasks has become common practice in machine learning. While pretraining has been shown to enhance the performance of models in practice, the transfer of robustness properties from pretraining to downstream tasks remains poorly understood. In this study, we demonstrate that the robustness of a linear predictor on downstream tasks can be constrained by the robustness of its underlying representation, regardless of the protocol used for pretraining. We prove (i) a bound on the loss that holds independent of any downstream task, as well as (ii) a criterion for robust classification in particular. We validate our theoretical results in practical applications, show how our results can be used for calibrating expectations of downstream robustness, and when our results are useful for optimal transfer learning. Taken together, our results offer an initial step towards characterizing the requirements of the representation function for reliable post-adaptation performance.

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Hassan Akbari Dan Kondratyuk Yin Cui Rachel Hornung Huisheng Wang Hartwig Adam



研究问题:如何有效地整合多种模态的输入,如图像、视频、文本和音频,进行多任务训练和模型构建?
动机:目前的多模态模型通常需要为每种模态设计特定的组件,这限制了模型的扩展性和效率。
方法:提出了一种简单且可扩展的多模态多任务训练和模型构建方法——综合多模态感知(IMP)。该方法将各种模态的输入集成到一个单一的Transformer编码器中,并使用交替梯度下降(AGD)和专家混合(MoE)进行有效的模型和任务缩放。
效果:通过广泛的实证研究,发现通过在不同的模态、损失函数和任务上交替执行梯度下降更新,以及在单个模态无关的编码器上使用MoE进行稀疏化,可以显著提高模型的性能。IMP在一系列下游任务上取得了有竞争力的性能,包括视频分类、图像分类、图像-文本和视频-文本检索。特别是在视频任务上,训练了一个稀疏的IMP-MoE-L模型,在零样本视频分类上取得了新的最先进成果,同时只使用了15%的总训练计算成本。

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model & task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery
Yuxin Wen Neel Jain John Kirchenbauer Micah Goldblum Jonas Geiping Tom Goldstein



研究问题:如何通过优化提示来控制现代生成模型。
动机:现有的硬提示需要人工设计,而软提示虽然可以通过强大的优化方法发现,但无法轻易编辑、跨模型复用或插入文本接口。
方法:提出一种简单易用的方法,通过高效的梯度基优化自动优化硬文本提示。
效果:该方法可以应用于文本到图像和纯文本应用,使API用户能够在不了解如何提示模型的情况下轻松生成、发现和混合图像概念。此外,通过使用这种方法,我们可以绕过Midjourney实施的基于令牌级内容过滤器,通过开源文本编码器进行优化。

The strength of modern generative models lies in their ability to be controlled through prompts. Hard prompts comprise interpretable words and tokens, and are typically hand-crafted by humans. Soft prompts, on the other hand, consist of continuous feature vectors. These can be discovered using powerful optimization methods, but they cannot be easily edited, re-used across models, or plugged into a text-based interface. We describe an easy-to-use approach to automatically optimize hard text prompts through efficient gradient-based optimization. Our approach can be readily applied to text-to-image and text-only applications alike. This method allows API users to easily generate, discover, and mix and match image concepts without prior knowledge of how to prompt the model. Furthermore, using our method, we can bypass token-level content filters imposed by Midjourney by optimizing through the open-sourced text encoder.

Learning Reliable Logical Rules with SATNet
Zhaoyu Li Jinpei Guo Yuhe Jiang Xujie Si



研究问题:如何将逻辑推理与深度学习相结合,生成可解释和可验证的逻辑规则。
动机:目前的深度学习模型虽然强大,但其内部逻辑往往难以理解,无法直接生成人类可读的规则。
方法:提出一种新的框架,通过不同的学习方式生成可解释和可验证的逻辑规则,不依赖于预先设定的逻辑结构。该方法基于SATNet,一个可以从输入输出示例中学习基本规则的可微分的最大满足性求解器。
效果:实验结果表明,该方法生成的规则非常可靠,使用精确求解器可以达到100%的准确率,而原始的SATNet在许多情况下无法给出正确的解决方案。此外,我们还形式化地验证了解码后的逻辑规则与真实规则在功能上是等价的。

Bridging logical reasoning and deep learning is crucial for advanced AI systems. In this work, we present a new framework that addresses this goal by generating interpretable and verifiable logical rules through differentiable learning, without relying on pre-specified logical structures. Our approach builds upon SATNet, a differentiable MaxSAT solver that learns the underlying rules from input-output examples. Despite its efficacy, the learned weights in SATNet are not straightforwardly interpretable, failing to produce human-readable rules. To address this, we propose a novel specification method called ``maximum equality'', which enables the interchangeability between the learned weights of SATNet and a set of propositional logical rules in weighted MaxSAT form. With the decoded weighted MaxSAT formula, we further introduce several effective verification techniques to validate it against the ground truth rules. Experiments on stream transformations and Sudoku problems show that our decoded rules are highly reliable: using exact solvers on them could achieve 100% accuracy, whereas the original SATNet fails to give correct solutions in many cases. Furthermore, we formally verify that our decoded logical rules are functionally equivalent to the ground truth ones.

SatLM: Satisfiability-Aided Language Models Using Declarative Prompting
Xi Ye Qiaochu Chen Isil Dillig Greg Durrett



研究问题:如何提高大型语言模型的推理能力,特别是在需要复杂规划和搜索的约束求解问题上。
动机:目前的大语言模型在需要进行复杂规划和搜索的约束求解问题上表现不佳。
方法:提出一种新的基于可满足性辅助的语言模型(SatLM)方法,使用大型语言模型生成声明式的任务规格,然后利用现成的自动定理证明器来推导出最终的答案。
效果:在8个不同的数据集上进行评估,结果显示SatLM始终优于程序辅助的语言模型,在一些具有挑战性的子集上,SatLM的性能甚至超过了之前的方法。

Prior work has combined chain-of-thought prompting in large language models (LLMs) with programmatic representations to perform effective and transparent reasoning. While such an approach works well for tasks that only require forward reasoning (e.g., straightforward arithmetic), it is less effective for constraint solving problems that require more sophisticated planning and search. In this paper, we propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of LLMs. We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer. This approach has two key advantages. The declarative specification is closer to the problem description than the reasoning steps are, so the LLM can parse it out of the description more accurately. Furthermore, by offloading the actual reasoning task to an automated theorem prover, our approach can guarantee the correctness of the answer with respect to the parsed specification and avoid planning errors in the solving process. We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm. In particular, SATLM outperforms program-aided LMs by 23% on a challenging subset of the GSM arithmetic reasoning dataset; SATLM also achieves a new SoTA on LSAT and BoardgameQA, surpassing previous models that are trained on the respective training sets.

Learning to Modulate pre-trained Models in RL
Thomas Schmied Markus Hofmarcher Fabian Paischer Razvan Pascanu Sepp Hochreiter



研究问题:强化学习(RL)在适应新任务方面存在不足,预训练模型在微调新任务时会出现灾难性遗忘现象。
动机:为了解决预训练模型在新任务上的灾难性遗忘问题,本文提出了一种新颖的学习方法——学习调节(L2M)。
方法:首先在Meta-World和DMControl两个基准测试套件上联合预训练一个模型,然后评估和比较多种自然语言处理中的常见微调方法。最后,提出一种新的方法L2M,通过可学习的调制池来调节冻结的预训练模型的信息流,以避免已学习技能的退化。
效果:该方法在连续世界基准测试中取得了最先进的性能,同时保持了预训练任务的性能。

Reinforcement Learning (RL) has been successful in various domains like robotics, game playing, and simulation. While RL agents have shown impressive capabilities in their specific tasks, they insufficiently adapt to new tasks. In supervised learning, this adaptation problem is addressed by large-scale pre-training followed by fine-tuning to new down-stream tasks. Recently, pre-training on multiple tasks has been gaining traction in RL. However, fine-tuning a pre-trained model often suffers from catastrophic forgetting. That is, the performance on the pre-training tasks deteriorates when fine-tuning on new tasks. To investigate the catastrophic forgetting phenomenon, we first jointly pre-train a model on datasets from two benchmark suites, namely Meta-World and DMControl. Then, we evaluate and compare a variety of fine-tuning methods prevalent in natural language processing, both in terms of performance on new tasks, and how well performance on pre-training tasks is retained. Our study shows that with most fine-tuning approaches, the performance on pre-training tasks deteriorates significantly. Therefore, we propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation of learned skills by modulating the information flow of the frozen pre-trained model via a learnable modulation pool. Our method achieves state-of-the-art performance on the Continual-World benchmark, while retaining performance on the pre-training tasks. Finally, to aid future research in this area, we release a dataset encompassing 50 Meta-World and 16 DMControl tasks.

Enhancing Robot Program Synthesis Through Environmental Context
Tianyi Chen Qidi Wang Zhen Dong Liwei Shen Xin Peng



研究问题:如何通过部分观察的环境进行程序合成。
动机:现有的机器人编程方法需要对整个环境有全面的理解,这在实际操作中往往难以实现。
方法:提出一个框架,通过修正可能错误的代码段来进行程序合成,并利用部分观察的环境。首先学习一个环境嵌入空间,根据预设条件隐式评估每个程序标记的影响;然后通过图结构聚合环境和语法信息流,提供平滑的程序修正指导。
效果:在部分观察的VizDoom领域进行的大量实验评估和消融研究表明,该方法在各种任务上具有优越的泛化能力,并在遇到噪声时具有更强的鲁棒性。

Program synthesis aims to automatically generate an executable program that conforms to the given specification. Recent advancements have demonstrated that deep neural methodologies and large-scale pretrained language models are highly proficient in capturing program semantics. For robot programming, prior works have facilitated program synthesis by incorporating global environments. However, the assumption of acquiring a comprehensive understanding of the entire environment is often excessively challenging to achieve. In this work, we present a framework that learns to synthesize a program by rectifying potentially erroneous code segments, with the aid of partially observed environments. To tackle the issue of inadequate attention to partial observations, we propose to first learn an environment embedding space that can implicitly evaluate the impacts of each program token based on the precondition. Furthermore, by employing a graph structure, the model can aggregate both environmental and syntactic information flow and furnish smooth program rectification guidance. Extensive experimental evaluations and ablation studies on the partially observed VizDoom domain authenticate that our method offers superior generalization capability across various tasks and greater robustness when encountering noises.

CoLLAT: On Adding Fine-grained Audio Understanding to Language Models using Token-Level Locked-Language Tuning
Amila Silva Spencer Whitehead Chris Lengerich Hugh James Leather



研究问题:现有的音频分类模型在训练过程中无法预测未见过的类型,导致性能不佳。
动机:为了解决这个问题,研究人员开始探索使用预训练语言模型的自然语言监督进行对比语言-音频预训练,以学习音频理解模型。
方法:提出CoLLAT框架,通过一种新的音频到文本的预训练目标来有效学习锁定的语言模型,从而实现细粒度的音频理解。
效果:实验证明,CoLLAT在音频理解方面取得了最先进的性能,并在基于预训练语言模型的应用程序中解锁了音频指导功能。

Humans can easily understand various audio concepts, but conventional audio classification models fail due to their inability to predict unseen classes during training. To address this challenge, recent literature has explored contrastive language-audio pretraining to learn an audio understanding model using natural language supervision from a pretrained language model. However, despite their reasonable zero-shot performance in audio understanding, these models typically fail to achieve optimal performance while preserving the text understanding capabilities of the pretrained language model. They also perform poorly when comprehending audio clips with multiple audio concepts. To bridge these gaps, we propose $CoLLAT$: $Co$ntrastive $L$ocked $L$anguage and $A$udio $T$uning. This is a framework to effectively learn an audio understanding model with a locked language model, which is learned using a novel pretraining objective for audio-to-text grounding to yield fine-grained audio understanding. Our extensive experiments, which include several downstream applications such as audio classification, cross-modal retrieval, and audio-guided image generation, demonstrate that $CoLLAT$ yields state-of-the-art performance for audio understanding. Additionally, it unlocks audio guidance to applications built on top of pretrained language models.

Large Language Models Are Zero-Shot Time Series Forecasters
Nate Gruver Marc Anton Finzi Shikai Qiu Andrew Gordon Wilson



研究问题:如何利用大型语言模型进行时间序列预测?
动机:将时间序列编码为数字字符串,将时间序列预测视为文本中的下一个标记预测。
方法:使用GPT-3和LLaMA-2等大型语言模型进行零样本的时间序列外推,并提出了有效的时间序列数据标记化和离散标记分布转化为连续值密度的转换方法。
效果:大型语言模型在处理许多时间序列(如重复的季节趋势)时,由于其能够自然表示多模态分布、简单性和重复性等特性,表现出色。同时,大型语言模型还可以通过非数值文本处理缺失数据,适应文本侧信息,并回答问题以帮助解释预测结果。然而,研究发现增加模型大小并不一定能提高性能,例如GPT-4由于其数字标记化方式和不良的不确定性校准,可能表现不如GPT-3。

By encoding time series as a string of numerical digits, we can frame time series forecasting as next-token prediction in text. Developing this approach, we find that large language models (LLMs) such as GPT-3 and LLaMA-2 can surprisingly zero-shot extrapolate time series at a level comparable to or exceeding the performance of purpose-built time series models trained on the downstream tasks. To facilitate this performance, we propose procedures for effectively tokenizing time series data and converting discrete distributions over tokens into highly flexible densities over continuous values. We argue the success of LLMs for time series stems from their ability to naturally represent multimodal distributions, in conjunction with biases for simplicity, and repetition, which align with the salient features in many time series, such as repeated seasonal trends. We also show how LLMs can naturally handle missing data without imputation through non-numerical text, accommodate textual side information, and answer questions to help explain predictions. While we find that increasing model size generally improves performance on time series, we show GPT-4 can perform worse than GPT-3 because of how it tokenizes numbers, and poor uncertainty calibration, which is likely the result of alignment interventions such as RLHF.

Neuro-symbolic Learning Yielding Logical Constraints
Zenan Li Yunpeng Huang Zhaoyu Li Yuan Yao Jingwei Xu Taolue Chen Xiaoxing Ma Jian Lu



研究问题:本文旨在解决神经符号系统端到端学习的难题。
动机:目前的神经符号系统需要改进神经网络训练、符号基础和逻辑约束合成的交互性。
方法:提出了一个融合神经网络训练、符号基础和逻辑约束合成的自然框架,通过引入差异凸规划技术来放松逻辑约束,同时保持其精度。
效果:理论分析和实证评估证实了该框架的有效性。

Neuro-symbolic systems combine the abilities of neural perception and logical reasoning. However, end-to-end learning of neuro-symbolic systems is still an unsolved challenge. This paper proposes a natural framework that fuses neural network training, symbol grounding, and logical constraint synthesis into a coherent and efficient end-to-end learning process. The capability of this framework comes from the improved interactions between the neural and the symbolic parts of the system in both the training and inference stages. Technically, to bridge the gap between the continuous neural network and the discrete logical constraint, we introduce a difference-of-convex programming technique to relax the logical constraints while maintaining their precision. We also employ cardinality constraints as the language for logical constraint learning and incorporate a trust region method to avoid the degeneracy of logical constraint in learning. Both theoretical analyses and empirical evaluations substantiate the effectiveness of the proposed framework.

Focused Transformer: Contrastive Training for Context Scaling
Szymon Tworkowski Konrad Staniszewski Mikołaj Pacek Yuhuai Wu Henryk Michalewski Piotr Miłoś



研究问题:大型语言模型的有效上下文长度受限,如何提高其上下文处理能力?
动机:随着文档数量的增加,相关键和无关键的比例下降,导致模型更关注无关键,即注意力分散问题。
方法:提出Focused Transformer(FoT)技术,通过对比学习的训练过程增强(key, value)空间的结构,以延长上下文长度。
效果:对$3 B$和$7 B$ OpenLLaMA模型进行微调后,得到的新型模型LongLLaMA在需要长上下文的任务上表现优越,并能管理长达$256 k$的上下文长度进行密钥检索。

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an additional context, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3 B$ and $7 B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.

Frequency-Enhanced Data Augmentation for Vision-and-Language Navigation
Keji He Chenyang Si Zhihe Lu Yan Huang Liang Wang Xinchao Wang



研究问题:如何提高基于自然语言指令的视觉-语言导航(VLN)任务的性能。
动机:现有的VLN方法主要关注空间域的探索,我们提出转向傅里叶域的新视角,以增强视觉-文本匹配,提高理解并执行基于给定指令的导航任务的能力。
方法:我们首先探讨了高频信息在VLN中的重要性,并提出了一种复杂的、多功能的频率增强数据增强(FDA)技术,以提高VLN模型捕获关键高频信息的能力。
效果:我们在R2R, RxR, CVDN和REVERIE等数据集上的实验结果表明,我们的FDA可以很容易地与现有的VLN方法集成,在不增加额外参数的情况下提高性能,同时保持模型的简单和高效。

Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate through complex environments based on natural language instructions. In contrast to conventional approaches, which primarily focus on the spatial domain exploration, we propose a paradigm shift toward the Fourier domain. This alternative perspective aims to enhance visual-textual matching, ultimately improving the agent's ability to understand and execute navigation tasks based on the given instructions. In this study, we first explore the significance of high-frequency information in VLN and provide evidence that it is instrumental in bolstering visual-textual matching processes. Building upon this insight, we further propose a sophisticated and versatile Frequency-enhanced Data Augmentation (FDA) technique to improve the VLN model's capability of capturing critical high-frequency information. Specifically, this approach requires the agent to navigate in environments where only a subset of high-frequency visual information corresponds with the provided textual instructions, ultimately fostering the agent's ability to selectively discern and capture pertinent high-frequency features according to the given instructions. Promising results on R2R, RxR, CVDN and REVERIE demonstrate that our FDA can be readily integrated with existing VLN approaches, improving performance without adding extra parameters, and keeping models simple and efficient. The code is available at https://github.com/hekj/FDA.

Human-Guided Complexity-Controlled Abstractions
Andi Peng Mycal Tucker Eoin M. Kenny Noga Zaslavsky Pulkit Agrawal Julie Shah



研究问题:训练神经网络生成一系列离散表示,并通过调整表示的熵来控制表示的复杂性。
动机:受人类学习的启发,我们训练神经网络以产生一系列的离散表示,并控制表示的复杂性(即编码输入的位数)。
方法:通过调整分布的熵来控制表示的复杂性,并在新任务中使用少量标记示例进行微调实验。
效果:实验结果表明,将表示调整为适合任务的复杂性水平可以支持最大的微调性能,并且在人类参与者研究中,用户能够通过离散表示的可视化识别下游任务的适当复杂性级别。

Neural networks often learn task-specific latent representations that fail to generalize to novel settings or tasks. Conversely, humans learn discrete representations (i.e., concepts or words) at a variety of abstraction levels (e.g., "bird" vs. "sparrow'") and use the appropriate abstraction based on tasks. Inspired by this, we train neural models to generate a spectrum of discrete representations, and control the complexity of the representations (roughly, how many bits are allocated for encoding inputs) by tuning the entropy of the distribution over representations. In finetuning experiments, using only a small number of labeled examples for a new task, we show that (1) tuning the representation to a task-appropriate complexity level supports the greatest finetuning performance, and (2) in a human-participant study, users were able to identify the appropriate complexity level for a downstream task via visualizations of discrete representations. Our results indicate a promising direction for rapid model finetuning by leveraging human insight.

Few-shot Generation via Recalling Brain-Inspired Episodic-Semantic Memory
Zhibin Duan Lv Zhiyi Chaojie Wang Bo Chen Bo An Mingyuan Zhou



研究问题:如何将生成模型适应于只有少量给定数据样本的新颖生成任务,以提高少数生成能力。
动机:现实世界中许多应用的数据有限,如艺术领域,少数生成能力对于这些应用至关重要。
方法:受人类大脑记忆机制的启发,设计了一种变分结构化记忆模块(VSM),可以同时存储情节记忆和语义记忆,以帮助现有的生成模型在样本生成过程中有效地回忆这些记忆。同时引入了一种仿生的记忆更新策略,用于转换情节记忆和语义记忆,也可以模拟转换过程中的不确定性。然后将开发的VSM与各种生成模型结合在贝叶斯框架下,并使用少数生成任务评估这些记忆增强的生成模型,证明了我们的方法的有效性。
效果:实验结果表明,通过结合VSM和生成模型,可以显著提高少数生成能力,并在少数生成任务上取得良好的效果。

Aimed at adapting a generative model to a novel generation task with only a few given data samples, the capability of few-shot generation is crucial for many real-world applications with limited data, \emph{e.g.}, artistic domains. Instead of training from scratch, recent works tend to leverage the prior knowledge stored in previous datasets, which is quite similar to the memory mechanism of human intelligence, but few of these works directly imitate the memory-recall mechanism that humans make good use of in accomplishing creative tasks, \emph{e.g.}, painting and writing. Inspired by the memory mechanism of human brain, in this work, we carefully design a variational structured memory module (VSM), which can simultaneously store both episodic and semantic memories to assist existing generative models efficiently recall these memories during sample generation. Meanwhile, we introduce a bionic memory updating strategy for the conversion between episodic and semantic memories, which can also model the uncertainty during conversion. Then, we combine the developed VSM with various generative models under the Bayesian framework, and evaluate these memory-augmented generative models with few-shot generation tasks, demonstrating the effectiveness of our methods.

Latent Space Translation via Semantic Alignment
Valentino Maiorca Luca Moschella Antonio Norelli Marco Fumero Francesco Locatello Emanuele Rodolà



研究问题:不同神经网络模型的隐空间在面对语义相关数据时往往表现出相似性,但这种内在相似性并不总是立即可辨。
动机:为了更好地理解这种现象,本研究展示了如何通过比之前认为更简单的转换,在不同的预训练网络之间转换这些神经网络模块学习到的表示。
方法:该方法直接估计两个给定隐空间之间的转换,从而无需额外训练即可有效地拼接编码器和解码器。
效果:在各种实验设置中广泛验证了这种转换程序的适应性,包括不同的训练、领域、架构(如ResNet、CNN、ViT)以及多个下游任务(分类、重建)。特别地,我们展示了如何在多模态设置中零次击中拼接文本编码器和视觉解码器,或反之,产生了令人惊讶的良好分类性能。

While different neural models often exhibit latent spaces that are alike when exposed to semantically related data, this intrinsic similarity is not always immediately discernible. Towards a better understanding of this phenomenon, our work shows how representations learned from these neural modules can be translated between different pre-trained networks via simpler transformations than previously thought. An advantage of this approach is the ability to estimate these transformations using standard, well-understood algebraic procedures that have closed-form solutions. Our method directly estimates a transformation between two given latent spaces, thereby enabling effective stitching of encoders and decoders without additional training. We extensively validate the adaptability of this translation procedure in different experimental settings: across various trainings, domains, architectures (e.g., ResNet, CNN, ViT), and in multiple downstream tasks (classification, reconstruction). Notably, we show how it is possible to zero-shot stitch text encoders and vision decoders, or vice-versa, yielding surprisingly good classification performance in this multimodal setting.

NuTrea: Neural Tree Search for Context-guided Multi-hop KGQA
Hyeong Kyu Choi Seunghun Lee Jaewon Chu Hyunwoo J. Kim



研究问题:如何有效地从知识图谱中检索节点以回答自然语言问题。
动机:现有的基于图神经网络的方法只关注于从种子节点到答案节点的消息传递,忽视了整个知识图谱的上下文信息,且对代表实体的KG节点的处理存在问题。
方法:提出了一种基于树搜索的图神经网络模型Neural Tree Search (NuTrea),该模型引入了更广泛的知识图谱上下文,并采用了一种新的消息传递机制来增强过去导向的嵌入。同时,还引入了考虑全局知识图谱上下文的关系频率-逆实体频率(RF-IEF)节点嵌入,以更好地表征模糊的知识图谱节点。
效果:通过在三个主要的多跳知识图谱问答基准数据集上的实验,验证了该方法的有效性。进一步的分析也证实了其表达性和鲁棒性。总的来说,NuTrea为使用复杂的自然语言问题查询知识图谱提供了强大的工具。

Multi-hop Knowledge Graph Question Answering (KGQA) is a task that involves retrieving nodes from a knowledge graph (KG) to answer natural language questions. Recent GNN-based approaches formulate this task as a KG path searching problem, where messages are sequentially propagated from the seed node towards the answer nodes. However, these messages are past-oriented, and they do not consider the full KG context. To make matters worse, KG nodes often represent pronoun entities and are sometimes encrypted, being uninformative in selecting between paths. To address these problems, we propose Neural Tree Search (NuTrea), a tree search-based GNN model that incorporates the broader KG context. Our model adopts a message-passing scheme that probes the unreached subtree regions to boost the past-oriented embeddings. In addition, we introduce the Relation Frequency-Inverse Entity Frequency (RF-IEF) node embedding that considers the global KG context to better characterize ambiguous KG nodes. The general effectiveness of our approach is demonstrated through experiments on three major multi-hop KGQA benchmark datasets, and our extensive analyses further validate its expressiveness and robustness. Overall, NuTrea provides a powerful means to query the KG with complex natural language questions. Code is available at https://github.com/mlvlab/NuTrea.

Extensible Prompts for Language Models on Zero-shot Language Style Customization
Tao Ge Jing Hu Li Dong Shaoguang Mao Yan Xia Xun Wang Si-Qing Chen Furu Wei



研究问题:如何让大型语言模型理解并处理超出自然语言范围的概念?
动机:目前的模型在处理一些难以用自然语言描述的概念时存在困难,需要一种方法来扩展模型的理解能力。
方法:提出可扩展提示(X-Prompt)方法,通过引入虚构词汇来指导语言模型理解复杂概念,同时设计了具有OOD鲁棒性的虚构词,使其能在各种提示中重复使用。
效果:实验结果显示,X-Prompt能有效帮助大型语言模型理解和处理超出自然语言范围的概念,为人类和语言模型之间的交流提供了新的桥梁。

We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imaginary words are designed to be out-of-distribution (OOD) robust so that they can be (re)used like NL words in various prompts, distinguishing X-Prompt from soft prompt that is for fitting in-distribution data. We propose context-augmented learning (CAL) to learn imaginary words for general usability, enabling them to work properly in OOD (unseen) prompts. We experiment X-Prompt for zero-shot language style customization as a case study. The promising results of X-Prompt demonstrate its potential to facilitate advanced interaction beyond the natural language interface, bridging the communication gap between humans and LLMs.

Adapting Neural Link Predictors for Data-Efficient Complex Query Answering
Erik Arakelyan Pasquale Minervini Daniel Daza Michael Cochez Isabelle Augenstein



研究问题:解决知识图谱中不完整知识下复杂查询的问题。
动机:现有的方法要么需要大量数据和资源进行训练,要么解释性差。
方法:提出CQD$^{\mathcal{A}}$模型,通过优化神经链接预测得分来重新校准复杂查询的答案。
效果:在实验中,CQD$^{\mathcal{A}}$比当前最先进的方法表现更好,提高了34.4到35.1的均值倒数排名值,同时使用的查询类型不超过30%。

Answering complex queries on incomplete knowledge graphs is a challenging task where a model needs to answer complex logical queries in the presence of missing knowledge. Prior work in the literature has proposed to address this problem by designing architectures trained end-to-end for the complex query answering task with a reasoning process that is hard to interpret while requiring data and resource-intensive training. Other lines of research have proposed re-using simple neural link predictors to answer complex queries, reducing the amount of training data by orders of magnitude while providing interpretable answers. The neural link predictor used in such approaches is not explicitly optimised for the complex query answering task, implying that its scores are not calibrated to interact together. We propose to address these problems via CQD$^{\mathcal{A}}$, a parameter-efficient score \emph{adaptation} model optimised to re-calibrate neural link prediction scores for the complex query answering task. While the neural link predictor is frozen, the adaptation component -- which only increases the number of model parameters by $0.03\%$ -- is trained on the downstream complex query answering task. Furthermore, the calibration component enables us to support reasoning over queries that include atomic negations, which was previously impossible with link predictors. In our experiments, CQD$^{\mathcal{A}}$ produces significantly more accurate results than current state-of-the-art methods, improving from $34.4$ to $35.1$ Mean Reciprocal Rank values averaged across all datasets and query types while using $\leq 30\%$ of the available training query types. We further show that CQD$^{\mathcal{A}}$ is data-efficient, achieving competitive results with only $1\%$ of the complex training queries and robust in out-of-domain evaluations. Source code and datasets are available at https://github.com/EdinburghNLP/adaptive-cqd.

Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context
Lakshya Agrawal Aditya Kanade Navin Goyal Shuvendu K Lahiri Sriram Rajamani



研究问题:代码语言模型在处理类型、功能或APIs时,由于缺乏全局上下文的理解,往往会出现误判。
动机:为了解决代码语言模型在处理全局上下文时的局限性,研究者提出了利用静态分析辅助解码的方法。
方法:研究者提出了一种名为monitor-guided decoding(MGD)的方法,通过静态分析来指导解码过程。并在PragmaticCode数据集上进行了方法完成的任务评估。
效果:实验结果表明,MGD可以显著提高编译率和与真实值的一致性。此外,参数较少的LM在加入MGD后,性能甚至超过了参数更多的LM。在多种编程语言和编程场景下,MGD也显示出良好的泛化能力。

Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen.

On Masked Pre-training and the Marginal Likelihood
Pablo Moreno-Muñoz Pol G. Recasens Søren Hauberg



研究问题:本文旨在理解遮蔽预训练的成功之处,并探索大型语言模型的遮蔽预训练的主要学习原则。
动机:遮蔽预训练是一种直观的自监督学习方法,但其成功的原因尚不清楚。
方法:通过使用适当的累积评分函数进行遮蔽预训练,并将其与最大化模型的边缘似然性联系起来,从而理解遮蔽预训练的成功之处。
效果:理论上,我们证实了这种发展的理论,并在实践中确认了遮蔽预训练的主要学习原则。

Masked pre-training removes random input dimensions and learns a model that can predict the missing values. Empirical results indicate that this intuitive form of self-supervised learning yields models that generalize very well to new domains. A theoretical understanding is, however, lacking. This paper shows that masked pre-training with a suitable cumulative scoring function corresponds to maximizing the model's marginal likelihood, which is de facto the Bayesian model selection measure of generalization. Beyond shedding light on the success of masked pre-training, this insight also suggests that Bayesian models can be trained with appropriately designed self-supervision. Empirically, we confirm the developed theory and explore the main learning principles of masked pre-training in large language models.

Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
Baohao Liao Shaomu Tan Christof Monz



研究问题:如何有效地进行预训练语言模型的参数高效微调,同时减少内存消耗。
动机:现有的参数高效微调方法在提高性能的同时,需要缓存大部分中间激活值,导致内存消耗大。
方法:通过插入适配器到预训练语言模型中,保留模型的初始状态,使其成为可逆模型,从而实现内存高效的微调。
效果:该方法在GLUE基准测试和五个问答任务上表现出色,显著减少了84%的激活内存,同时保持了与全量微调相当的性能。

Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning. A similar finding is also observed for the image classification task.

Vocabulary-free Image Classification
Alessandro Conti Enrico Fini Massimiliano Mancini Paolo Rota Yiming Wang Elisa Ricci



研究问题:本文旨在解决在未知和不断变化的语义环境下,预定义类别集(即词汇表)在测试时间被假设用于构造文本提示的问题。
动机:尽管大型视觉-语言模型在图像分类方面取得了显著的进步,但在语义上下文未知和不断变化的情况下,预定义的类别集可能不实用。
方法:本文提出了一种新的任务,称为无词汇图像分类(VIC),目标是为输入图像分配一个位于未受约束的语言引发的语义空间中的类别,无需已知的词汇表。我们首先通过外部视觉-语言数据库来表示这个巨大的语义空间,然后提出一种从外部数据库中搜索类别的方法(CaSED)。
效果:实验结果表明,CaSED比其他复杂的视觉-语言框架表现更好,同时参数更少,为未来在这个方向的研究铺平了道路。

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.

Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents
Wenlong Huang Fei Xia Dhruv Shah Danny Driess Andy Zeng Yao Lu Pete Florence Igor Mordatch Sergey Levine Karol Hausman brian ichter



研究问题:如何将大型语言模型(LLMs)与具有实体的设置(如机器人)相结合,使其能够理解现实世界并执行长期任务。
动机:现有的预训练语言模型在应用于实体化代理(如机器人)时面临挑战,因为它们缺乏对物理世界的经验,无法解析非语言观察,并且不了解机器人可能需要的奖励或安全约束。
方法:通过交互数据学习的语言条件机器人策略可以提供必要的基础,使代理正确地位于现实世界中,但这种策略受到可用训练交互数据范围有限的限制,缺乏高级语义理解。因此,如果希望在使用语言模型的同时将其置于实体化环境中,必须构造一个既符合语言模型又符合环境基础模型的动作序列。
效果:通过三个模拟和现实世界领域的实验,证明了这种基于环境的模型可以获得,并且所提出的解码策略能够通过利用两种模型的知识来解决复杂的、长期的机器人设置中的实体化任务。

Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models. Unfortunately, applying such models to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. On the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Thus, if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models.

Learning from Both Structural and Textual Knowledge for Inductive Knowledge Graph Completion
Kunxun Qi Jianfeng Du Hai Wan



研究问题:如何利用结构化知识和文本知识来学习规则系统,以改进知识图谱补全(KGC)的效果。
动机:现有的基于规则的系统只接受结构知识作为输入,可能会忽略一些有用的推理知识,如文本知识。
方法:提出了一个两阶段框架,同时引入结构和文本知识来学习规则系统。第一阶段从文本语料库中通过远程监督计算一组带有置信度分数的三元组(称为“软三元组”)。第二阶段使用这些软三元组来学习用于KGC的规则模型。为了减轻软三元组带来的噪声影响,提出了一种新的规则形式,称为“文本增强规则”或“TE-规则”。并设计了一个模拟TE-规则推理的神经网络模型。
效果:实验结果表明,引入软三元组和TE-规则可以显著提高归纳链接预测的性能。

Learning rule-based systems plays a pivotal role in knowledge graph completion (KGC). Existing rule-based systems restrict the input of the system to structural knowledge only, which may omit some useful knowledge for reasoning, e.g., textual knowledge. In this paper, we propose a two-stage framework that imposes both structural and textual knowledge to learn rule-based systems. In the first stage, we compute a set of triples with confidence scores (called \emph{soft triples}) from a text corpus by distant supervision, where a textual entailment model with multi-instance learning is exploited to estimate whether a given triple is entailed by a set of sentences. In the second stage, these soft triples are used to learn a rule-based model for KGC. To mitigate the negative impact of noise from soft triples, we propose a new formalism for rules to be learnt, named \emph{text enhanced rules} or \emph{TE-rules} for short. To effectively learn TE-rules, we propose a neural model that simulates the inference of TE-rules. We theoretically show that any set of TE-rules can always be interpreted by a certain parameter assignment of the neural model. We introduce three new datasets to evaluate the effectiveness of our method. Experimental results demonstrate that the introduction of soft triples and TE-rules results in significant performance improvements in inductive link prediction.

Emergent and Predictable Memorization in Large Language Models
Stella Biderman USVSN Sai Prashanth Lintang Sutawika Hailey Schoelkopf Quentin Gregory Anthony Shivanshu Purohit Edward Raff



研究问题:大型语言模型(LLMs)倾向于逐字输出其训练数据中的所有序列,这对部署语言模型是一个关键问题。
动机:特别需要减少模型对包含个人身份信息(PII)等敏感数据点的逐字记忆,这种不良记忆的普遍性可能会给模型训练者带来问题,甚至可能需要丢弃其他功能正常的模型。
方法:我们通过预测大型模型在完全训练之前会记住哪些序列来外推低计算试验运行的记忆行为。我们在Pythia模型套件中测量记忆,并绘制用于预测记忆的规模定律,以便提供等效计算建议以最大化此类预测的可靠性(召回率)。
效果:我们提供了关于模型和数据之间记忆分数分布的进一步新发现。我们在https://github.com/EleutherAI/pythia上发布了重现本文结果所需的所有代码和数据。

Memorization, or the tendency of large language models (LLMs) to output entire sequences from their training data verbatim, is a key concern for deploying language models. In particular, it is vital to minimize a model's memorization of sensitive datapoints such as those containing personal identifiable information (PII). The prevalence of such undesirable memorization can pose issues for model trainers, and may even require discarding an otherwise functional model. We therefore seek to predict which sequences will be memorized before a large model's full train-time by extrapolating the memorization behavior of lower-compute trial runs. We measure memorization in the Pythia model suite and plot scaling laws for forecasting memorization, allowing us to provide equi-compute recommendations to maximize the reliability (recall) of such predictions. We additionally provide further novel discoveries on the distribution of memorization scores across models and data. We release all code and data necessary to reproduce the results in this paper at https://github.com/EleutherAI/pythia.

Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering
Noah Hollmann Samuel Müller Frank Hutter



研究问题:如何将领域知识融入自动化机器学习(AutoML)系统?
动机:随着AutoML的发展,将领域知识引入这些系统变得越来越重要。
方法:提出了一种利用大型语言模型(LLMs)的方法,即上下文感知的自动特征工程(CAAFE)。这是一种针对表格数据集的特征工程方法,利用LLM根据数据集描述迭代生成具有语义意义的额外特征。该方法同时生成创建新特性的Python代码和对生成特性效用的解释。
效果:尽管方法简单,但CAAFE在14个数据集中提高了11个的性能——所有数据集的平均ROC AUC性能从0.798提高到0.822,类似于在我们的数据集上使用随机森林而不是逻辑回归所实现的改进。此外,CAAFE是可解释的,为每个生成的特性提供文本解释。CAAFE为数据科学任务的更广泛半自动化铺平了道路,并强调了可以扩展AutoML系统范围到语义AutoML的上下文感知解决方案的重要性。

As the field of automated machine learning (AutoML) advances, it becomes increasingly important to incorporate domain knowledge into these systems. We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to iteratively generate additional semantically meaningful features for tabular datasets based on the description of the dataset. The method produces both Python code for creating new features and explanations for the utility of the generated features. Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC performance from 0.798 to 0.822 across all dataset - similar to the improvement achieved by using a random forest instead of logistic regression on our datasets. Furthermore, CAAFE is interpretable by providing a textual explanation for each generated feature. CAAFE paves the way for more extensive semi-automation in data science tasks and emphasizes the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML. We release our [code](url{https://github.com/automl/CAAFE), a simple [demo](url{https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvyARTMjhl6RZf0a) and a [python package](url{https://pypi.org/project/caafe/).

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment
Hao Liu Wilson Yan Pieter Abbeel



研究问题:如何让大型语言模型与视觉感知进行连接,以扩展至真实世界的任务,如视觉问答和机器人技术。
动机:目前的语言模型在处理视觉-语言任务时,由于缺乏对视觉感知的基础理解,表现不佳。现有的方法主要通过预训练或微调将图像与文本关联起来,但这种方法成本高且计算量大。
方法:提出一种名为Language-Quantized AutoEncoder(LQAE)的简单有效方法,该方法通过使用预训练的语言模型去噪器(如BERT)来无监督地对齐文本-图像数据。主要思想是将图像编码为文本令牌序列,直接使用预训练的语言代码本对图像嵌入进行量化,然后将量化后的嵌入的掩码版本输入到BERT中以重建原始输入。
效果:LQAE学习了相似的图像对应相似的文本令牌簇,从而无需使用对齐的文本-图像对就能实现这两种模态的对齐。实验表明,LQAE可以在大型语言模型上实现少样本多模态学习,在图像分类和视觉问答等任务上优于基线方法,同时只需要1-10个图像-文本对即可。

Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of natural language tasks. However, a key limitation is that these language models fundamentally lack grounding to visual perception - a crucial attribute needed to extend to real world tasks such as in visual-question answering and robotics. While prior works have largely connected image to text through pretraining or fine-tuning, learning such alignments are generally costly due to a combination of curating massive datasets and large computational burdens. In order to resolve these limitations, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language model denoisers (e.g., BERT). Our main idea is to encode images as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then feed a masked version of the quantized embeddings into a BERT to reconstruct the original input. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. We show LQAE learns text-aligned image tokens that enable few-shot multi-modal learning with large language models, outperforming baseline methods in tasks such as image classification and VQA while requiring as few as 1-10 image-text pairs.

DesCo: Learning Object Recognition with Rich Language Descriptions
Liunian Harold Li Zi-Yi Dou Nanyun Peng Kai-Wei Chang



研究问题:如何让视觉识别模型更好地理解复杂的语言描述,并从中获取上下文信息。
动机:现有的视觉-语言方法虽然能通过语言查询来对齐对象,但往往忽视了描述中的上下文信息,过于依赖对象名称进行检测。
方法:提出一种新的描述条件(DesCo)学习范式,利用大型语言模型生成丰富的对象语言描述,设计上下文敏感的查询,以提升模型解析复杂细节和关注上下文的能力。
效果:在两个新的物体检测基准测试LVIS和OminiLabel上,该方法在零次检测设置下分别取得了34.8 APr minival (+9.1) 和29.3 AP (+3.6)的成绩,大幅超过了先前最先进的模型GLIP和FIBER。

Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and thus improve the models' adaptability to novel objects and domains. Recent studies have attempted to query these models with complex language expressions that include specifications of fine-grained details, such as colors, shapes, and relations. However, simply incorporating language descriptions into queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, a state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenge, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.

Emergent Communication in Interactive Sketch Question Answering
Zixing Lei Yiming Zhang Yuxin Xiong Siheng Chen



研究问题:如何通过视觉基础的紧急通信(EC)学习通过草图进行交流,并理解人类交流的演变。
动机:先前的研究忽视了在人类交流中不可或缺的多轮交互。
方法:我们首先引入了一个新的交互式草图问答(ISQA)任务,其中两个协作的玩家通过草图互动来回答关于图像的问题。为了完成这个任务,我们设计了一个新的、高效的交互式EC系统,可以在问题回答准确性、绘图复杂度和人类可解释性这三个评估因素之间实现有效的平衡。
效果:实验结果表明,多轮交互机制促进了智能代理之间的目标导向和高效通信。

Vision-based emergent communication (EC) aims to learn to communicate through sketches and demystify the evolution of human communication. Ironically, previous works neglect multi-round interaction, which is indispensable in human communication. To fill this gap, we first introduce a novel Interactive Sketch Question Answering (ISQA) task, where two collaborative players are interacting through sketches to answer a question about an image. To accomplish this task, we design a new and efficient interactive EC system, which can achieve an effective balance among three evaluation factors, including the question answering accuracy, drawing complexity and human interpretability. Our experimental results demonstrate that multi-round interactive mechanism facilitates tar- geted and efficient communication between intelligent agents. The code will be released.

Brant: Foundation Model for Intracranial Neural Signal
Daoze Zhang Zhizhang Yuan Yang Yang Junru Chen Jingjing Wang Yafeng Li



研究问题:本文旨在提出一种名为Brant的基础模型,用于模拟颅内记录,通过预训练学习颅内神经信号的强大表示,为医学提供大规模、现成的模型。
动机:目前缺乏对颅内神经信号的大规模、现成的模型。
方法:采用预训练的方式,从我们收集的大量颅内数据中学习颅内神经信号的强大表示,设计Brant模型以捕捉神经信号的长期时间依赖性和空间相关性,结合时域和频域的信息。
效果:作为基础模型,Brant在各种下游任务(如神经信号预测、频率-相位预测、插补和癫痫发作检测)上达到了最先进的性能,显示出对广泛任务的泛化能力。同时,低资源标签分析和表示可视化进一步证明了我们的预训练策略的有效性。此外,我们还探索了模型大小的影响,结果显示,具有更高容量的更大模型可以在我们的数据集上提高性能。

We propose a foundation model named Brant for modeling intracranial recordings, which learns powerful representations of intracranial neural signals by pre-training, providing a large-scale, off-the-shelf model for medicine. Brant is the largest model in the field of brain signals and is pre-trained on a large corpus of intracranial data collected by us. The design of Brant is to capture long-term temporal dependency and spatial correlation from neural signals, combining the information in both time and frequency domains. As a foundation model, Brant achieves SOTA performance on various downstream tasks (i.e. neural signal forecasting, frequency-phase forecasting, imputation and seizure detection), showing the generalization ability to a broad range of tasks. The low-resource label analysis and representation visualization further illustrate the effectiveness of our pre-training strategy. In addition, we explore the effect of model size to show that a larger model with a higher capacity can lead to performance improvements on our dataset. The source code and pre-trained weights are available at: https://zju-brainnet.github.io/Brant.github.io/.

Universality and Limitations of Prompt Tuning
Yihan Wang Jatin Chauhan Wei Wang Cho-Jui Hsieh



研究问题:尽管提示调优已被证明能有效适应预训练语言模型的新任务,但"在输入前调整参数"与"调整模型权重"之间差异的理论依据仍有限。
动机:我们首次尝试理解软提示调优在基于变压器的架构中的作用。
方法:通过考虑通用架构,我们从普遍近似和有限深度固定权重预训练变压器对连续值函数的限制两个角度分析提示调优。
效果:我们的普遍结果保证了存在一个强大的变压器,可以通过提示来逼近 Lipschitz 函数集中的任何序列到序列函数。我们还证明了对于有限深度变压器,提示调优存在局限性,并提供了所需的可调提示参数数量的下限。此外,我们的分析还扩展到了多层设置,并提供了充分条件,使得变压器最多只能学习可逆函数的数据集。

Despite the demonstrated empirical efficacy of prompt tuning to adapt a pretrained language model for a new task, the theoretical underpinnings of the difference between "tuning parameters before the input" against "the tuning of model weights" are limited. We thus take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. By considering a general purpose architecture, we analyze prompt tuning from the lens of both: universal approximation and limitations with finite-depth fixed-weight pretrained transformers for continuous-valued functions. Our universality result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions. The limitations of prompt tuning for limited-depth transformers are first proved by constructing a set of datasets, that cannot be memorized by a prompt of any length for a given single encoder layer. We also provide a lower bound on the required number of tunable prompt parameters and compare the result with the number of parameters required for a low-rank update (based on LoRA) for a single-layer setting. We finally extend our analysis to multi-layer settings by providing sufficient conditions under which the transformer can at best learn datasets from invertible functions only. Our theoretical claims are also corroborated by empirical results.

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning
Lin Guan Karthik Valmeekam Sarath Sreedharan Subbarao Kambhampati



研究问题:如何有效地利用预训练大型语言模型(LLMs)进行计划问题。
动机:当前直接使用LLMs作为规划器的方法存在一些问题,如计划的正确性有限,过度依赖与模拟器或实际环境的交互反馈,以及利用人类反馈的效率低下。
方法:提出了一种新的替代范式,首先在PDDL中构建一个明确的世界(领域)模型,然后使用无领域依赖的规划器进行规划。为了解决LLMs可能无法初始生成完全功能的PDDL模型的问题,我们让LLMs作为PDDL和纠正反馈源(如PDDL验证器和人类)之间的接口。
效果:在两个IPC领域和一个比常见的基准测试(如ALFWorld)更复杂的Household领域上,我们证明了GPT-4可以被用来为超过40个动作生成高质量的PDDL模型,修正后的PDDL模型被成功用于解决了48个具有挑战性的规划任务。

There is a growing interest in applying pre-trained large language models (LLMs) to planning problems. However, methods that use LLMs directly as planners are currently impractical due to several factors, including limited correctness of plans, strong reliance on feedback from interactions with simulators or even the actual environment, and the inefficiency in utilizing human feedback. In this work, we introduce a novel alternative paradigm that constructs an explicit world (domain) model in planning domain definition language (PDDL) and then uses it to plan with sound domain-independent planners. To address the fact that LLMs may not generate a fully functional PDDL model initially, we employ LLMs as an interface between PDDL and sources of corrective feedback, such as PDDL validators and humans. For users who lack a background in PDDL, we show that LLMs can translate PDDL into natural language and effectively encode corrective feedback back to the underlying domain model. Our framework not only enjoys the correctness guarantee offered by the external planners but also reduces human involvement by allowing users to correct domain models at the beginning, rather than inspecting and correcting (through interactive prompting) every generated plan as in previous work. On two IPC domains and a Household domain that is more complicated than commonly used benchmarks such as ALFWorld, we demonstrate that GPT-4 can be leveraged to produce high-quality PDDL models for over 40 actions, and the corrected PDDL models are then used to successfully solve 48 challenging planning tasks. Resources, including the source code, are released at: https://guansuns.github.io/pages/llm-dm.

ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers
Kexun Zhang Danqing Wang Jingtao Xia William Yang Wang Lei Li



研究问题:大型语言模型在实现功能描述的代码方面表现出色,但在需要确定合适算法的算法问题上表现不佳,且生成的程序缺乏正确性保证,需要人工验证。
动机:为了解决这些问题,我们提出了ALGO框架,该框架通过结合大型语言模型生成的“指南”和算法程序来指导其生成并验证其正确性。
方法:ALGO首先通过提示大型语言模型穷举所有相关变量的组合来生成一个参考指南。然后利用这个指南来指导任意搜索策略探索算法空间并验证合成的算法。
效果:实验表明,大型语言模型生成的指南在88%的情况下是正确的。有了这个指南作为验证器,ALGO可以以与任何现有代码生成模型无关的方式集成,从而提高其性能。实验表明,配备ALGO后,我们在CodeContests上的一个提交通过率比Codex模型高出8倍,比当前最先进的CodeT模型高出2.6倍。我们还能在未见过的问题上获得比ChatGPT Code Interpreter高出1.3倍的通过率。

Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems that require not only implementation but also identification of the suitable algorithm. Moreover, LLM-generated programs lack guaranteed correctness and require human verification. To address these challenges, we propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness. ALGO first generates a reference oracle by prompting an LLM to exhaustively enumerate all the combinations of relevant variables. This oracle is then utilized to guide an arbitrary search strategy in exploring the algorithm space and to verify the synthesized algorithms. Our study shows that the LLM-generated oracles are correct for 88% of the cases. With the oracles as verifiers, ALGO can be integrated with any existing code generation model in a model-agnostic manner to enhance its performance. Experiments show that when equipped with ALGO, we achieve an 8× better one-submission pass rate over the Codex model and a 2.6× better one-submission pass rate over CodeT, the current state-of-the-art model on CodeContests. We can also get 1.3× better pass rate over the ChatGPT Code Interpreter on unseen problems. The problem set we used for testing, the prompts we used, the verifier and solution programs, and the test cases generated by ALGO are available at https://github.com/zkx06111/ALGO.

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior
Shashank Subramanian Peter Harrington Kurt Keutzer Wahid Bhimji Dmitriy Morozov Michael W. Mahoney Amir Gholami



研究问题:本文旨在研究预训练机器学习模型在科学机器学习(SciML)应用中的转移学习行为,特别是在不同物理问题的混合预训练模型如何适应各种下游应用。
动机:通过预训练和微调的方式,可以显著减少下游示例的数量,同时达到所需的准确度水平,这为解决科学机器学习问题提供了新的可能性。
方法:通过扩大预训练模型的规模、扩大下游训练数据集的规模、将物理参数推向分布之外以及使用预训练在不同物理问题上的混合模型来适应各种下游应用,来研究预训练模型的转移行为。
效果:实验结果表明,当进行适当的微调时,预训练和微调的方法可以帮助达到所需的准确度水平,其性能提升比从零开始训练更大。这种方法对广泛的偏微分方程学习任务都有效。

Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pretrained model size is scaled, (ii) the downstream training dataset size is scaled, (iii) the physics parameters are systematically pushed out of distribution, and (iv) how a single model pre-trained on a mixture of different physics problems can be adapted to various downstream applications. We find that—when fine-tuned appropriately—transfer learning can help reach desired accuracy levels with orders of magnitude fewer downstream examples (across different tasks that can even be out-of-distribution) than training from scratch, with consistent behaviour across a wide range of downstream examples. We also find that fine-tuning these models yields more performance gains as model size increases, compared to training from scratch on new downstream tasks. These results hold for a broad range of PDE learning tasks. All in all, our results demonstrate the potential of the “pre-train and fine-tune” paradigm for SciML problems, demonstrating a path towards building SciML foundation models. Our code is available as open-source.

The Impact of Positional Encoding on Length Generalization in Transformers
Amirhossein Kazemnejad Inkit Padhi Karthikeyan Natesan Payel Das Siva Reddy



研究问题:Transformer模型在训练上下文大小从小型扩展到大型时的长度泛化能力是一个关键挑战。
动机:位置编码(PE)被认为是影响长度泛化的主要因素,但不同PE方案对下游任务的外推影响尚不清楚。
方法:本研究比较了五种不同的位置编码方法,包括绝对位置嵌入(APE)、T5的相对PE、ALiBi、Rotary以及没有位置编码的Transformers(NoPE),在解码器仅Transformers上进行系统的经验性研究。
效果:研究发现,最常用的位置编码方法,如ALiBi、Rotary和APE,在下游任务的长度泛化上表现不佳。更重要的是,NoPE在不需要额外计算的情况下优于其他显式位置编码方法。理论上证明,NoPE可以表示绝对和相对PE,但在使用SGD训练时,它主要类似于T5的相对PE注意力模式。最后,我们发现scratchpad并不总是有助于解决长度泛化问题,其格式对模型性能有很大影响。总的来说,我们的研究显示显式位置嵌入对于解码器仅Transformers的良好长序列泛化并非必要。

Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Zihao Yue Anwen Hu Liang Zhang Qin Jin



研究问题:本文旨在解决图像描述生成模型在训练过程中存在的优化方向冲突问题。
动机:现有的图像描述生成模型在训练过程中,由于最大似然估计的训练目标,其预测结果与标签不匹配时会被惩罚,导致模型倾向于生成更简洁的描述,而忽视了丰富的语义信息。
方法:本文提出了半透性最大似然估计(SMILE)方法,允许模型进行丰富度优化,同时阻止简洁度优化,从而鼓励模型生成更长、包含更多细节的描述。
效果:在两个主流的图像描述数据集MSCOCO和Flickr30K上的大量实验表明,SMILE方法显著提高了生成描述的详细程度。

Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as *conciseness optimization*. In contrast, predictions that are more concise than labels lead to *richness optimization*. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.

Language Models are Weak Learners
Hariharan Manikandan Yiding Jiang J Zico Kolter



研究问题:本文旨在探讨大型语言模型作为弱学习器在提升算法中的应用。
动机:现有的预训练语言模型可以捕获丰富的语义模式,但很少被用于结合结构化知识图谱进行联合训练。
方法:利用大规模文本语料库和知识图谱训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

A central notion in practical and theoretical machine learning is that of a *weak learner*, classifiers that achieve better-than-random performance (on any given distribution over data), even by a small margin. Such weak learners form the practical basis for canonical machine learning methods such as boosting. In this work, we illustrate that prompt-based large language models can operate effectively as said weak learners. Specifically, we illustrate the use of a large language model (LLM) as a weak learner in a boosting algorithm applied to tabular data. We show that by providing (properly sampled according to the distribution of interest) text descriptions of tabular data samples, LLMs can produce a summary of the samples that serves as a template for classification, and achieves the aim of acting as a weak learner on this task. We incorporate these models into a boosting approach, which in many settings can leverage the knowledge within the LLM to outperform traditional tree-based boosting. The model outperforms both few-shot learning and occasionally even more involved fine-tuning procedures, particularly for some tasks involving small numbers of data points. The results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning models.

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Chenyang Le Yao Qian Long Zhou Shujie LIU Yanmin Qian Michael Zeng Xuedong Huang



研究问题:本文旨在解决联合语音-语言训练的挑战,包括对大量训练数据和GPU的需求,以及语音和语言之间的模态差距。
动机:由于需要大量的训练数据和GPU,并且语音和语言之间存在模态差距,因此联合语音-语言训练具有挑战性。
方法:我们提出了ComSL模型,该模型建立在公共预训练的仅语音和仅语言模型的复合架构之上,并针对口语任务进行了优化。特别是,我们将跨模态学习纳入迁移学习中,并以多任务学习的方式进行下游任务。
效果:我们的方法是有效的,在端到端的语音到文本翻译任务上取得了显著的效果,在21种语言的多语言语音到英语文本翻译任务上,我们在公开的CoVoST2评估集上达到了新的最先进的平均BLEU得分为31.5。

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pre-trained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

ChatGPT-Powered Hierarchical Comparisons for Image Classification
Zhiyuan Ren Yiyang Su Xiaoming Liu



研究问题:零样本开放词汇设置对图像分类提出了挑战。
动机:利用像CLIP这样的视觉语言模型,通过比较嵌入来对图像进行分类,但CLIP仍然对某些类别有偏见,并忽视了不同类别之间的差异。
方法:我们提出了一种新的图像分类框架,通过递归比较和分组类,构建了一个类层次结构。通过这种层次结构,我们可以从上到下逐层比较图像和文本嵌入来进行图像分类。
效果:实验和分析表明,我们提出的方法直观、有效且可解释。

The zero-shot open-vocabulary setting poses challenges for image classification. Fortunately, utilizing a vision-language model like CLIP, pre-trained on image-text pairs, allows for classifying images by comparing embeddings. Leveraging large language models (LLMs) such as ChatGPT can further enhance CLIP’s accuracy by incorporating class-specific knowledge in descriptions. However, CLIP still exhibits a bias towards certain classes and generates similar descriptions for similar classes, disregarding their differences. To address this problem, we present a novel image classification framework via hierarchical comparisons. By recursively comparing and grouping classes with LLMs, we construct a class hierarchy. With such a hierarchy, we can classify an image by descending from the top to the bottom of the hierarchy, comparing image and text embeddings at each level. Through extensive experiments and analyses, we demonstrate that our proposed approach is intuitive, effective, and explainable. Code will be released upon publication.

Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting
Hejie Cui Xinyu Fang Zihan Zhang Ran Xu Xuan Kan Xin Liu Yue Yu Manling Li Yangqiu Song Carl Yang



研究问题:现有的视觉知识提取方法通常依赖于预定义的格式或词汇,限制了提取知识的表达能力。
动机:图像包含丰富的关系性知识,可以帮助机器理解世界。
方法:提出了一种新的开放视觉知识提取范式,包括一个开放的关联区域检测器和一个视觉知识生成器。
效果:通过广泛的知识质量评估,证明了OpenVik提取的开放视觉知识的正确性和独特性。在各种视觉推理应用中整合我们提取的知识,显示出持续的改进,表明OpenVik具有实际的应用价值。

Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.

Connecting Pre-trained Language Model and Downstream Task via Properties of Representation
Chenwei Wu Holden Lee Rong Ge



研究问题:本文旨在探讨预训练性能与下游任务性能之间的关系。
动机:尽管大型预训练语言模型的表示在各种下游任务中很有用,但关于预训练性能如何影响下游任务性能的理论理解却很少。
方法:通过分析一个以softmax作为最后一层网络的对数线性模型,研究了预训练性能和下游任务性能的关系。
效果:研究发现即使下游任务结构严谨且依赖于隐藏表示的简单函数,低预训练损失也不能保证良好的下游任务性能。另一方面,作者提出了“锚向量”的存在假设,并验证了这一假设以及下游任务的性质能够保证性能转移。

Recently, researchers have found that representations learned by large-scale pre-trained language models are useful in various downstream tasks. However, there is little theoretical understanding of how pre-training performance is related to downstream task performance. In this paper, we analyze how this performance transfer depends on the properties of the downstream task and the structure of the representations. We consider a log-linear model where a word can be predicted from its context through a network having softmax as its last layer. We show that even if the downstream task is highly structured and depends on a simple function of the hidden representation, there are still cases when a low pre-training loss cannot guarantee good performance on the downstream task. On the other hand, we propose and empirically validate the existence of an ``anchor vector'' in the representation space, and show that this assumption, together with properties of the downstream task, guarantees performance transfer.

EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning
Ping Guo Xiangpeng Wei Yue Hu Baosong Yang Dayiheng Liu Fei Huang jun xie



研究问题:如何利用大规模非平行数据学习跨语言的通用表示。
动机:由于平行数据的稀疏和稀缺,学习任何两种语言的真实“通用”仍然是一个重大挑战。
方法:提出Emma-X,一种类似EM的多语言预训练算法,借助大量的多语言非平行数据学习跨语言的通用表示。Emma-X在EM框架内统一了跨语言表示学习和额外的语义关系预测任务。
效果:在xrete(一个包含12个广泛研究的依赖句子级别表示的跨语言任务的新基准)上进行实验,结果显示Emma-X取得了最先进的性能。

Expressing universal semantics common to all languages is helpful to understand the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic ``universals'' for any two languages. In this paper, we propose Emma-X: an EM-like Multilingual pre-training Algorithm, to learn Cross-lingual universals with the aid of excessive multilingual non-parallel data. Emma-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate Emma-X, we conduct experiments on xrete, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that Emma-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of Emma-X over advanced models.

Mass-Producing Failures of Multimodal Systems with Language Models
Shengbang Tong Erik Jones Jacob Steinhardt



研究问题:多模态模型在部署时可能会出现评估者未能预料的失败,如何找出这些失败?
动机:为了在部署前找到这些失败,我们引入了MultiMon系统,它能自动识别系统性失败。
方法:通过抓取产生相同输出但不应如此的输入例子,提示语言模型识别常见类别并用自然语言描述它们。
效果:使用MultiMon找到了CLIP文本编码器的14个系统性失败,并针对特定的应用场景(如自动驾驶汽车)进行定向探索。

Deployed multimodal models can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures---generalizable, natural-language descriptions that describe categories of individual failures. To uncover systematic failures, MultiMon scrapes for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a language model to identify common categories and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g."ignores quantifiers'') of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g."a shelf with a few/many books''). Because CLIP is the backbone for most state-of-the-art multimodal models, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion, and others. MultiMon can also steer towards failures relevant to specific use cases, such as self-driving cars. We see MultiMon as a step towards evaluation that autonomously explores the long-tail of potential system failures.

Brain-like Flexible Visual Inference by Harnessing Feedback Feedforward Alignment
Tahereh Toosi Elias Issa



研究问题:反馈连接在视觉皮层中如何支持灵活的视觉功能,其机制尚不清楚。
动机:反馈路径通过优化自身目标与前馈路径的对齐,产生顶部效应。
方法:提出反馈-前馈对齐(FFA)学习算法,将反馈和前馈路径作为共同的信用分配计算图,实现对齐。
效果:实验表明,FFA在MNIST和CIFAR10数据集上的分类和重建任务上具有有效性。该对齐机制使反馈连接具有出现视觉推理功能,包括去噪、解决遮挡、幻觉和想象等。此外,与传统的反向传播(BP)方法相比,FFA在实施上具有生物合理性。这项研究展示了FFA作为视觉皮层中反馈连接支持灵活视觉功能的机制的有前景的概念证明。这项工作也对感知现象背后的视觉推理领域做出了贡献,并对开发更具生物启发性的学习算法具有意义。

In natural vision, feedback connections support versatile visual inference capabilities such as making sense of the occluded or noisy bottom-up sensory information or mediating pure top-down processes such as imagination. However, the mechanisms by which the feedback pathway learns to give rise to these capabilities flexibly are not clear. We propose that top-down effects emerge through alignment between feedforward and feedback pathways, each optimizing its own objectives. To achieve this co-optimization, we introduce Feedback-Feedforward Alignment (FFA), a learning algorithm that leverages feedback and feedforward pathways as mutual credit assignment computational graphs, enabling alignment. In our study, we demonstrate the effectiveness of FFA in co-optimizing classification and reconstruction tasks on widely used MNIST and CIFAR10 datasets. Notably, the alignment mechanism in FFA endows feedback connections with emergent visual inference functions, including denoising, resolving occlusions, hallucination, and imagination. Moreover, FFA offers bio-plausibility compared to traditional backpropagation (BP) methods in implementation. By repurposing the computational graph of credit assignment into a goal-driven feedback pathway, FFA alleviates weight transport problems encountered in BP, enhancing the bio-plausibility of the learning algorithm. Our study presents FFA as a promising proof-of-concept for the mechanisms underlying how feedback connections in the visual cortex support flexible visual functions. This work also contributes to the broader field of visual inference underlying perceptual phenomena and has implications for developing more biologically inspired learning algorithms.

Efficient Equivariant Transfer Learning from Pretrained Models
Sourya Basu Pulkit Katdare Prasanna Sattigeri Vijil Chenthamarakshan Katherine Rose Driggs-Campbell Payel Das Lav R. Varshney



研究问题:如何提高基础模型在各种下游任务上的效率,特别是在数据有限的情况下。
动机:现有的转移学习方法对于处理不同下游任务的成功至关重要。
方法:提出了一种基于重要性权重λ的等变平均(λ-equitune)方法,通过直接从数据中学习这些权重来获得更好的特征。
效果:实验结果表明,该方法在零样本和微调任务上的表现优于现有的等变平均(equitune)方法,并在多种应用和模型上验证了其有效性。

Efficient transfer learning algorithms are key to the success of foundation models on diverse downstream tasks even with limited data. Recent works of Basu et al. (2023) and Kaba et al. (2022) propose group averaging (equitune) and optimization-based methods, respectively, over features from group-transformed inputs to obtain equivariant outputs from non-equivariant neural networks. While Kaba et al. (2022) are only concerned with training from scratch, we find that equitune performs poorly on equivariant zero-shot tasks despite good finetuning results. We hypothesize that this is because pretrained models provide better quality features for certain transformations than others and simply averaging them is deleterious. Hence, we propose λ-equitune that averages the features using importance weights, λs. These weights are learned directly from the data using a small neural network, leading to excellent zero-shot and finetuned results that outperform equitune. Further, we prove that λ-equitune is equivariant and a universal approximator of equivariant functions. Additionally, we show that the method of Kaba et al. (2022) used with appropriate loss functions, which we call equizero, also gives excellent zero-shot and finetuned performance. Both equitune and equizero are special cases of λ- equitune. To show the simplicity and generality of our method, we validate on a wide range of diverse applications and models such as 1) image classification using CLIP, 2) deep Q-learning, 3) fairness in natural language generation (NLG), 4) compositional generalization in languages, and 5) image classification using pretrained CNNs such as Resnet and Alexnet.

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Abulhair Saparov Richard Yuanzhe Pang Vishakh Padmakumar Nitish Joshi Mehran Kazemi Najoung Kim He He



研究问题:大型语言模型是否具有通用的演绎推理能力,以及它们如何在不同的复杂性和多样性下进行归纳推理。
动机:大型语言模型(LLMs)在复杂的证明推理任务上的表现尚不明确,需要进一步研究和测试。
方法:通过构建一个可控制的、包含多种演绎规则和证明复杂度的新合成和可编程推理数据集,对四种不同规模和训练目标的大型语言模型进行实验。
效果:实验结果显示,大型语言模型能够进行组合证明的归纳推理,但在长证明和特定证明方式(如分情况证明和反证法)上存在困难,需要明确的示范才能生成假设子证明。

Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.

TOA: Task-oriented Active VQA
Xiaoying Xing Mingfu Liang Ying Wu



研究问题:如何让大型语言模型有效地理解图像输入,提取视觉信息并作为知识基础的视觉问答任务的输入。
动机:现有的大型语言模型虽然在知识驱动的任务上表现良好,但无法有效理解图像输入,因此需要找到一种方法来提取图像信息并输入到大型语言模型中。
方法:提出让大型语言模型根据其知识进行初始假设,然后主动收集验证假设所需的视觉证据的方法。利用空间注意力(即看哪里)和属性注意力(即看什么)的观点,借鉴人类认知的方式,从视觉模块的角度进行操作。
效果:实验表明,该方法在开放式基于知识的视觉问答数据集上优于基线,呈现出更清晰的推理过程和更好的可解释性。

Knowledge-based visual question answering (VQA) requires external knowledge to answer the question about an image. Early methods explicitly retrieve knowledge from external knowledge bases, which often introduce noisy information. Recently large language models like GPT-3 have shown encouraging performance as implicit knowledge source and revealed planning abilities. However, current large language models can not effectively understand image inputs, thus it remains an open problem to extract the image information and input to large language models. Prior works have used image captioning and object descriptions to represent the image. However, they may either drop the essential visual information to answer the question correctly or involve irrelevant objects to the task-of-interest. To address this problem, we propose to let large language models make an initial hypothesis according to their knowledge, then actively collect the visual evidence required to verify the hypothesis. In this way, the model can attend to the essential visual information in a task-oriented manner. We leverage several vision modules from the perspectives of spatial attention (i.e., Where to look) and attribute attention (i.e., What to look), which is similar to human cognition. The experiments show that our proposed method outperforms the baselines on open-ended knowledge-based VQA datasets and presents clear reasoning procedure with better interpretability.

Statistical Knowledge Assessment for Large Language Models
Qingxiu Dong Jingjing Xu Lingpeng Kong Zhifang Sui Lei Li



研究问题:大型语言模型(LLM)能否根据不同的提示可靠地生成事实正确的答案?
动机:现有的LLM对于同一事实可能会产生不同的回答,因此需要一种方法来评估和量化LLM的知识。
方法:本文提出了KaRR,这是一种统计方法,用于评估LLM的事实知识。主要思想是估计LLM生成与给定主题和查询关系的提示对应的答案实体的文本的比例,以及它通过随机机会生成的比例。
效果:实验表明,我们的方法与人类对LLM的评估结果有较强的相关性(0.43的Kendall's tau)。结果显示,具有相同主干架构的LLM中的知识遵循规模定律,而调整指令跟随数据有时会损害模型可靠生成事实正确文本的能力。

Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? Existing LLMs may generate distinct responses for different prompts. In this paper, we study the problem of quantifying knowledge contained in an LLM regarding a given set of facts. We propose KaRR, a statistical approach to assess factual knowledge for LLMs. The main idea is to estimate the ratio of LLM generating text corresponding to the answer entity given diverse prompts of the subject and the querying relation, versus it generating by random chances. Our assessment suite contains a comprehensive set of 994,123 entities and 600 relations, with 1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes, including LLaMA, Alpaca, OPT, etc. Experiments show that our results have a strong correlation (0.43 Kendall's $\tau$) with the results of human assessment on LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.

Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Jae Sung Park Jack Hessel Khyathi Chandu Paul Pu Liang Ximing Lu Peter West Youngjae Yu Qiuyuan Huang Jianfeng Gao Ali Farhadi Yejin Choi



研究问题:现有的视觉语言模型无法直接让用户在图像中“指向”和访问特定区域,这对于研究问题:现有的视觉语言模型无法直接让用户在图像中“指向”和访问特定区域,这对于支持基于参考的视觉语言基准测试以及需要精确图像内推理的实际应用非常重要。
动机:为了解决这个问题,我们构建了一个局部化的视觉常识模型,允许用户指定多个输入区域。
方法:我们通过从大型语言模型中采样局部化常识知识来训练我们的模型。具体来说,我们提示一个大型语言模型根据全局文字图像描述和一组视觉语言模型自动生成的局部文字区域描述来收集常识知识。这个流程是可扩展的,完全自动化的,不需要对齐或人工编写的图像和文本对。
效果:通过使用单独训练的评价模型选择高质量的示例,我们发现仅从图像扩展的局部化常识语料库进行训练可以成功地提炼现有的视觉语言模型以支持参考作为输入接口。实证结果和人类评估在零射设置中表明,与传递生成的引用表达式的基线相比,我们的提炼方法会产生更精确的推理视觉语言模型。

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to “point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense model which allows users to specify (multiple) regions- as-input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt a LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. This pipeline is scalable and fully automatic, as no aligned or human-authored image and text pairs are required. With a separately trained critic model that selects high quality examples, we find that training on the localized commonsense corpus expanded solely from images can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in zero-shot settings demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression.

Scaling laws for language encoding models in fMRI
Richard Antonello Aditya Vaidya Alexander Huth



研究问题:本文旨在测试更大的开源模型,如OPT和LLaMA家族的模型,是否能更好地预测使用功能性磁共振成像(fMRI)记录的大脑反应。
动机:大多数比较语言模型与大脑的研究都使用了GPT-2或类似大小的模型。作者们想要探索更大的开源模型是否在预测大脑反应上更有效。
方法:通过对比不同大小的语言模型(从125M到30B参数),并使用三个受试者的一个独立测试集进行相关性测量,来评估模型对大脑反应的预测性能。
效果:实验结果表明,大脑预测性能随着模型规模的对数增长而增长,其中编码性能提高了约15%。当扩大fMRI训练集的规模时,也观察到了类似的对数行为。此外,对于使用HuBERT、WavLM和Whisper的声学编码模型,也发现了与模型规模相当的改进。这些大型高性能编码模型的噪音上限分析显示,其性能已接近理论最大值,例如后扣带回和高级听觉皮层等大脑区域。这些结果暗示,增加模型和数据的规模将产生非常有效的大脑语言处理模型,从而促进科学理解和应用,如解码。

Representations from transformer-based unidirectional language models are known to be effective at predicting brain responses to natural language. However, most studies comparing language models to brains have used GPT-2 or similarly sized language models. Here we tested whether larger open-source models such as those from the OPT and LLaMA families are better at predicting brain responses recorded using fMRI. Mirroring scaling results from other contexts, we found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with ~15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects. Similar log-linear behavior was observed when scaling the size of the fMRI training set. We also characterized scaling for acoustic encoding models that use HuBERT, WavLM, and Whisper, and we found comparable improvements with model size. A noise ceiling analysis of these large, high-performance encoding models showed that performance is nearing the theoretical maximum for brain areas such as the precuneus and higher auditory cortex. These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain, enabling better scientific understanding as well as applications such as decoding.

LIMA: Less Is More for Alignment
Chunting Zhou Pengfei Liu Puxin Xu Srini Iyer Jiao Sun Yuning Mao Xuezhe Ma Avia Efrat Ping Yu LILI YU Susan Zhang Gargi Ghosh Mike Lewis Luke Zettlemoyer Omer Levy



研究问题:大型语言模型的训练分为无监督预训练和大规模指令微调强化学习两个阶段,本研究旨在测量这两个阶段的重要性。
动机:通过训练650亿参数的LLaMa语言模型,仅使用1000个精心策划的提示和响应进行标准的有监督损失微调,无需任何强化学习或用户偏好建模,来了解这两个阶段的重要性。
方法:训练LIMA模型,该模型在训练数据中仅从少数示例中学习特定的响应格式,包括复杂的查询任务,如规划旅行路线和推测替代历史等。
效果:实验结果表明,LIMA模型表现出强大的性能,能够很好地泛化到未见过的任务。在一项受控的人类研究中,LIMA的响应在43%的情况下与GPT-4相当或更好;当与Bard和DaVinci003(使用人类反馈进行训练)相比时,这一比例分别高达58%和65%。这些结果强烈表明,大型语言模型中几乎所有的知识都是在预训练期间学习的,只需要有限的指令微调数据就可以教模型产生高质量的输出。

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43\% of cases; this statistic is as high as 58\% when compared to Bard and 65\% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning
Yingcong Li Kartik Sreenivasan Angeliki Giannou Dimitris Papailiopoulos Samet Oymak



研究问题:本文旨在探讨链式思维(CoT)方法对语言模型处理复杂推理任务的影响,以及其背后的机制。
动机:尽管链式思维(CoT)在处理复杂推理任务上取得了成功,但其底层机制尚未完全理解。
方法:通过将CoT应用于transformers的上下文学习中,研究了其对多层感知器(MLPs)这种简单但通用的组合函数族的学习影响。
效果:研究发现,CoT的成功可以归因于将其分解为两个不同的阶段:关注和过滤与组合每一步相关的数据,以及上下文学习单步组合函数。实验和理论证据表明,CoT显著降低了上下文学习(ICL)的样本复杂度,并促进了对非CoT方法难以处理的复杂函数的学习。此外,我们还展示了transformers如何通过简单地添加执行CoT所需数据过滤的额外层,从普通的上下文学习过渡到掌握组合函数。除了这些测试时间的好处外,我们还发现CoT通过学习表示复杂函数的快捷方式来加速预训练,而过滤在这个过程中起着重要的作用。这些发现共同为我们提供了对CoT机制的理解,进一步探索了其在复杂推理任务中的作用。

Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating additional layers that perform the necessary data-filtering for CoT via the attention mechanism. In addition to these test-time benefits, we show CoT helps accelerate pretraining by learning shortcuts to represent complex functions and filtering plays an important role in this process. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.

Generating Images with Multimodal Language Models
Jing Yu Koh Daniel Fried Ruslan Salakhutdinov



研究问题:如何将冻结的文本大型语言模型(LLMs)与预训练的图像编码器和解码器模型进行融合,以实现跨模态的能力。
动机:目前的多模态语言模型在处理图像和文本输入时,往往无法生成连贯的图像(和文本)输出。因此,我们提出了一种新的方法,通过在它们的嵌入空间之间进行映射,将冻结的文本LLM与预训练的图像编码器和解码器模型进行融合。
方法:我们的方法首先通过一个高效的映射网络,将文本的隐藏表示转化为视觉模型的嵌入空间,从而利用LLM的强大文本表示能力来生成视觉输出。此外,我们还提出了一个学习决策模块,用于决定在推理时是检索还是生成图像。
效果:实验结果表明,我们的方法在处理更长、更复杂的语言任务时,优于基线生成模型。此外,我们的方法还能够从预指定的数据集中进行图像检索,并在推理时决定是检索还是生成图像。这种方法在多个测量上下文依赖性的文本到图像任务上,都优于非LLM基的生成模型。

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text — outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.

Improving Language Plasticity via Pretraining with Active Forgetting
Yihong Chen Kelly Marchisio Roberta Raileanu David Ifeoluwa Adelani Pontus Stenetorp Sebastian Riedel Mikel Artetxe



研究问题:如何使预训练语言模型(PLMs)快速适应新的语言。
动机:尽管预训练语言模型在下游任务中表现优异,但在新语言的应用上存在困难,限制了其普适性。
方法:提出在预训练过程中使用主动遗忘机制,通过在每K次更新时重置嵌入层,鼓励模型在有限的更新次数内提高学习新嵌入的能力,类似于元学习效应。
效果:实验表明,使用遗忘机制预训练的模型不仅在新语言适应过程中表现出更快的收敛速度,而且在低数据量的情况下优于标准模型,尤其对于远离英语的语言。

Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English. Code will be available at https://github.com/facebookresearch/language-model-plasticity.

RECKONING: Reasoning through Dynamic Knowledge Encoding
Zeming Chen Gail Weiss Eric Mitchell Asli Celikyilmaz Antoine Bosselut



研究问题:现有的基于transformer的语言模型在回答特定问题时,由于未对知识进行筛选,容易受到无关事实的干扰,导致推理失败。
动机:为了解决语言模型在回答问题时的推理失败问题,提高其区分必要知识和无关信息的能力。
方法:提出一种名为RECKONING的双层学习算法,通过将上下文知识编码到模型参数中,使模型能够使用更新后的参数回答问题。训练过程中,内循环快速调整模型权重以编码上下文知识;外循环则让模型学习使用更新后的权重来重现和回答关于记忆知识的推理问题。
效果:实验结果表明,RECKONING在三个不同类型的多跳推理数据集上的性能优于in-context reasoning基线(最高提升4.5%)。与in-context reasoning相比,RECKONING在未见过的长推理链上具有更好的泛化能力,对上下文中的干扰更具鲁棒性,并且在多个问题询问相同知识时具有更高的计算效率。

Recent studies on transformer-based language models show that they can answer questions by reasoning over knowledge provided as part of the context (i.e., in-context reasoning). However, since the available knowledge is often not filtered for a particular question, in-context reasoning can be sensitive to distractor facts, additional content that is irrelevant to a question but that may be relevant for a different question (i.e., not necessarily random noise). In these situations, the model fails to distinguish the necessary knowledge to answer the question, leading to spurious reasoning and degraded performance. This reasoning failure contrasts with the model’s apparent ability to distinguish its contextual knowledge from all the knowledge it has memorized during pre-training. Following this observation, we propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model’s parameters before presenting it with a question. Our method, RECKONING, is a bi-level learning algorithm that teaches language models to reason by updating their parametric knowledge through back-propagation, allowing them to answer questions using the updated parameters. During training, the inner loop rapidly adapts a copy of the model weights to encode contextual knowledge into its parameters. In the outer loop, the model learns to use the updated weights to reproduce and answer reasoning questions about the memorized knowledge. Our experiments on three diverse multi-hop reasoning datasets show that RECKONING’s performance improves over the in-context reasoning baseline (by up to 4.5%). We also find that compared to in-context reasoning, RECKONING generalizes better to longer reasoning chains unseen during training, is more robust to distractors in the context, and is computationally more efficient when multiple questions are asked about the same knowledge.

The Quantization Model of Neural Scaling
Eric J Michaud Ziming Liu Uzay Girit Max Tegmark



研究问题:本文旨在提出一种神经缩放定律的量化模型,以解释观察到的损失与模型和数据规模之间的幂律下降以及随着规模的突然新能力的出现。
动机:网络知识和技能被“量化”为离散块(量子),当量子按照使用频率的递减顺序学习时,观察到的损失的幂律缩放可以用使用频率的幂律来解释。
方法:通过语言模型梯度,我们将模型行为自动分解为一组多样化的技能(量子)。我们暂时发现这些量子在训练分布中的使用频率大致遵循与语言模型的经验缩放指数相对应的幂律,这是我们理论的一个预测。
效果:实验结果表明,该模型能够有效解释神经网络损失与模型和数据规模之间的幂律关系,并揭示出随着规模的增大,网络会突然出现新的功能。

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are "quantized" into discrete chunks (quanta). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.

Propagating Knowledge Updates to LMs Through Distillation
Shankar Padmanabhan Yasumasa Onoe Michael JQ Zhang Greg Durrett Eunsol Choi



研究问题:如何更新和传播现代语言模型中存储的知识,使其能够进行更广泛的推理?
动机:尽管现有的方法可以成功注入原子事实来更新知识,但更新后的模型无法基于注入的事实进行推理。
方法:本文提出了一种基于上下文 distillation 的方法,通过生成转移集并对转移集进行 distillation 来传递实体知识并使其能够进行更广泛的推理。
效果:实验表明,这种方法在传播知识更新方面比微调和其他基于梯度的知识编辑方法更有效,并且在其他上下文中的性能不会受到影响,即使一次注入多达150个实体的定义。

Modern language models have the capacity to store and use immense amounts of knowledge about real-world entities, but it remains unclear how to update such knowledge stored in model parameters. While prior methods for updating knowledge in LMs successfully inject atomic facts, updated LMs fail to make inferences based on injected facts. In this work, we demonstrate that a context distillation-based approach can both impart knowledge about entities \emph{and} propagate that knowledge to enable broader inferences. Our approach consists of two stages: transfer set generation and distillation on the transfer set. We first generate a transfer set by prompting a language model to generate continuations from the entity definition. Then, we update the model parameters so that the distribution of the LM (the 'student') matches the distribution of the LM conditioned on the definition (the 'teacher') on the transfer set. Our experiments demonstrate that this approach is more effective at propagating knowledge updates than fine-tuning and other gradient-based knowledge-editing methods. Moreover, it does not compromise performance in other contexts, even when injecting the definitions of up to 150 entities at once.

AVIS: Autonomous Visual Information Seeking with Large Language Model Agent
Ziniu Hu Ahmet Iscen Chen Sun Kai-Wei Chang Yizhou Sun David A Ross Cordelia Schmid Alireza Fathi



研究问题:如何有效地回答需要外部知识的问题,如图像中建筑纪念的事件是什么?
动机:现有的视觉问答系统在处理需要外部知识的问题时,往往需要手动设计复杂的策略和决策过程。
方法:提出一种自主信息寻求的视觉问答框架AVIS,利用大型语言模型动态制定工具使用策略,并通过树搜索分析其输出,获取回答问题所需的关键知识。
效果:通过用户研究收集人类决策行为数据,设计出由计划器、推理器和工作记忆组件组成的系统。实验证明,AVIS在Infoseek和OK-VQA等基于知识的视觉问答基准测试上取得了最先进的结果。

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs via tree search, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-based visual question answering benchmarks such as Infoseek and OK-VQA.

Learning to Reason and Memorize with Self-Notes
Jack Lanchantin Shubham Toshniwal Jason E Weston Arthur Szlam Sainbayar Sukhbaatar



研究问题:大型语言模型在多步推理和保留先前推理步骤方面存在困难。
动机:提出一种允许模型进行自我记录的方法,以解决这两个问题。
方法:与最近的链式思维或临时存储方法不同,模型可以随时偏离输入上下文,明确思考并写下自己的想法。这使得模型在阅读上下文时能够即时进行推理,甚至整合先前的推理步骤,从而增强其记忆功能并实现多步推理。
效果:通过交替输入文本的自我记录,实验表明该方法可以超越链式思维和临时存储方法。

Large language models have been shown to struggle with multi-step reasoning, and do not retain previous reasoning steps for future use. We propose a simple method for solving both of these problems by allowing the model to take Self-Notes. Unlike recent chain-of-thought or scratchpad approaches, the model can deviate from the input context at any time to explicitly think and write down its thoughts. This allows the model to perform reasoning on the fly as it reads the context and even integrate previous reasoning steps, thus enhancing its memory with useful information and enabling multi-step reasoning. Experiments across a wide variety of tasks demonstrate that our method can outperform chain-of-thought and scratchpad methods by taking Self-Notes that interleave the input text.

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction
Mohammadreza Pourreza Davood Rafiei



研究问题:当前,大型语言模型在文本到SQL的复杂任务上的表现与微调模型和提示方法存在显著差距。
动机:为了提高大型语言模型在推理过程中的性能,我们研究了如何将任务分解为更小的子任务。
方法:我们将生成问题分解为多个子问题,并将这些子问题的解输入到大型语言模型中,以此有效提升其性能。
效果:实验结果显示,这种方法可以稳定地提升大型语言模型的简单几轮学习性能约10%,使准确率接近或超过最新技术。在Spider测试集上,该方法取得了85.3%的执行准确率,超过了之前的最佳结果79.9%。此外,在BIRD基准测试中,该方法实现了55.9%的执行准确率,创造了新的最好成绩。

There is currently a significant gap between the performance of fine-tuned models and prompting approaches using Large Language Models (LLMs) on the challenging task of text-to-SQL, as evaluated on datasets such as Spider. To improve the performance of LLMs in the reasoning process, we study how decomposing the task into smaller sub-tasks can be effective. In particular, we show that breaking down the generation problem into sub-problems and feeding the solutions of those sub-problems into LLMs can be an effective approach for significantly improving their performance. Our experiments with three LLMs show that this approach consistently improves their simple few-shot performance by roughly 10%, pushing the accuracy of LLMs towards SOTA or surpassing it. On the holdout test set of Spider, the SOTA, in terms of execution accuracy, was 79.9 and the new SOTA at the time of this writing using our approach is 85.3. Our approach with in-context learning beats many heavily fine-tuned models by at least 5%. Additionally, when evaluated on the BIRD benchmark, our approach achieved an execution accuracy of 55.9%, setting a new SOTA on its holdout test set.

Are Diffusion Models Vision-And-Language Reasoners?
Benno Krojer Elinor Poole-Dayan Vikram Voleti Christopher Pal Siva Reddy



研究问题:如何对基于扩散过程的文本条件图像生成模型进行自动精细定量评估。
动机:当前,基于扩散过程的文本条件图像生成模型在定性上取得了巨大成功,但在高级别的现象(如组合性)上进行自动精细定量评估仍是一个具有挑战性的任务。
方法:我们提出了两种创新方法。首先,我们使用一种新的方法“DiffusionITM”将扩散模型(在我们的案例中为“Stable Diffusion”)应用于任何图像-文本匹配(ITM)任务。其次,我们引入了具有7个复杂视觉和语言任务、偏差评估和详细分析的生成式鉴别性评估基准(GDBench)。
效果:我们发现,Stable Diffusion + DiffusionITM在许多任务上具有竞争力,并在CLEVR和Winoground等组合任务上优于CLIP。通过在保留生成能力的同时在MS-COCO上进行微调,我们进一步提高了其组合性能。我们还测量了扩散模型中的刻板印象偏见,发现Stable Diffusion 2.1在很大程度上比Stable Diffusion 1.5的偏见要小。总的来说,我们的研究结果为将鉴别性和生成性模型评估更紧密地联系在一起指明了一个令人兴奋的方向。

Text-conditioned image generation models have recently shown immense qualitative success using denoising diffusion processes. However, unlike discriminative vision-and-language models, it is a non-trivial task to subject these diffusion-based generative models to automatic fine-grained quantitative evaluation of high-level phenomena such as compositionality. Towards this goal, we perform two innovations. First, we transform diffusion-based models (in our case, Stable Diffusion) for any image-text matching (ITM) task using a novel method called DiffusionITM. Second, we introduce the Generative-Discriminative Evaluation Benchmark (GDBench) benchmark with 7 complex vision-and-language tasks, bias evaluation and detailed analysis. We find that Stable Diffusion + DiffusionITM is competitive on many tasks and outperforms CLIP on compositional tasks like like CLEVR and Winoground. We further boost its compositional performance with a transfer setup by fine-tuning on MS-COCO while retaining generative capabilities. We also measure the stereotypical bias in diffusion models, and find that Stable Diffusion 2.1 is, for the most part, less biased than Stable Diffusion 1.5. Overall, our results point in an exciting direction bringing discriminative and generative model evaluation closer. We will release code and benchmark setup soon.

Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
Ying Fan Olivia Watkins Yuqing Du Hao Liu Moonkyung Ryu Craig Boutilier Pieter Abbeel Mohammad Ghavamzadeh Kangwook Lee Kimin Lee



研究问题:如何通过人类反馈来改进文本到图像的模型。
动机:尽管已经有一些方法可以学习奖励函数以改进文本到图像的模型,但使用奖励函数进行微调仍然具有挑战性。
方法:提出了一种在线强化学习方法,将文本到图像的微调任务定义为一个强化学习问题,并使用策略梯度更新预训练的文本到图像扩散模型以最大化反馈训练的奖励。该方法被称为DPOK,它结合了策略优化和KL正则化。
效果:实验结果表明,DPOK在图像-文本对齐和图像质量方面通常优于有监督的微调。

Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though relatively simple approaches (e.g., rejection sampling based on reward scores) have been investigated, fine-tuning text-to-image models with the reward function remains challenging. In this work, we propose using online reinforcement learning (RL) to fine-tune text-to-image models. We focus on diffusion models, defining the fine-tuning task as an RL problem, and updating the pre-trained text-to-image diffusion models using policy gradient to maximize the feedback-trained reward. Our approach, coined DPOK, integrates policy optimization with KL regularization. We conduct an analysis of KL regularization for both RL fine-tuning and supervised fine-tuning. In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality. Our code is available at https://github.com/google-research/google-research/tree/master/dpok.

Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan Niket Tandon Prakhar Gupta Skyler Hallinan Luyu Gao Sarah Wiegreffe Uri Alon Nouha Dziri Shrimai Prabhumoye Yiming Yang Shashank Gupta Bodhisattwa Prasad Majumder Katherine Hermann Sean Welleck Amir Yazdanbakhsh Peter Clark



研究问题:如何通过迭代反馈和自我修正提高大型语言模型的初始输出质量。
动机:受到人类如何改进书面文本的启发,提出一种通过迭代反馈和自我修正来提高大型语言模型初始输出质量的方法。
方法:首先使用大型语言模型生成一个初始输出,然后让同一个模型为其输出提供反馈并利用该反馈进行自我修正,这个过程会不断迭代。这种方法不需要任何监督训练数据、额外的训练或强化学习,而是使用单个大型语言模型作为生成器、修正器和反馈提供者。
效果:在7个不同的任务上评估了Self-Refine,包括对话响应生成和数学推理等,使用的是最先进的大型语言模型(GPT-3.5、ChatGPT和GPT-4)。在所有评估的任务中,使用Self-Refine生成的输出比使用相同大型语言模型进行传统一步生成的输出更受人类和自动指标的青睐,平均任务性能提高了约20%。这项工作表明,即使是最先进的大型语言模型(如GPT-4)也可以使用我们这种简单独立的测试时方法进行进一步的改进。

Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides *feedback* for its output and uses it to *refine* itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner and the feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by $\sim$20\% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test-time using our simple, standalone approach.

TART: A plug-and-play Transformer module for task-agnostic reasoning
Kush Bhatia Avanika Narayan Christopher De Sa Christopher Re



研究问题:大型语言模型是否具备任务无关的推理能力?
动机:尽管大型语言模型具有上下文学习的能力,但其在执行特定任务时的性能却始终低于任务特定的微调方法。
方法:提出TART方法,通过训练一个基于Transformer的推理模块来提高大型语言模型的推理能力,该模块使用合成逻辑回归任务进行训练,并与预训练模型无缝集成,无需额外训练。
效果:实验证明,TART可以显著提高不同模型家族、模型规模和任务的性能,并在RAFT基准测试中超越GPT-3的性能。

Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and, as a proof of concept, propose TART which generically improves an LLM's reasoning abilities using a synthetically trained reasoning module. TART trains this Transformer-based reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, Bloom), model sizes (100M - 6B), tasks (14 NLP classification tasks), and even across different modalities (audio and vision). On the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms Bloom (176B), and is within $4$% of GPT-3.

The Transient Nature of Emergent In-Context Learning in Transformers
Aaditya K Singh Stephanie C.Y. Chan Ted Moskovitz Erin Grant Andrew M Saxe Felix Hill



研究问题:本文探讨了Transformer神经网络在训练过程中的上下文学习(ICL)现象,并研究了其出现和消失的过程。
动机:尽管Transformer神经网络并未专门进行上下文学习的培训,但其却表现出了惊人的上下文学习能力。然而,现有的研究大多将ICL视为一种持久的现象,一旦出现就会持续存在。
方法:本文通过设计特定的合成数据,对Transformer神经网络进行训练,观察ICL和权重内学习(IWL)策略的出现、消失以及相互转换过程。
效果:实验结果显示,ICL在Transformer神经网络的训练过程中通常是短暂的,会先出现后消失,最终转变为IWL。这一发现对于如何“过度训练”Transformer以获得更紧凑、运行成本更低的模型提供了新的思考。同时,L2正则化可能为使ICL更持久提供了一条路径,从而消除了基于ICL风格的验证任务的早期停止需求。

Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL), despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g., through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL or in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to “overtrain” transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks.

Im-Promptu: In-Context Composition from Image Prompts
Bhishma Dedhia Michael Chang Jake Snell Thomas L. Griffiths Niraj Jha



研究问题:本文旨在探索语言模型的注意机制是否有助于类比推理,并进一步研究视觉刺激的可组合元素在上下文中的合成能力。
动机:大型语言模型通过少数示范就能解决各种任务,这暗示了其对任务的隐含理解可能与词令的注意力机制有关。同时,对于视觉刺激的上下文学习,合适的组合粒度通常是未指定的。
方法:本文提出了一个基于类比推理的上下文学习框架Im-Promptu,并通过训练具有不同组合粒度的多个代理(包括向量表示、补丁表示和对象插槽)来测试其泛化属性。
效果:实验结果显示,非组合表示可以将学到的组合规则扩展到未见过的区域,但在组合任务上表现不佳。补丁基础表示需要补丁包含整个对象才能进行稳健的外推。同时,对象中心的标记器与交叉注意力模块一起生成一致且高度保真的解决方案,这些归纳偏置对于组合泛化尤为重要。最后,作者展示了Im-Promptu作为直观的图像生成编程接口的应用案例。

Large language models are few-shot learners that can solve diverse tasks from a handful of demonstrations. This implicit understanding of tasks suggests that the attention mechanisms over word tokens may play a role in analogical reasoning. In this work, we investigate whether analogical reasoning can enable in-context composition over composable elements of visual stimuli. First, we introduce a suite of three benchmarks to test the generalization properties of a visual in-context learner. We formalize the notion of an analogy-based in-context learner and use it to design a meta-learning framework called Im-Promptu. Whereas the requisite token granularity for language is well established, the appropriate compositional granularity for enabling in-context generalization in visual stimuli is usually unspecified. To this end, we use Im-Promptu to train multiple agents with different levels of compositionality, including vector representations, patch representations, and object slots. Our experiments reveal tradeoffs between extrapolation abilities and the degree of compositionality, with non-compositional representations extending learned composition rules to unseen domains but performing poorly on combinatorial tasks. Patch-based representations require patches to contain entire objects for robust extrapolation. At the same time, object-centric tokenizers coupled with a cross-attention module generate consistent and high-fidelity solutions, with these inductive biases being particularly crucial for compositional generalization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive programming interface for image generation.

A Logic for Expressing Log-Precision Transformers
William Merrill Ashish Sabharwal



研究问题:本文旨在探讨一种基于变换器的语言模型的逻辑推理能力,并尝试将其表达为一阶逻辑。
动机:最近的研究显示,有限精度的变换器分类器可以用一阶逻辑来表示,但这种变换器的能力有限。因此,作者想知道一个具有普遍注意力能力的更强大的模型是否也可以用逻辑来描述。
方法:作者分析了在上下文长度为n的情况下,前向传播计算在log n精度上的变换器。他们证明了任何log-precision变换器分类器都可以等价地表示为一个包含标准全称和存在量词以及多数投票量词的第一阶逻辑句子。
效果:这是已知的对log-precision变换器的最紧上限,也是第一次将log-precision变换器用逻辑来表征。

One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformer classifiers can be equivalently expressed in a generalization of first-order logic. However, finite-precision transformers are a weak transformer variant because, as we show, a single head can only attend to a constant number of tokens and, in particular, cannot represent uniform attention. Since attending broadly is a core capability for transformers, we ask whether a minimally more expressive model that can attend universally can also be characterized in logic. To this end, we analyze transformers whose forward pass is computed in $\log n$ precision on contexts of length $n$. We prove any log-precision transformer classifier can be equivalently expressed as a first-order logic sentence that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. This is the tightest known upper bound and first logical characterization of log-precision transformers.

Analyzing Vision Transformers for Image Classification in Class Embedding Space
Martina G. Vilas Timothy Schaumlöffel Gemma Roig



研究问题:尽管变换模型在计算机视觉中的应用越来越广泛,但对这类网络的机制性理解仍然需要。
动机:受先前NLP研究的启发,本文介绍了一种方法来逆向工程用于解决图像分类任务的视觉变换器。
方法:通过将内部表示的任何级别的层次结构投影到学习的类别嵌入空间,揭示这些网络如何为其预测建立类别表示。
效果:结果显示,图像标记会根据注意力机制和上下文信息发展出特定于类别的表示,并且自注意力和MLP层对这种类别构成有显著的贡献。此外,该方法还可以确定检测感兴趣类别的重要部分,并显示出比传统线性探测方法有明显优势。

Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks. Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class embedding space to uncover how these networks build categorical representations for their predictions. We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information, and give insights on how self-attention and MLP layers differentially contribute to this categorical composition. We additionally demonstrate that this method (1) can be used to determine the parts of an image that would be important for detecting the class of interest, and (2) exhibits significant advantages over traditional linear probing approaches. Taken together, our results position our proposed framework as a powerful tool for mechanistic interpretability and explainability research.

Taking the neural sampling code very seriously: A data-driven approach for evaluating generative models of the visual system
Suhas Shrinivasan Konstantin-Klemens Lurz Kelli Restivo George Denfield Andreas S. Tolias Edgar Y. Walker Fabian H. Sinz



研究问题:本文旨在解决当前知觉理论与神经生理数据之间缺乏精确对齐的问题,特别是在自然刺激下的神经活动记录。
动机:目前的知觉理论,如神经采样编码(NSC)理论,虽然在理论上优雅,但并未明确指定生成模型的具体形式,也未规定如何将理论与神经活动记录相联系。
方法:本文提出了一种新的NSC理论形式,可以直接拟合自然图像下记录的神经活动,形成更丰富、更灵活的生成模型,并使用标准指标对不同生成模型进行定量评估。
效果:通过在猕猴初级视觉皮层(V1)上对经典和灵活的深度学习生成模型进行比较,发现灵活的模型在生成模型和预测模型性能上都优于经典模型。这为知觉和行为的概率计算原理提供了实验性的理解。

Prevailing theories of perception hypothesize that the brain implements perception via Bayesian inference in a generative model of the world. One prominent theory, the Neural Sampling Code (NSC), posits that neuronal responses to a stimulus represent samples from the posterior distribution over latent world state variables that cause the stimulus. Although theoretically elegant, NSC does not specify the exact form of the generative model or prescribe how to link the theory to recorded neuronal activity. Previous works assume simple generative models and test their qualitative agreement with neurophysiological data. Currently, there is no precise alignment of the normative theory with neuronal recordings, especially in response to natural stimuli, and a quantitative, experimental evaluation of models under NSC has been lacking. Here, we propose a novel formalization of NSC, that (a) allows us to directly fit NSC generative models to recorded neuronal activity in response to natural images, (b) formulate richer and more flexible generative models, and (c) employ standard metrics to quantitatively evaluate different generative models under NSC. Furthermore, we derive a stimulus-conditioned predictive model of neuronal responses from the trained generative model using our formalization that we compare to neural system identification models. We demonstrate our approach by fitting and comparing classical- and flexible deep learning-based generative models on population recordings from the macaque primary visual cortex (V1) to natural images, and show that the flexible models outperform classical models in both their generative- and predictive-model performance. Overall, our work is an important step towards a quantitative evaluation of NSC. It provides a framework that lets us \textit{learn} the generative model directly from neuronal population recordings, paving the way for an experimentally-informed understanding of probabilistic computational principles underlying perception and behavior.

Systematic Visual Reasoning through Object-Centric Relational Abstraction
Taylor Whittington Webb Shanka Subhra Mondal Jonathan Cohen



研究问题:本文旨在通过结合对象和关系提取显式表示,实现复杂视觉显示任务(包括具有更高视觉复杂度的新数据集CLEVR-ART)中的强系统性概括。
动机:人类视觉推理能够从少量示例中识别抽象模式,并将其系统地推广到新输入。这种能力在很大程度上取决于我们以对象和关系的方式表示复杂视觉输入的能力。
方法:本文引入了Object-Centric Relational Abstraction(OCRA)模型,该模型提取对象和抽象关系的显式表示,并在涉及复杂视觉显示的任务中实现强大的系统性概括。
效果:实验结果表明,OCRA模型在各种任务中表现出色,特别是在处理具有更高视觉复杂度的CLEVR-ART数据集时。

Human visual reasoning is characterized by an ability to identify abstract patterns from only a small number of examples, and to systematically generalize those patterns to novel inputs. This capacity depends in large part on our ability to represent complex visual inputs in terms of both objects and relations. Recent work in computer vision has introduced models with the capacity to extract object-centric representations, leading to the ability to process multi-object visual inputs, but falling short of the systematic generalization displayed by human reasoning. Other recent models have employed inductive biases for relational abstraction to achieve systematic generalization of learned abstract rules, but have generally assumed the presence of object-focused inputs. Here, we combine these two approaches, introducing Object-Centric Relational Abstraction (OCRA), a model that extracts explicit representations of both objects and abstract relations, and achieves strong systematic generalization in tasks (including a novel dataset, CLEVR-ART, with greater visual complexity) involving complex visual displays.

Goal Driven Discovery of Distributional Differences via Language Descriptions
Ruiqi Zhong Peter Zhang Steve Li Jinwoo Ahn Dan Klein Jacob Steinhardt



研究问题:如何有效地比较两个大型语料库之间的差异?
动机:人工探索大型语料库耗时且效率低下,因此需要自动发现差异的新任务。
方法:提出新任务D5,以目标驱动的方式自动发现两个大型语料库之间的差异。通过用户指定的研究目标和一对语料库进行输入,输出与目标相关的描述(发现),如两种药物的副作用有何不同。
效果:构建了D5系统,并通过合成数据集和真实数据集进行评估。实验证明,语言模型可以利用用户指定的目标来提出更相关的候选发现,有时还能产生作者未知的发现,如讨论主题的人口统计差异、演讲中的政治立场、商业评论中的洞察以及NLP模型的错误模式等。然而,目前的D5系统只能发现相关性而非因果关系,并有可能在误用时强化社会偏见,因此使用者应谨慎对待其输出结果。

Exploring large corpora can generate useful discoveries but is time-consuming for humans. We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way. The task input is a problem comprising a user-specified research goal (“*comparing the side effects of drug A and drug*”) and a corpus pair (two large collections of patients' self-reported reactions after taking each drug). The output is a goal-related description (discovery) of how these corpora differ (patients taking drug A “*mention feelings of paranoia*” more often). We build a D5 system, and to quantitatively evaluate its performance, we 1) build a diagnostic benchmark, SynD5, to test whether it can recover known differences between two synthetic corpora, and 2) contribute a meta-dataset, OpenD5, aggregating 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health. With both synthetic and real datasets, we confirm that language models can leverage the user-specified goals to propose more relevant candidate discoveries, and they sometimes produce discoveries previously unknown to the authors, including demographic differences in discussion topics, political stances in speech, insights in commercial reviews, and error patterns in NLP models. Finally, we discuss the limitations of the current D5 system, which discovers correlation rather than causation and has the potential to reinforce societal biases when misused; therefore, practitioners should treat the outputs of our system with caution.

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Ziyi Yin Muchao Ye Tianrong Zhang Tianyu Du Jinguo Zhu Han Liu Jinghui Chen Ting Wang Fenglong Ma



研究问题:本文旨在探索视觉-语言预训练模型在黑箱微调模型上的对抗鲁棒性,特别是在现实情况下的对抗鲁棒性。
动机:现有的方法主要关注白盒设置下的对抗鲁棒性,这在实际中是不现实的。因此,本文提出了一个新的实用任务,即使用预训练的视觉-语言模型来攻击黑箱微调模型。
方法:为了实现这一目标,我们提出了VLATTACK框架,该框架通过融合来自单模态和多模态级别的图像和文本扰动来生成对抗样本。在单模态级别,我们提出了一种新的块状相似性攻击(BSA)策略来学习用于破坏通用表示的图像扰动。此外,我们还采用了一种现有的文本攻击策略来生成与图像模态攻击无关的文本扰动。在多模态级别,我们设计了一种新的迭代跨搜索攻击(ICSA)方法,该方法周期性地更新对抗图像-文本对,从单模态级别的输出开始。
效果:我们在八个数据集上针对三种广泛使用的视觉-语言预训练模型进行了六项任务的大量实验。实验结果表明,与最先进的基线相比,所提出的VLATTACK框架在所有任务上都实现了最高的攻击成功率,这表明预训练的视觉-语言模型在部署中存在一个重大盲点。

Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multi-modal levels. At the single-modal level, we propose a new block-wise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multi-modal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack three widely-used VL pretrained models for six tasks on eight datasets. Experimental results show that the proposed VLATTACK framework achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a significant blind spot in the deployment of pre-trained VL models.

Brain encoding models based on multimodal transformers can transfer across language and vision
Jerry Tang Meng Du Vy A. Vo Vasudev Lal Alexander Huth



研究问题:本文旨在探索多模态转换器如何提供对大脑进行多模态处理的洞察。
动机:目前的编码模型通常在独立的情况下训练和测试大脑对每种模态的反应,而语言和视觉依赖于相似的概念表示。
方法:使用来自多模态转换器的表示来训练可以在故事和电影的fMRI反应之间转移的编码模型。
效果:研究发现,在一个模态的大脑反应上训练的编码模型可以成功预测另一个模态的大脑反应,特别是在代表概念意义的皮质区域。比较使用多模态和单模态转换器表示训练的编码模型,发现多模态转换器学习到的语言和视觉的概念表示更为一致。

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain’s capacity for multimodal processing.

Meet in the Middle: A New Pre-training Paradigm
Anh Tuan Nguyen Nikos Karampatziakis Weizhu Chen



研究问题:目前的大部分语言模型训练和应用都采用自回归的从左到右方式,忽视了训练过程中完整序列的存在。
动机:为了提高数据效率,本文提出了一种新的预训练模式“在中间相遇”(MIM),通过从左到右和从右到左两个方向进行训练,并鼓励各自的模型对每个位置的标记分布达成一致。
方法:主要成果是改进了从左到右的语言模型,同时也在填充任务中获得了次要收益。我们利用两个预训练的方向提出了一种同时从两边构建完成的填充过程。
效果:在编程和自然语言方面进行了广泛的实验,结果显示,MIM显著超越了现有的预训练模式,无论是在从左到右的生成还是在填充任务中。

Most language models (LMs) are trained and applied in an autoregressive left-to-right fashion, predicting the next token from the preceding ones. However, this ignores that the full sequence is available during training. In this paper, we introduce ``Meet in the Middle'' (MIM) a new pre-training paradigm that improves data efficiency by training in two directions, left-to-right and right-to-left, and encouraging the respective models to agree on their token distribution for each position. While the primary outcome is an improved left-to-right LM, we also obtain secondary benefits in the infilling task. There, we leverage the two pre-trained directions to propose an infilling procedure that builds the completion simultaneously from both sides. We conduct extensive experiments on both programming and natural languages and show that MIM significantly surpasses existing pre-training paradigms, in both left-to-right generation as well as infilling. Code and models available at https://github.com/microsoft/Meet-in-the-Middle

Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh Benjamin Elizalde Rita Singh Huaming Wang



研究问题:当前音频处理模型无法完成开放性任务,如音频描述或问答。
动机:通过将音频任务转化为文本生成任务,利用迁移学习来开发新的音频语言模型。
方法:提出Pengi模型,该模型将输入的音频和文本编码为连续嵌入序列,然后与预训练的语言模型结合,无需额外的微调和特定任务扩展。
效果:在21个下游任务中,Pengi模型在多个任务上取得了最先进的性能,显示了语言模型与音频模型的结合是实现通用音频理解的重要步骤。

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 21 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding.

Beyond MLE: Convex Learning for Text Generation
Chenze Shao Zhengrui Ma Min Zhang Yang Feng



研究问题:本文旨在探讨在封闭的文本生成任务中,最大似然估计(MLE)并非总是必要和最优的,并提出一种新的基于凸函数的训练目标。
动机:在封闭的文本生成任务如机器翻译中,模型的目标是生成最合适的响应,而无需估计整个数据分布。因此,我们提出了一种基于凸函数的新型训练目标。
方法:我们提出了一种新的基于凸函数的训练目标,使文本生成模型能够专注于高概率输出,而无需估计整个数据分布。我们还研究了应用凸函数到损失时的最佳预测分布的理论性质。
效果:实验表明,这种方法可以显著提高各种文本生成任务和模型的效果,使自回归模型能够弥合贪婪搜索和束搜索之间的差距,并显著提高大型语言模型(LLMs)在各种任务上的生成能力。

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at \url{https://github.com/ictnlp/Convex-Learning}.

Cognitive Steering in Deep Neural Networks via Long-Range Modulatory Feedback Connections
Talia Konkle George A. Alvarez



研究问题:如何使视觉模型具备人类一样的目标导向信息处理能力。
动机:目前的视觉模型缺乏对丰富视觉信息的充分利用,无法像人一样进行目标导向的信息处理。
方法:引入认知和生物启发的长距离调节路径,实现视觉模型的“认知引导”。
效果:实验结果表明,这种新的视觉模型在图像识别、对抗性鲁棒性和大脑匹配度上均优于基线模型,并在多类别复合图像的类别识别上取得了显著改进。

Given the rich visual information available in each glance, humans can internally direct their visual attention to enhance goal-relevant information---a capacity often absent in standard vision models. Here we introduce cognitively and biologically-inspired long-range modulatory pathways to enable `cognitive steering’ in vision models. First, we show that models equipped with these feedback pathways naturally show improved image recognition, adversarial robustness, and increased brain alignment, relative to baseline models. Further, these feedback projections from the final layer of the vision backbone provide a meaningful steering interface, where goals can be specified as vectors in the output space. We show that there are effective ways to steer the model that dramatically improve recognition of categories in composite images of multiple categories, succeeding where baseline feed-forward models without flexible steering fail. And, our multiplicative modulatory motif prevents rampant hallucination of the top-down goal category, dissociating what the model is looking for, from what it is looking at. Thus, these long-range modulatory pathways enable new behavioral capacities for goal-directed visual encoding, offering a flexible communication interface between cognitive and visual systems.

TIES-Merging: Resolving Interference When Merging Models
Prateek Yadav Derek Tam Leshem Choshen Colin Raffel Mohit Bansal



研究问题:如何有效地合并多个预训练模型,以构建一个可以执行多种任务的多任务模型。
动机:现有的模型合并技术在合并多个特定任务模型时,往往会忽视不同模型参数之间的干扰,导致性能大幅下降。
方法:提出TrIm, Elect Sign & Merge (TIES-Merging)方法,通过重置微调过程中变化小的参数、解决符号冲突以及只合并与最终一致符号的参数等三个步骤来合并模型。
效果:TIES-Merging在各种情况下都优于现有方法,包括不同的模态、领域、任务数量、模型大小、架构和微调设置。进一步分析发现,不同类型的干扰对模型参数的影响不同,符号的重要性突出,并且使用验证数据估计符号可以进一步提高性能。

Transfer learning – i.e., further fine-tuning a pre-trained model on a downstream task – can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model merging techniques have emerged as a solution to combine multiple task-specific models into a single multitask model without performing additional training. However, existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. In this paper, we demonstrate that prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter’s values across models. To address this, we propose our method, TrIm, Elect Sign & Merge (TIES-Merging), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign. We find that TIES-Merging outperforms existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, highlight the importance of signs, and show that estimating the signs using the validation data could further improve performance.

Joint processing of linguistic properties in brains and language models
SUBBA REDDY OOTA Manish Gupta Mariya Toneva



研究问题:理解人类大脑对语言信息详细处理与语言模型的对应关系。
动机:为了更深入地了解这种对应关系,需要消除语言模型表示中与特定语言属性相关的信息,并观察这如何影响与参与者听故事时获得的fMRI脑记录的对齐。
方法:通过直接的方式,即在语言模型表示中消除与特定语言属性相关的信息,并观察其对齐效果。
效果:研究发现,每种语言属性(表面、句法和语义)的消除都会导致大脑对齐显著下降。具体来说,句法属性(即顶级成分和树深度)对模型层间的大脑对齐趋势影响最大。这些发现为大脑和语言模型之间的对应关系提供了明确的证据,并为映射两者的信息处理开辟了新的途径。

Language models have been shown to be very effective in predicting brain recordings of subjects experiencing complex language stimuli. For a deeper understanding of this alignment, it is important to understand the correspondence between the detailed processing of linguistic information by the human brain versus language models. We investigate this correspondence via a direct approach, in which we eliminate information related to specific linguistic properties in the language model representations and observe how this intervention affects the alignment with fMRI brain recordings obtained while participants listened to a story. We investigate a range of linguistic properties (surface, syntactic, and semantic) and find that the elimination of each one results in a significant decrease in brain alignment. Specifically, we find that syntactic properties (i.e. Top Constituents and Tree Depth) have the largest effect on the trend of brain alignment across model layers. These findings provide clear evidence for the role of specific linguistic information in the alignment between brain and language models, and open new avenues for mapping the joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/linguistic-properties-brain-alignment].

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
Jiazheng Xu Xiao Liu Yuchen Wu Yuxuan Tong Qinkai Li Ming Ding Jie Tang Yuxiao Dong



研究问题:如何从人类偏好反馈中学习和改进文本到图像的模型。
动机:现有的文本到图像模型缺乏对人类偏好的有效编码,需要一种更有效的方法来优化这些模型。
方法:构建了ImageReward模型,这是一种通用的文本到图像人类偏好奖励模型,通过系统的注释管道进行训练,包括评级和排名,收集了137k专家比较数据。同时,提出了奖励反馈学习(ReFL)算法,这是一种直接优化扩散模型的调优算法。
效果:实验结果表明,ImageReward在人类评估中优于现有的评分模型和指标,使其成为评估文本到图像合成的有希望的自动指标。同时,人类和自动评估都支持ReFL优于比较方法。所有代码和数据集都可以在提供的网址中找到。

We present a comprehensive solution to learn and improve text-to-image models from human preference feedback. To begin with, we build ImageReward---the first general-purpose text-to-image human preference reward model---to effectively encode human preferences. Its training is based on our systematic annotation pipeline including rating and ranking, which collects 137k expert comparisons to date. In human evaluation, ImageReward outperforms existing scoring models and metrics, making it a promising automatic metric for evaluating text-to-image synthesis. On top of it, we propose Reward Feedback Learning (ReFL), a direct tuning algorithm to optimize diffusion models against a scorer. Both automatic and human evaluation support ReFL's advantages over compared methods. All code and datasets are provided at \url{https://github.com/THUDM/ImageReward}.

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
Fuzhao Xue Yao Fu Wangchunshu Zhou Zangwei Zheng Yang You



研究问题:大型语言模型在预训练阶段对数据量的需求巨大,但高质量网络文本数据可能接近其扩展极限。如何进一步提升大型语言模型的性能?
动机:通过重复使用预训练数据进行额外的训练周期,可以作为一种直接提升大型语言模型性能的方法。
方法:本研究对此方法进行了实证研究,探索了重复预训练数据的后果,发现模型容易过拟合,导致多周期退化。同时,研究了导致多周期退化的关键因素,包括数据集大小、模型参数和训练目标等。
效果:研究发现,虽然大多数正则化技术对缓解多周期退化效果不明显,但dropout表现出显著的效果。此外,利用专家混合(MoE)可以有效地为计算密集型的大型语言模型进行成本效益高的超参数调整,有可能在更广泛的范围内影响高效的大型语言模型开发。

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is likely to be approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.

Parts of Speech–Grounded Subspaces in Vision-Language Models
James Oldfield Christos Tzelepis Yannis Panagakis Mihalis Nicolaou Ioannis Patras



研究问题:现有的视觉-语言模型的潜图像表示对于各种下游任务非常有用,但其效用受到不同视觉属性之间纠缠的限制。
动机:近期的研究表明,CLIP的图像表示往往以不可预测的方式偏向特定的视觉属性(如物体或动作)。
方法:通过利用词性与特定视觉变化模式(如名词关联物体,形容词描述外观)之间的关联,在CLIP的联合视觉-语言空间中分离不同视觉模态的表示。这通过形成一个适当的成分分析模型来实现,该模型学习捕获与特定词性相对应的可变性的子空间,同时最小化对其余部分的可变性。
效果:这种子空间产生了封闭形式的不同图像或文本的不同视觉属性的解缠表示,同时尊重表示所基于的流形的几何结构。此外,我们展示了提出的模型还有助于学习对应于特定视觉外观(如艺术家的绘画风格)的子空间,这使得能够从基于CLIP的文本到图像合成中选择性地移除整个视觉主题。我们在定性上通过使用一个文本到图像模型可视化子空间投影并防止模仿艺术家的风格来验证模型,并在定量上通过类别不变性度量和对基线零射击分类的改进来进行验证。

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.

Diffused Redundancy in Pre-trained Representations
Vedant Nanda Till Speicher John P Dickerson Krishna P. Gummadi Soheil Feizi Adrian Weller



研究问题:本文旨在探究预训练神经网络在大规模数据集上学习到的特征编码方式。
动机:作者发现预训练神经网络的某一层中,学习到的表示存在一定程度的扩散冗余性,即任意一个随机选择的、大于某一阈值的神经元子集与整个层的相似度很高,且在各种下游任务上的表现也与整个层相当。
方法:作者在ImageNet1k和ImageNet21k上对不同的神经网络架构(包括CNNs和Transformers)进行预训练,并评估了VTAB基准测试集中的各种下游任务。
效果:预训练过程中的损失和数据集在很大程度上决定了扩散冗余的程度,而“关键质量”神经元的数量通常取决于下游任务,这表明存在一种任务固有的冗余-性能帕累托最优边界。这些发现揭示了预训练深度神经网络所学习表示的性质,并表明对于许多下游任务来说,可能并不需要整个层来完成。

Representations learned by pre-training a neural network on a large dataset are increasingly used successfully to perform a variety of downstream tasks. In this work, we take a closer look at how features are encoded in such pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of neurons in the layer that is larger than a threshold size shares a large degree of similarity with the full layer and is able to perform similarly as the whole layer on a variety of downstream tasks. For example, a linear probe trained on $20\%$ of randomly picked neurons from the penultimate layer of a ResNet50 pre-trained on ImageNet1k achieves an accuracy within $5\%$ of a linear probe trained on the full layer of neurons for downstream CIFAR10 classification. We conduct experiments on different neural architectures (including CNNs and Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a variety of downstream tasks taken from the VTAB benchmark. We find that the loss \& dataset used during pre-training largely govern the degree of diffuse redundancy and the "critical mass" of neurons needed often depends on the downstream task, suggesting that there is a task-inherent redundancy-performance Pareto frontier. Our findings shed light on the nature of representations learned by pre-trained deep neural networks and suggest that entire layers might not be necessary to perform many downstream tasks. We investigate the potential for exploiting this redundancy to achieve efficient generalization for downstream tasks and also draw caution to certain possible unintended consequences. Our code is available at \url{https://github.com/nvedant07/diffused-redundancy}.

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
Arjun Majumdar Karmesh Yadav Sergio Arnaud Yecheng Jason Ma Claire Chen Sneha Silwal Aryan Jain Vincent-Pierre Berges Tingfan Wu Jay Vakil Pieter Abbeel Jitendra Malik Dhruv Batra Yixin Lin Oleksandr Maksymets Aravind Rajeswaran Franziska Meier



研究问题:本文旨在对预训练视觉表示(PVRs)或视觉“基础模型”进行最大规模和最全面的实证研究。
动机:目前对于预训练视觉表示的研究尚无统一的主导模型,同时,预训练数据的规模和多样性对性能的影响尚不明确。
方法:作者创建了CortexBench数据集,包含17个不同的任务,覆盖了移动、导航、灵巧和机动操作等领域。然后,通过使用掩蔽自动编码(MAE)在来自7个不同来源的4000多小时的自我中心视频(超过430万张图像)和ImageNet上训练不同规模的视觉变压器,来系统地评估现有的PVRs。
效果:结果显示,扩大数据集规模和多样性并不能普遍提高性能(但平均而言确实有所提高)。作者的最大模型VC-1在所有PVR中平均表现最好,但也没有普遍占优。此外,任务或领域特定的VC-1适应可以带来显著的收益,VC-1(适应)在所有CortexBench基准测试中实现了竞争或优越的性能。最后,在真实世界的硬件实验中,VC-1和VC-1(适应)超过了现有最强的PVR。总的来说,本文虽然没有提出新的技术,但是进行了严格的系统评估,得出了一系列关于PVRs的发现(在某些情况下,与先前在狭窄领域中的工作相反),并为研究社区提供了开源代码和模型(需要超过1万个GPU小时进行训练)。

We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual ‘foundation models’ for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 4.3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Next, we show that task- or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. Finally, we present real-world hardware experiments, in which VC-1 and VC-1 (adapted) outperform the strongest pre-existing PVR. Overall, this paper presents no new techniques but a rigorous systematic evaluation, a broad set of findings about PVRs (that in some cases, refute those made in narrow domains in prior work), and open-sourced code and models (that required over 10,000 GPU-hours to train) for the benefit of the research community.

AmadeusGPT: a natural language interface for interactive animal behavioral analysis
Shaokai Ye Jessy Lauer Mu Zhou Alexander Mathis Mackenzie W Mathis



研究问题:如何将自然语言描述的动物行为转化为机器可执行的代码,并解决大型语言模型在理解复杂上下文时的限制。
动机:为了弥补动物行为分析中对动物行为理解和机器学习知识的需要,以及大型语言模型在处理长对话记忆上的限制。
方法:提出了AmadeusGPT,一个自然语言接口,可以将自然语言描述的行为转化为机器可执行的代码。同时,通过引入一种新的双记忆机制,允许短时记忆和长时记忆之间的通信,以克服大型语言模型在处理长对话记忆上的限制。
效果:使用MABe 2022行为挑战任务进行基准测试,结果显示AmadeusGPT表现出色。此系统将深度学习知识、大型语言模型和核心计算机视觉模块融合在一起,形成一个更自然的智能系统。

The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code. Large-language models (LLMs) such as GPT3.5 and GPT4 allow for interactive language-based queries that are potentially well suited for making interactive behavior analysis. However, the comprehension capability of these LLMs is limited by the context window size, which prevents it from remembering distant conversations. To overcome the context window limitation, we implement a novel dual-memory mechanism to allow communication between short-term and long-term memory using symbols as context pointers for retrieval and saving. Concretely, users directly use language-based definitions of behavior and our augmented GPT develops code based on the core AmadeusGPT API, which contains machine learning, computer vision, spatio-temporal reasoning, and visualization modules. Users then can interactively refine results, and seamlessly add new behavioral modules as needed. We used the MABe 2022 behavior challenge tasks to benchmark AmadeusGPT and show excellent performance. Note, an end-user would not need to write any code to achieve this. Thus, collectively AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system. Code and demos can be found at: https://github.com/AdaptiveMotorControlLab/AmadeusGPT

Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language.
Eghbal A. Hosseini Evelina Fedorenko



研究问题:本文旨在探索预测性目标如何塑造自回归变压器模型的语言表示。
动机:受视觉神经科学研究的启发,作者测试了关于自回归变压器模型预测性表示的假设。
方法:通过量化1维曲率指标,研究句子中词序列的神经网络轨迹是否随着网络层数的增加而逐渐变直。
效果:研究发现,训练后的模型中,曲率从网络的第一层到中间层逐渐减小;在大型数据集上训练的大型模型表现出更大的曲率减小,这可能解释了它们在语言建模性能上的优势;此外,模型生成的序列的曲率低于真实值,表明模型倾向于使用更直的轨迹进行预测。这些结果支持轨迹变直假设,并提供了自回归模型内部表示几何形状如何支持下一个词预测的可能机制。

Predicting upcoming events is critical to our ability to effectively interact with our environment and conspecifics. In natural language processing, transformer models, which are trained on next-word prediction, appear to construct a general-purpose representation of language that can support diverse downstream tasks. However, we still lack an understanding of how a predictive objective shapes such representations. Inspired by recent work in vision neuroscience Hénaff et al. (2019), here we test a hypothesis about predictive representations of autoregressive transformer models. In particular, we test whether the neural trajectory of a sequence of words in a sentence becomes progressively more straight as it passes through the layers of the network. The key insight behind this hypothesis is that straighter trajectories should facilitate prediction via linear extrapolation. We quantify straightness using a 1- dimensional curvature metric, and present four findings in support of the trajectory straightening hypothesis: i) In trained models, the curvature progressively decreases from the first to the middle layers of the network. ii) Models that perform better on the next-word prediction objective, including larger models and models trained on larger datasets, exhibit greater decreases in curvature, suggesting that this improved ability to straighten sentence neural trajectories may be the underlying driver of better language modeling performance. iii) Given the same linguistic context, the sequences that are generated by the model have lower curvature than the ground truth (the actual continuations observed in a language corpus), suggesting that the model favors straighter trajectories for making predictions. iv) A consistent relationship holds between the average curvature and the average surprisal of sentences in the middle layers of models, such that sentences with straighter neural trajectories also have lower surprisal. Importantly, untrained models don’t exhibit these behaviors. In tandem, these results support the trajectory straightening hypothesis and provide a possible mechanism for how the geometry of the internal representations of autoregressive models supports next word prediction.

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Ibrahim Alabdulmohsin Xiaohua Zhai Alexander Kolesnikov Lucas Beyer



研究问题:如何通过推理计算最优的模型形状,如宽度和深度,以优化视觉转换器的性能?
动机:现有的方法主要通过增加模型的大小来提高性能,但这种方法可能会导致计算成本的增加。因此,需要一种更有效的方法来优化模型的形状。
方法:本文提出了一种新的方法,通过推理计算最优的模型形状,包括宽度和深度,并将这种方法应用到视觉转换器中。
效果:实验结果表明,这种方法可以有效地优化视觉转换器的性能,使其在多个任务上的表现超过了更大的模型,同时计算成本也大大降低。

Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.

Meta-in-context learning in large language models
Julian Coda-Forno Marcel Binz Zeynep Akata Matthew Botvinick Jane X Wang Eric Schulz



研究问题:本文旨在探讨大型语言模型的元上下文学习能力,即通过上下文学习本身来递归提高其性能。
动机:目前的预训练语言模型在各种任务上表现出色,其中上下文学习是其主要贡献者之一。本文试图通过元上下文学习进一步提升这种能力。
方法:本文提出了一种新的方法,通过推理计算最优的模型形状,包括宽度和深度,并将这种方法应用到视觉转换器中。
效果:实验结果表明,这种方法可以有效地优化视觉转换器的性能,使其在多个任务上的表现超过了更大的模型,同时计算成本也大大降低。

Large language models have shown tremendous performance in a variety of tasks. In-context learning -- the ability to improve at a task after being provided with a number of demonstrations -- is seen as one of the main contributors to their success. In the present paper, we demonstrate that the in-context learning abilities of large language models can be recursively improved via in-context learning itself. We coin this phenomenon meta-in-context learning. Looking at two idealized domains, a one-dimensional regression task and a two-armed bandit task, we show that meta-in-context learning adaptively reshapes a large language model's priors over expected tasks. Furthermore, we find that meta-in-context learning modifies the in-context learning strategies of such models. Finally, we broaden the scope of our investigation to encompass two diverse benchmarks: one focusing on real-world regression problems and the other encompassing multiple NLP tasks. In both cases, we observe competitive performance comparable to that of traditional learning algorithms. Taken together, our work improves our understanding of in-context learning and paves the way toward adapting large language models to the environment they are applied purely through meta-in-context learning rather than traditional finetuning.

ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training
Antonio Norelli Marco Fumero Valentino Maiorca Luca Moschella Emanuele Rodolà Francesco Locatello



研究问题:本文旨在解决在无需显式训练的情况下,如何通过创建共享空间来解决许多视觉任务。
动机:当前的图像和文本编码器需要从大量数据集中进行训练,但本文提出了一种无需任何训练就可以创建共享空间的方法。
方法:使用单领域编码器(有监督或无监督训练)和少量图像-文本对来创建共享空间。
效果:实验结果表明,该方法在标准的零样本视觉基准测试中表现出了典型的图像-文本模型的转移能力,为基础多模态模型提供了一个简单但强大的基线,并引发了关于其数据效率和检索在机器学习中的作用的重要问题。

CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation
Jaemin Cho Abhay Zala Mohit Bansal



研究问题:如何利用语言模型进行视觉模块的控制,以实现文本到图像(T2I)的生成和评估。
动机:现有的工作主要集中在让语言模型具备视觉理解能力,而我们提出了两种新的可解释/可解释的视觉编程框架,用于T2I生成和评估。
方法:我们首先引入VPGen,这是一个可解释的逐步T2I生成框架,它将T2I生成分解为三个步骤:对象/数量生成、布局生成和图像生成。我们使用一个语言模型来处理前两个步骤(对象/数量生成和布局生成),通过在文本-布局对上进行微调。我们的逐步T2I生成框架提供了比端到端模型更强的空间控制能力。其次,我们利用预训练语言模型的世界知识,克服了先前布局引导的T2I工作只能处理预定义对象类别的限制。
效果:我们的VPGen在对象的数量/空间关系/比例方面比最先进的T2I生成模型具有更好的控制能力。其次,我们引入VPEval,这是一个基于视觉编程的解释性和可解释性的T2I生成评估框架。与以前使用单个评分模型的T2I评估不同,该模型在某些技能上准确,但在其他技能上不可靠,VPEval产生的评估程序调用一组在不同技能上是专家的视觉模块,并提供了视觉+文本的解释结果。我们的分析表明,对于特定技能和开放式提示,VPEval比广泛使用的单一模型基线提供了更与人类相关的评估。

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows that VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope that our work encourages future progress on interpretable/explainable generation and evaluation for T2I models.

Trial matching: capturing variability with data-constrained spiking neural networks
Christos Sourmpis Carl C. H. Petersen Wulfram Gerstner Guillaume Bellec



研究问题:如何揭示神经活动和行为之间的相互作用?
动机:同时记录行为和电生理信号需要新的方法来揭示神经活动和行为之间的相互作用。
方法:使用大规模循环脉冲神经网络(RSNN)对小鼠皮质感觉运动通路进行建模,并通过基于梯度的优化适应记录。
效果:通过最优传输定义生成和记录试验分布之间的距离,该方法应用于人工数据和覆盖六个皮层区域的神经记录。结果表明,生成的RSNN可以产生真实的皮层活动并预测颌部运动。

Simultaneous behavioral and electrophysiological recordings call for new methods to reveal the interactions between neural activity and behavior. A milestone would be an interpretable model of the co-variability of spiking activity and behavior across trials. Here, we model a mouse cortical sensory-motor pathway in a tactile detection task reported by licking with a large recurrent spiking neural network (RSNN), fitted to the recordings via gradient-based optimization. We focus specifically on the difficulty to match the trial-to-trial variability in the data. Our solution relies on optimal transport to define a distance between the distributions of generated and recorded trials. The technique is applied to artificial data and neural recordings covering six cortical areas. We find that the resulting RSNN can generate realistic cortical activity and predict jaw movements across the main modes of trial-to-trial variability. Our analysis also identifies an unexpected mode of variability in the data corresponding to task-irrelevant movements of the mouse.

On Evaluating Adversarial Robustness of Large Vision-Language Models
Yunqing Zhao Tianyu Pang Chao Du Xiao Yang Chongxuan Li Ngai-man Cheung Min Lin



研究问题:大型视觉语言模型(VLMs)在响应生成方面取得了卓越性能,但多模态生成加剧了安全顾虑,因为对手可能通过微妙地操纵最脆弱的模态(如视觉)来成功规避整个系统。
动机:为了解决这一问题,我们提出了在最真实和高风险的环境中评估开源大型VLMs的鲁棒性,即对手只有黑盒系统访问权限并试图欺骗模型返回目标响应。
方法:我们首先针对预训练模型(如CLIP和BLIP)制作有针对性的对抗性示例,然后将这些对抗性示例转移到其他VLMs(如MiniGPT-4、LLaVA、UniDiffuser、BLIP-2和Img2Prompt)。此外,我们发现对这些VLMs进行黑盒查询可以进一步提高针对性逃避的效果,从而在生成目标响应方面取得令人惊讶的高成功率。
效果:我们的发现为大型VLMs的对抗性脆弱性提供了定量理解,并呼吁在实际应用部署之前对其潜在的安全漏洞进行更彻底的检查。

Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented performance in response generation, especially with visual inputs, enabling more creative and adaptable interaction than large language models such as ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable modality (e.g., vision). To this end, we propose evaluating the robustness of open-source large VLMs in the most realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning the targeted responses. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP, and then transfer these adversarial examples to other VLMs such as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we observe that black-box queries on these VLMs can further improve the effectiveness of targeted evasion, resulting in a surprisingly high success rate for generating targeted responses. Our findings provide a quantitative understanding regarding the adversarial vulnerability of large VLMs and call for a more thorough examination of their potential security flaws before deployment in practice. Our project page: https://yunqing-me.github.io/AttackVLM/.

The Learnability of In-Context Learning
Noam Wies Yoav Levine Amnon Shashua



研究问题:现代大型语言模型在没有修改权重的情况下,如何通过包含下游自然语言任务的训练示例来调整其性能。
动机:尽管这种新兴的学习范式对大型语言模型的许多实际应用产生了破坏性影响,但从理论角度来说,人们对此并不十分了解。
方法:本文提出了一种基于PAC的首次框架用于上下文可学习性,并使用它为上下文学习设置提供了第一个有限的样本复杂性结果。
效果:我们的理论分析表明,在这种设置下,上下文学习更多的是关于识别任务,而不是学习任务,这一结果与一系列最近的实证发现相一致。我们希望本文提出的上下文可学习性框架将有助于进一步理解这种重要的新学习范式。

In-context learning is a surprising and important phenomenon that emerged when modern language models were scaled to billions of learned parameters. Without modifying a large language model's weights, it can be tuned to perform various downstream natural language tasks simply by including concatenated training examples of these tasks in its input. Though disruptive for many practical applications of large language models, this emergent learning paradigm is not well understood from a theoretical perspective. In this paper, we propose a first-of-its-kind PAC based framework for in-context learnability, and use it to provide the first finite sample complexity results for the in-context learning setup. Our framework includes an initial pretraining phase, which fits a function to the pretraining distribution, and then a second in-context learning phase, which keeps this function constant and concatenates training examples of the downstream task in its input. We use our framework in order to prove that, under mild assumptions, when the pretraining distribution is a mixture of latent tasks (a model often considered for natural language pretraining), these tasks can be efficiently learned via in-context learning, even though the model's weights are unchanged and the input significantly diverges from the pretraining distribution. Our theoretical analysis reveals that in this setting, in-context learning is more about identifying the task than about learning it, a result which is in line with a series of recent empirical findings. We hope that the in-context learnability framework presented in this paper will facilitate future progress towards a deeper understanding of this important new learning paradigm.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai Junnan Li Dongxu Li Anthony Tiong Junqi Zhao Weisheng Wang Boyang Li Pascale Fung Steven Hoi



研究问题:构建通用的视觉语言模型由于丰富的输入分布和任务多样性,以及额外的视觉输入,具有挑战性。
动机:尽管视觉语言预训练已被广泛研究,但基于预训练BLIP-2模型的视觉语言指令微调仍待探索。
方法:我们收集了26个公开可用的数据集,涵盖了各种任务和能力,并将它们转换为指令微调格式。此外,我们还引入了一种指令感知的查询转换器,用于提取针对给定指令的有用特征。
效果:在13个保持数据集上进行训练后,InstructBLIP在所有13个保持数据集中都取得了最先进的零样本性能,大大超过了BLIP-2和更大的Flamingo模型。我们的模型在个别下游任务上也取得了最先进的性能(例如,在带有图像上下文的ScienceQA问题上达到90.7%的准确率)。此外,我们还定性地展示了InstructBLIP相对于同时期多模态模型的优势。所有InstructBLIP模型都是开源的。

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-source.

Composing Parameter-Efficient Modules with Arithmetic Operation
Jinghan Zhang Shiqi Chen Junteng Liu Junxian He



研究问题:如何有效地利用预训练语言模型进行参数效率微调,以适应不同的领域和任务。
动机:传统的完全微调方法效率低下,参数效率微调(PEFT)作为替代方案正在成为主流,但其模块间的整合能力有待提高。
方法:提出在权重空间中通过线性算术运算来组合这些参数效率模块,无需额外训练即可实现高度灵活的模块组合。
效果:实验证明,该方法能产生新的、有效的参数效率模块,在所有设置中都显著优于现有的模块。

As an efficient alternative to conventional full fine-tuning, parameter-efficient fine-tuning (PEFT) is becoming the prevailing method to adapt pretrained language models. In PEFT, a lightweight module is learned on each dataset while the underlying pretrained language model remains unchanged, resulting in multiple compact modules representing diverse skills when applied to various domains and tasks. In this paper, we propose to compose these parameter-efficient modules through linear arithmetic operations in the weight space, thereby integrating different module capabilities. Specifically, we first define an addition and negation operator for the module, and then further compose these two basic operators to perform flexible arithmetic. Our approach requires no additional training and enables highly flexible module composition. We apply different arithmetic operations to compose the parameter-efficient modules for (1) distribution generalization, (2) multi-tasking, (3) detoxifying, and (4) domain transfer. Additionally, we extend our approach to detoxify Alpaca-LoRA, the latest instruction-tuned large language model based on LLaMA. Empirical results demonstrate that our approach produces new and effective parameter-efficient modules that significantly outperform existing ones across all settings.

Soft-Unification in Deep Probabilistic Logic
Jaron Maene Luc De Raedt



研究问题:神经符号AI的一个基本挑战是如何设计融合逻辑和神经网络概念的原始操作。
动机:现有的系统如Neural Theorem Prover并未满足软统一操作的非冗余性、定义明确的证明分数和非稀疏梯度等理想属性,因此需要一种更原则性的框架。
方法:提出了基于概率而非模糊语义的DeepSoftLog框架。
效果:实验表明,DeepSoftLog在神经符号基准测试中的表现优于现有技术,突显了这些属性的优势。

A fundamental challenge in neuro-symbolic AI is to devise primitives that fuse the logical and neural concepts. The Neural Theorem Prover has proposed the notion of soft-unification to turn the symbolic comparison between terms (i.e. unification) into a comparison in embedding space. It has been shown that soft-unification is a powerful mechanism that can be used to learn logic rules in an end-to-end differentiable manner. We study soft-unification from a conceptual point and outline several desirable properties of this operation. These include non-redundancy in the proof, well-defined proof scores, and non-sparse gradients. Unfortunately, these properties are not satisfied by previous systems such as the Neural Theorem Prover. Therefore, we introduce a more principled framework called DeepSoftLog based on probabilistic rather than fuzzy semantics. Our experiments demonstrate that DeepSoftLog can outperform the state-of-the-art on neuro-symbolic benchmarks, highlighting the benefits of these properties.

Lift Yourself Up: Retrieval-augmented Text Generation with Self-Memory
Xin Cheng Di Luo Xiuying Chen Lemao Liu Dongyan Zhao Rui Yan



研究问题:如何通过更好的记忆来提高文本生成任务的效果。
动机:传统的内存检索方法受限于固定语料库的质量,无法充分利用人类编写的参考记忆。
方法:提出一种新的框架selfmem,通过迭代使用检索增强的生成器创建一个无限制的内存池,并使用内存选择器选择一个输出作为后续生成轮次的记忆,从而利用自身的输出(称为自我记忆)来改进生成效果。
效果:在三个不同的文本生成任务上评估selfmem的效果,包括神经机器翻译、抽象文本摘要和对话生成,并在两个生成范例下实现最先进的结果。

With direct access to human-written reference as memory, retrieval-augmented generation has achieved much progress in a wide range of text generation tasks. Since better memory would typically prompt better generation (we define this as primal problem). The traditional approach for memory retrieval involves selecting memory that exhibits the highest similarity to the input. However, this method is constrained by the quality of the fixed corpus from which memory is retrieved. In this paper, by exploring the duality of the primal problem: better generation also prompts better memory, we propose a novel framework, selfmem, which addresses this limitation by iteratively employing a retrieval-augmented generator to create an unbounded memory pool and using a memory selector to choose one output as memory for the subsequent generation round. This enables the model to leverage its own output, referred to as self-memory, for improved generation. We evaluate the effectiveness of selfmem on three distinct text generation tasks: neural machine translation, abstractive text summarization, and dialogue generation, under two generation paradigms: fine-tuned small model and few-shot LLM. Our approach achieves state-of-the-art results in four directions in JRC-Acquis translation dataset, 50.3 ROUGE-1 in XSum, and 62.9 ROUGE-1 in BigPatent, demonstrating the potential of self-memory in enhancing retrieval-augmented generation models. Furthermore, we conduct thorough analyses of each component in the selfmem framework to identify current system bottlenecks and provide insights for future research.

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen Handong Li Qunbo Wang Zijia Zhao Mingzhen Sun Xinxin Zhu Jing Liu



研究问题:本文旨在探索视频中视觉、音频和字幕等多模态信息与文本的联系,并建立相应的模型。
动机:当前的视频-文本基础模型主要关注了视觉和文本两种模态,而音频和字幕等其他模态的信息尚未得到充分关注。
方法:通过收集2700万个开放领域的视频片段,分别训练视觉和音频的字幕生成器,然后利用预训练的大型语言模型将生成的字幕、字幕和指令提示整合为全模态的字幕。基于提出的VAST-27M数据集,训练一个能够感知和处理视频中的视觉、音频和字幕模态的全模态视频-文本基础模型VAST。
效果:实验表明,VAST在各种跨模态基准测试中取得了22个新的最先进的结果。

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks.

Guiding Large Language Models via Directional Stimulus Prompting
Zekun Li Baolin Peng Pengcheng He Michel Galley Jianfeng Gao Xifeng Yan



研究问题:如何引导大型语言模型(LLMs)生成特定期望的输出?
动机:现有的直接调整LLMs的方法存在挑战,因此需要一种新方法来优化LLMs的行为。
方法:提出了一种新的框架——方向性刺激提示(Directional Stimulus Prompting),通过小型可调策略模型为每个输入实例生成辅助的方向性刺激提示,作为微妙的、针对特定实例的提示和线索,引导LLMs生成期望的结果。
效果:在摘要生成、对话响应生成和思维链推理等任务上进行评估,实验表明,该方法可以显著提高LLMs的性能,如ChatGPT、Codex和InstructGPT,并且在使用最小量标注数据的情况下,性能优于一些最先进的全监督模型。

We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) towards specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model (e.g., T5) to generate an auxiliary directional stimulus prompt for each input instance. These directional stimulus prompts act as nuanced, instance-specific hints and clues to guide LLMs in generating desired outcomes, such as including specific keywords in the generated summary. Our approach sidesteps the challenges of direct LLM tuning by optimizing the policy model to explore directional stimulus prompts that align LLMs with desired behaviors. The policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the LLM's output. We evaluate our method across various tasks, including summarization, dialogue response generation, and chain-of-thought reasoning. Our experiments indicate a consistent improvement in the performance of LLMs such as ChatGPT, Codex, and InstructGPT on these supervised tasks with minimal labeled data. Remarkably, by utilizing merely 80 dialogues from the MultiWOZ dataset, our approach boosts ChatGPT's performance by a relative 41.4%, achieving or exceeding the performance of some fully supervised state-of-the-art models. Moreover, the instance-specific chain-of-thought prompt generated through our method enhances InstructGPT's reasoning accuracy, outperforming both generalized human-crafted prompts and those generated through automatic prompt engineering. The code and data are publicly available at https://github.com/Leezekun/Directional-Stimulus-Prompting.

Foundation Model is Efficient Multimodal Multitask Model Selector
Fanqing Meng Wenqi Shao zhanglin peng Chonghe Jiang Kaipeng Zhang Yu Qiao Ping Luo



研究问题:如何预测预训练神经网络在多模态任务上的性能,而无需进行微调。
动机:现有的方法要么计算量大(如全量微调),要么依赖于特定任务的先验知识(如轻量化度量),不适合多模态多任务场景。
方法:提出一种高效的多任务模型选择器(EMMS),利用大型基础模型将不同下游任务的多种标签格式统一为噪声标签嵌入,通过加权线性回归估计模型的迁移性。
效果:在5个下游任务、24个数据集上的大量实验表明,EMMS快速有效,能够评估预训练模型的迁移性,是首个适用于多任务场景的模型选择方法。例如,与最先进的LogME方法相比,EMMS在图像识别、参照、描述、视觉问答和文本问答任务上分别提高了9.0%、26.3%、20.1%、54.8%和12.2%的性能,同时计算时间分别加快了5.13倍、6.29倍、3.59倍、6.19倍和5.66倍。代码可在https://github.com/OpenGVLab/Multitask-Model-Selector获取。

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering.A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models’ transferability, they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model’s transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state- of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0%, 26.3%, 20.1%, 54.8%, 12.2% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13×, 6.29×, 3.59×, 6.19×, and 5.66× speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

Visual Instruction Inversion: Image Editing via Image Prompting
Thao Nguyen Yuheng Li Utkarsh Ojha Yong Jae Lee



研究问题:如何有效地描述图像编辑操作。
动机:语言在描述特定图像编辑时可能存在模糊性和无效性,需要更直观的方式传达想法。
方法:提出一种通过视觉提示进行图像编辑的方法,利用文本到图像扩散模型的预训练编辑能力,将视觉提示转换为编辑指令。
效果:实验表明,只需一个示例对,就可以达到与最先进的文本条件图像编辑框架相竞争的结果。

Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the "before" and "after" images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images. We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions. Our results show that with just one example pair, we can achieve competitive results compared to state-of-the-art text-conditioned image editing frameworks.

Complex Query Answering on Eventuality Knowledge Graph with Implicit Logical Constraints
Jiaxin Bai Xin Liu Weiqi Wang Chen Luo Yangqiu Song



研究问题:如何利用深度学习方法查询知识图谱,进行逻辑推理和泛化学习,以更好地回答问题。
动机:传统的神经网络复杂查询回答(CQA)方法主要在实体为中心的知识图谱上工作,但在现实世界中,我们还需要对事件、状态和活动(即事态或情况)进行逻辑推理,以推动学习系统从系统I到系统II的发展。
方法:本文提出了一个新的框架,利用神经网络方法基于事态为中心的知识图谱(EVKG)来回答复杂的逻辑查询,不仅满足传统的一阶逻辑约束,还能满足关于事态发生和顺序的隐含逻辑约束。
效果:实验结果表明,该方法在各种知识驱动任务上取得了显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美。

Querying knowledge graphs (KGs) using deep learning approaches can naturally leverage the reasoning and generalization ability to learn to infer better answers. Traditional neural complex query answering (CQA) approaches mostly work on entity-centric KGs. However, in the real world, we also need to make logical inferences about events, states, and activities (i.e., eventualities or situations) to push learning systems from System I to System II, as proposed by Yoshua Bengio. Querying logically from an EVentuality-centric KG (EVKG) can naturally provide references to such kind of intuitive and logical inference. Thus, in this paper, we propose a new framework to leverage neural methods to answer complex logical queries based on an EVKG, which can satisfy not only traditional first-order logic constraints but also implicit logical constraints over eventualities concerning their occurrences and orders. For instance, if we know that *Food is bad* happens before *PersonX adds soy sauce*, then *PersonX adds soy sauce* is unlikely to be the cause of *Food is bad* due to implicit temporal constraint. To facilitate consistent reasoning on EVKGs, we propose Complex Eventuality Query Answering (CEQA), a more rigorous definition of CQA that considers the implicit logical constraints governing the temporal order and occurrence of eventualities. In this manner, we propose to leverage theorem provers for constructing benchmark datasets to ensure the answers satisfy implicit logical constraints. We also propose a Memory-Enhanced Query Encoding (MEQE) approach to significantly improve the performance of state-of-the-art neural query encoders on the CEQA task.

Exploring Diverse In-Context Configurations for Image Captioning
Xu Yang Yongliang Wu Mingzhuo Yang Haokun Chen Xin Geng



研究问题:探索不同的配置方式对视觉语言(VL)上下文学习的影响。
动机:图像标题作为可视条件下的语言模型,其上下文学习具有多模态协同的独特特性,而现有方法仅采用随机抽样的方式配置图像-文本对进行上下文学习。
方法:设计了四种图像选择策略和四种标题分配策略来配置图像-文本对进行图像标题生成任务的上下文学习。
效果:通过全面实验发现,优化的配置方式相较于基准线平均提升了20.9%的CIDEr分数,揭示了VL上下文学习的多模态协同特性。

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, \ie, randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 of CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/ExploreCfg.

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions
Kai Liu Zhihang Fu Chao Chen Sheng Jin Ze Chen Mingyuan Tao Rongxin Jiang Jieping Ye



研究问题:如何通过联合训练大规模文本语料库和知识图谱,构建一种可以充分利用词汇、句法和知识的增强语言表示模型。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,构建ERNIE模型,以更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e.g., cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e.g., cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX’s effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications.

What Makes Good Examples for Visual In-Context Learning?
Yuanhan Zhang Kaiyang Zhou Ziwei Liu



研究问题:如何更好地利用大型视觉模型进行上下文学习。
动机:大型视觉模型具有巨大的潜力,但参数量大且难以调整,通常只有API可供使用。
方法:通过上下文学习的视角,提出了一种提示检索框架,用于自动选择视觉上下文示例,无需访问大型视觉模型的内部权重。
效果:实验证明,该方法比常用的随机选择能带来显著的改进,提高了视觉上下文学习的性能。

Large vision models with billions of parameters and trained on broad data have great potential in numerous downstream applications. However, these models are typically difficult to adapt due to their large parameter size and sometimes lack of accesss to their weights---entities able to develop large vision models often provide APIs only. In this paper, we study how to better utilize large vision models through the lens of in-context learning, a concept that has been well-known in natural language processing but has only been studied very recently in computer vision. In-context learning refers to the ability to perform inference on tasks never seen during training by simply conditioning on in-context examples (i.e., input-output pairs) without updating any internal model parameters. To demystify in-context learning in computer vision, we conduct an extensive research and identify a critical problem: downstream performance is highly sensitivie to the choice of visual in-context examples. To address this problem, we propose a prompt retrieval framework specifically for large vision models, allowing the selection of in-context examples to be fully automated. Concretely, we provide two implementations: (i) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (ii) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. Both methods do not require access to the internal weights of large vision models. Our results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection. Code and models will be released.

Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification
Neel Guha Mayee F Chen Kush Bhatia Azalia Mirhoseini Frederic Sala Christopher Re



研究问题:如何在不增加额外标注数据的情况下改进基于提示的学习?
动机:改善提示需要大量的标注数据,但修改预测可能不需要。
方法:提出Embroid方法,通过计算不同嵌入函数下的数据集的多个表示,并利用LM对相邻样本的预测一致性来识别错误预测。然后使用这些邻域为每个样本创建额外的预测,并将这些预测与简单的潜在变量图形模型结合以生成最终的校正预测。
效果:在六个不同的LM和多达95个不同的任务上进行了严格的实证评估。发现Embroid显著提高了原始提示的性能(例如,在GPT-JT上平均提高了7.3分),并且对于更复杂的提示策略(如思维链)也实现了改进,还可以通过嵌入函数专门针对法律等领域进行优化。

Recent work has shown that language models' (LMs) prompt-based learning capabilities make them well suited for automating data labeling in domains where manual annotation is expensive. The challenge is that while writing an initial prompt is cheap, improving a prompt is costly---practitioners often require significant labeled data in order to evaluate the impact of prompt modifications. Our work asks whether it is possible to improve prompt-based learning _without_ additional labeled data. We approach this problem by attempting to modify the predictions of a prompt, rather than the prompt itself. Our intuition is that accurate predictions should also be consistent: samples which are similar under some feature representation should receive the same prompt prediction. We propose Embroid, a method which computes multiple representations of a dataset under different embedding functions, and uses the consistency between the LM predictions for neighboring samples to identify mispredictions. Embroid then uses these neighborhoods to create additional predictions for each sample, and combines these predictions with a simple latent variable graphical model in order to generate a final corrected prediction. In addition to providing a theoretical analysis of Embroid, we conduct a rigorous empirical evaluation across six different LMs and up to 95 different tasks. We find that (1) Embroid substantially improves performance over original prompts (e.g., by an average of 7.3 points on GPT-JT), (2) also realizes improvements for more sophisticated prompting strategies (e.g., chain-of-thought), and (3) can be specialized to domains like law through the embedding functions.

Large Language Models as Commonsense Knowledge for Large-Scale Task Planning
Zirui Zhao Wee Sun Lee David Hsu



研究问题:如何有效地进行大规模任务规划?
动机:现有的方法在大规模任务规划上面临挑战,而大型语言模型(LLMs)具有巨大的潜力。
方法:本文提出一种新的LLM-MCTS算法,将LLM作为世界模型和策略来指导搜索,以提升任务规划的效率和效果。
效果:实验表明,新的LLM-MCTS算法在复杂、新颖的任务上表现优于单独的MCTS和仅使用LLM作为策略的方法。

Large-scale task planning is a major challenge. Recent work exploits large language models (LLMs) directly as a policy and shows surprisingly interesting results. This paper shows that LLMs provide a commonsense model of the world in addition to a policy that acts on it. The world model and the policy can be combined in a search algorithm, such as Monte Carlo Tree Search (MCTS), to scale up task planning. In our new LLM-MCTS algorithm, the LLM-induced world model provides a commonsense prior belief for MCTS to achieve effective reasoning; the LLM-induced policy acts as a heuristic to guide the search, vastly improving search efficiency. Experiments show that LLM-MCTS outperforms both MCTS alone and policies induced by LLMs (GPT2 and GPT3.5) by a wide margin, for complex, novel tasks. Further experiments and analyses on multiple tasks -- multiplication, travel planning, object rearrangement -- suggest minimum description length (MDL) as a general guiding principle: if the description length of the world model is substantially smaller than that of the policy, using LLM as a world model for model-based planning is likely better than using LLM solely as a policy.

FACE: Evaluating Natural Language Generation with Fourier Analysis of Cross-Entropy
Zuhao Yang Yingfang Yuan Yang Xu SHUO ZHAN Huajun Bai Kefan Chen



研究问题:如何衡量机器生成语言与人类语言的距离。
动机:受到心理语言学关于语言熵周期性的实证发现启发,提出一种基于估计交叉熵语言的傅里叶分析的度量标准FACE,用于测量模型生成语言与人类写作语言的相似性。
方法:通过开放性生成任务和以往研究的实验数据,我们发现FACE可以有效地识别人机差距,随模型规模扩大而扩大,反映不同解码采样方法的结果,与其他评估指标和人类判断分数高度相关。
效果:FACE能有效衡量机器生成语言与人类语言的距离,并反映出模型的规模、解码采样方法的影响以及与人类判断的相关性。

Measuring the distance between machine-produced and human language is a critical open problem. Inspired by empirical findings from psycholinguistics on the periodicity of entropy in language, we propose FACE, a set of metrics based on Fourier Analysis of the estimated Cross-Entropy of language, for measuring the similarity between model-generated and human-written languages. Based on an open-ended generation task and the experimental data from previous studies, we find that FACE can effectively identify the human-model gap, scales with model size, reflects the outcomes of different sampling methods for decoding, correlates well with other evaluation metrics and with human judgment scores.

Tuning Multi-mode Token-level Prompt Alignment across Modalities
Dongsheng Wang Miaoge Li Xinyang Liu MingSheng Xu Bo Chen Hanwang Zhang



研究问题:如何优化视觉语言模型的提示调优,以增强开放世界视觉概念理解。
动机:目前的视觉语言模型在提示调优上主要关注单一模式和整体语义对齐,这无法捕捉到样本的多样性,导致提示发现的效果不佳。
方法:提出一种多模式的基于传输学习的提示调优框架,通过学习跨模态的一组提示标记进行对齐。具体包括多模式提示发现和标记级对齐两个关键步骤。
效果:实验表明,该方法在流行的图像识别基准测试中表现出优越的泛化能力和少样本学习能力,且学习到的提示标记能够捕获多样化的视觉概念。

Advancements in prompt tuning of vision-language models have underscored their potential in enhancing open-world visual concept comprehension. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
Weizhe Lin Jinghong Chen Jingbiao Mei Alexandru Coca Bill Byrne



研究问题:如何提高基于知识库的视觉问答系统(KB-VQA)中的知识检索效果。
动机:现有的RA-VQA系统在处理KB-VQA任务时,存在图像到文本转换获取的图像表示不完整且不准确,以及查询和文档之间的相似度分数计算一维嵌入,对细粒度相似性不敏感的问题。
方法:提出一种细粒度的晚期交互多模态检索(FLMR),通过一个简单的对齐网络,使用与现有基于文本的检索器对齐的视觉模型来获取补充图像到文本转换的图像表示。同时,使用多维嵌入编码图像和问题,以捕获查询和文档之间的细粒度相似性。
效果:FLMR显著提高了原始RA-VQA检索器的PRRecall@5约8%,并在OK-VQA数据集上实现了约62%的VQA得分。

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) similarity scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained similarities. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transform using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained similarities between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim62$% VQA score in the OK-VQA dataset.

Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
Jinhyuk Lee Zhuyun Dai Sai Meher Karthik Duddu Tao Lei Iftekhar Naim Ming-Wei Chang Vincent Y Zhao



研究问题:本文旨在简化多向量检索模型,通过重新思考标记检索的作用来改善信息检索。
动机:现有的多向量检索模型如ColBERT虽然在许多信息检索基准测试中取得了最先进的效果,但其非线性评分函数无法扩展到数百万个文档,需要通过三个阶段进行推理:通过标记检索获取初始候选者,访问所有标记向量,并对初始候选文档进行评分。这个过程复杂且缓慢。
方法:本文提出了XTR(Contextualized Token Retriever),引入了一个简单的、新颖的目标函数,鼓励模型首先检索最重要的文档标记。改进的标记检索使XTR能够仅使用检索到的标记对候选者进行排名,而无需使用文档中的所有标记,并实现了一种比ColBERT便宜两到三个数量级的新设计的评分阶段。
效果:在流行的BEIR基准测试中,XTR将最先进的技术推进了2.8 nDCG@10,没有任何蒸馏过程。详细的分析证实了我们重新审视标记检索阶段的决定,因为与ColBERT相比,XTR在标记检索阶段的召回率要高得多。

Multi-vector retrieval models such as ColBERT [Khattab et al., 2020] allow token-level interactions between queries and documents, and hence achieve state of the art on many information retrieval benchmarks. However, their non-linear scoring function cannot be scaled to millions of documents, necessitating a three-stage process for inference: retrieving initial candidates via token retrieval, accessing all token vectors, and scoring the initial candidate documents. The non-linear scoring function is applied over all token vectors of each candidate document, making the inference process complicated and slow. In this paper, we aim to simplify the multi-vector retrieval by rethinking the role of token retrieval. We present XTR, ConteXtualized Token Retriever, which introduces a simple, yet novel, objective function that encourages the model to retrieve the most important document tokens first. The improvement to token retrieval allows XTR to rank candidates only using the retrieved tokens rather than all tokens in the document, and enables a newly designed scoring stage that is two-to-three orders of magnitude cheaper than that of ColBERT. On the popular BEIR benchmark, XTR advances the state-of-the-art by 2.8 nDCG@10 without any distillation. Detailed analysis confirms our decision to revisit the token retrieval stage, as XTR demonstrates much better recall of the token retrieval stage compared to ColBERT.

Preference-grounded Token-level Guidance for Language Model Fine-tuning
Shentao Yang Shujian Zhang Congying Xia Yihao Feng Caiming Xiong Mingyuan Zhou



研究问题:如何将语言模型与序列级别的偏好对齐,以解决自然语言生成中的重要问题。
动机:由于偏好通常在序列级别提供,而语言模型的训练和生成都在标记级别进行,因此存在粒度不匹配的问题,可能使学习问题复杂化。
方法:开发一种替代训练过程,通过将序列级别的偏好转化为标记级别的训练指导,并使用学到的指导来改进语言模型。设计了一个框架,将模仿学习中的成对偏好学习扩展到可变长度的语言模型生成和多个生成之间的偏好利用。
效果:实验结果表明,该方法在离散提示生成和文本摘要两个代表性的语言模型任务上表现良好。

Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the *sequence level* while LM training and generation both occur at the *token level*. There is, therefore, a *granularity mismatch* between the preference and the LM training losses, which may complicate the learning problem. In this paper, we address this issue by developing an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance. For guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length LM generation and the utilization of the preference among multiple generations. For LM training, based on the amount of supervised data, we present two *minimalist* learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks --- discrete-prompt generation and text summarization.

Multi-Head Adapter Routing for Cross-Task Generalization
Lucas Caccia Edoardo Ponti Zhan Su Matheus Pereira Nicolas Le Roux Alessandro Sordoni



研究问题:本文旨在探讨适配器路由在跨任务泛化中的作用,并基于发现设计新的变体。
动机:当前的参数高效微调方法(PEFT)通过在多任务训练集上预训练适配器,然后进行少次样本的测试任务适应。Polytropon [Ponti et al., 2023]($\texttt{Poly}$)同时学习适配器库存和选择每个任务的适配器子集的路由函数。
方法:我们提出多头路由(MHR),它结合了适配器参数的子集,并在相似的参数预算下优于$\texttt{Poly}$;通过只微调路由函数而不微调适配器($texttt{MHR}$-$z$),我们实现了极高的参数效率和竞争性能。
效果:我们发现,$\texttt{Poly}$/$\texttt{MHR}$的性能是更好的多任务优化的结果,而不是之前假设的促进适配器重组和局部适应的模块化归纳偏置。实际上,我们发现$\texttt{MHR}$在训练任务之间表现出高度的梯度对齐。我们还发现,路由在多任务预训练期间最有益,而不是在少次样本适应期间,因此我们提出了$\texttt{MHR}$-$\mu$,它丢弃路由并在每个下游任务上微调预训练适配器的平均值。这确立了$\texttt{MHR}$-$\mu$作为单适配器微调的有效方法。我们还表明,通过在多任务训练集上对预训练适配器的平均值进行额外的几步训练,$\texttt{MHR}$-$\mu$可以用作有效的零样本转移方法:这比基线获得了高达3%的绝对精度增益。代码可在https://github.com/microsoft/mttl获取。

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before few-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] ($\texttt{Poly}$) jointly learns an inventory of adapters and a *routing* function that selects a (variable-size) subset of adapters for each task during both pre-training and few-shot adaptation. In this paper, we investigate the role that adapter routing plays in its success and design new variants based on our findings. First, we build on the intuition that finer-grained routing provides more expressivity. Hence, we propose $\texttt{MHR}$ (Multi-Head Routing) which combines *subsets* of adapter parameters and outperforms $\texttt{Poly}$ under a comparable parameter budget; by only fine-tuning the routing function and not the adapters ($\texttt{MHR}$-$z$) we achieve competitive performance with extreme parameter efficiency. Second, we find that $\texttt{Poly}$/$\texttt{MHR}$ performance is a result of better multi-task optimization, rather than modular inductive biases that facilitate adapter recombination and local adaptation, as previously hypothesized. In fact, we find that $\texttt{MHR}$ exhibits high gradient alignment between training tasks. We find that routing is most beneficial during multi-task pre-training rather than during few-shot adaptation and propose $\texttt{MHR}$-$\mu$, which discards routing and fine-tunes the average of the pre-trained adapters on each downstream tasks. This establishes $\texttt{MHR}$-$\mu$ as an effective method for single-adapter fine-tuning. We also show that $\texttt{MHR}$-$\mu$ can be used as an effective zero-shot transfer method by training the average of the pre-trained adapters for a few additional steps on the multi-task training set: this yields gains up to 3\% on absolute accuracy w.r.t. the baselines. Code is available at .

Explainable Brain Age Prediction using coVariance Neural Networks
Saurabh Sihag Gonzalo Mateos Corey McMillan Alejandro Ribeiro



研究问题:如何利用大脑成像数据来预测个体的“脑年龄”,并解决现有算法缺乏透明度和方法论证的问题。
动机:脑年龄与实际年龄的差距(称为“脑年龄差距”)可以反映由于不良健康状况导致的加速老化,进而反映出对神经性疾病或认知障碍的增加脆弱性。然而,由于大多数现有的脑年龄预测算法缺乏透明度和方法论证,因此阻碍了脑年龄在临床决策支持中的广泛应用。
方法:本文提出了一种解释驱动和解剖可解释的框架,使用皮层厚度特征进行脑年龄预测。具体来说,我们的脑年龄预测框架不仅局限于阿尔茨海默病中的大脑年龄差距这一粗略指标,而且我们做出了两个重要观察:(i)VNNs可以通过识别贡献大脑区域,为AD中提高的大脑年龄差距提供解剖学上的可解释性;(ii)VNNs提供的可解释性取决于它们利用解剖学协方差矩阵特定特征向量的能力。这些观察结果为脑年龄预测任务提供了一种可解释和解剖上可解释的视角。
效果:实验结果表明,该方法能够有效地预测脑年龄,并且具有很好的解剖学可解释性。

In computational neuroscience, there has been an increased interest in developing machine learning algorithms that leverage brain imaging data to provide estimates of "brain age" for an individual. Importantly, the discordance between brain age and chronological age (referred to as "brain age gap") can capture accelerated aging due to adverse health conditions and therefore, can reflect increased vulnerability towards neurological disease or cognitive impairments. However, widespread adoption of brain age for clinical decision support has been hindered due to lack of transparency and methodological justifications in most existing brain age prediction algorithms. In this paper, we leverage coVariance neural networks (VNN) to propose an explanation-driven and anatomically interpretable framework for brain age prediction using cortical thickness features. Specifically, our brain age prediction framework extends beyond the coarse metric of brain age gap in Alzheimer’s disease (AD) and we make two important observations: (i) VNNs can assign anatomical interpretability to elevated brain age gap in AD by identifying contributing brain regions, (ii) the interpretability offered by VNNs is contingent on their ability to exploit specific eigenvectors of the anatomical covariance matrix. Together, these observations facilitate an explainable and anatomically interpretable perspective to the task of brain age prediction.

GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning
Haiteng Zhao Shengchao Liu Chang Ma Hannan Xu Jie Fu Zhi-Hong Deng Lingpeng Kong Qi Liu



研究问题:本文旨在解决分子性质预测中由于昂贵的实验导致的标签不足的问题,以及如何更好地利用文本知识进行任务。
动机:现有的分子-文本模型在零样本设置下表现不佳,主要原因是对指令处理不足和对图形的容量有限。
方法:提出了GIMLET模型,该模型统一了图形和文本的语言模型。通过采用通用位置嵌入,我们的模型可以在不增加额外图形编码模块的情况下编码图形结构和指令文本。GIMLET还通过注意力机制将图形特征的编码与任务指令解耦,增强了图形特征在新任务上的泛化能力。
效果:实验结果表明,GIMLET在基于指令的零样本学习上显著优于分子-文本基线,甚至在如toxcast和muv等任务上达到了接近监督GNN模型的效果。

Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We discover that existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs. To overcome these issues, we propose GIMLET, which unifies language models for both graph and text data. By adopting generalized position embedding, our model is extended to encode both graph structures and instruction text without additional graph encoding modules. GIMLET also decouples encoding of the graph from tasks instructions in the attention mechanism, enhancing the generalization of graph features across novel tasks. We construct a dataset consisting of more than two thousand molecule tasks with corresponding instructions derived from task descriptions. We pretrain GIMLET on the molecule tasks along with instructions, enabling the model to transfer effectively to a broad range of tasks. Experimental results demonstrate that GIMLET significantly outperforms molecule-text baselines in instruction-based zero-shot learning, even achieving closed results to supervised GNN models on tasks such as toxcast and muv.

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval
Ida Momennejad Hosein Hasanbeig Felipe Vieira Frujeri Hiteshi Sharma Nebojsa Jojic Hamid Palangi Robert Ness Jonathan Larson



研究问题:本文旨在解决当前大型语言模型(LLMs)缺乏系统性评估的问题,以及其在规划任务中存在的明显失败模式。
动机:大部分关于LLMs认知能力的研究依赖于轶事证据、训练集的污染或缺乏系统的评估,包括多任务、控制条件、多次迭代和统计鲁棒性测试。
方法:本文提出了一个受认知科学启发的协议CogEval,用于对LLMs的认知能力进行系统评估。同时,作者还根据人类实验设计了任务提示,以评估规划能力和在LLM训练集中不存在的任务。
效果:虽然LLMs在一些结构较简单的规划任务中表现出明显的胜任能力,但系统的评估揭示了其在规划任务中的显著失败模式,包括无效轨迹的幻觉和陷入循环。这些发现并不支持LLMs具有出现即用型规划能力的观点。这可能是因为LLMs不理解规划问题背后的潜在关系结构(称为认知地图),并且无法基于该结构展开目标导向的轨迹。

Recently an influx of studies claims emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and falling in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

Self-Evaluation Guided Beam Search for Reasoning
Yuxi Xie Kenji Kawaguchi Yiran Zhao Xu Zhao Min-Yen Kan Junxian He Qizhe Xie



研究问题:大型语言模型在多步推理中存在不确定性和误差累积的问题。
动机:为了解决多步推理中的不确定性问题,提出了一种逐步自我评估机制来指导和校准大型语言模型的推理过程。
方法:通过随机束搜索整合了自我评估指导的解码算法。自我评估指导作为一种更精确的自动标准,有助于在推理空间中进行有效搜索,从而提高预测质量。随机束搜索通过温度控制的随机性平衡了搜索空间的利用和探索。
效果:该方法在GSM8K、AQuA和StrategyQA基准测试上分别比相应的Codex基础线高出6.34%、9.56%和5.46%的少次准确度。在算术推理方面的实验结果也显示,该方法在同等计算预算下优于基础方法。进一步的多步推理分析发现,自我评估指导能够准确找出逻辑错误,提高一致性和鲁棒性。

Breaking down a problem into intermediate steps has demonstrated impressive performance in Large Language Model (LLM) reasoning. However, the growth of the reasoning chain introduces uncertainty and error accumulation, making it challenging to elicit accurate final results. To tackle this challenge of uncertainty in multi-step reasoning, we introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of LLMs. We propose a decoding algorithm integrating the self-evaluation guidance via stochastic beam search. The self-evaluation guidance serves as a better-calibrated automatic criterion, facilitating an efficient search in the reasoning space and resulting in superior prediction quality. Stochastic beam search balances exploitation and exploration of the search space with temperature-controlled randomness. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34$%, $9.56$%, and $5.46$% on the GSM8K, AQuA, and StrategyQA benchmarks, respectively. Experiment results with Llama-2 on arithmetic reasoning demonstrate the efficiency of our method in outperforming the baseline methods with comparable computational budgets. Further analysis in multi-step reasoning finds our self-evaluation guidance pinpoints logic failures and leads to higher consistency and robustness. Our code is publicly available at [https://guideddecoding.github.io/](https://guideddecoding.github.io/).

Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Jannik Kossen Mark Collier Basil Mustafa Xiao Wang Xiaohua Zhai Lucas Beyer Andreas Peter Steiner Jesse Berent Rodolphe Jenatton Effrosyni Kokiopoulou



研究问题:如何通过结合预训练的图像分类器来改进视觉-语言模型的对比学习。
动机:当前的对比学习方法通常从零开始训练,而利用预训练的分类器嵌入可以提升性能。然而,直接替换图像塔为冻结嵌入可能会排除对比训练的潜在好处。
方法:提出了一种灵活的策略,即引入包含冻结预训练嵌入的第三个塔,并鼓励这个第三塔与主要的图像-文本塔之间的对齐。
效果:实验结果表明,该方法在检索任务上始终优于LiT和CLIP风格的从零开始基线,对于分类任务,除了在JFT预训练模型上表现稍逊于LiT外,在其他如ImageNet-21k和Places365预训练上都超过了LiT的表现。

We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits from training the image tower contrastively. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining.

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation
Yun Xing Jian Kang Aoran Xiao Jiahao Nie Ling Shao Shijian Lu



研究问题:现有的视觉语言预训练模型在语义对齐上存在明显的问题,即图像中的许多视觉概念在配对的文本中缺失。
动机:为了解决这一问题,我们提出了一种名为“Concept Curation(CoCu)”的方法,通过建立概念档案和利用视觉驱动的扩展以及文本到视觉的引导排名来弥补缺失的语义。
方法:对于每一张图像-文本对,我们建立一个概念档案,通过集群引导采样和提供相关的概念来填补视觉和文本语义之间的鸿沟。
效果:实验结果表明,CoCu在广泛的8个分割基准测试中实现了优秀的零样本转移性能,大大提高了语言监督分割基线的性能,证明了在预训练数据中填补语义差距的价值。

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from languagesupervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from a clear semantic gap between visual and textual modalities: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of closing semantic gap in pre-training data.

A Theory of Unsupervised Translation Motivated by Understanding Animal Communication
Shafi Goldwasser David Gruber Adam Tauman Kalai Orr Paradise



研究问题:本文旨在探讨在没有平行翻译和源语与目标语领域不相关或语言结构不同的情况下,无监督机器翻译(UMT)的分析框架。
动机:随着神经网络在无监督机器翻译方面取得进展,人们开始关注是否可以利用机器学习工具来理解动物交流,特别是高度智能动物的交流。
方法:提出了一个理论框架,用于分析在没有平行翻译和源语与目标语领域不相关或语言结构不同的情况下的无监督机器翻译。通过两个风格化的语言模型进行示例说明,并提供了理论上的必要样本复杂度界限。
效果:实验结果表明,错误率与语言复杂度和共同点数量呈反比。这表明,如果交流系统足够复杂,无监督的动物交流翻译可能是可行的。

Neural networks are capable of translating between languages—in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. We propose a theoretical framework for analyzing UMT when no parallel translations are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure. We exemplify this theory with two stylized models of language, for which our framework provides bounds on necessary sample complexity; the bounds are formally proven and experimentally verified on synthetic data. These bounds show that the error rates are inversely related to the language complexity and amount of common ground. This suggests that unsupervised translation of animal communication may be feasible if the communication system is sufficiently complex.

Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift
Florian Seligmann Philipp Becker Michael Volpp Gerhard Neumann



研究问题:本文旨在系统地评估现代贝叶斯深度学习(BDL)算法在真实世界数据集上的表现,特别是在分布转移下的准确性和校准性。
动机:尽管贝叶斯深度学习是实现分布转移数据上准确预测的有前景的方法,但目前还没有大规模的调查来系统地评估最新的SOTA方法在多样化、现实和挑战性的基准任务上的表现。
方法:我们在WILDS集合中的真实世界数据集上评估了现代BDL算法,这些数据集包含具有挑战性的分类和回归任务,重点关注泛化能力和分布转移下的校准。我们比较了各种大型的卷积神经网络和基于变压器的神经网络架构上的算法。
效果:我们发现,通过集成单模近似值通常可以显著提高模型的泛化能力和校准性,但我们也发现了当微调大型变压器基础的语言模型时,集成的失败模式。在这种情况下,基于变分推理的方法,如last-layer Bayes By Backprop,在准确性方面比其他方法高出很多,而现代近似推理算法如SWAG在校准方面表现最好。

Bayesian deep learning (BDL) is a promising approach to achieve well-calibrated predictions on distribution-shifted data. Nevertheless, there exists no large-scale survey that evaluates recent SOTA methods on diverse, realistic, and challenging benchmark tasks in a systematic manner. To provide a clear picture of the current state of BDL research, we evaluate modern BDL algorithms on real-world datasets from the WILDS collection containing challenging classification and regression tasks, with a focus on generalization capability and calibration under distribution shift. We compare the algorithms on a wide range of large, convolutional and transformer-based neural network architectures. In particular, we investigate a signed version of the expected calibration error that reveals whether the methods are over- or underconfident, providing further insight into the behavior of the methods. Further, we provide the first systematic evaluation of BDL for fine-tuning large pre-trained models, where training from scratch is prohibitively expensive. Finally, given the recent success of Deep Ensembles, we extend popular single-mode posterior approximations to multiple modes by the use of ensembles. While we find that ensembling single-mode approximations generally improves the generalization capability and calibration of the models by a significant margin, we also identify a failure mode of ensembles when finetuning large transformer-based language models. In this setting, variational inference based approaches such as last-layer Bayes By Backprop outperform other methods in terms of accuracy by a large margin, while modern approximate inference algorithms such as SWAG achieve the best calibration.

ANPL: Towards Natural Programming with Interactive Decomposition
Di Huang Ziyuan Nan Xing Hu Pengwei Jin Shaohui Peng Yuanbo Wen Rui Zhang Zidong Du Qi Guo Yewen Pu Yunji Chen



研究问题:如何有效地与预训练语言模型进行交互,以进一步修改程序。
动机:目前的预训练语言模型虽然能生成合理的程序,但用户很难根据特定的需求对生成的程序进行修订。
方法:本文介绍了一种交互式编程系统ANPL,该系统通过结构化分解确保用户可以通过精确的代码(如Python)表达控制/数据流的“草图”和用自然语言描述待实现的“孔”(子模块)来不断优化生成的代码。
效果:在抽象推理语料库(ARC)等具有挑战性的任务上,ANPL表现出色,优于无法交互式分解任务和不能保证模块正确组合的基准编程系统。在APPS、HumanEval和真实世界编程任务上的额外评估也验证了ANPL框架适用于多个编程领域。

Though LLMs are capable of generating plausible programs, it’s challenging to interact with the LLMs further to revise the program, especially if the user’s specific requirements are different from the initial proposal. In this paper, we introduce ANPL, an interactive programming system that ensures users can always refine the generated code towards their specific programmatic intents via structured decompositions. Borrowing the paradigm of sketching from program synthesis, an ANPL program consists of a set of input-outputs that it must satisfy, a “sketch” — control/data flow expressed in precise code (e.g. Python), and “holes” — sub-modules to be implemented by the LLM specified with natural language. The user revises an ANPL program by either modifying the sketch, changing the language used to describe the holes, or providing additional input-outputs to a particular hole, turning it into a sub-ANPL program that can be solved recursively. This workflow allows the users to offload programming burdens to the LLM as much as possible while retaining the ability to pinpoint and resolve bugs locally, without exposing the rest of the program to the LLM. We deploy ANPL on the Abstraction and Reasoning Corpus (ARC), a set of unique tasks that are challenging for state-of-the-art AI systems, showing it outperforms baseline programming systems that (a) without the ability to decompose tasks interactively and (b) without the guarantee that the modules can be correctly composed together. Additional evaluations on APPS, HumanEval, and real-world programming tasks have validated that the ANPL framework is applicable to multiple programming domains. We release the ANPL solutions to the ARC tasks as a dataset, providing insights into how humans decompose novel tasks programmatically.

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
Xin Li Dongze Lian Zhihe Lu Jiawang Bai Zhibo Chen Xinchao Wang



研究问题:如何利用适配器风格的高效迁移学习(ETL)在低数据量的情况下优化视觉语言模型(VLMs)的性能。
动机:大多数适配器风格的作品存在两个限制,一是仅用单一模态对任务特定知识进行建模,二是忽视了下游任务中类别间关系的利用,导致解决方案次优。
方法:提出一种有效的适配器风格调优策略,命名为GraphAdapter,通过建立双知识图谱来显式建模文本和视觉模态的双模态结构知识,从而提升每个提示的文本特征。
效果:在11个基准数据集上的大量实验结果表明,GraphAdapter显著优于先前的适配器基方法。

Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the inter-class relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms the previous adapter-based methods.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Wenhai Wang Zhe Chen Xiaokang Chen Jiannan Wu Xizhou Zhu Gang Zeng Ping Luo Tong Lu Jie Zhou Yu Qiao Jifeng Dai



研究问题:如何利用大型语言模型(LLMs)在计算机视觉领域实现开放性任务的能力?
动机:尽管存在强大的视觉基础模型(VFMs),但在计算机视觉领域中,它们仍然受限于预定义的任务形式,无法与大型语言模型(LLMs)的开放性任务能力相匹配。
方法:提出了一个基于LLM的视觉任务框架,称为VisionLLM。该框架通过将图像视为外语,并将视觉中心任务与可以通过语言指令灵活定义和管理的语言任务对齐,为视觉和语言任务提供了统一的视图。然后,基于LLM的解码器可以根据这些指令进行适当的预测以完成开放性任务。
效果:实验表明,提出的VisionLLM可以通过语言指令实现不同级别的任务定制,从细粒度的对象级别到粗粒度的任务级别,所有结果都很好。值得注意的是,使用通用的LLM框架,我们的模型在COCO上实现了超过60%的mAP,与专门的检测模型相当。我们希望这个模型能为通用的视觉和语言模型设定一个新的基线。代码将被发布。

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The code shall be released.

What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom Yonatan Bitton Soravit Changpinyo Roee Aharoni Jonathan Herzig Oran Lang Eran Ofek Idan Szpektor



研究问题:自动判断文本和图像是否在语义上对齐是视觉语言模型的一个重大挑战,应用在生成文本到图像和图像到文本的任务中。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:我们介绍了两种自动确定对齐的方法:第一种是基于问题生成和视觉问答模型的管道,第二种是通过微调多模态预训练模型进行端到端分类。这两种方法在各种文本-图像对齐任务上都超过了先前的方法,并在涉及复杂组合或非自然图像的挑战性情况下取得了显著的改进。
效果:实验结果表明,我们的方法可以定位图像和给定文本之间的特定不对齐,并可以用于自动重新排序文本到图像生成的候选者。

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

Chatting Makes Perfect: Chat-based Image Retrieval
Matan Levy Rami Ben-Ari Nir Darshan Dani Lischinski



研究问题:本文旨在解决现有的图像检索方法大多只处理单个查询-图像的交互,而忽视了聊天在图像检索中的应用。
动机:鉴于当今基础模型的能力,我们利用大型语言模型生成与初始图像描述的后续问题,通过与用户的对话框从大量语料库中检索所需的图像。
方法:我们构建了一个基于聊天的图像检索系统ChatIR,该系统通过与用户进行对话来获取额外的信息以明确用户的搜索意图。
效果:实验结果表明,通过进行对话可以显著提高图像检索的成功率。在5轮对话后,我们的系统能从50K张图片中成功检索到目标图片,成功率超过78%,而人类提问的成功率为75%,单次文本到图像检索的成功率为64%。

Chats emerge as an effective user-friendly approach for information retrieval, and are successfully employed in many domains, such as customer service, healthcare, and finance. However, existing image retrieval approaches typically address the case of a single query-to-image round, and the use of chats for image retrieval has been mostly overlooked. In this work, we introduce ChatIR: a chat-based image retrieval system that engages in a conversation with the user to elicit information, in addition to an initial query, in order to clarify the user's search intent. Motivated by the capabilities of today's foundation models, we leverage Large Language Models to generate follow-up questions to an initial image description. These questions form a dialog with the user in order to retrieve the desired image from a large corpus. In this study, we explore the capabilities of such a system tested on a large dataset and reveal that engaging in a dialog yields significant gains in image retrieval. We start by building an evaluation pipeline from an existing manually generated dataset and explore different modules and training strategies for ChatIR. Our comparison includes strong baselines derived from related applications trained with Reinforcement Learning. Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds, compared to 75% when questions are asked by humans, and 64% for a single shot text-to-image retrieval. Extensive evaluations reveal the strong capabilities and examine the limitations of CharIR under different settings. Project repository is available at https://github.com/levymsn/ChatIR.

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation
Yujie Lu Xianjun Yang Xiujun Li Xin Eric Wang William Yang Wang



研究问题:现有的文本到图像合成自动评估只能提供图像-文本匹配分数,没有考虑到对象级别的组合性,这导致与人类判断的相关性较差。
动机:为了解决上述问题,我们提出了LLMScore,这是一个新的框架,可以提供具有多粒度组合性的评估分数。
方法:LLMScore利用大型语言模型(LLMs)来评估文本到图像的模型。首先,它将图像转化为图像级别和对象级别的视觉描述。然后,将评估指令输入到LLMs中,以测量合成图像和文本之间的对齐程度,最终生成一个伴随有理有据的分数。
效果:我们的实证分析显示,在各种数据集(属性绑定对比、概念联合、MSCOCO、DrawBench、PaintSkills)上,LLMScore与人类判断的最高相关性。值得注意的是,我们的LLMScore与人类评价的Kendall's tau相关性比常用的文本-图像匹配指标CLIP和BLIP分别高出58.8%和31.2%。

Existing automatic evaluation on text-to-image synthesis can only provide an image-text matching score, without considering the object-level compositionality, which results in poor correlation with human judgments. In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models. Initially, it transforms the image into image-level and object-level visual descriptions. Then an evaluation instruction is fed into the LLMs to measure the alignment between the synthesized image and the text, ultimately generating a score accompanied by a rationale. Our substantial analysis reveals the highest correlation of LLMScore with human judgments on a wide range of datasets (Attribute Binding Contrast, Concept Conjunction, MSCOCO, DrawBench, PaintSkills). Notably, our LLMScore achieves Kendall's tau correlation with human evaluations that is 58.8% and 31.2% higher than the commonly-used text-image matching metrics CLIP and BLIP, respectively.

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen Kaitao Song Xu Tan Dongsheng Li Weiming Lu Yueting Zhuang



研究问题:如何利用大型语言模型(LLMs)作为控制器,管理现有的AI模型来解决复杂的AI任务。
动机:尽管存在许多适用于不同领域和模态的AI模型,但它们无法自主处理复杂的AI任务。考虑到大型语言模型在语言理解、生成、交互和推理方面表现出色,我们主张大型语言模型可以作为控制器来管理现有的AI模型以解决复杂的AI任务,其中语言作为一种通用接口来实现这一目标。
方法:我们提出了HuggingGPT,这是一个由大型语言模型驱动的代理,它利用大型语言模型(如ChatGPT)将机器学习社区中的各种AI模型(如Hugging Face)连接起来解决AI任务。具体来说,当接收到用户请求时,我们使用ChatGPT进行任务规划,根据Hugging Face中可用的模型功能描述选择模型,用选定的AI模型执行每个子任务,并根据执行结果总结响应。
效果:通过利用ChatGPT强大的语言能力和Hugging Face中的丰富AI模型,HuggingGPT能够处理跨越不同模态和领域的广泛复杂AI任务,并在语言、视觉、语音和其他具有挑战性的任务中取得令人印象深刻的结果,为实现人工智能开辟了新的道路。

Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we present HuggingGPT, an LLM-powered agent that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models
Lin Li Jun Xiao Guikun Chen Jian Shao Yueting Zhuang Long Chen



研究问题:如何利用预训练的视觉语言模型进行零样本视觉识别,特别是在关系检测任务中。
动机:现有的使用CLIP进行零样本视觉识别的方法存在一些弱点,如难以区分精细的关系类型,忽视了两个物体的空间信息。
方法:提出一种新的方法RECODE,通过组合描述提示解决关系检测问题。首先将每个谓词类别分解为主体、对象和空间组件,然后利用大型语言模型为每个组件生成基于描述的提示(或视觉线索)。不同的视觉线索从不同的角度增强了相似关系类别的可区分性,从而显著提高了VRD的性能。
效果:在四个VRD基准测试上的大量实验表明,RECODE具有有效性和可解释性。

Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.

Intriguing Properties of Quantization at Scale
Arash Ahmadian Saurabh Dash Hongyu Chen Bharat Venkitesh Zhen Stephen Gou Phil Blunsom Ahmet Üstün Sara Hooker



研究问题:量化性能下降是否仅仅是规模的问题?
动机:最近的研究表明,量化性能的下降是大模型中的一种涌现特性。本研究旨在探究这种特性是否仅由规模决定。
方法:通过对不同规模的模型进行优化和量化,我们发现异常维度并不是规模的产物,而是对预训练期间存在的优化条件敏感。
效果:我们成功地对从4.1亿到52亿参数范围的模型进行了量化,性能下降最小。

Emergent properties have been widely adopted as a term to describe behavior not present in smaller models but observed in larger models (Wei et al., 2022a). Recent work suggests that the trade-off incurred by quantization is also an emergent property, with sharp drops in performance in models over 6B parameters. In this work, we ask _are quantization cliffs in performance solely a factor of scale?_ Against a backdrop of increased research focus on why certain emergent properties surface at scale, this work provides a useful counter-example. We posit that it is possible to optimize for a quantization friendly training recipe that suppresses large activation magnitude outliers. Here, we find that outlier dimensions are not an inherent product of scale, but rather sensitive to the optimization conditions present during pre-training. This both opens up directions for more efficient quantization, and poses the question of whether other emergent properties are inherent or can be altered and conditioned by optimization and architecture design choices. We successfully quantize models ranging in size from 410M to 52B with minimal degradation in performance.

InfoPrompt: Information-Theoretic Soft Prompt Tuning for Natural Language Understanding
Junda Wu Tong Yu Rui Wang Zhao Song Ruiyi Zhang Handong Zhao Chaochao Lu Shuai Li Ricardo Henao



研究问题:如何提高软提示调优的性能和鲁棒性。
动机:目前的软提示调优方法对初始提示敏感,且无法充分从提示令牌中学习任务相关信息。
方法:提出一种基于信息论的框架,将软提示调优视为最大化提示与其他模型参数(或编码表示)之间的互信息。并开发两种新的互信息损失函数,用于探索适当的任务提示初始化和鼓励预训练语言模型的输出表示更关注在提示中学到的任务相关信息。
效果:实验证明,该方法可以显著加速提示调优的收敛速度,并优于传统的提示调优方法。

Soft prompt tuning achieves superior performances across a wide range of few-shot tasks. However, the performances of prompt tuning can be highly sensitive to the initialization of the prompts. We have also empirically observed that conventional prompt tuning methods cannot encode and learn sufficient task-relevant information from prompt tokens. In this work, we develop an information-theoretic framework that formulates soft prompt tuning as maximizing the mutual information between prompts and other model parameters (or encoded representations). This novel view helps us to develop a more efficient, accurate and robust soft prompt tuning method, InfoPrompt. With this framework, we develop two novel mutual information based loss functions, to (i) explore proper prompt initialization for the downstream tasks and learn sufficient task-relevant information from prompt tokens and (ii) encourage the output representation from the pretrained language model to be more aware of the task-relevant information captured in the learnt prompts. Extensive experiments validate that InfoPrompt can significantly accelerate the convergence of the prompt tuning and outperform traditional prompt tuning methods. Finally, we provide a formal theoretical result to show that a gradient descent type algorithm can be used to train our mutual information loss.

Large Language Models of Code Fail at Completing Code with Potential Bugs
Tuan Dinh Jinman Zhao Samson Tan Renato Negrinho Leonard Lausen Sheng Zha George Karypis



研究问题:现有的大规模语言模型在代码补全任务中忽视了代码上下文可能存在的bug,这在软件开发中是不可避免的。
动机:受实时代码建议这一现实场景的启发,我们引入并研究了存在潜在bug的代码补全问题。
方法:我们引入了两个数据集:一个是从语义改变的操作符变化中提取的合成bug(buggy-HumanEval),另一个是从用户提交给编程问题的bug中提取的真实bug(buggy-FixEval)。我们发现潜在的bug显著降低了高性能的Code-LLMs的生成性能。
效果:例如,给定上下文中的一个潜在bug,CODEGEN-2B-MONO在buggy-HumanEval测试用例上的通过率下降了50%以上。最后,我们调查了几种后处理方法来减轻潜在bug的负面影响,发现后处理性能仍存在很大差距。

Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs – anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a large gap in post-mitigation performance.

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Alexander H. Liu Heng-Jui Chang Michael Auli Wei-Ning Hsu James R. Glass



研究问题:本文旨在介绍自我蒸馏和在线聚类在自我监督语音表示学习中的应用。
动机:通过结合掩蔽语言模型、自我蒸馏和在线聚类,提出了一种新的语音表示学习方法。
方法:首先,使用教师网络从输入音频中提取上下文嵌入;然后,对嵌入进行在线聚类以生成机器发现的音素库存;最后,使用离散化的标记指导学生网络。
效果:实验结果表明,DinoSR在几个下游任务上超过了先前最先进的性能,并对模型和学习的离散单元进行了详细分析。

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

Neural Algorithmic Reasoning Without Intermediate Supervision
Gleb Rodionov Liudmila Prokhorenkova



研究问题:本文旨在解决神经网络算法推理中的一项主要挑战,即如何让模型泛化到分布外的数据,特别是输入规模显著较大的数据。
动机:现有的工作都是通过学习算法的每一步来解决这个问题,但这种方法需要对原始算法的轨迹进行监督。本文则尝试从输入-输出对中学习神经网络算法推理,不依赖中间监督。
方法:本文提出了一些简单但有效的架构改进,并构建了一个自我监督的目标,可以在没有访问算法轨迹的情况下规范模型的中间计算。
效果:实验结果表明,本文的方法在CLRS算法推理基准测试中的任务上具有竞争力,并在排序等几个问题上取得了新的最优结果。因此,无需中间监督的学习是神经网络推理器进一步研究的一个有希望的方向。

Neural algorithmic reasoning is an emerging area of machine learning focusing on building models that can imitate the execution of classic algorithms, such as sorting, shortest paths, etc. One of the main challenges is to learn algorithms that are able to generalize to out-of-distribution data, in particular with significantly larger input sizes. Recent work on this problem has demonstrated the advantages of learning algorithms step-by-step, giving models access to all intermediate steps of the original algorithm. In this work, we instead focus on learning neural algorithmic reasoning only from the input-output pairs without appealing to the intermediate supervision. We propose simple but effective architectural improvements and also build a self-supervised objective that can regularise intermediate computations of the model without access to the algorithm trajectory. We demonstrate that our approach is competitive to its trajectory-supervised counterpart on tasks from the CLRS Algorithmic Reasoning Benchmark and achieves new state-of-the-art results for several problems, including sorting, where we obtain significant improvements. Thus, learning without intermediate supervision is a promising direction for further research on neural reasoners.

Does Visual Pretraining Help End-to-End Reasoning?
Chen Sun Calvin Luo Xingyi Zhou Anurag Arnab Cordelia Schmid



研究问题:本文旨在通过视觉预训练,探索能否用通用神经网络实现端到端的视觉推理学习。
动机:当前普遍的观点认为显式的视觉抽象(如物体检测)对于视觉推理的组合泛化至关重要,本文试图证明神经网络“通才”也能解决视觉识别和推理任务,以反驳这一观点。
方法:提出了一种简单且通用的自监督框架,利用转换器网络将每个视频帧压缩为一组小的标记,然后根据压缩的时序上下文重构剩余的帧。为了最小化重构损失,网络必须学习每个图像的紧凑表示,同时从时序上下文中捕捉时序动态和物体持久性。
效果:在两个视觉推理基准测试——CATER和ACRE上进行评估,发现预训练对于实现端到端视觉推理的组合泛化至关重要。所提出的框架在传统的有监督预训练(包括图像分类和显式物体检测)方面取得了大幅度的优越性能。

We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network ''generalist'' to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which ''compresses'' each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.

Unlimiformer: Long-Range Transformers with Unlimited Length Input
Amanda Bertsch Uri Alon Graham Neubig Matthew R. Gormley



研究问题:现有的transformer模型由于需要关注输入的每一个标记,因此其输入长度有限。
动机:提出一种通用方法Unlimiformer,将任意预训练的编码器-解码器transformer进行封装,并将跨注意力计算卸载到单个k近邻(kNN)索引上,同时返回的kNN距离就是注意力点积得分。
方法:在GPU或CPU内存中保存kNN索引,并在次线性时间内查询;这样,我们可以索引实际上无限长的输入序列,而每个解码器层中的每个注意力头都会检索其前k个键,而不是关注每一个键。
效果:在几个长文档和书籍总结基准测试中评估Unlimiformer,表明它可以处理BookSum数据集中的500k个标记长的输入,在测试时无需任何输入截断。通过扩展预训练模型如BART和Longformer到无限输入,无需额外的学习权重和修改代码,展示了Unlimiformer的改进效果。

Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single $k$-nearest-neighbor ($k$NN) index, while the returned $k$NN distances are the attention dot-product scores. This $k$NN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-$k$ keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even **500k** token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. Our code and models are publicly available at https://github.com/abertsch72/unlimiformer , and support LLaMA-2 as well.

Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Changdae Oh Junhyuk So Hoyoon Byun YongTaek Lim Minchul Shin Jong-June Jeon Kyungwoo Song



研究问题:本文旨在解决预训练多模态模型(如CLIP)的嵌入分析相对未探索,以及其嵌入转移性可以改进的问题。
动机:尽管预训练的多模态模型(如CLIP)在各种应用中表现出色,但其学习到的多模态嵌入的分析相对较少,且嵌入的转移性仍有待提高。
方法:通过观察发现CLIP为两种不同模态保留了分离的嵌入子空间,然后通过“一致性-对齐”的视角来测量学习到的表示的质量。理论和实验均表明,即使在微调后,CLIP仍然保持较差的一致性和对齐性。因此,我们设计了一种新的微调方法以获得更好的对齐和一致性的稳健表示。
效果:我们在检索、校准、小样本或零样本分类(分布偏移下)、嵌入算术和图像描述等任务上进行了广泛的实验,进一步证明我们的方法提供了可转移的表示,使模型能够在多样化的任务上进行稳健的适应。

Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of \textit{uniformity-alignment} to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a \textit{Geodesic Multi-Modal Mixup} that mixes the embeddings of image and text to generate hard negative samples on the hypersphere. Then, we fine-tune the model on hard negatives as well as original negatives and positives with contrastive loss. Based on the theoretical analysis about hardness guarantee and limiting behavior, we justify the use of our method. Extensive experiments on retrieval, calibration, few- or zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks.

What’s Left? Concept Grounding with Logic-Enhanced Foundation Models
Joy Hsu Jiayuan Mao Joshua B. Tenenbaum Jiajun Wu



研究问题:如何让大型语言模型在不同的领域中进行通用的、基于逻辑的推理?
动机:现有的基于大型语言模型的视觉推理模型仅在有限的领域(如2D图像)中有效,无法充分利用语言的一般性,例如“*left*”这样的抽象概念也可以在3D、时间、动作数据中找到依据。
方法:提出一种逻辑增强的基础模型(LEFT),该模型通过可微分的、与领域无关的一阶逻辑程序执行器来学习和适应新的领域。
效果:LEFT模型可以灵活地学习四个领域的知识,并在各种复杂任务中表现出强大的推理能力,包括那些在训练期间未见过的任务,并且可以轻松应用于新领域。

Recent works such as VisProg and ViperGPT have smartly composed foundation models for visual reasoning—using large language models (LLMs) to produce programs that can be executed by pre-trained vision-language models. However, they operate in limited domains, such as 2D images, not fully exploiting the generalization of language: abstract concepts like “*left*” can also be grounded in 3D, temporal, and action data, as in moving to your *left*. This limited generalization stems from these inference-only methods’ inability to learn or adapt pre-trained models to a new domain. We propose the **L**ogic-**E**nhanced **F**ounda**T**ion Model (**LEFT**), a unified framework that *learns* to ground and reason with concepts across domains with a differentiable, domain-independent, first-order logic-based program executor. LEFT has an LLM interpreter that outputs a program represented in a general, logic-based reasoning language, which is shared across all domains and tasks. LEFT’s executor then executes the program with trainable domain-specific grounding modules. We show that LEFT flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, and robotic manipulation. It exhibits strong reasoning ability in a wide variety of tasks, including those that are complex and not seen during training, and can be easily applied to new domains.

Language Model Tokenizers Introduce Unfairness Between Languages
Aleksandar Petrov Emanuele La Malfa Philip Torr Adel Bibi



研究问题:尽管最近的多语言模型表现令人印象深刻,但其在不同语言处理上的质量存在差异。
动机:这种差异主要源于分词阶段的不同处理方式,而这一问题在训练时就已存在。
方法:本文提出使用多语言公平的子词分词器来训练未来的语言模型。
效果:通过使用多语言公平的分词器,可以减小不同语言处理上的差异,提高模型的公平性。

Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, there are concerns about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. This induces unfair treatment for some language communities in regard to the cost of accessing commercial language services, the processing time and latency, as well as the amount of content that can be provided as context to the models. Therefore, we make the case that we should train future language models using multilingually fair subword tokenizers.

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation
Marco Bellagente Manuel Brack Hannah Benita Teufel Felix Friedrich Björn Deiseroth Constantin Eichenberg Andrew Dai Robert John Nicholas Baldock Souradeep Nanda Koen Oostermeijer Andres Felipe Cruz-Salinas Patrick Schramowski Kristian Kersting Samuel Weinbach



研究问题:如何利用预训练模型和多模态、多语言输入,提高文本到图像生成模型的性能。
动机:现有的文本到图像生成模型在处理复杂或微妙的概念时存在困难,需要一种能够处理多模态、多语言输入的模型。
方法:提出MultiFusion模型,通过预训练模型对各模块进行对齐,整合成一个连贯的系统,无需从头开始进行大量训练。
效果:实验结果表明,所有独立组件的融合使得图像生成模块能够利用多语言、交错的多模态输入,尽管其在单一语言的单模态数据上进行训练。

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MultiFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks
Minki Kang Seanie Lee Jinheon Baek Kenji Kawaguchi Sung Ju Hwang



研究问题:大型语言模型在知识密集推理任务上表现出色,但在实际部署中存在计算需求高和数据隐私问题。
动机:针对此问题,本文提出了一种新方法——知识增强推理蒸馏(KARD),通过从外部知识库获取增强的知识来微调小型语言模型生成推理。
方法:首先,使用大型语言模型生成推理;然后,利用外部知识库增强这些推理;最后,通过神经重排器获取与推理生成相关的文档。
效果:实验表明,KARD显著提高了小型T5和GPT模型在具有挑战性的知识密集推理数据集(如MedQA-USMLE、StrategyQA和OpenbookQA)上的性能。特别是在MedQA-USMLE和StrategyQA基准测试中,250M参数的T5模型的表现优于30亿参数的fine-tuned模型。

Large Language Models (LLMs) have shown promising performance in knowledge-intensive reasoning tasks that require a compound understanding of knowledge. However, deployment of the LLMs in real-world applications can be challenging due to their high computational requirements and concerns on data privacy. Previous studies have focused on building task-specific small Language Models (LMs) by fine-tuning them with labeled data or distilling LLMs. However, these approaches are ill-suited for knowledge-intensive reasoning tasks due to the limited capacity of small LMs in memorizing the knowledge required. Motivated by our theoretical analysis on memorization, we propose Knowledge-Augmented Reasoning Distillation (KARD), a novel method that fine-tunes small LMs to generate rationales obtained from LLMs with augmented knowledge retrieved from an external knowledge base. Moreover, we further propose a neural reranker to obtain documents relevant to rationale generation. We empirically show that KARD significantly improves the performance of small T5 and GPT models on the challenging knowledge-intensive reasoning datasets, namely MedQA-USMLE, StrategyQA, and OpenbookQA. Notably, our method makes the 250M T5 models achieve superior performance against the fine-tuned 3B models, having 12 times larger parameters, on both MedQA-USMLE and StrategyQA benchmarks.

Textually Pretrained Speech Language Models
Michael Hassid Tal Remez Tu Anh Nguyen Itai Gat Alexis Conneau Felix Kreuk Jade Copet Alexandre Défossez Gabriel Synnaeve Emmanuel Dupoux Roy Schwartz Yossi Adi



研究问题:如何利用预训练的文本语言模型来训练语音语言模型。
动机:目前的语音语言模型在没有文本监督的情况下处理和生成声学数据,效果不理想。
方法:提出TWIST方法,通过从预训练的文本语言模型进行热启动来训练语音语言模型。
效果:实验结果表明,TWIST在所有方面都优于冷启动的语音语言模型,且模型设计和数据集规模对构建性能更好的语音语言模型起着重要作用。

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available.

Uncovering and Quantifying Social Biases in Code Generation
Yan Liu Xiaokang Chen Yan Gao Zhe Su Fengji Zhang Daoguang Zan Jian-Guang Lou Pin-Yu Chen Tsung-Yi Ho



研究问题:本研究旨在探索预训练代码生成模型中的社会偏见问题。
动机:随着自动代码生成工具如Copilot的普及,这些工具可能带来的危害的研究越来越重要。
方法:我们提出了一种新的构建代码提示的模式,以揭示代码生成模型中的社会偏见。我们还开发了一个数据集和三个指标来量化生成代码的社会偏见严重程度。
效果:在实验中,我们发现三种不同规模的预训练代码生成模型(Codex、InCoder和CodeGen)都存在严重的社会偏见。此外,我们的分析为选择具有低社会偏见的代码生成模型提供了有用的洞察。

With the popularity of automatic code generation tools, such as Copilot, the study of the potential hazards of these tools is gaining importance. In this work, we explore the social bias problem in pre-trained code generation models. We propose a new paradigm to construct code prompts and successfully uncover social biases in code generation models. To quantify the severity of social biases in generated code, we develop a dataset along with three metrics to evaluate the overall social bias and fine-grained unfairness across different demographics. Experimental results on three pre-trained code generation models (Codex, InCoder, and CodeGen) with varying sizes, reveal severe social biases. Moreover, we conduct analysis to provide useful insights for further choice of code generation models with low social bias.

Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang Li Dong Wenhui Wang Yaru Hao Saksham Singhal Shuming Ma Tengchao Lv Lei Cui Owais Khan Mohammed Barun Patra Qiang Liu Kriti Aggarwal Zewen Chi Johan Bjorck Vishrav Chaudhary Subhojit Som Xia Song Furu Wei



研究问题:如何实现语言、多模态感知、行动和世界建模的大融合,以迈向人工通用智能。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Haoyi Duan Yan Xia Mingze Zhou Li Tang Jieming Zhu Zhou Zhao



研究问题:现有的大规模预训练模型在多模态任务中提取特征时,由于引入了无关的模态特定信息,导致性能不佳。
动机:为了解决这一问题,本文提出了一种新的双引导空间-通道-时间(DG-SCT)注意力机制。
方法:该机制利用音频和视觉模态作为软提示,根据当前的多模态输入特征动态调整预训练模型的参数。具体来说,DG-SCT模块将可训练的跨模态交互层纳入预训练的音视频编码器,允许从当前模态中自适应地提取关键信息,同时保留大规模预训练模型的冻结参数。
效果:实验评估表明,我们的模型在多个下游任务中取得了最先进的结果,包括AVE、AVVP、AVS和AVQA。此外,我们的模型在具有挑战性的少样本和零样本场景中也表现出良好的性能。

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning
Kun Song Huimin Ma Bochao Zou Huishuai Zhang Weiran Huang



研究问题:由于数据有限,现有的从零开始训练的少次学习(few-shot learning)方法无法达到满意的性能。
动机:与从零开始训练的方法不同,大规模预训练模型如CLIP显示出了卓越的少次和零次学习能力。然而,预训练模型在面对分布偏移时,其泛化能力会下降,而少次学习中有限的样本数量使得模型容易过拟合。
方法:本文提出了一种称为特征鉴别对齐(Feature Discrimination Alignment, FD-Align)的微调方法。该方法通过保持微调过程中伪特征的一致性来增强模型的泛化能力。
效果:大量的实验结果验证了我们的方法在识别(ID)和异常检测(OOD)任务上的有效性。微调后的模型可以无缝地集成到现有的方法中,从而提高性能。

Due to the limited availability of data, existing few-shot learning methods trained from scratch fail to achieve satisfactory performance. In contrast, large-scale pre-trained models such as CLIP demonstrate remarkable few-shot and zero-shot capabilities. To enhance the performance of pre-trained models for downstream tasks, fine-tuning the model on downstream data is frequently necessary. However, fine-tuning the pre-trained model leads to a decrease in its generalizability in the presence of distribution shift, while the limited number of samples in few-shot learning makes the model highly susceptible to overfitting. Consequently, existing methods for fine-tuning few-shot learning primarily focus on fine-tuning the model's classification head or introducing additional structure. In this paper, we introduce a fine-tuning approach termed Feature Discrimination Alignment (FD-Align). Our method aims to bolster the model's generalizability by preserving the consistency of spurious features across the fine-tuning process. Extensive experimental results validate the efficacy of our approach for both ID and OOD tasks. Once fine-tuned, the model can seamlessly integrate with existing methods, leading to performance improvements. Our code can be found in https://github.com/skingorz/FD-Align.

Learning-to-Rank Meets Language: Boosting Language-Driven Ordering Alignment for Ordinal Classification
Rui Wang Pei Pei Li Huaibo Huang Chunshui Cao Ran He Zhaofeng He



研究问题:本文旨在解决序数分类中标签的额外排序关系导致的过拟合问题。
动机:由于序数分类的标签包含额外的排序关系,仅依赖训练数据容易导致过拟合。受预训练视觉-语言模型的启发,作者希望通过将原始任务转化为视觉-语言对齐任务来利用人类语言中的丰富序数先验知识。
方法:提出了一种名为L2RCLIP的语言驱动排序对齐方法。首先,引入了一种名为RankFormer的补充提示调优技术,用于增强原始排名提示的排序关系。其次,为了进一步融入语言先验知识,重新审视了朴素交叉熵损失的近似边界优化,并在跨模态嵌入空间内对其进行了重构。因此,提出了一种跨模态序数成对损失来细化CLIP特征空间,使文本和图像在语义对齐和排序对齐方面都保持一致。
效果:在三个序数分类任务上进行了大量实验,包括面部年龄估计、历史彩色图像(HCI)分类和美学评估,证明了其良好的性能。

We present a novel language-driven ordering alignment method for ordinal classification. The labels in ordinal classification contain additional ordering relations, making them prone to overfitting when relying solely on training data. Recent developments in pre-trained vision-language models inspire us to leverage the rich ordinal priors in human language by converting the original task into a vision-language alignment task. Consequently, we propose L2RCLIP, which fully utilizes the language priors from two perspectives. First, we introduce a complementary prompt tuning technique called RankFormer, designed to enhance the ordering relation of original rank prompts. It employs token-level attention with residual-style prompt blending in the word embedding space. Second, to further incorporate language priors, we revisit the approximate bound optimization of vanilla cross-entropy loss and restructure it within the cross-modal embedding space. Consequently, we propose a cross-modal ordinal pairwise loss to refine the CLIP feature space, where texts and images maintain both semantic alignment and ordering alignment. Extensive experiments on three ordinal classification tasks, including facial age estimation, historical color image (HCI) classification, and aesthetic assessment demonstrate its promising performance.

Improving Compositional Generalization using Iterated Learning and Simplicial Embeddings
Yi Ren Samuel Lavoie Mikhail Galkin Danica J. Sutherland Aaron Courville



研究问题:如何提高深度神经网络的组合泛化能力,使其能够推广到未见过的潜在因素组合。
动机:人类可以轻松地完成组合泛化,但深度神经网络却很难做到。受认知科学中“迭代学习”过程的启发,研究者提出通过在具有简单嵌入式模型上进行迭代学习来改进深度网络的组合泛化能力。
方法:使用简单嵌入式模型并应用迭代学习的方法,将表示近似离散化,以提高其组合泛化能力。
效果:这种结合的改变在视觉任务和未知潜在结构的真实分子图预测任务上都表现出比其他方法更好的组合泛化能力。

Compositional generalization, the ability of an agent to generalize to unseen combinations of latent factors, is easy for humans but hard for deep neural networks. A line of research in cognitive science has hypothesized a process, "iterated learning," to help explain how human language developed this ability; the theory rests on simultaneous pressures towards compressibility (when an ignorant agent learns from an informed one) and expressivity (when it uses the representation for downstream tasks). Inspired by this process, we propose to improve the compositional generalization of deep networks by using iterated learning on models with simplicial embeddings, which can approximately discretize representations. This approach is further motivated by an analysis of compositionality based on Kolmogorov complexity. We show that this combination of changes improves compositional generalization over other approaches, demonstrating these improvements both on vision tasks with well-understood latent factors and on real molecular graph prediction tasks where the latent structure is unknown.

Punctuation-level Attack: Single-shot and Single Punctuation Can Fool Text Models
Wenqiang Wang Chongyang Du Tao Wang Kaihao Zhang Wenhan Luo Lin Ma Wei Liu Xiaochun Cao



研究问题:本文旨在提出一种新的文本攻击模式——标点符号级攻击,并设计了一种名为“文本位置标点嵌入和释义”(TPPEP)的搜索方法来加速寻找最佳攻击位置。
动机:现有的文本攻击模型主要通过添加字符/词/句子级别的扰动来欺骗模型,忽视了它们对人类感知的影响。
方法:通过插入、移位、删除和替换等不同类型的扰动,实现了对典型文本任务的SOTA模型的高欺骗率,并通过单个标点的简单扰动,保持了对人类感知和理解文本的最小影响。同时,提出了一种名为“文本位置标点嵌入和释义”的搜索方法,以加速寻找最佳攻击位置。
效果:实验结果在公开数据集和SOTA模型上证明了标点符号级攻击和提出的TPPE的有效性。此外,将单个标点符号攻击应用于摘要、语义相似性评分和文本到图像的任务,取得了令人鼓舞的结果。

The adversarial attacks have attracted increasing attention in various fields including natural language processing. The current textual attacking models primarily focus on fooling models by adding character-/word-/sentence-level perturbations, ignoring their influence on human perception. In this paper, for the first time in the community, we propose a novel mode of textual attack, punctuation-level attack. With various types of perturbations, including insertion, displacement, deletion, and replacement, the punctuation-level attack achieves promising fooling rates against SOTA models on typical textual tasks and maintains minimal influence on human perception and understanding of the text by mere perturbation of single-shot single punctuation. Furthermore, we propose a search method named Text Position Punctuation Embedding and Paraphrase (TPPEP) to accelerate the pursuit of optimal position to deploy the attack, without exhaustive search, and we present a mathematical interpretation of TPPEP. Thanks to the integrated Text Position Punctuation Embedding (TPPE), the punctuation attack can be applied at a constant cost of time. Experimental results on public datasets and SOTA models demonstrate the effectiveness of the punctuation attack and the proposed TPPE. We additionally apply the single punctuation attack to summarization, semantic-similarity-scoring, and text-to-image tasks, and achieve encouraging results.

Thrust: Adaptively Propels Large Language Models with External Knowledge
Xinran Zhao Hongming Zhang Xiaoman Pan Wenlin Yao Dong Yu Jianshu Chen



研究问题:如何有效地将外部知识引入预训练语言模型,以提高其在各种NLP任务上的性能。
动机:尽管预训练语言模型可以编码丰富的知识,但其内在的知识可能是不透明的或静态的,因此需要引入外部知识。然而,现有的信息检索技术可能成本高昂,甚至可能引入噪声和误导性的知识。
方法:提出了一种实例级的自适应外部知识推进(IAPEK)方法,只在必要时进行检索。为了实现这一目标,我们提出了一种新的度量标准——Thrust,它利用少量已见实例的表示分布来评估预训练语言模型在解决实例问题上的知识能力。
效果:实验表明,Thrust是一种衡量模型实例级知识能力的好方法。此外,我们发现,与直接使用外部知识相比,使用Thrust分数作为检索指标可以在88%的评估任务上实现更高的成本效益,平均性能提高了26%。这些发现为计算延迟或成本限制下的知识增强语言模型的实际应用提供了启示。

Although large-scale pre-trained language models (PTLMs) are shown to encode rich knowledge in their model parameters, the inherent knowledge in PTLMs can be opaque or static, making external knowledge necessary. However, the existing information retrieval techniques could be costly and may even introduce noisy and sometimes misleading knowledge. To address these challenges, we propose the instance-level adaptive propulsion of external knowledge (IAPEK), where we only conduct the retrieval when necessary. To achieve this goal, we propose to model whether a PTLM contains enough knowledge to solve an instance with a novel metric, Thrust, which leverages the representation distribution of a small amount of seen instances. Extensive experiments demonstrate that Thrust is a good measurement of models' instance-level knowledgeability. Moreover, we can achieve higher cost-efficiency with the Thrust score as the retrieval indicator than the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings shed light on the real-world practice of knowledge-enhanced LMs with a limited budget for knowledge seeking due to computation latency or costs.

RRHF: Rank Responses to Align Language Models with Human Feedback
Hongyi Yuan Zheng Yuan Chuanqi Tan Wei Wang Songfang Huang Fei Huang



研究问题:如何通过人类反馈来强化大型语言模型,提高人与模型之间的互动质量。
动机:现有的强化学习方法(如PPO)在实施过程中对超参数敏感,且需要多个模型,难以进行训练和扩大规模。
方法:提出一种新的学习范式——RRHF,通过条件概率的对数对不同来源的样本响应进行评分,并通过排名损失学习将这些概率与人类偏好对齐。
效果:RRHF只需要1到2个模型进行调优,无需复杂的超参数调整,就能有效地将语言模型与人类偏好对齐。实验表明,RRHF的性能与采样质量高度相关,是一种最佳n选一的学习者。

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-$n$ learner.

Language Models can Solve Computer Tasks
Geunwoo Kim Pierre Baldi Stephen Marcus McAleer



研究问题:如何让大型语言模型(LLM)代理通过自然语言指令执行计算机任务。
动机:现有的方法需要大量的专家演示和特定任务的奖励函数,对于新任务来说并不实用。
方法:提出了一种简单的提示方案,让代理通过递归地批评和改进其输出(RCI)来执行计算机任务。
效果:实验结果表明,RCI方法在自动化计算机任务方面显著优于现有的LLM方法,并在MiniWoB++基准测试中超过了监督学习和强化学习方法。

Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent \textbf{R}ecursively \textbf{C}riticizes and \textbf{I}mproves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function. Furthermore, we demonstrate RCI prompting's effectiveness in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting with external feedback. We find that RCI combined with CoT performs better than either separately. Our code can be found here: https://github.com/posgnu/rci-agent.

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
Cheng Cheng Lin Song Ruoyi Xue Hang Wang Hongbin Sun Yixiao Ge Ying Shan



研究问题:如何提高基于CLIP的少次学习方法的效率和泛化能力。
动机:目前基于CLIP的少次学习方法需要离线微调参数,导致推理时间长且在某些领域存在过拟合的风险。
方法:提出Meta-Adapter,一种轻量级的残差式适配器,以在线方式指导CLIP特征的优化。
效果:使用少量训练样本,该方法能够实现有效的少次学习能力并泛化到未见过的数据或任务,无需额外微调,在八个图像分类数据集上取得了有竞争力的性能和高效率。

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of overfitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6\% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.

Learning to Parameterize Visual Attributes for Open-set Fine-grained Retrieval
Shijie Wang Jianlong Chang Haojie Li Zhihui Wang Wanli Ouyang Qi Tian



研究问题:如何将基于图像级别的检索模型从类别语义提取转变为属性建模,以处理未知类别的开放集细粒度检索任务。
动机:现有的处理方法需要大量的手动标注,劳动密集且效率低下。因此,探索如何在无需任何属性标注的情况下,从已知类别中学习视觉属性并将其参数化到检索模型中,是一个值得研究的问题。
方法:提出了一种新的视觉属性参数化网络(VAPNet),通过利用局部图像补丁来获取丰富的细节语义,并从中提炼出视觉属性。同时,将这些视觉属性作为监督信号纳入训练过程,实现属性的参数化。
效果:在开放集细粒度检索数据集上的大量实验表明,VAPNet的性能优于现有的解决方案。

Open-set fine-grained retrieval is an emerging challenging task that allows to retrieve unknown categories beyond the training set. The best solution for handling unknown categories is to represent them using a set of visual attributes learnt from known categories, as widely used in zero-shot learning. Though important, attribute modeling usually requires significant manual annotations and thus is labor-intensive. Therefore, it is worth to investigate how to transform retrieval models trained by image-level supervision from category semantic extraction to attribute modeling. To this end, we propose a novel Visual Attribute Parameterization Network (VAPNet) to learn visual attributes from known categories and parameterize them into the retrieval model, without the involvement of any attribute annotations. In this way, VAPNet could utilize its parameters to parse a set of visual attributes from unknown categories and precisely represent them. Technically, VAPNet explicitly attains some semantics with rich details via making use of local image patches and distills the visual attributes from these discovered semantics. Additionally, it integrates the online refinement of these visual attributes into the training process to iteratively enhance their quality. Simultaneously, VAPNet treats these attributes as supervisory signals to tune the retrieval models, thereby achieving attribute parameterization. Extensive experiments on open-set fine-grained retrieval datasets validate the superior performance of our VAPNet over existing solutions.

Post Hoc Explanations of Language Models Can Improve Language Models
Satyapriya Krishna Jiaqi Ma Dylan Z Slack Asma Ghandeharioun Sameer Singh Himabindu Lakkaraju



研究问题:如何利用人类注释的理由(例如,思维链提示)在上下文学习中显著提高大型语言模型的性能?
动机:尽管大型语言模型在执行复杂任务方面表现出色,但将人类注释的理由融入上下文学习以增强模型性能需要大量的人力参与,且难以扩展。
方法:提出了一种名为AMPLIFY的新框架,通过自动化生成理由的过程来解决这个问题。具体来说,我们利用后验解释方法输出一个解释分数,该分数捕获了每个输入特征对模型预测的影响。然后,我们构建了自动的自然语言理由,这些理由是从后验解释中嵌入的见解,为大型语言模型提供修正信号。
效果:通过在真实世界数据集上的大量实验,我们发现AMPLIFY框架在所有类型的任务上都能实现约10%-25%的预测精度提升,包括那些依赖人类注释理由(如思维链提示)的先前方法无法胜任的任务。

Large Language Models (LLMs) have demonstrated remarkable capabilities in performing complex tasks. Moreover, recent research has shown that incorporating human-annotated rationales (e.g., Chain-of-Thought prompting) during in-context learning can significantly enhance the performance of these models, particularly on tasks that require reasoning capabilities. However, incorporating such rationales poses challenges in terms of scalability as this requires a high degree of human involvement. In this work, we present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY), which addresses the aforementioned challenges by automating the process of rationale generation. To this end, we leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. More specifically, we construct automated natural language rationales that embed insights from post hoc explanations to provide corrective signals to LLMs. Extensive experimentation with real-world datasets demonstrates that our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks, including those where prior approaches which rely on human-annotated rationales such as Chain-of-Thought prompting fall short. Our work makes one of the first attempts at highlighting the potential of post hoc explanations as valuable tools for enhancing the effectiveness of LLMs. Furthermore, we conduct additional empirical analyses and ablation studies to demonstrate the impact of each of the components of AMPLIFY, which, in turn, lead to critical insights for refining in context learning.

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna Ollie Liu Alexandre Variengien



研究问题:本文旨在探究预训练语言模型实现未明确训练任务能力的基本数学能力。
动机:尽管预训练语言模型在未明确训练的任务上表现出惊人的能力,但其实现这些能力的方式尚不清楚。
方法:使用机制可解释性技术来解释GPT-2 small的有限数学能力。作为案例研究,我们考察了它接收如"战争从1732年开始到17年结束"这样的句子,并预测有效的两位数结束年份(大于32年)的能力。
效果:我们发现GPT-2 small通过一个复杂的但通用的机制来计算大于号,该机制在不同的上下文中被激活。

Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we find related tasks that activate our circuit. Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism that activates across diverse contexts.

Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals
Tam Minh Nguyen Tan Minh Nguyen Richard Baraniuk



研究问题:Transformer模型在处理自然语言处理和计算机视觉任务时取得了显著的成功,但其深度增加会导致标记表示的退化,即过平滑问题。
动机:过平滑问题是由于Transformer中的自我注意层最小化了一个促进平滑性的函数,导致标记一致性的问题。
方法:我们提出了一种新的正则化器,通过惩罚自我注意输出标记与输入标记之间的差异来保持标记的保真度。通过最小化得到的正则化能量函数,我们得到了一种新的Transformer模型NeuTRENO,它可以缓解过平滑问题。
效果:实验结果表明,NeuTRENO在减少各种实用任务(包括对象分类、图像分割和语言建模)的标记表示过平滑方面优于基线Transformer和最先进的方法。

Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

From Cloze to Comprehension: Retrofitting Pre-trained Masked Language Models to Pre-trained Machine Reader
Weiwen Xu Xin Li Wenxuan Zhang Meng Zhou Wai Lam Luo Si Lidong Bing



研究问题:本文旨在提出一种新的方法,即预训练的机器阅读器(PMR),用于在不获取标记数据的情况下,将预训练的掩码语言模型(MLMs)改造为预训练的机器阅读理解(MRC)模型。
动机:现有的MLMs在模型预训练和下游微调之间存在差异,而PMR可以解决这个问题。
方法:通过使用维基百科超链接构建大量通用和高质量的MRC风格训练数据,并设计了一个Wiki锚点提取任务来指导MRC风格的预训练,从而构建了提出的PMR。
效果:PMR不仅简单易行,而且在提取任务(如抽取式问答和命名实体识别)上表现出色,尤其在低资源场景中。当应用于MRC形式的顺序分类任务时,PMR能够提取高质量的理由来解释分类过程,从而提高预测的可解释性。此外,PMR还有潜力成为解决MRC形式中各种提取和分类任务的统一模型。

We present Pre-trained Machine Reader (PMR), a novel method for retrofitting pre-trained masked language models (MLMs) to pre-trained machine reading comprehension (MRC) models without acquiring labeled data. PMR can resolve the discrepancy between model pre-training and downstream fine-tuning of existing MLMs. To build the proposed PMR, we constructed a large volume of general-purpose and high-quality MRC-style training data by using Wikipedia hyperlinks and designed a Wiki Anchor Extraction task to guide the MRC-style pre-training. Apart from its simplicity, PMR effectively solves extraction tasks, such as Extractive Question Answering and Named Entity Recognition. PMR shows tremendous improvements over existing approaches, especially in low-resource scenarios. When applied to the sequence classification task in the MRC formulation, PMR enables the extraction of high-quality rationales to explain the classification process, thereby providing greater prediction explainability. PMR also has the potential to serve as a unified model for tackling various extraction and classification tasks in the MRC formulation.

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Jiawei Liu Chunqiu Steven Xia Yuyao Wang LINGMING ZHANG



研究问题:现有的编程基准测试在评估大型语言模型(LLMs)生成的代码的功能正确性方面存在数量和质量的限制,因此需要提出一种更严格的评估框架。
动机:为了解决现有编程基准测试中存在的限制,我们提出了EvalPlus,这是一个用于严格评估LLM生成的代码功能正确性的代码合成评估框架。
方法:EvalPlus通过使用基于LLM和突变策略的自动测试输入生成器来增加新的测试用例,从而增强给定的评估数据集。我们还扩展了流行的HumanEval基准测试的测试用例,构建了HumanEval+。
效果:我们的广泛评估表明,HumanEval+能够检测到大量以前未被发现的LLM生成的错误代码,将pass@k降低了19.3-28.9%。此外,我们还发现测试不足可能导致误排名。例如,WizardCoder-CodeLlama和Phind-CodeLlama现在在HumanEval+上的表现优于ChatGPT,而在HumanEval上则无法做到。我们的研究不仅表明现有的编程基准测试结果不能准确反映LLM在代码合成方面的真正性能,而且还开启了通过自动化测试改进此类编程基准的新方向。

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus – a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning
Xiaoming Shi Siqiao Xue Kangrui Wang Fan Zhou James Y. Zhang JUN ZHOU Chenhao Tan Hongyuan Mei



研究问题:本文旨在探索大型语言模型是否能对真实世界事件进行推理,并帮助提高事件序列模型的预测性能。
动机:目前的预训练语言模型在推理任务上表现出色,作者希望探究其是否能够用于事件预测。
方法:设计了一个名为LAMP的框架,将大型语言模型整合到事件预测中。具体来说,语言模型通过溯因推理来协助事件序列模型:事件模型根据过去提出对未来事件的预测;在专家注释的演示指导下,语言模型学习为每个预测提出可能的原因;搜索模块找出与原因匹配的先前事件;评分函数学习检查检索到的事件是否真的可能导致预测。
效果:通过对多个具有挑战性的现实世界数据集进行大量实验,作者证明,由于大型语言模型的推理能力,他们的框架可以显著优于最先进的事件序列模型。

Large language models have shown astonishing performance on a wide range of reasoning tasks. In this paper, we investigate whether they could reason about real-world events and help improve the prediction performance of event sequence models. We design LAMP, a framework that integrates a large language model in event prediction. Particularly, the language model performs abductive reasoning to assist an event sequence model: the event model proposes predictions on future events given the past; instructed by a few expert-annotated demonstrations, the language model learns to suggest possible causes for each proposal; a search module finds out the previous events that match the causes; a scoring function learns to examine whether the retrieved events could actually cause the proposal. Through extensive experiments on several challenging real-world datasets, we demonstrate that our framework---thanks to the reasoning capabilities of large language models---could significantly outperform the state-of-the-art event sequence models.

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Pan Lu Baolin Peng Hao Cheng Michel Galley Kai-Wei Chang Ying Nian Wu Song-Chun Zhu Jianfeng Gao



研究问题:大型语言模型(LLMs)在解决各种自然语言处理任务上取得了显著进步,但它们无法获取最新的信息,使用外部工具,进行精确的数学和逻辑推理。
动机:为了解决LLMs的这些固有限制,本文提出了Chameleon,一个通过增强LLMs以实现组合推理的AI系统。
方法:Chameleon通过将各种工具(如LLMs、现成的视觉模型、网络搜索引擎、Python函数和基于启发式模块)组合起来完成复杂的推理任务,从而合成程序。其核心是一个基于LLM的计划器,该计划器组装一系列工具来执行生成最终响应。
效果:在两个多模态知识密集推理任务ScienceQA和TabMWP上,Chameleon展示了其有效性。在ScienceQA上,Chameleon实现了86.54%的总体准确率,比已发布的最好的少数结果提高了11.37%。在TabMWP上,GPT-4驱动的Chameleon将准确率提高了17.0%,将最先进的技术提升到98.78%。分析还显示,与ChatGPT驱动的计划器相比,GPT-4驱动的计划器通过从指令中推断出潜在的约束条件,表现出更一致和理性的工具选择。

Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning. In this paper, we present Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks. At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response. We showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54% overall accuracy on ScienceQA, improving the best published few-shot result by 11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%, lifting the state of the art to 98.78%. Our analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner.

PointGPT: Auto-regressively Generative Pre-training from Point Clouds
Guangyan Chen Meiling Wang Yi Yang Kai Yu Li Yuan Yufeng Yue



研究问题:如何将生成预训练转换器(GPT)的概念扩展到点云,以解决无序特性、信息密度低和任务差距等问题。
动机:受GPT进展的启发,提出一种新的方法PointGPT,该方法可以更好地处理点云数据。
方法:通过将输入点云分割成多个点片并按照空间邻近性进行排序,然后使用提取器-生成器基于的变压器解码器,配合双掩蔽策略,学习了条件于前一点片的潜在表示,以自回归的方式预测下一个点片。
效果:该方法可学习高容量模型,具有良好的泛化能力,并在各种下游任务上取得了最先进的性能。在ModelNet40数据集和ScanObjectNN数据集上,我们的模型达到了94.9%和93.4%的分类准确率,超过了所有其他变压器模型。此外,我们的方法还在所有四个少样本学习基准测试中实现了新的最先进的准确率。

Large language models (LLMs) based on the generative pre-training transformer (GPT) have demonstrated remarkable effectiveness across a diverse range of downstream tasks. Inspired by the advancements of the GPT, we present PointGPT, a novel approach that extends the concept of GPT to point clouds, addressing the challenges associated with disorder properties, low information density, and task gaps. Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models. Our method partitions the input point cloud into multiple point patches and arranges them in an ordered sequence based on their spatial proximity. Then, an extractor-generator based transformer decoder, with a dual masking strategy, learns latent representations conditioned on the preceding point patches, aiming to predict the next one in an auto-regressive manner. Our scalable approach allows for learning high-capacity models that generalize well, achieving state-of-the-art performance on various downstream tasks. In particular, our approach achieves classification accuracies of 94.9% on the ModelNet40 dataset and 93.4% on the ScanObjectNN dataset, outperforming all other transformer models. Furthermore, our method also attains new state-of-the-art accuracies on all four few-shot learning benchmarks.

LLM-Pruner: On the Structural Pruning of Large Language Models
Xinyin Ma Gongfan Fang Xinchao Wang



研究问题:本文旨在探索大型语言模型(LLMs)的压缩方法,以解决模型大小带来的部署、推理和训练挑战。
动机:大型语言模型虽然在理解和生成语言方面表现出色,但其庞大的模型规模给部署、推理和训练带来了重大挑战。
方法:本文提出了一种名为LLM-pruner的方法,该方法采用结构剪枝技术,根据梯度信息选择性地移除非关键耦合结构,最大限度地保留LLM的大部分功能。通过调优技术LoRA,可以在仅3小时内高效地恢复剪枝模型的性能,且仅需50K数据。
效果:我们在LLaMA、Vicuna和ChatGLM三种LLMs上验证了LLM-Pruner的效果,证明压缩后的模型在零样本分类和生成任务上仍表现出令人满意的能力。

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality. To this end, the performance of pruned models can be efficiently recovered through tuning techniques, LoRA, in merely 3 hours, requiring only 50K data. We validate the LLM-Pruner on three LLMs, including LLaMA, Vicuna, and ChatGLM, and demonstrate that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation. The code will be made public.

Find What You Want: Learning Demand-conditioned Object Attribute Space for Demand-driven Navigation
Hongcheng Wang Andy Guan Hong Chen Xiaoqi Li Mingdong Wu Hao Dong



研究问题:如何让智能代理在未知环境中,根据用户的需求找到相应的物体。
动机:传统的视觉对象导航(VON)需要用户知道目标物体的名称并且该物体必须在场景中,这在现实中往往难以满足。因此,本文提出需求驱动导航(DDN),将用户的需求作为任务指令,使智能代理找到与需求匹配的物体。
方法:通过大型语言模型(LLM)提取常见知识来获取物体的文本属性特征,然后使用对比性语言-图像预训练(CLIP)将这些文本属性特征与视觉属性特征对齐。
效果:实验证明,引入视觉属性特征可以提升智能代理的导航性能,优于常见的VON和VLN任务方法以及使用LLM的方法。

The task of Visual Object Navigation (VON) involves an agent's ability to locate a particular object within a given scene. To successfully accomplish the VON task, two essential conditions must be fulfiled: 1) the user knows the name of the desired object; and 2) the user-specified object actually is present within the scene. To meet these conditions, a simulator can incorporate predefined object names and positions into the metadata of the scene. However, in real-world scenarios, it is often challenging to ensure that these conditions are always met. Humans in an unfamiliar environment may not know which objects are present in the scene, or they may mistakenly specify an object that is not actually present. Nevertheless, despite these challenges, humans may still have a demand for an object, which could potentially be fulfilled by other objects present within the scene in an equivalent manner. Hence, this paper proposes Demand-driven Navigation (DDN), which leverages the user's demand as the task instruction and prompts the agent to find an object which matches the specified demand. DDN aims to relax the stringent conditions of VON by focusing on fulfilling the user's demand rather than relying solely on specified object names. This paper proposes a method of acquiring textual attribute features of objects by extracting common sense knowledge from a large language model (LLM). These textual attribute features are subsequently aligned with visual attribute features using Contrastive Language-Image Pre-training (CLIP). Incorporating the visual attribute features as prior knowledge, enhances the navigation process. Experiments on AI2Thor with the ProcThor dataset demonstrate that the visual attribute features improve the agent's navigation performance and outperform the baseline methods commonly used in the VON and VLN task and methods with LLMs. The codes and demonstrations can be viewed at https://sites.google.com/view/demand-driven-navigation.

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Zhengxuan Wu Atticus Geiger Thomas Icard Christopher Potts Noah Goodman



研究问题:如何获取大型通用语言模型的人类可解释性解释,同时保证解释方法忠实于模型行为背后的因果关系并能对未见过输入进行稳健泛化。
动机:AI安全需要我们能够理解和解释大型语言模型的行为,而我们的可解释性方法必须能反映模型行为的因果关系并具有稳健的泛化能力。
方法:通过使用基于因果抽象理论的强大梯度下降方法分布式对齐搜索(DAS),并将其中的暴力搜索步骤替换为学习参数,我们提出了无边界DAS方法。这种方法可以在大型语言模型遵循指令的同时,有效地寻找其可解释的因果关系结构。
效果:我们将无边界DAS应用于阿尔帕卡模型(7B参数),发现该模型通过实现一个带有两个可解释布尔变量的因果模型来解决简单的数值推理问题。此外,我们还发现神经网络表示与这些变量的对齐对于输入和指令的变化具有稳健性。这些发现标志着我们向深入理解最大和最广泛部署的语言模型的内部工作机制迈出了第一步。

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call Boundless DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan Vijay Kumar b g Samuel Schulter Manmohan Chandraker Yun Fu



研究问题:本文旨在解决视觉问答(VQA)中每个问题投入相同精力的问题,探索一种将问题分解的策略。
动机:传统的VQA被视为单步任务,每个问题得到相同的处理,这与人类自然的问题解答策略不同。
方法:通过使用人类编写的分解和模型自己生成的分解来探索新开发的大型视觉语言模型的能力,并引入了一种模型驱动的选择性分解方法来预测和纠正错误。
效果:在三个领域的八个VQA任务上进行了验证,结果显示了一致的准确性提高,包括在医疗VQA数据集上提高了20%以上,并在Winoground挑战性任务的VQA重新制定中使BLIP-2的零样本性能超过了机会。

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/

Collaborative Alignment of NLP Models
Fereshte Khani Marco Tulio Ribeiro



研究问题:现有的自然语言处理模型在训练后需要进行调整以符合业务规则,纠正不良行为并符合用户价值观,但定义所有可能的概念是一项困难的任务。
动机:为了解决这一问题,我们提出了一个多用户协作的模型对齐框架CoAlign。
方法:CoAlign通过让多个用户与模型互动来操作他们的概念,学习每个概念的局部模型和整合原始数据与所有概念的全局模型,然后引导大型语言模型在概念边界内生成实例。
效果:实验证明,CoAlign能有效帮助多个用户操作概念,避免各种场景、任务和模型之间的干扰。

Despite substantial advancements, Natural Language Processing (NLP) models often require post-training adjustments to enforce business rules, rectify undesired behavior, and align with user values. These adjustments involve operationalizing "concepts"—dictating desired model responses to certain inputs. However, it's difficult for a single entity to enumerate and define all possible concepts, indicating a need for a multi-user, collaborative model alignment framework. Moreover, the exhaustive delineation of a concept is challenging, and an improper approach can create shortcuts or interfere with original data or other concepts. To address these challenges, we introduce CoAlign, a framework that enables multi-user interaction with the model, thereby mitigating individual limitations. CoAlign aids users in operationalizing their concepts using Large Language Models, and relying on the principle that NLP models exhibit simpler behaviors in local regions. Our main insight is learning a \emph{local} model for each concept, and a \emph{global} model to integrate the original data with all concepts. We then steer a large language model to generate instances within concept boundaries where local and global disagree. Our experiments show CoAlign is effective at helping multiple users operationalize concepts and avoid interference for a variety of scenarios, tasks, and models.

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
Yasheng SUN Yifan Yang Houwen Peng Yifei Shen Yuqing Yang Han Hu Lili Qiu Hideki Koike



研究问题:如何准确、全面地使用自然语言描述图像操作任务,以反映人类的意图?
动机:由于语言表达的内在不确定性和模糊性,用自然语言精确、全面地描述图像操作任务既费力又有时甚至不可能。
方法:提出一种新的图像操作方法,名为ImageBrush,该方法通过学习视觉指令来进行更准确的图像编辑。主要思路是使用一对转换图像作为视觉指令,这不仅能精确捕捉到人类的意图,也便于在真实场景中的使用。
效果:实验表明,该方法可以生成符合演示中包含的变换的吸引人的操作结果。此外,该模型在各种下游任务(如姿态转移、图像翻译和视频修复)上表现出强大的泛化能力。

While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.

Fairness-guided Few-shot Prompting for Large Language Models
Huan Ma Changqing Zhang Yatao Bian Lemao Liu Zhirui Zhang Peilin Zhao Shu Zhang Huazhu Fu Qinghua Hu Bingzhe Wu



研究问题:大型语言模型的上下文学习性能受训练示例、示例顺序和提示格式变化的影响,如何构建合适的提示以提高上下文学习性能。
动机:已有研究表明,上下文学习的性能受到训练示例、示例顺序和提示格式的变化影响,因此,构建合适的提示对于提高上下文学习的性能至关重要。
方法:本文从预测偏倚的角度重新审视了这个问题,引入了一个评估固定提示对标签或给定属性的预测偏倚的度量标准,并提出了一种新的基于贪婪搜索的提示搜索策略来识别接近最优的提示,以改善上下文学习的性能。
效果:通过在各种下游任务上与最先进的主流模型如GPT-3进行广泛的实验,结果表明,该方法可以有效且可解释地提高模型的上下文学习性能。

Large language models have demonstrated surprising ability to perform in-context learning, i.e., these models can be directly applied to solve numerous downstream tasks by conditioning on a prompt constructed by a few input-output examples. However, prior research has shown that in-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. Therefore, the construction of an appropriate prompt is essential for improving the performance of in-context learning. In this paper, we revisit this problem from the view of predictive bias. Specifically, we introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. Then we empirically show that prompts with higher bias always lead to unsatisfactory predictive quality. Based on this observation, we propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning. We perform comprehensive experiments with state-of-the-art mainstream models such as GPT-3 on various downstream tasks. Our results indicate that our method can enhance the model's in-context learning performance in an effective and interpretable manner.

RADAR: Robust AI-Text Detection via Adversarial Learning
Xiaomeng Hu Pin-Yu Chen Tsung-Yi Ho



研究问题:如何区分由人类生成的文本和大型语言模型(LLM)生成的AI文本,并解决由此产生的滥用和公平性问题。
动机:当前AI文本检测器对基于LLM的文本改写不够稳健,需要一种能够有效识别AI文本的新方法。
方法:提出一种新的框架RADAR,通过对抗性学习同时训练一个鲁棒的AI文本检测器。RADAR基于一个改写器和一个检测器的对抗性训练,改写器的目标是生成真实的内容以逃避AI文本检测,而检测器则根据反馈更新改写器。
效果:在8个不同的LLMs和4个数据集上进行评估,实验结果表明RADAR显著优于现有的AI文本检测方法,特别是在存在改写的情况下。此外,RADAR还显示出从指令调整的LLM到其他LLM的强大可转移性,并通过GPT-3.5-Turbo评估了其改进的能力。

Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a $\underline{r}$obust $\underline{A}$I-text $\underline{d}$etector via $\underline{a}$dversarial lea$\underline{r}$ning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.

Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning
Xinyi Wang Wanrong Zhu Michael Saxon Mark Steyvers William Yang Wang



研究问题:本文旨在通过贝叶斯视角审视预训练大型语言模型的上下文学习现象,并探讨其背后的机制。
动机:现有的文献指出,预训练的大型语言模型在上下文学习中表现出对少量示范样本选择的敏感性,但现有理解与实际的预训练语言模型之间存在脱节。
方法:本研究将真实的预训练语言模型视为潜在变量模型,并提出一种算法从一组带有小语言模型注释的数据中选择最优示范,然后直接将这些选定的示范推广到更大的语言模型上。
效果:实验结果表明,该方法在八个真实世界文本分类数据集上的GPT模型上平均表现优于基线,并在数学问题数据集GSM8K上展示了实际应用价值。这些实证发现支持了我们的假设,即预训练的语言模型会隐含地推断出包含任务信息的潜在变量。

In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.

VPGTrans: Transfer Visual Prompt Generator across LLMs
Ao Zhang Hao Fei Yuan Yao Wei Ji Li Li Zhiyuan Liu Tat-Seng Chua



研究问题:如何降低多模态语言模型(MLLM)中视觉提示生成器(VPG)训练的成本?
动机:预训练一个全新的多模态语言模型需要大量的图像-文本对,资源消耗巨大。因此,将现有的语言模型与相对较轻量级的视觉提示生成器连接起来,成为一种可行的方法。
方法:首次研究了跨语言模型的视觉提示生成器转移性,以降低视觉提示生成器的训练成本。具体来说,我们探索了不同大小和类型的语言模型之间的视觉提示生成器转移。基于最大化转移效率的关键因素,我们开发了一个简单但高效的两阶段转移框架,称为VPGTrans。
效果:我们的VPGTrans方法使得从BLIP-2 OPT 2.7B到BLIP-2 OPT 6.7B的视觉提示生成器转移仅需10%的GPU小时和10.7%的训练数据,远低于从头开始训练一个新的视觉提示生成器所需的资源。此外,我们还展示了我们的VPGTrans方法的实际价值,通过定制两个新的多模态语言模型,包括VL-LLaMA和VL-Vicuna,与最近发布的LLaMA和Vicuna语言模型进行了结合。

Since developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm. However, further tuning the VPG component of the MLLM still incurs significant computational costs, such as thousands of GPU hours and millions of training data points. An alternative solution is transferring an existing VPG from one MLLM to the target MLLM. In this work, we investigate VPG transferability across LLMs for the first time, aiming to reduce the cost of VPG training. Specifically, we explore VPG transfer across different LLM sizes (e.g., small-to-large) and types. We identify key factors to maximize transfer efficiency, based on which we develop a simple yet highly effective two-stage transfer framework, called VPGTrans. Notably, it enables VPG transfer from BLIP-2 OPT 2.7B to BLIP-2 OPT 6.7B with less than 10% of the GPU hours using only 10.7% of the training data compared to training a VPG for OPT 6.7B from scratch. Furthermore, we provide a series of intriguing findings and discuss potential explanations behind them. Finally, we showcase the practical value of our VPGTrans approach, by customizing two novel MLLMs, including VL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs.

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
Ge Zheng Bin Yang Jiajin Tang Hong-Yu Zhou Sibei Yang



研究问题:如何将大型语言模型的多步推理能力转移到多模态环境中,并解决由此带来的挑战。
动机:当前的大型语言模型在单一语言模态上的推理能力强大,但在多模态环境下的推理仍面临诸多挑战,如需要大量人力进行标注、灵活性、泛化性和可解释性有限等。
方法:本研究提出了一种新的DDCoT提示方法,通过负空间提示保持批判性思维,并通过将视觉模型的视觉识别能力整合到联合推理过程中,将多模态引入推理。
效果:DDCoT生成的推理不仅显著提高了大型和小型语言模型在零样本提示和微调学习中的推理能力,而且表现出了令人印象深刻的泛化性和可解释性,大大超越了最先进的方法。

A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: “keeping critical thinking” and “letting everyone do their jobs” in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.

Large Language Models are Visual Reasoning Coordinators
Liangyu Chen Bo Li Sheng Shen Jingkang Yang Chunyuan Li Kurt Keutzer Trevor Darrell Ziwei Liu



研究问题:如何利用多个视觉语言模型进行有效的视觉推理。
动机:现有的方法在整合这些互补的视觉语言模型时,往往难以实现理想的高级通信。
方法:提出一种新的范式Cola,通过促进自然语言交流来协调多个视觉语言模型,以发挥其独特和互补的能力。
效果:实验表明,我们的指令调优变体Cola-FT在视觉问答、外部知识视觉问答、视觉蕴含和视觉空间推理任务上取得了最先进的性能。此外,我们的上下文学习变体Cola-Zero在零次和少次设置中表现出竞争力,无需微调。

Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a large language model (LLM) can efficiently coordinate multiple VLMs by facilitating natural language communication that leverages their distinct and complementary capabilities. Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
Rui Yang Lin Song Yanwei Li Sijie Zhao Yixiao Ge Xiu Li Ying Shan



研究问题:如何让大型语言模型有效地使用多模态工具。
动机:尽管先进的专有大型语言模型,如ChatGPT和GPT-4,通过复杂的提示工程表现出了强大的工具使用潜力,但这些模型通常依赖于高昂的计算成本和无法公开访问的数据。
方法:我们提出了基于自我指令的GPT4Tools,以使开源大型语言模型(如LLaMA和OPT)能够使用工具。该方法通过向高级教师提供各种多模态上下文来生成一个遵循指令的数据集,并使用低秩适应(LoRA)优化来帮助开源大型语言模型解决一系列视觉问题,包括视觉理解和图像生成。
效果:广泛的实验表明,我们的方法在各种语言模型上都非常有效,不仅显著提高了调用已见过的工具的准确性,而且还使未见过的工具具有零样本能力。

This paper aims to efficiently enable Large Language Models (LLMs) to use multi-modal tools. The advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering. Nevertheless, these models typically rely on prohibitive computational costs and publicly inaccessible data. To address these challenges, we propose the GPT4Tools based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts. By using the Low-Rank Adaptation (LoRA) optimization, our approach facilitates the open-source LLMs to solve a range of visual problems, including visual comprehension and image generation. Moreover, we provide a benchmark to evaluate the ability of LLMs to use tools, which is performed in both zero-shot and fine-tuning ways. Extensive experiments demonstrate the effectiveness of our method on various language models, which not only significantly improves the accuracy of invoking seen tools, but also enables the zero-shot capacity for unseen tools.

Universal Prompt Tuning for Graph Neural Networks
Taoran Fang Yunchao Mercer Zhang Yang Yang Chunping Wang Lei CHEN



研究问题:如何设计适用于各种预训练策略的图神经网络(GNN)的提示式调优方法。
动机:目前图领域的预训练策略多样,使得设计合适的提示式调优方法具有挑战性。尽管已有一些开创性的工作为使用边预测作为预训练任务的模型设计了专门的提示函数,但这些方法仅限于特定的预训练GNN模型,缺乏广泛的适用性。
方法:本文提出了一种名为图提示特征(GPF)的通用提示式调优方法,适用于任何预训练策略的预训练GNN模型。GPF在输入图的特征空间上操作,理论上可以实现与任何形式的提示函数等效的效果。因此,我们不再需要明确说明每种预训练策略对应的提示函数。相反,我们采用GPF以自适应的方式获取下游任务所需的提示图。
效果:实验结果表明,我们的方法在全场景和少场景中分别比微调平均提高了约1.4%和3.2%。此外,当我们的方法应用于专门使用其专业预训练策略的模型时,它显著优于现有的专门提示式调优方法。这些众多优点使我们的方法成为下游适应的优秀替代微调方案。

In recent years, prompt tuning has sparked a research surge in adapting pre-trained models. Unlike the unified pre-training strategy employed in the language field, the graph field exhibits diverse pre-training strategies, posing challenges in designing appropriate prompt-based tuning methods for graph neural networks. While some pioneering work has devised specialized prompting functions for models that employ edge prediction as their pre-training tasks, these methods are limited to specific pre-trained GNN models and lack broader applicability. In this paper, we introduce a universal prompt-based tuning method called Graph Prompt Feature (GPF) for pre-trained GNN models under any pre-training strategy. GPF operates on the input graph's feature space and can theoretically achieve an equivalent effect to any form of prompting function. Consequently, we no longer need to illustrate the prompting function corresponding to each pre-training strategy explicitly. Instead, we employ GPF to obtain the prompted graph for the downstream task in an adaptive manner. We provide rigorous derivations to demonstrate the universality of GPF and make guarantee of its effectiveness. The experimental results under various pre-training strategies indicate that our method performs better than fine-tuning, with an average improvement of about 1.4% in full-shot scenarios and about 3.2% in few-shot scenarios. Moreover, our method significantly outperforms existing specialized prompt-based tuning methods when applied to models utilizing the pre-training strategy they specialize in. These numerous advantages position our method as a compelling alternative to fine-tuning for downstream adaptations.

Discovering Intrinsic Spatial-Temporal Logic Rules to Explain Human Actions
Chengzhi Cao Chao Yang Ruimao Zhang Shuang Li



研究问题:提出一种可解释的模型,通过分析人类运动轨迹来揭示其行为模式。
动机:人类行为受意图和周围环境因素的影响,如与周围物体的空间关系。
方法:使用一组包含意图变量的空间-时间逻辑规则来模拟这种行为,并设计了一种EM学习算法来学习模型参数和规则内容。
效果:在行人和NBA篮球运动员数据集上,该模型显示出优越的可解释性和预测性能,取得了有希望的结果。

We propose an interpretable model to uncover the behavioral patterns of human movements by analyzing their trajectories. Our approach is based on the belief that human actions are driven by intentions and are influenced by environmental factors such as spatial relationships with surrounding objects. To model this, we use a set of spatial-temporal logic rules that include intention variables as principles. These rules are automatically discovered and used to capture the dynamics of human actions. To learn the model parameters and rule content, we design an EM learning algorithm that treats the unknown rule content as a latent variable. In the E-step, we evaluate the posterior over the latent rule content, and in the M-step, we optimize the rule generator and model parameters by maximizing the expected log-likelihood. Our model has wide-ranging applications in areas such as sports analytics, robotics, and autonomous cars. We demonstrate the model's superior interpretability and prediction performance on both pedestrian and NBA basketball player datasets, achieving promising results.

A Theory of Multimodal Learning
Zhou Lu



研究问题:本研究旨在探索多模态学习的理论框架,以解释多模态模型在单模态任务上的性能优势。
动机:尽管多模态学习的实践已经显示出优越性,但其理论依据尚不明确。
方法:通过研究多模态学习算法的泛化特性,提出了一个理论框架来解释这一现象。
效果:研究发现,当模态之间存在连接和异质性时,多模态学习可以实现比单模态学习更好的泛化性能,其优势可达到$O(\sqrt{n})$倍,其中$n$代表样本大小。

Human perception of the empirical world involves recognizing the diverse appearances, or 'modalities', of underlying objects. Despite the longstanding consideration of this perspective in philosophy and cognitive science, the study of multimodality remains relatively under-explored within the field of machine learning. Nevertheless, current studies of multimodal machine learning are limited to empirical practices, lacking theoretical foundations beyond heuristic arguments. An intriguing finding from the practice of multimodal learning is that a model trained on multiple modalities can outperform a finely-tuned unimodal model, even on unimodal tasks. This paper provides a theoretical framework that explains this phenomenon, by studying generalization properties of multimodal learning algorithms. We demonstrate that multimodal learning allows for a superior generalization bound compared to unimodal learning, up to a factor of $O(\sqrt{n})$, where $n$ represents the sample size. Such advantage occurs when both connection and heterogeneity exist between the modalities.

Random-Access Infinite Context Length for Transformers
Amirkeivan Mohtashami Martin Jaggi



研究问题:Transformers在处理长文本时,由于其注意力机制的大量内存需求,限制了其处理长上下文的能力。
动机:现有的方法如循环记忆或基于检索的增强,或者牺牲了注意力的随机访问灵活性,或者依赖于与模型的注意力不兼容的单独机制进行相关上下文检索。
方法:本文提出了一种新的方法,通过使用地标令牌来代表输入的每一块,并训练注意力选择相关的块,使得可以直接通过注意力机制检索块,而不需要依赖单独的机制。
效果:该方法可以获取与Transformer-XL相当的性能,同时显著减少了每一步检索的令牌数量。此外,通过使用这种方法对LLaMA 7B进行微调,成功地将其上下文长度容量扩展到超过32k个标记,使其可以进行GPT-4级别的推理。

While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity to over 32k tokens, allowing for inference at the context lengths of GPT-4. We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.

Deductive Verification of Chain-of-Thought Reasoning
Zhan Ling Yunhao Fang Xuanlin Li Zhiao Huang Mingu Lee Roland Memisevic Hao Su



研究问题:大型语言模型在进行各种推理任务时,如何通过链式思维提示进行精确的演绎推理,并确保其推理过程的可信性。
动机:当前的链式思维提示虽然能让模型产生更全面的推理过程,但其对中间推理步骤的强调可能会无意中引入幻觉和累积错误,从而限制模型解决复杂推理任务的能力。
方法:我们提出了一种自然语言基于的演绎推理格式——自然程序,将推理验证过程分解为一系列逐步的子过程,每个子过程只接收必要的上下文和前提。
效果:通过将验证过程整合到每个演绎推理阶段,我们显著提高了生成推理步骤的严谨性和可信度,同时也提高了复杂推理任务的答案正确率。

Large Language Models (LLMs) significantly benefit from Chain-of-thought (CoT) prompting in performing various reasoning tasks. While CoT allows models to produce more comprehensive reasoning processes, its emphasis on intermediate reasoning steps can inadvertently introduce hallucinations and accumulated errors, thereby limiting models’ ability to solve complex reasoning tasks. Inspired by how humans engage in careful and meticulous deductive logical reasoning processes to solve tasks, we seek to enable language models to perform explicit and rigorous deductive reasoning, and also ensure the trustworthiness of their reasoning process through self-verification. However, directly verifying the validity of an entire deductive reasoning process is challenging, even with advanced models like ChatGPT. In light of this, we propose to decompose a reasoning verification process into a series of step-by-step subprocesses, each only receiving their necessary context and premises. To facilitate this procedure, we propose Natural Program, a natural language-based deductive reasoning format. Our approach enables models to generate precise reasoning steps where subsequent steps are more rigorously grounded on prior steps. It also empowers language models to carry out reasoning self-verification in a step-by-step manner. By integrating this verification process into each deductive reasoning stage, we significantly enhance the rigor and trustfulness of generated reasoning steps. Along this process, we also improve the answer correctness on complex reasoning tasks.

Training Transitive and Commutative Multimodal Transformers with LoReTTa
Manuel Tran Yashin Dicente Cid Amal Lahiani Fabian J Theis Tingying Peng Eldad Klaiman



研究问题:训练多模态基础模型具有挑战性,因为多模态数据集的可用性有限。
动机:尽管许多公共数据集将图像与文本配对,但很少有数据集同时结合图像和音频或文本和音频。在关键领域如医疗、基础设施或交通中,缺失的模态尤其受影响。
方法:我们引入LoReTTa(链接模态的敏感且交换性的预训练策略)来解决这个问题。我们的自监督框架将因果关系建模和掩蔽建模与交换性和传递性规则相结合。这使得我们可以在模态内部和之间进行转换。
效果:我们在合成、医学和强化学习数据集上广泛评估了我们的方法。在不同的领域中,我们的通用多模态转换器在涉及缺失模态元组的任务上始终优于强大的基线模型,如GPT、BERT和CLIP。

Training multimodal foundation models is challenging due to the limited availability of multimodal datasets. While many public datasets pair images with text, few combine images with audio or text with audio. Even rarer are datasets that align all three modalities at once. Critical domains such as healthcare, infrastructure, or transportation are particularly affected by missing modalities. This makes it difficult to integrate all modalities into a large pre-trained neural network that can be used out-of-the-box or fine-tuned for different downstream tasks. We introduce LoReTTa ($\textbf{L}$inking m$\textbf{O}$dalities with a t$\textbf{R}$ansitive and commutativ$\textbf{E}$ pre-$\textbf{T}$raining s$\textbf{T}$r$\textbf{A}$tegy) to address this understudied problem. Our self-supervised framework unifies causal modeling and masked modeling with the rules of commutativity and transitivity. This allows us to transition within and between modalities. As a result, our pre-trained models are better at exploring the true underlying joint probability distribution. Given a dataset containing only the disjoint combinations $(A, B)$ and $(B, C)$, LoReTTa can model the relation $A \leftrightarrow C$ with $A \leftrightarrow B \leftrightarrow C$. In particular, we show that a transformer pre-trained with LoReTTa can handle any mixture of modalities at inference time, including the never-seen pair $(A, C)$ and the triplet $(A, B, C)$. We extensively evaluate our approach on a synthetic, medical, and reinforcement learning dataset. Across different domains, our universal multimodal transformer consistently outperforms strong baselines such as GPT, BERT, and CLIP on tasks involving the missing modality tuple.

CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models
Zhijing Jin Yuen Chen Felix Leeb Luigi Gresele Ojasv Kamal Zhiheng LYU Kevin Blin Fernando Gonzalez Adauto Max Kleiman-Weiner Mrinmaya Sachan Bernhard Schölkopf



研究问题:大型语言模型是否能够连贯地进行因果关系推理?
动机:现有的自然语言处理工作主要集中在评估大型语言模型的常识因果关系推理,而没有评估模型是否能根据一组明确的形式规则进行因果推断。
方法:我们提出了一个新的自然语言处理任务——因果推理,并创建了一个大型数据集CLadder。基于一系列因果关系图和查询(关联、干预和反事实),我们通过一个权威的因果推理引擎获取了符号化的问题和地面真值答案,并将其翻译成自然语言。
效果:我们的实验表明,这个任务对大型语言模型来说极具挑战性。我们进行了深入的分析,以深入了解大型语言模型的因果推理能力。

The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating _commonsense_ causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined _formal rules_. To address this, we propose a new NLP task, _causal inference in natural language_, inspired by the _“causal inference engine”_ postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs

Convolutional Visual Prompt for Robust Visual Perception
Yun-Yun Tsai Chengzhi Mao Junfeng Yang



研究问题:本文旨在解决视觉模型在面对分布外(OOD)样本时容易受到干扰的问题,并提出了一种无需标签的测试时间适应方法。
动机:现有的视觉提示方法虽然可以对大规模视觉模型进行输入空间的适应,但需要依赖高维的附加向量和标记数据,这会导致在无标签的自监督测试时间设置中进行模型适应时过拟合。
方法:本文提出了一种卷积视觉提示(CVP)的方法,用于标签自由的测试时间适应,以实现鲁棒的视觉感知。由于CVP的结构性质,其所需的可训练参数少于标准视觉提示的1%,从而防止了过拟合。
效果:通过在各种OOD视觉感知任务上进行的大量实验和分析,证明该方法是有效的,与几种大型模型相比,其鲁棒性提高了5.87%。

Vision models are often vulnerable to out-of-distribution (OOD) samples without adapting. While visual prompts offer a lightweight method of input-space adaptation for large-scale vision models, they rely on a high-dimensional additive vector and labeled data. This leads to overfitting when adapting models in a self-supervised test-time setting without labels. We introduce convolutional visual prompts (CVP) for label-free test-time adaptation for robust visual perception. The structured nature of CVP demands fewer trainable parameters, less than 1\% compared to standard visual prompts, combating overfitting. Extensive experiments and analysis on a wide variety of OOD visual perception tasks show that our approach is effective, improving robustness by up to 5.87\% over several large-scale models.

Learning to Compress Prompts with Gist Tokens
Jesse Mu Xiang Lisa Li Noah Goodman



研究问题:如何有效地利用语言模型的多任务能力,同时避免在输入上下文窗口中重复编码相同的提示。
动机:目前的预训练语言模型在处理多任务时,需要反复编码相同的提示,这既占用了宝贵的输入空间,又降低了计算效率。
方法:提出一种新的方法——gisting,通过训练语言模型将提示压缩为更小的“要点”令牌集,这些令牌集会被缓存并重复使用以提高计算效率。
效果:在解码器(LLaMA-7B)和编码器-解码器(FLAN-T5-XXL)语言模型上,gisting可以将提示压缩高达26倍,实现了40%的FLOPs减少、4.2%的计算时间加速以及存储节省,同时保持了输出质量的最小损失。

Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient. Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task. To avoid this trade-off entirely, we present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens which can be cached and reused for compute efficiency. Gist models can be trained with no additional cost over standard instruction finetuning by simply modifying Transformer attention masks to encourage prompt compression. On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, and storage savings, all with minimal loss in output quality.

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Gen Luo Yiyi Zhou Tianhe Ren Shengxin Chen Xiaoshuai Sun Rongrong Ji



研究问题:如何有效地将大型语言模型(LLMs)扩展到多模态能力,如视觉-语言学习。
动机:现有的多模态解决方案成本过高,需要优化大量参数,并在多模态指令调优前进行大规模的预训练。
方法:提出一种新颖且经济的解决方案,称为混合模态适应(MMA)。MMA采用轻量级模块(适配器)连接图像编码器和LLM,实现图像和语言模型的联合优化,并配备路由算法帮助LLM在单模态和多模态指令之间自动切换。
效果:实验结果表明,MMA和LaVIN在多模态科学问答和多模态对话两种设置下的性能和训练效率均优于现有多模态LLMs,且LaVIN作为通用聊天机器人具有巨大潜力。更重要的是,LaVIN的实际开销非常低,验证了MMA的有效性。

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our code is anonymously released at: https://anonymous.4open.science/r/LaVIN--1067.

Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
Kalpesh Krishna Yixiao Song Marzena Karpinska John Frederick Wieting Mohit Iyyer



研究问题:大型语言模型的恶意使用,如虚假内容创建和学术剽窃,已经促使人们开发了识别AI生成文本的方法,包括基于水印或异常检测的方法。然而,这些检测算法对AI生成文本的同义词替换的鲁棒性尚不清楚。
动机:为了测试这些检测器,我们构建了一个11B参数的段落改写模型(DIPPER),它可以改写段落,根据周围上下文进行条件控制,并控制词汇多样性和内容重排。
方法:通过DIPPER改写三个大型语言模型(包括GPT3.5-davinci-003)生成的文本,成功地避开了几种检测器,包括水印、GPTZero、DetectGPT和OpenAI的文本分类器。
效果:为了提高AI生成文本检测对同义词替换攻击的鲁棒性,我们引入了一种简单的防御策略,依赖于检索语义相似的生成结果,必须由语言模型API提供商维护。在给定候选文本的情况下,我们的算法会在数据库中搜索以前由API生成的序列,寻找与候选文本在一定阈值内匹配的序列。

The rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify AI-generated text, including those based on watermarking or outlier detection. However, the robustness of these detection algorithms to paraphrases of AI-generated text remains unclear. To stress test these detectors, we build a 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings while only classifying 1% of human-written sequences as AI-generated. We open-source our models, code and data.

Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Rame Guillaume Couairon Corentin Dancette Jean-Baptiste Gaya Mustafa Shukor Laure Soulier Matthieu Cord



研究问题:本文旨在解决预训练模型在面对多样化奖励时,由于代理奖励的不完美性可能导致训练效果不佳的问题。
动机:目前的预训练模型主要依赖大规模的无监督数据集进行预训练,然后通过人类反馈的强化学习进行微调。然而,代理奖励的不完美性可能会阻碍训练并导致次优结果。
方法:本文提出了一种多策略的方法,通过奖励汤(rewarded soup)来应对多样化的奖励。具体来说,我们首先独立地专门化多个网络(每个代理奖励一个),然后线性插值它们的权重。这种方法在实践中是成功的,因为我们发现,当从共享的预训练初始状态微调不同的奖励时,这些权重仍然是线性连接的。
效果:我们在文本到文本(摘要、问答、有帮助的助手、评论)、文本到图像(图像字幕、文本到图像生成、视觉基础)和控制(移动)任务上展示了该方法的有效性。我们希望提高深度模型的对齐度,以及它们与世界多样性的交互方式。

Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in real-world tasks and human opinions exacerbate the issue. This paper proposes embracing the heterogeneity of diverse rewards by following a multi-policy strategy. Rather than focusing on a single a priori reward, we aim for Pareto-optimal generalization across the entire space of preferences. To this end, we propose rewarded soup, first specializing multiple networks independently (one for each proxy reward) and then interpolating their weights linearly. This succeeds empirically because we show that the weights remain linearly connected when fine-tuned on diverse rewards from a shared pre-trained initialization. We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding), and control (locomotion) tasks. We hope to enhance the alignment of deep models, and how they interact with the world in all its diversity.

Strong and Precise Modulation of Human Percepts via Robustified ANNs
Guy Gaziv Michael J. Lee James J. DiCarlo



研究问题:人工神经网络(ANNs)的视觉对象类别报告对微小的对抗性图像扰动非常敏感,而人类类别报告则相对稳定。本研究旨在探究ANNs是否能够准确地引导对人类感知的强烈和精确的干预。
动机:由于人类类别报告相对于微小的图像扰动是相对稳定的,这表明ANNs在科学上是不完整的人类视觉感知模型。因此,本研究希望探究经过强化的ANNs是否能够可靠地发现低范数图像扰动,从而对人类感知产生强烈的干扰。
方法:通过使用标准ANN模型生成小范数图像扰动,并观察人类对象类别感知的稳定性。同时,使用经过强化的ANNs来发现低范数图像扰动,以改变人类类别感知向特定预设感知的方向。
效果:研究发现,经过强化的ANNs能够可靠地发现低范数图像扰动,这些扰动对人类感知产生了强烈的干扰。此外,经过强化的ANNs还能够支持精确的感知状态干预,即通过构建低范数图像扰动来将人类类别感知强烈地改变为特定的预设感知。综上所述,现代生物视觉处理模型已经足够准确,可以对人类感知进行强烈和精确的干预。

The visual object category reports of artificial neural networks (ANNs) are notoriously sensitive to tiny, adversarial image perturbations. Because human category reports (aka human percepts) are thought to be insensitive to those same small-norm perturbations -- and locally stable in general -- this argues that ANNs are incomplete scientific models of human visual perception. Consistent with this, we show that when small-norm image perturbations are generated by standard ANN models, human object category percepts are indeed highly stable. However, in this very same "human-presumed-stable" regime, we find that robustified ANNs reliably discover low-norm image perturbations that strongly disrupt human percepts. These previously undetectable human perceptual disruptions are massive in amplitude, approaching the same level of sensitivity seen in robustified ANNs. Further, we show that robustified ANNs support precise perceptual state interventions: they guide the construction of low-norm image perturbations that strongly alter human category percepts toward specific prescribed percepts. In sum, these contemporary models of biological visual processing are now accurate enough to guide strong and precise interventions on human perception.

MotionGPT: Human Motion as a Foreign Language
Biao Jiang Xin Chen Wen Liu Jingyi Yu Gang YU Tao Chen



研究问题:如何将语言和其他多模态数据(如运动)统一到一个模型中,以提升相关任务的性能。
动机:尽管预训练的大型语言模型取得了进步,但构建一个用于处理语言和其他多模态数据的统一的模型仍然具有挑战性。
方法:通过融合语言数据和大规模的运动模型,提出了一种名为MotionGPT的统一、多功能、用户友好的运动-语言模型,用于处理多种与运动相关的任务。具体来说,我们使用离散向量量化来表示人体运动,并将3D运动转化为运动令牌,类似于生成词令牌的过程。
效果:大量的实验表明,MotionGPT在多个运动任务上实现了最先进的性能,包括基于文本的运动生成、运动描述、运动预测和运动插值等。

Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multimodal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

TFLEX: Temporal Feature-Logic Embedding Framework for Complex Reasoning over Temporal Knowledge Graph
Xueyuan Lin Haihong E Chengjin Xu Gengxian Zhou Haoran Luo Tianyi Hu Fenglong Su Ningyuan Li Mingzhi Sun



研究问题:本文旨在解决知识图谱上的多跳逻辑推理问题,特别是在处理时态知识图谱(TKGs)的复杂查询时的缺失。
动机:现有的复杂查询嵌入方法主要关注静态知识图谱,而对时态知识图谱的研究尚未充分。在时态知识图谱上进行推理存在两个挑战:1. 查询应回答实体或时间戳;2. 运算符应同时考虑实体集上的关系逻辑和时间戳集上的时间逻辑。
方法:我们提出了一种名为“Temporal Feature-Logic Embedding framework”(TFLEX)的时态复杂查询嵌入方法,以解答时态复杂查询。具体来说,我们使用模糊逻辑来计算时态特征逻辑嵌入的逻辑部分,从而自然地对实体集进行所有一阶逻辑运算。此外,我们还在时间戳集上进一步扩展了模糊逻辑,以应对三个额外的时间操作符(**After**、**Before**和**Between**)。
效果:我们在许多查询模式上进行了实验,证明了我们的方法的有效性。

Multi-hop logical reasoning over knowledge graph plays a fundamental role in many artificial intelligence tasks. Recent complex query embedding methods for reasoning focus on static KGs, while temporal knowledge graphs have not been fully explored. Reasoning over TKGs has two challenges: 1. The query should answer entities or timestamps; 2. The operators should consider both set logic on entity set and temporal logic on timestamp set. To bridge this gap, we introduce the multi-hop logical reasoning problem on TKGs and then propose the first temporal complex query embedding named Temporal Feature-Logic Embedding framework (TFLEX) to answer the temporal complex queries. Specifically, we utilize fuzzy logic to compute the logic part of the Temporal Feature-Logic embedding, thus naturally modeling all first-order logic operations on the entity set. In addition, we further extend fuzzy logic on timestamp set to cope with three extra temporal operators (**After**, **Before** and **Between**). Experiments on numerous query patterns demonstrate the effectiveness of our method.

Don’t Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner
Zhengxiang Shi Aldo Lipani



研究问题:本文旨在重新审视预训练语言模型在无标签数据上继续预训练能否提高下游任务的微调性能。
动机:传统的持续预训练对句子对任务或使用提示式微调时可能无效,甚至有害。
方法:提出基于提示的持续预训练(PCP),通过无监督预训练目标将任务相关文本和提示模板同时呈现给语言模型,然后进行目标任务的微调。
效果:实验表明,PCP在半监督和全监督设置中都能显著提升最先进的提示式微调方法的性能,且只需数百个无标签示例即可实现,简化了过程并消除了迭代过程和额外的数据增强的需求。

Language models (LMs) trained on vast quantities of unlabelled data have greatly advanced the field of natural language processing (NLP). In this study, we re-visit the widely accepted notion in NLP that continued pre-training LMs on task-related texts improves the performance of fine-tuning (FT) in downstream tasks. Through experiments on eight single-sentence tasks and eight sentence-pair tasks in both semi-supervised and fully-supervised settings, we find that conventional continued pre-training does not consistently provide benefits and can even be detrimental for sentence-pair tasks or when prompt-based FT is used. To tackle these issues, we propose Prompt-based Continued Pre-training (PCP), which combines the idea of instruction tuning with conventional continued pre-training. Our approach aims to improve the performance of prompt-based FT by presenting both task-related texts and prompt templates to LMs through unsupervised pre-training objectives before fine-tuning for the target task. Our empirical evaluations on 21 benchmarks demonstrate that the PCP consistently improves the performance of state-of-the-art prompt-based FT approaches (up to 20.1% absolute) in both semi-supervised and fully-supervised settings, even with only hundreds of unlabelled examples. Additionally, prompt-based FT with PCP outperforms state-of-the-art semi-supervised approaches with greater simplicity, eliminating the need for an iterative process and extra data augmentation. Our further analysis explores the performance lower bound of the PCP and reveals that the advantages of PCP persist across different sizes of models and datasets.

Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models
Haonan Duan Adam Dziedzic Nicolas Papernot Franziska Boenisch



研究问题:大型语言模型在上下文学习方面表现出色,但其提示中包含的数据敏感性引发了隐私问题。
动机:本文首次证明了这些担忧是合理的,针对用于提示大型语言模型的数据,实现了一种有效的成员推断攻击。
方法:我们提出了一种私有的提示学习方法。首先,通过在下游数据上进行梯度下降,可以私有地获取软提示。然而,对于离散提示来说,情况并非如此。因此,我们提出了一种随机鹦鹉投票法,即通过向一组不同的提示呈现大型语言模型(即一群随机鹦鹉),让它们进行噪声投票,从而将群体的知识私有地转移到一个公共提示中。
效果:实验结果表明,使用我们的私有算法提示的大型语言模型与非私有基线非常接近。例如,在使用GPT3作为基础模型时,我们在sst2数据集上实现了92.7%的下游准确率,同时保持了$(\varepsilon=0.147, \delta=10^{-6})$的差分隐私,而非私有基线的准确率为95.2%。此外,通过实验我们还发现,基于提示的方法可以轻松地与现有的商业API一起部署。

Large language models (LLMs) are excellent in-context learners. However, the sensitivity of data contained in prompts raises privacy concerns. Our work first shows that these concerns are valid: we instantiate a simple but highly effective membership inference attack against the data used to prompt LLMs. To address this vulnerability, one could forego prompting and resort to fine-tuning LLMs with known algorithms for private gradient descent. However, this comes at the expense of the practicality and efficiency offered by prompting. Therefore, we propose to privately learn to prompt. We first show that soft prompts can be obtained privately through gradient descent on downstream data. However, this is not the case for discrete prompts. Thus, we orchestrate a noisy vote among an ensemble of LLMs presented with different prompts, i.e., a flock of stochastic parrots. The vote privately transfers the flock's knowledge into a single public prompt. We show that LLMs prompted with our private algorithms closely match the non-private baselines. For example, using GPT3 as the base model, we achieve a downstream accuracy of 92.7% on the sst2 dataset with $(\varepsilon=0.147, \delta=10^{-6})$-differential privacy vs. 95.2% for the non-private baseline. Through our experiments, we also show that our prompt-based approach is easily deployed with existing commercial~APIs.

Language-based Action Concept Spaces Improve Video Self-Supervised Learning
Kanchana Ranasinghe Michael S Ryoo



研究问题:如何将对比性语言图像预训练模型适应到视频领域,以实现最小监督的迁移学习。
动机:现有的图像CLIP模型在视频领域的适应性问题尚未解决。
方法:通过使用语言绑定的自我监督学习,将图像CLIP模型适应到视频领域。修改了一个用于时间建模的骨干网络,并在自我蒸馏设置下进行训练,其训练目标在动作概念空间中运行。从语言编码器中提取各种动作概念的特征向量构建了这个空间。一个了解动作及其属性的大型语言模型生成了相关的文本提示。
效果:引入了两种训练目标,即概念蒸馏和概念对齐,它们在保留原始表示的一般性的同时,强化了动作及其属性之间的关系。该方法在三个动作识别基准测试上提高了零样本和线性探测性能。

Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domain with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. A large language model aware of actions and their attributes generates the relevant textual prompts. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.

Active Reasoning in an Open-World Environment
Manjie Xu Guangyuan Jiang Wei Liang Chi Zhang Yixin Zhu



研究问题:现有的视觉语言学习模型主要通过整合大量世界知识在完全信息的问题回答数据集上取得了显著的成功,但这些模型大多被动地根据预先存储的知识回答问题,缺乏主动探索和推理的能力。
动机:为了弥补这一差距,我们引入了Conan,这是一个用于评估主动推理的交互式开放世界环境。
方法:Conan促进了主动探索和多轮溯因推理,要求代理与其周围环境积极互动,将新的证据与先前的知识相结合,从不完全的观察中阐明事件。我们还探索了“从推论中进行溯因”的方法,使代理能够利用贝叶斯规则将溯因挑战重新定义为演绎过程。
效果:我们的分析表明,当前的最先进模型在主动探索和理解复杂场景方面存在不足。通过Conan,我们希望推动主动推理的进步,为下一代能够动态参与环境的人工智能代理奠定基础。

Recent advances in vision-language learning have achieved notable success on *complete-information* question-answering datasets through the integration of extensive world knowledge. Yet, most models operate *passively*, responding to questions based on pre-stored knowledge. In stark contrast, humans possess the ability to *actively* explore, accumulate, and reason using both newfound and existing information to tackle *incomplete-information* questions. In response to this gap, we introduce **Conan**, an interactive open-world environment devised for the assessment of *active reasoning*. **Conan** facilitates active exploration and promotes multi-round abductive inference, reminiscent of rich, open-world settings like Minecraft. Diverging from previous works that lean primarily on single-round deduction via instruction following, **Conan** compels agents to actively interact with their surroundings, amalgamating new evidence with prior knowledge to elucidate events from incomplete observations. Our analysis on \bench underscores the shortcomings of contemporary state-of-the-art models in active exploration and understanding complex scenarios. Additionally, we explore *Abduction from Deduction*, where agents harness Bayesian rules to recast the challenge of abduction as a deductive process. Through **Conan**, we aim to galvanize advancements in active reasoning and set the stage for the next generation of artificial intelligence agents adept at dynamically engaging in environments.

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback
Jaskirat Singh Liang Zheng



研究问题:如何准确评估和改善文本到图像的对齐,特别是在给定文本输入复杂度增加的情况下。
动机:当前的扩散模型在处理复杂文本输入时可能无法生成准确传达给定提示语义的图像,而预训练的多模态模型如CLIP往往无法检测这种不对准。
方法:本文提出了一种分解的方法来评估和改进文本到图像的对齐。首先,引入了一个分解对齐分数,它将复杂的标题分解为一组不相交的断言。然后,使用VQA模型测量每个断言与生成图像的对齐。最后,将不同断言的对齐分数后验组合以给出最终的文本到图像对齐分数。
效果:实验分析表明,所提出的对齐度量与传统的CLIP、BLIP分数相比,与人类评分有显著更高的相关性。此外,我们还发现,断言级别的对齐分数提供了有用的反馈,可以用于简单的迭代过程,逐渐增加最终图像输出中不同断言的表现力。用户研究表明,该方法在总体文本到图像对齐准确性方面比之前最先进的方法提高了8.7%。

The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While revolutionary, as the complexity of given text input increases, the current state of art diffusion models may still fail in generating images that accurately convey the semantics of the given prompt. Furthermore, such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper, we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex caption decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows a significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores also provide useful feedback which can then be used in a simple iterative procedure to gradually increase the expressivity of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.

SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models
Hongxin Li Jingran Su Yuntao Chen Qing Li Zhaoxiang Zhang



研究问题:如何利用大型语言模型(LLMs)实现自然语言任务导向的电子表格控制。
动机:日常的电子表格处理和项目时间线规划等任务重复且易出错,但大多数终端用户缺乏自动化这些繁重工作的技术水平。随着大型语言模型的出现,通过自然语言用户请求指导软件成为可能。
方法:提出SheetCopilot代理,将自然语言任务与电子表格控制相结合以满足需求。设计一套原子操作作为电子表格软件功能的抽象,并进一步为大型语言模型设计基于状态机的任务规划框架以实现与电子表格的稳健交互。
效果:创建了一个包含221个电子表格控制任务的代表性数据集,并建立了一个全自动化的评估管道,严格衡量大型语言模型在软件控制任务中的能力。SheetCopilot在单次生成中正确完成了44.3%的任务,大大超过了强大的代码生成基线。

Computer end users have spent billions of hours completing daily tasks like tabular data processing and project timeline scheduling. Most of these tasks are repetitive and error-prone, yet most end users lack the skill to automate these burdensome works. With the advent of large language models (LLMs), directing software with natural language user requests become a reachable goal. In this work, we propose a SheetCopilot agent that takes natural language task and control spreadsheet to fulfill the requirements. We propose a set of atomic actions as an abstraction of spreadsheet software functionalities. We further design a state machine-based task planning framework for LLMs to robustly interact with spreadsheets. We curate a representative dataset containing 221 spreadsheet control tasks and establish a fully automated evaluation pipeline for rigorously benchmarking the ability of LLMs in software control tasks. Our SheetCopilot correctly completes 44.3\% of tasks for a single generation, outperforming the strong code generation baseline by a wide margin. Our project page: https://sheetcopilot.github.io/.

Improving CLIP Training with Language Rewrites
Lijie Fan Dilip Krishnan Phillip Isola Dina Katabi Yonglong Tian



研究问题:本文旨在解决预训练视觉模型在训练过程中语言输入未发生变化,限制了对同一图像的多样化文本暴露的问题。
动机:通过语言重写增强对比性语言-图像预训练(CLIP)的训练效果,提高模型的迁移性能。
方法:利用大型语言模型的上下文学习能力,重新编写与每张图像关联的文本描述,生成具有多样性的句子结构和词汇,同时保留原始的关键概念和含义。在训练过程中,LaCLIP随机选择原始文本或重写版本作为每个图像的文本增强。
效果:实验结果表明,使用语言重写的CLIP预训练显著提高了迁移性能,且在训练过程中没有计算或内存开销。具体来说,对于ImageNet零样本准确率,LaCLIP在CC12M上比CLIP高出8.2%,在LAION-400M上高出2.4%。

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.

LANCE: Stress-testing Visual Models by Generating Language-guided Counterfactual Images
Viraj Uday Prabhu Sriram Yenamandra Prithvijit Chattopadhyay Judy Hoffman



研究问题:本文旨在提出一种自动算法,通过生成语言引导的反事实测试图像(LANCE)来对训练好的视觉模型进行压力测试。
动机:利用最新的大型语言建模和基于文本的图像编辑技术,在不改变模型权重的情况下,为IID测试集增加一系列多样化、真实且具有挑战性的测试图像。
方法:我们的方法利用了大型语言模型和基于文本的图像编辑的最新进展,以增强IID测试集,而无需更改模型权重。
效果:我们在生成的数据上对一组多样化的预训练模型进行了基准测试,观察到显著且一致的性能下降。我们还分析了模型在不同类型编辑中的敏感性,并展示了其在揭示ImageNet中以前未知的类别级模型偏见方面的适用性。

We propose an automated algorithm to stress-test a trained visual model by generating language-guided counterfactual test images (LANCE). Our method leverages recent progress in large language modeling and text-based image editing to augment an IID test set with a suite of diverse, realistic, and challenging test images without altering model weights. We benchmark the performance of a diverse set of pre-trained models on our generated data and observe significant and consistent performance drops. We further analyze model sensitivity across different types of edits, and demonstrate its applicability at surfacing previously unknown class-level model biases in ImageNet. Code is available at https://github.com/virajprabhu/lance.

Natural Language Instruction-following with Task-related Language Development and Translation
Jing-Cheng Pang Xinyu Yang Si-Hang Yang Xiong-Hui Chen Yang Yu



研究问题:如何让智能体更好地理解并执行人类的语言指令?
动机:现有的语言条件强化学习方法通常需要处理大量的自然语言示例,这增加了解决问题的复杂性,也分散了智能体的注意力。
方法:提出了一种由内而外的自然语言条件强化学习方法,通过开发一个任务相关的、易于被智能体理解的任务语言(TL),以减轻智能体的学习负担。同时,使用翻译器将自然语言翻译成任务语言,用于高效的策略训练。
效果:实验表明,该方法不仅能更好地理解自然语言指令,还能产生更好的指令遵循策略,显著提高成功率,并能适应未见过的自然语言指令表达。此外,任务语言也是一种有效的子任务抽象,与分层强化学习兼容。

Natural language-conditioned reinforcement learning (RL) enables agents to follow human instructions. Previous approaches generally implemented language-conditioned RL by providing the policy with human instructions in natural language (NL) and training the policy to follow instructions. In this is outside-in approach, the policy must comprehend the NL and manage the task simultaneously. However, the unbounded NL examples often bring much extra complexity for solving concrete RL tasks, which can distract policy learning from completing the task. To ease the learning burden of the policy, we investigate an inside-out scheme for natural language-conditioned RL by developing a task language (TL) that is task-related and easily understood by the policy, thus reducing the policy learning burden. Besides, we employ a translator to translate natural language into the TL, which is used in RL to achieve efficient policy training. We implement this scheme as TALAR (TAsk Language with predicAte Representation) that learns multiple predicates to model object relationships as the TL. Experiments indicate that TALAR not only better comprehends NL instructions but also leads to a better instruction-following policy that significantly improves the success rate over baselines and adapts to unseen expressions of NL instruction. Besides, the TL is also an effective sub-task abstraction compatible with hierarchical RL.

Focus Your Attention when Few-Shot Classification
Haoqing Wang Shibo Jie Zhi-Hong Deng



研究问题:如何将预训练的视觉转换器适应于少样本图像分类任务。
动机:在少样本图像分类任务中,模型可能无法关注与当前任务相关的类别实体,即使对支持样本进行微调,来自与类别无关的实体的噪音信息也会损害性能。
方法:首先提出一种使用注意力和梯度信息自动定位支持图像中关键实体位置的方法,称为位置提示;然后通过它们的注意力日志和许多热展示之间的交叉熵损失来优化模型,使其在微调过程中关注关键实体。
效果:该方法可以改善全参数或参数高效微调方法在少样本任务上的性能,适用于不同的视觉转换器和预训练方式。

Since many pre-trained vision transformers emerge and provide strong representation for various downstream tasks, we aim to adapt them to few-shot image classification tasks in this work. The input images typically contain multiple entities. The model may not focus on the class-related entities for the current few-shot task, even with fine-tuning on support samples, and the noise information from the class-independent ones harms performance. To this end, we first propose a method that uses the attention and gradient information to automatically locate the positions of key entities, denoted as position prompts, in the support images. Then we employ the cross-entropy loss between their many-hot presentation and the attention logits to optimize the model to focus its attention on the key entities during fine-tuning. This ability then can generalize to the query samples. Our method is applicable to different vision transformers (e.g., columnar or pyramidal ones), and also to different pre-training ways (e.g., single-modal or vision-language pre-training). Extensive experiments show that our method can improve the performance of full or parameter-efficient fine-tuning methods on few-shot tasks. Code is available at https://github.com/Haoqing-Wang/FORT.

Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution
Ying Wang Tim G. J. Rudner Andrew Gordon Wilson



研究问题:如何提高视觉语言预训练模型的可解释性。
动机:视觉语言预训练模型在安全关键领域的应用受限于其缺乏可解释性。
方法:提出一种多模态信息瓶颈(M2IB)方法,学习压缩无关信息同时保留相关视觉和文本特征的潜在表示。
效果:实验证明M2IB可以应用于视觉语言预训练模型的属性分析,提高属性准确性,并在医疗等安全关键领域提高模型的可解释性。

Vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability. To improve the interpretability of vision-language models such as CLIP, we propose a multi-modal information bottleneck (M2IB) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features. We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare. Crucially, unlike commonly used unimodal attribution methods, M2IB does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available. Using CLIP as an example, we demonstrate the effectiveness of M2IB attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Weixi Feng Wanrong Zhu Tsu-Jui Fu Varun Jampani Arjun Reddy Akula Xuehai He S Basu Xin Eric Wang William Yang Wang



研究问题:如何让大型语言模型(LLMs)通过文本条件生成布局,从而与视觉生成模型协作,提高用户在视觉生成中的控制能力。
动机:复杂的精细输入如布局对用户来说是一个重大负担,而大型语言模型可以通过文本条件生成布局,解决这一问题。
方法:提出了LayoutGPT方法,该方法能在样式表语言中编写上下文视觉演示,以提高LLMs的视觉规划能力。
效果:实验表明,LayoutGPT可以在多个领域生成合理的布局,包括2D图像和3D室内场景。在将具有挑战性的语言概念转化为布局安排以实现忠实的文本到图像生成方面,LayoutGPT表现出优越的性能。当与下游图像生成模型结合时,LayoutGPT比文本到图像模型/系统提高了20-40%的性能,并在设计数字和空间正确的视觉布局方面达到了与人类用户相当的水平。最后,LayoutGPT在3D室内场景合成方面取得了与有监督方法相当的性能,证明了其在多个视觉领域中的有效性和潜力。

Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance visual planning skills of LLMs. We show that LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40\% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, LayoutGPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains.

Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning
Xiaoqian Wu Yong-Lu Li Jianhua Sun Cewu Lu



研究问题:如何提高视觉活动理解的可解释性、泛化性和数据效率。
动机:现有的类似System-1的方法在视觉活动理解中,需要结合System-2处理以提高性能。
方法:构建一个由符号和规则组成的符号系统,其中一条规则连接多个符号,以体现人类的知识和推理能力。提出一种新的符号系统,具有广泛的覆盖符号和合理的规则两大理想特性。利用大型语言模型(LLMs)作为这两个理想特性的近似值,即大型语言模型中的符号(Symbol-LLM)。然后,给定一张图像,从图像中提取视觉内容并检查为符号,通过模糊逻辑计算基于规则推理出活动语义。
效果:该方法在广泛的活动理解任务上表现出优越性。

Human reasoning can be understood as a cooperation between the intuitive, associative "System-1'' and the deliberative, logical "System-2''. For existing System-1-like methods in visual activity understanding, it is crucial to integrate System-2 processing to improve explainability, generalization, and data efficiency. One possible path of activity reasoning is building a symbolic system composed of symbols and rules, where one rule connects multiple symbols, implying human knowledge and reasoning abilities. Previous methods have made progress, but are defective with limited symbols from handcraft and limited rules from visual-based annotations, failing to cover the complex patterns of activities and lacking compositional generalization. To overcome the defects, we propose a new symbolic system with two ideal important properties: broad-coverage symbols and rational rules. Collecting massive human knowledge via manual annotations is expensive to instantiate this symbolic system. Instead, we leverage the recent advancement of LLMs (Large Language Models) as an approximation of the two ideal properties, i.e., Symbols from Large Language Models (Symbol-LLM). Then, given an image, visual contents from the images are extracted and checked as symbols and activity semantics are reasoned out based on rules via fuzzy logic calculation. Our method shows superiority in extensive activity understanding tasks. Code and data are available at https://mvig-rhos.com/symbol_llm.

topic-9

Topic words :  diffusion,  image,  models,  model,  generation,  generative,  images,  text

Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation
Diederik P Kingma Ruiqi Gao



研究问题:如何提高视觉活动理解的可解释性、泛化性和数据效率。
动机:现有的类似System-1的方法在视觉活动理解中,需要结合System-2处理以提高性能。
方法:构建一个由符号和规则组成的符号系统,其中一条规则连接多个符号,以体现人类的知识和推理能力。提出一种新的符号系统,具有广泛的覆盖符号和合理的规则两大理想特性。利用大型语言模型(LLMs)作为这两个理想特性的近似值,即大型语言模型中的符号(Symbol-LLM)。然后,给定一张图像,从图像中提取视觉内容并检查为符号,通过模糊逻辑计算基于规则推理出活动语义。
效果:该方法在广泛的活动理解任务上表现出优越性。

To achieve the highest perceptual quality, state-of-the-art diffusion models are optimized with objectives that typically look very different from the maximum likelihood and the Evidence Lower Bound (ELBO) objectives. In this work, we reveal that diffusion model objectives are actually closely related to the ELBO. Specifically, we show that all commonly used diffusion model objectives equate to a weighted integral of ELBOs over different noise levels, where the weighting depends on the specific objective used. Under the condition of monotonic weighting, the connection is even closer: the diffusion objective then equals the ELBO, combined with simple data augmentation, namely Gaussian noise perturbation. We show that this condition holds for a number of state-of-the-art diffusion models. In experiments, we explore new monotonic weightings and demonstrate their effectiveness, achieving state-of-the-art FID scores on the high-resolution ImageNet benchmark.

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation
Saurabh Saxena Charles Herrmann Junhwa Hur Abhishek Kar Mohammad Norouzi Deqing Sun David J. Fleet



研究问题:本文旨在探讨去噪扩散概率模型在图像生成、光流估计和单目深度估计任务上的应用。
动机:去噪扩散概率模型具有高保真度和多样性,作者希望探究其在光流估计和单目深度估计等任务上的潜力。
方法:采用自监督预训练、合成数据与真实数据的联合监督训练以及处理噪声不完整训练数据的技术创新(填充和逐步展开的去噪扩散训练),训练出用于深度和光流估计的先进扩散模型,并进行零样本粗到细的精化以获得高分辨率估计。
效果:实验结果表明,该模型在室内NYU基准测试中的相对深度误差为0.074,在KITTI光学流基准测试中的Fl-all得分为3.26%,比已发表的最佳方法提高了约25%。

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, one can train state-of-the-art diffusion models for depth and optical flow estimation, with additional zero-shot coarse-to-fine refinement for high resolution estimates. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all score of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method.

High-Fidelity Audio Compression with Improved RVQGAN
Rithesh Kumar Prem Seetharaman Alejandro Luebs Ishaan Kumar Kundan Kumar



研究问题:如何利用神经网络压缩模型将高维自然信号(如图像、语音和音乐)压缩成低维离散标记。
动机:现有的语言模型可以成功应用于多种自然信号的建模,其中关键的一环是高质量的神经网络压缩模型。
方法:通过结合高保真音频生成、图像领域的矢量量化技术以及改进的对抗性和重建损失,提出了一种高保真通用神经网络音频压缩算法。
效果:该算法可以将44.1 KHz的音频压缩至仅8kbps的带宽,压缩比达到~90x。在与竞争性音频压缩算法的比较中,该方法表现出显著的优势。

Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

LinkerNet: Fragment Poses and Linker Co-Design with 3D Equivariant Diffusion
Jiaqi Guan Xingang Peng PeiQi Jiang Yunan Luo Jian Peng Jianzhu Ma



研究问题:设计一种连接不同分子片段以形成稳定药物候选分子的连接器,其中片段在3D空间中的位置是未知的。
动机:现有的连接器设计模型假设片段的相对位置已知,但在实际情况下可能并非如此。
方法:开发了一种3D等变扩散模型,该模型联合学习了片段姿势和连接器3D结构的生成过程。通过将片段视为刚体,设计了一种受刚体力学中牛顿-欧拉方程启发的片段姿势预测模块。
效果:在ZINC和PROTAC-DB数据集上的实证研究表明,我们的模型可以在无约束和有约束的生成设置下生成化学有效、可合成和低能分子。

Targeted protein degradation techniques, such as PROteolysis TArgeting Chimeras (PROTACs), have emerged as powerful tools for selectively removing disease-causing proteins. One challenging problem in this field is designing a linker to connect different molecular fragments to form a stable drug-candidate molecule. Existing models for linker design assume that the relative positions of the fragments are known, which may not be the case in real scenarios. In this work, we address a more general problem where the poses of the fragments are *unknown* in 3D space. We develop a 3D equivariant diffusion model that jointly learns the generative process of both fragment poses and the 3D structure of the linker. By viewing fragments as rigid bodies, we design a fragment pose prediction module inspired by the Newton-Euler equations in rigid body mechanics. Empirical studies on ZINC and PROTAC-DB datasets demonstrate that our model can generate chemically valid, synthetically-accessible, and low-energy molecules under both unconstrained and constrained generation settings.

Object-Centric Slot Diffusion
Jindong Jiang Fei Deng Gautam Singh Sungjin Ahn



研究问题:探索将扩散模型整合到以对象为中心的学习中的可行性和潜力,并研究这种方法的优缺点。
动机:尽管扩散模型在图像生成中具有高表现力,但它们在以对象为中心的学习中的集成尚未得到充分探索。
方法:介绍了一种新的模型Latent Slot Diffusion(LSD),它是第一个用条件于对象槽位的潜在扩散模型替换传统插槽解码器的对象中心学习模型,也是第一个无需文本等监督注释的无监督组合条件扩散模型。
效果:通过在各种以对象为中心的任务上进行实验,包括在该领域首次应用FFHQ数据集,证明LSD显著优于最先进的基于变压器的解码器,特别是在更复杂的场景中,并且展现出优越的无监督组合生成质量。此外,还对预训练扩散模型在LSD中的集成进行了初步研究,并证明了其在真实世界的图像分割和生成中的有效性。

The recent success of transformer-based image generative models in object-centric learning highlights the importance of powerful image generators for handling complex scenes. However, despite the high expressiveness of diffusion models in image generation, their integration into object-centric learning remains largely unexplored in this domain. In this paper, we explore the feasibility and potential of integrating diffusion models into object-centric learning and investigate the pros and cons of this approach. We introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes: it is the first object-centric learning model to replace conventional slot decoders with a latent diffusion model conditioned on object slots, and it is also the first unsupervised compositional conditional diffusion model that operates without the need for supervised annotations like text. Through experiments on various object-centric tasks, including the first application of the FFHQ dataset in this field, we demonstrate that LSD significantly outperforms state-of-the-art transformer-based decoders, particularly in more complex scenes, and exhibits superior unsupervised compositional generation quality. In addition, we conduct a preliminary investigation into the integration of pre-trained diffusion models in LSD and demonstrate its effectiveness in real-world image segmentation and generation. Project page is available at https://latentslotdiffusion.github.io

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen Michael Poli Marjan Faizi Armin W Thomas Michael Wornow Callum Birch-Sykes Stefano Massaroli Aman Patel Clayton M. Rabideau Yoshua Bengio Stefano Ermon Christopher Re Stephen Baccus



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution (i.e. DNA "characters") where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena’s new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level – an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics for simple adaptation to novel tasks without updating pretrained model weights. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.1 On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.

Score-based Generative Models with Lévy Processes
Eunbi Yoon Keehun Park Sungwoong Kim Sungbin Lim



研究问题:寻找一种超越高斯的随机过程,用于基于分数的生成模型中噪声注入的最佳选择。
动机:现有的布朗运动等轻尾过程在不平衡数据上存在模式崩溃问题,且收敛速度慢。
方法:提出一种新的基于分数的生成模型——Lévy-Itō模型(LIM),利用各向同性的α稳定Lévy过程。通过推导由Lévy过程驱动的精确反向时间随机微分方程,并开发相应的分数去噪得分匹配方法。
效果:实验结果表明,与现有的扩散模型相比,LIM在各种图像数据集(如CIFAR10、CelebA和不平衡数据集CIFAR10LT)上允许更快、更多样化的采样,同时保持高保真度。在CelebA数据集上,与DDPM相比,取得了更好的Fréchet Inception Distance(FID)和召回率。在NFE 500中,LIM显示出最好的性能,总墙钟时间比基线快2倍。

Investigating the optimal stochastic process beyond Gaussian for noise injection in a score-based generative model remains an open question. Brownian motion is a light-tailed process with continuous paths, which leads to a slow convergence rate for the Number of Function Evaluation (NFE). Recent studies have shown that diffusion models suffer from mode-collapse issues on imbalanced data. In order to overcome the limitations of Brownian motion, we introduce a novel score-based generative model referred to as Lévy-Itō Model (LIM). This model utilizes isotropic $\alpha$-stable Lévy processes. We first derive an exact reverse-time stochastic differential equation driven by the Lévy process and develop the corresponding fractional denoising score matching. The proposed generative model takes advantage of the heavy-tailed properties of the Lévy process. Our experimental results show LIM allows for faster and more diverse sampling while maintaining high fidelity compared to existing diffusion models across various image datasets such as CIFAR10, CelebA, and imbalanced dataset CIFAR10LT. Comparing our results to those of DDPM with 3.21 Fréchet Inception Distance (FID) and 0.6437 Recall on the CelebA dataset, we achieve 1.58 FID and 0.7006 Recall using the same architecture. LIM shows the best performance in NFE 500 with $2\times$ faster total wall-clock time than the baseline.

Complexity Matters: Rethinking the Latent Space for Generative Modeling
Tianyang Hu Fei Chen Haonan Wang Jiawei Li Wenjia Wang Jiacheng Sun Zhenguo Li



研究问题:本文旨在解决生成模型中如何选择和确定最优的低维潜在空间的问题。
动机:尽管在生成模型中,选择和使用低维潜在空间是至关重要的,但如何确定最优的潜在空间及其选择过程仍然不清楚。
方法:受经典生成对抗网络(GANs)的启发,我们提出了一种新的潜在空间与数据分布之间的距离度量方式,其最小化等价于生成器复杂度的最小化。然后,我们考虑通过编码器网络参数化这样的潜在分布,并提出了两阶段训练策略,即解耦自动编码器(DAE)。
效果:我们的理论研究结果得到了广泛的实验验证,包括VQGAN和Diffusion Transformer等多种模型。我们的改进措施显著提高了样本质量,同时降低了模型复杂度。

In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion models the latent space induced by an encoder and generates images through a paired decoder. Although the selection of the latent space is empirically pivotal, determining the optimal choice and the process of identifying it remain unclear. In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity. Our investigation starts with the classic generative adversarial networks (GANs). Inspired by the GAN training objective, we propose a novel "distance" between the latent and data distributions, whose minimization coincides with that of the generator complexity. The minimizer of this distance is characterized as the optimal data-dependent latent that most effectively capitalizes on the generator's capacity. Then, we consider parameterizing such a latent distribution by an encoder network and propose a two-stage training strategy called Decoupled Autoencoder (DAE), where the encoder is only updated in the first stage with an auxiliary decoder and then frozen in the second stage while the actual decoder is being trained. DAE can improve the latent distribution and as a result, improve the generative performance. Our theoretical analyses are corroborated by comprehensive experiments on various models such as VQGAN and Diffusion Transformer, where our modifications yield significant improvements in sample quality with decreased model complexity.

Parallel Sampling of Diffusion Models
Andy Shih Suneel Belkhale Stefano Ermon Dorsa Sadigh Nima Anari



研究问题:扩散模型是一种强大的生成模型,但采样速度慢,通常需要1000个连续的去噪步骤才能生成一个样本。
动机:目前的研究主要通过减少去噪步骤的数量来提高采样速度,但这会降低样本质量。因此,本文探索了另一种方法:能否并行运行去噪步骤(用计算能力换取速度)。
方法:尽管去噪步骤具有序列性质,但通过猜测未来去噪步骤的解并进行迭代细化直到收敛,我们证明了可以通过皮卡尔迭代法并行化采样。基于这一发现,我们提出了ParaDiGMS方法,该方法通过并行执行多个去噪步骤来加速预训练扩散模型的采样。
效果:使用ParaDiGMS,我们在一系列机器人和图像生成模型上将采样速度提高了2-4倍,同时在任务奖励、FID分数或CLIP分数方面没有可测量的下降。

Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.

Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors
Paul Steven Scotti Atmadeep Banerjee Jimmie Goode Stepan Shabalin Alex Nguyen Cohen Ethan Aidan James Dempster Nathalie Verlinde Elad Yundler David Weisberg Kenneth Norman Tanishq Mathew Abraham



研究问题:如何通过大脑活动从fMRI中检索和重建查看的图像?
动机:目前的检索和重建方法在处理高维度多模态潜在空间时存在困难,需要一种能够映射fMRI大脑活动到任何高维多模态潜在空间的方法。
方法:提出MindEye模型,包括两个并行的子模块,一个用于检索(使用对比学习),另一个用于重建(使用扩散先验)。该模型可以将fMRI大脑活动映射到任何高维多模态潜在空间,如CLIP图像空间,并利用接受此潜在空间嵌入的生成模型进行图像重建。
效果:实验结果表明,MindEye在重建和检索任务上均取得了最先进的性能。特别是,即使在高度相似的候选者中,MindEye也可以检索到原始图像,表明其大脑嵌入保留了精细的图像特定信息。此外,通过消融实验证明,MindEye的性能提升主要来自于专门的检索和重建子模块、改进的训练技术和训练参数数量级更大的模型。

We present MindEye, a novel fMRI-to-image approach to retrieve and reconstruct viewed images from brain activity. Our model comprises two parallel submodules that are specialized for retrieval (using contrastive learning) and reconstruction (using a diffusion prior). MindEye can map fMRI brain activity to any high dimensional multimodal latent space, like CLIP image space, enabling image reconstruction using generative models that accept embeddings from this latent space. We comprehensively compare our approach with other existing methods, using both qualitative side-by-side comparisons and quantitative evaluations, and show that MindEye achieves state-of-the-art performance in both reconstruction and retrieval tasks. In particular, MindEye can retrieve the exact original image even among highly similar candidates indicating that its brain embeddings retain fine-grained image-specific information. This allows us to accurately retrieve images even from large-scale databases like LAION-5B. We demonstrate through ablations that MindEye's performance improvements over previous methods result from specialized submodules for retrieval and reconstruction, improved training techniques, and training models with orders of magnitude more parameters. Furthermore, we show that MindEye can better preserve low-level image features in the reconstructions by using img2img, with outputs from a separate autoencoder. All code is available on GitHub.

Towards Symmetry-Aware Generation of Periodic Materials
Youzhi Luo Chengkai Liu Shuiwang Ji



研究问题:本文旨在解决使用深度学习模型生成周期性材料的问题。
动机:虽然对对称性感知的分子生成已经进行了广泛研究,但周期性材料具有不同的对称性,这尚未被现有方法完全捕捉。
方法:我们提出了SyMat,这是一种新的材料生成方法,可以捕获周期性材料结构的物理对称性。SyMat通过变分自动编码器模型生成材料的原子类型集、晶格长度和晶格角度来生成材料的原子类型和晶格。此外,SyMat采用基于分数的扩散模型生成材料的原子坐标,其中在坐标扩散过程中使用了一种新的对称性感知概率模型。
效果:我们证明了SyMat在理论上对所有材料对称变换都是不变的,并在随机生成和性质优化任务上取得了良好的性能。我们的代码作为AIRS库的一部分公开提供(https://github.com/divelab/AIRS)。

We consider the problem of generating periodic materials with deep models. While symmetry-aware molecule generation has been studied extensively, periodic materials possess different symmetries, which have not been completely captured by existing methods. In this work, we propose SyMat, a novel material generation approach that can capture physical symmetries of periodic material structures. SyMat generates atom types and lattices of materials through generating atom type sets, lattice lengths and lattice angles with a variational auto-encoder model. In addition, SyMat employs a score-based diffusion model to generate atom coordinates of materials, in which a novel symmetry-aware probabilistic model is used in the coordinate diffusion process. We show that SyMat is theoretically invariant to all symmetry transformations on materials and demonstrate that SyMat achieves promising performance on random generation and property optimization tasks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).

ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting
Zongsheng Yue Jianyi Wang Chen Change Loy



研究问题:扩散式图像超分辨率(SR)方法主要受限于低推理速度,因为需要数百甚至数千次采样步骤。
动机:现有的加速采样技术不可避免地在一定程度上牺牲性能,导致过度模糊的SR结果。
方法:我们提出了一种新颖且高效的用于SR的扩散模型,显著减少了扩散步骤的数量,从而消除了推理期间后加速及其相关性能下降的需要。该方法构建了一个马尔可夫链,通过在高分辨率图像和低分辨率图像之间的残差进行转移,大大提高了转换效率。此外,还开发了一种精心设计的噪声调度程序,以灵活控制扩散过程中的移位速度和噪声强度。
效果:大量实验表明,即使在只有20个采样步骤的情况下,所提出的方法在合成和真实世界的数据集上都能获得优于或至少与当前最先进的方法相当的性能。我们的代码和模型将公开发布。

Diffusion-based image super-resolution (SR) methods are mainly limited by the low inference speed due to the requirements of hundreds or even thousands of sampling steps. Existing acceleration sampling techniques inevitably sacrifice performance to some extent, leading to over-blurry SR results. To address this issue, we propose a novel and efficient diffusion model for SR that significantly reduces the number of diffusion steps, thereby eliminating the need for post-acceleration during inference and its associated performance deterioration. Our method constructs a Markov chain that transfers between the high-resolution image and the low-resolution image by shifting the residual between them, substantially improving the transition efficiency. Additionally, an elaborate noise schedule is developed to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experiments demonstrate that the proposed method obtains superior or at least comparable performance to current state-of-the-art methods on both synthetic and real-world datasets, \textit{\textbf{even only with 20 sampling steps}}. Our code and model will be made publicly.

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
Zhengyi Wang Cheng Lu Yikai Wang Fan Bao Chongxuan Li Hang Su Jun Zhu



研究问题:本文旨在解决文本到3D生成中的问题,如过度饱和、过度平滑和低多样性。
动机:现有的分数蒸馏采样(SDS)方法在文本到3D生成中表现出巨大潜力,但存在过度饱和、过度平滑和低多样性等问题。
方法:提出将3D参数模型为随机变量,而不是像SDS那样的常量,并提出了变分分数蒸馏(VSD)方法,这是一种基于粒子的变分框架,用于解释和解决上述问题。
效果:实验结果表明,VSD可以很好地处理各种CFG权重,同时提高样本的多样性和质量。此外,还提出了一些关于文本到3D设计的改进,如蒸馏时间表和密度初始化等。整体方法称为ProlificDreamer,可以生成高分辨率(512x512)和高保真度的NeRF,具有丰富的结构和复杂的效果(如烟雾和水滴)。

Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present *variational score distillation* (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., 7.5). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed *ProlificDreamer*, can generate high rendering resolution (i.e., 512$\times$512) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic.

Aligning Synthetic Medical Images with Clinical Knowledge using Human Feedback
Shenghuan Sun Gregory Goldgof Atul Butte Ahmed Alaa



研究问题:如何评估合成医疗图像的临床可信度,并提高其质量?
动机:虽然现代生成模型能够合成逼真的医疗图像,但其临床可信度可能受到质疑。现有的评价方法无法融入临床知识,且难以预测模型在生成临床可信图像时可能出现的问题。
方法:本文提出了一种医生参与循环框架来生成具有临床可信度的合成医疗图像。该框架包括三个步骤:(1)预训练一个条件扩散模型,根据临床概念生成医疗图像;(2)专家病理学家评估生成的图像,以确定它们是否满足临床要求;(3)训练一个奖励模型,预测人类对新样本的反馈,并将其纳入扩散模型的微调目标中。
效果:实验结果表明,人类反馈显著提高了合成图像的质量,包括逼真度、多样性、在下游应用中的实用性以及专家评估的可信度。此外,人类反馈还可以教会模型新的临床概念,这些概念在原始训练数据中并未标注。

Generative models capable of precisely capturing nuanced clinical features in medical images hold great promise for facilitating clinical data sharing, enhancing rare disease datasets, and efficiently synthesizing (annotated) medical images at scale. Despite their potential, assessing the quality of synthetic medical images remains a challenge. While modern generative models can synthesize visually-realistic medical images, the clinical plausibility of these images may be called into question. Domain-agnostic scores, such as FID score, precision, and recall, cannot incorporate clinical knowledge and are, therefore, not suitable for assessing clinical sensibility. Additionally, there are numerous unpredictable ways in which generative models may fail to synthesize clinically plausible images, making it challenging to anticipate potential failures and design automated scores for their detection. To address these challenges, this paper introduces a pathologist-in-the-loop framework for generating clinically-plausible synthetic medical images. Our framework comprises three steps: (1) pretraining a conditional diffusion model to generate medical images conditioned on a clinical concept, (2) expert pathologist evaluation of the generated images to assess whether they satisfy clinical desiderata, and (3) training a reward model that predicts human feedback on new samples, which we use to incorporate expert knowledge into the finetuning objective of the diffusion model. Our results show that human feedback significantly improves the quality of synthetic images in terms of fidelity, diversity, utility in downstream applications, and plausibility as evaluated by experts. We also demonstrate that human feedback can teach the model new clinical concepts not annotated in the original training data. Our results demonstrate the value of incorporating human feedback in clinical applications where generative models may struggle to capture extensive domain knowledge from raw data alone.

Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction
Zuobai Zhang Minghao Xu Aurelie Lozano Vijil Chenthamarakshan Payel Das Jian Tang



研究问题:本文旨在解决预训练蛋白质模型的问题,目前大多数方法只关注蛋白质序列或结构,忽视了它们的联合分布,这对于通过整合共进化信息和结构特征全面理解蛋白质功能至关重要。
动机:受到去噪扩散模型在生成任务中成功的启发,我们提出了DiffPreT方法,通过序列-结构联合扩散建模来预训练一个蛋白质编码器。
方法:DiffPreT通过引导编码器从扰动的蛋白质序列和结构中恢复出原始的蛋白质序列和结构,从而获取序列和结构的联合分布。为了捕捉蛋白质的重要构象变化,我们通过一种名为Siamese Diffusion Trajectory Prediction(SiamDiff)的方法增强了DiffPreT,以捕获结构相关构象之间的关联性。
效果:实验结果表明,DiffPreT在所有任务上的性能始终具有竞争力,而SiamDiff在所有任务上实现了新的最先进的性能。

Self-supervised pre-training methods on proteins have recently gained attention, with most approaches focusing on either protein sequences or structures, neglecting the exploration of their joint distribution, which is crucial for a comprehensive understanding of protein functions by integrating co-evolutionary information and structural characteristics. In this work, inspired by the success of denoising diffusion models in generative tasks, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure joint diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the joint diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. Code will be released upon acceptance.

AbDiffuser: full-atom generation of in-vitro functioning antibodies
Karolis Martinkus Jan Ludwiczak WEI-CHING LIANG Julien Lafrance-Vanasse Isidro Hotzel Arvind Rajpal Yan Wu Kyunghyun Cho Richard Bonneau Vladimir Gligorijevic Andreas Loukas



研究问题:本文旨在开发一种等变且物理感知的扩散模型AbDiffuser,用于抗体3D结构和序列的联合生成。
动机:现有的抗体生成模型存在一些问题,如无法处理序列长度变化、内存复杂度高等问题。
方法:本文提出了一种新的蛋白质结构表示方法,并在此基础上构建了一种新型的对齐蛋白质架构。同时,利用强大的扩散先验知识来改进去噪过程,从而改善蛋白质扩散。
效果:实验结果表明,AbDiffuser能够生成与参考集在序列和结构属性上紧密跟踪的抗体。实验室实验证实,所有发现的16种HER2抗体都得到了高水平表达,其中57.1%的设计是紧密结合的。

We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints; handles sequence-length changes; and reduces memory complexity by an order of magnitude, enabling backbone and side chain generation. We validate AbDiffuser in silico and in vitro. Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies discovered were expressed at high levels and that 57.1% of the selected designs were tight binders.

Transition-constant Normalization for Image Enhancement
Jie Huang Man Zhou JingHao Zhang Gang Yang Mingde Yao Chongyi Li Zhiwei Xiong Feng Zhao



研究问题:探索归一化技术如何影响图像增强性能。
动机:尽管图像增强可以被视为一种形式的风格转换,但很少有研究探讨归一化对增强性能的影响。
方法:提出一种新的过渡常数归一化(TCN)用于各种图像增强任务。具体来说,它由两个满足可逆约束的归一化操作流以及一个满足归一化约束的特征子采样操作组成。
效果:通过在多个图像增强任务上进行大量实验,如低光增强、曝光校正、SDR2HDR转换和图像去雾,TCN始终显示出性能改进。此外,它在其他任务中也表现出强大的能力,包括全景锐化和医学分割。

Normalization techniques that capture image style by statistical representation have become a popular component in deep neural networks. Although image enhancement can be considered as a form of style transformation, there has been little exploration of how normalization affect the enhancement performance. To fully leverage the potential of normalization, we present a novel Transition-Constant Normalization (TCN) for various image enhancement tasks. Specifically, it consists of two streams of normalization operations arranged under an invertible constraint, along with a feature sub-sampling operation that satisfies the normalization constraint. TCN enjoys several merits, including being parameter-free, plug-and-play, and incurring no additional computational costs. We provide various formats to utilize TCN for image enhancement, including seamless integration with enhancement networks, incorporation into encoder-decoder architectures for downsampling, and implementation of efficient architectures. Through extensive experiments on multiple image enhancement tasks, like low-light enhancement, exposure correction, SDR2HDR translation, and image dehazing, our TCN consistently demonstrates performance improvements. Besides, it showcases extensive ability in other tasks including pan-sharpening and medical segmentation. The code is available at \textit{\textcolor{blue}{https://github.com/huangkevinj/TCNorm}}.

Stable Diffusion is Unstable
Chengbin Du Yanxi Li Zhongwei Qiu Chang Xu



研究问题:本文旨在解决文本到图像模型在生成过程中的鲁棒性问题,即对文本提示进行小的扰动可能导致主要主题与其他类别混合或完全消失。
动机:尽管文本到图像模型具有强大的生成能力,但其生成过程缺乏鲁棒性。通过引入小的扰动,可以有效地阻止模型生成期望的主题。
方法:本文提出了一种基于梯度的攻击方法——Auto-attack on Text-to-image Models (ATM)。该方法通过学习Gumbel Softmax分布,使单词替换或扩展的离散过程连续化,从而确保扰动生成的可微性。一旦分布被学习,ATM就可以同时生成多个攻击样本。
效果:实验结果表明,ATM在短文本攻击中取得了91.1%的成功率,在长文本攻击中取得了81.2%的成功率。进一步的实证分析揭示了三种攻击模式:1)生成速度的变化;2)粗粒度特性的相似性;3)词的多义性。

Recently, text-to-image models have been thriving. Despite their powerful generative capacity, our research has uncovered a lack of robustness in this generation process. Specifically, the introduction of small perturbations to the text prompts can result in the blending of primary subjects with other categories or their complete disappearance in the generated images. In this paper, we propose **Auto-attack on Text-to-image Models (ATM)**, a gradient-based approach, to effectively and efficiently generate such perturbations. By learning a Gumbel Softmax distribution, we can make the discrete process of word replacement or extension continuous, thus ensuring the differentiability of the perturbation generation. Once the distribution is learned, ATM can sample multiple attack samples simultaneously. These attack samples can prevent the generative model from generating the desired subjects without tampering with the category keywords in the prompt. ATM has achieved a 91.1\% success rate in short-text attacks and an 81.2\% success rate in long-text attacks. Further empirical analysis revealed three attack patterns based on: 1) variability in generation speed, 2) similarity of coarse-grained characteristics, and 3) polysemy of words. The code is available at https://github.com/duchengbin8/Stable_Diffusion_is_Unstable

Real-World Image Variation by Aligning Diffusion Inversion Chain
Yuechen ZHANG Jinbo Xing Eric Lo Jiaya Jia



研究问题:现有的扩散模型在生成真实世界图像的高质量变化时存在领域差距。
动机:这种领域差距源于不同扩散过程中的潜在分布差距。
方法:提出一种新的推理管道RIVAL,通过将图像生成过程与源图像的反转链对齐,利用扩散模型从单个图像样本生成图像变化。
效果:实验结果表明,RIVAL在语义相似性和感知质量方面优于现有方法,且可以轻易应用于其他基于扩散的生成任务。

Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods concerning semantic similarity and perceptual quality. This generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and stylization. Project page: https://rival-diff.github.io

Full-Atom Protein Pocket Design via Iterative Refinement
ZAIXI ZHANG Zepu Lu Zhongkai Hao Marinka Zitnik Qi Liu



研究问题:设计能与特定配体分子结合的功能蛋白质,特别是在治疗和生物工程等领域。
动机:现有的方法在生成效率、上下文模型(配体分子)和生成侧链原子方面存在不足。
方法:提出一种全原子迭代优化框架(FAIR),用于联合设计蛋白质口袋的序列(即残基类型)和3D结构。FAIR包括两个步骤,遵循从粗到细的流水线(从骨架原子到包括侧链在内的全原子)。
效果:实验表明,FAIR在高效设计高质量口袋序列和结构方面优于基线方法,平均AAR和RMSD提高超过10%。

The design of \emph{de novo} functional proteins that bind with specific ligand molecules is crucial in various domains like therapeutics and bio-engineering. One vital yet challenging step is to design the protein pocket, the cavity region of protein where the ligand binds with. Existing methods suffer from inefficient generation, insufficient context modeling (ligand molecule), and incapability of generating sidechain atoms. To overcome the limitations, we propose a \textbf{F}ull-\textbf{A}tom \textbf{I}terative \textbf{R}efinement framework (\textbf{FAIR}) for protein pocket sequence (i.e., residue types) and 3D structure co-design. Generally, FAIR consists of two steps that follow a coarse-to-fine pipeline (backbone atoms to full atoms including sidechain) for full-atom generation. For efficiency, all residue types and structures are updated together in each round (i.e., full-shot refinement). In the first step, the residue types and backbone coordinates are updated with a hierarchical context encoder and two structure refinement modules capturing inter-residue and pocket-ligand interactions. The second step further models the sidechain atoms of pockets and updates residue types to achieve sequence-structure consistency. The structure of the binding ligand is also updated along with the above refinement iterations accounting for its flexibility. Finally, extensive evaluations show that FAIR outperforms baselines in efficiently designing high-quality pocket sequences and structures. Specifically, the average improvements on AAR and RMSD are over 10$\%$.

Hierarchical Integration Diffusion Model for Realistic Image Deblurring
Zheng Chen Yulun Zhang Ding Liu Bin Xia Jinjin Gu Linghe Kong Xin Yuan



研究问题:本文旨在解决扩散模型在图像去模糊任务中需要大量计算资源和对目标结果的分布不准确的问题。
动机:扩散模型在图像去模糊任务中表现出良好的性能,但需要大量的计算资源并且生成的分布与目标结果不匹配。
方法:提出了一种分层集成扩散模型(HI-Diff),通过在高度压缩的潜在空间中执行扩散模型来生成去模糊过程的先验特征,然后通过基于回归的方法进行去模糊处理,同时设计了分层集成模块从多个尺度将先验知识融合到基于回归的模型中。
效果:在合成和真实世界的模糊数据集上的实验表明,HI-Diff优于现有的最先进方法。

Diffusion models (DMs) have recently been introduced in image deblurring and exhibited promising performance, particularly in terms of details reconstruction. However, the diffusion model requires a large number of inference iterations to recover the clean image from pure Gaussian noise, which consumes massive computational resources. Moreover, the distribution synthesized by the diffusion model is often misaligned with the target results, leading to restrictions in distortion-based metrics. To address the above issues, we propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring. Specifically, we perform the DM in a highly compacted latent space to generate the prior feature for the deblurring process. The deblurring process is implemented by a regression-based method to obtain better distortion accuracy. Meanwhile, the highly compact latent space ensures the efficiency of the DM. Furthermore, we design the hierarchical integration module to fuse the prior into the regression-based model from multiple scales, enabling better generalization in complex blurry scenarios. Comprehensive experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods. Code and trained models are available at https://github.com/zhengchen1999/HI-Diff.

Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?
Xiaoxiao Sun Nidham Gazagnadou Vivek Sharma Lingjuan Lyu Hongdong Li Liang Zheng



研究问题:现有的手工制作的图像质量指标(如PSNR和SSIM)在评估重建攻击下的模型隐私风险时,是否能够准确反映人类对隐私信息的认知?
动机:目前的图像质量指标无法确保能准确反映人类对重建图像的隐私泄露程度的判断,这可能导致模型隐私泄露的风险。
方法:本文通过让多个人类标注者对重建图像进行识别,发现现有的手工制作的图像质量指标与人类对隐私泄露程度的判断只有弱相关性,甚至这些指标本身也存在相互矛盾的情况。因此,提出了一种基于学习的度量标准SemSim,用于评估原始图像和重建图像之间的语义相似性。
效果:实验结果表明,SemSim与人类判断的相关性明显高于现有的图像质量指标,并且这种强相关性可以推广到未见过的数据、模型和攻击方法上。这项研究被视为更接近人类水平的图片质量评估的一个重要里程碑。

Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which offers trustworthy judgement for model privacy leakage. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods. We envision this work as a milestone for image quality evaluation closer to the human level. The project webpage can be accessed at https://sites.google.com/view/semsim.

Puzzlefusion: Unleashing the Power of Diffusion Models for Spatial Puzzle Solving
Sepidehsadat Hosseini Mohammad Amin Shabani Saghar Irandoust Yasutaka Furukawa



研究问题:如何利用扩散模型解决空间拼图问题,特别是拼图和房间布局任务。
动机:现有的方法在处理空间拼图任务时存在挑战,作者希望找到一种有效的解决方案。
方法:提出了一种基于扩散模型的端到端神经网络架构“PuzzleFusion”,通过估计2D平移和旋转来对齐房间布局,类似于解决房间布局的拼图问题。
效果:通过新的具有真实布局的数据集进行训练,实验结果表明该方法在所有三个空间拼图任务上都显著优于竞争方法。

This paper presents an end-to-end neural architecture based on Diffusion Models for spatial puzzle solving, particularly jigsaw puzzle and room arrangement tasks. In the latter task, for instance, the proposed system ``PuzzleFusion'' takes a set of room layouts as polygonal curves in the top-down view and aligns the room layout pieces by estimating their 2D translations and rotations, akin to solving the jigsaw puzzle of room layouts. A surprising discovery of the paper is that the simple use of a Diffusion Model effectively solves these challenging spatial puzzle tasks as a conditional generation process. To enable learning of an end-to-end neural system, the paper introduces new datasets with ground-truth arrangements: 1) 2D Voronoi Jigsaw Dataset, a synthetic one where pieces are generated by voronoi diagram of 2D pointset; and 2) MagicPlan Dataset, a real one from a production pipeline by MagicPlan, where pieces are room layouts constructed by augmented reality App by real-estate consumers. The qualitative and quantitative evaluations demonstrate that the proposed approach outperforms the competing methods by significant margins in all three spatial puzzle tasks. We have provided code and data in https://sepidsh.github.io/puzzlefusion.

Protein Design with Guided Discrete Diffusion
Nate Gruver Samuel Don Stanton Nathan C. Frey Tim G. J. Rudner Isidro Hotzel Julien Lafrance-Vanasse Arvind Rajpal Kyunghyun Cho Andrew Gordon Wilson



研究问题:如何结合生成模型和判别模型进行蛋白质设计?
动机:为了解决结构基础方法的局限性,如数据稀缺和逆向设计的挑战。
方法:提出了一种名为diffusioN优化采样(NOS)的离散扩散模型指导方法,该方法遵循降噪网络隐藏状态中的梯度。
效果:通过将NOS应用于LaMBO,实现了更强的性能和有限的编辑,并在实际应用中优化了抗体的表达率和结合率。

A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply LaMBO-2 to a real-world protein design task, optimizing antibodies for higher expression yield and binding affinity to several therapeutic targets under locality and developability constraints, attaining a 99\% expression rate and 40\% binding rate in exploratory in vitro experiments.

Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering
Tianxiao Li Hongyu Guo Filippo Grazioli Mark Gerstein Martin Renqiang Min



研究问题:如何从数据驱动的角度自动识别和修改蛋白质中的功能位点。
动机:功能位点与整体结构之间的区分是蛋白质生物物理中的一个基本概念,识别和修改这些功能位点对于蛋白质工程至关重要,但计算上复杂且需要大量的领域知识。
方法:提出一种带有辅助分类器的解耦的Wasserstein自编码器,该模型可以从理论上保证将功能相关模式与其他模式分离,实现一次性编辑蛋白质序列并提高对结果序列和编辑操作的理解。
效果:该方法在T细胞受体(TCRs)上的应用表明,可以在不改变结构主干的情况下改变TCRs的功能,并在生成质量和效率上都优于几种竞争方法,而且运行时间仅为基线模型的10%。据我们所知,这是第一个利用解耦表示进行TCR工程的方法。

In protein biophysics, the separation between the functionally important residues (forming the active site or binding surface) and those that create the overall structure (the fold) is a well-established and fundamental concept. Identifying and modifying those functional sites is critical for protein engineering but computationally non-trivial, and requires significant domain knowledge. To automate this process from a data-driven perspective, we propose a disentangled Wasserstein autoencoder with an auxiliary classifier, which isolates the function-related patterns from the rest with theoretical guarantees. This enables one-pass protein sequence editing and improves the understanding of the resulting sequences and editing actions involved. To demonstrate its effectiveness, we apply it to T-cell receptors (TCRs), a well-studied structure-function case. We show that our method can be used to alter the function of TCRs without changing the structural backbone, outperforming several competing methods in generation quality and efficiency, and requiring only 10\% of the running time needed by baseline models. To our knowledge, this is the first approach that utilizes disentangled representations for TCR engineering.

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
Zhong-Qiu Wang Shinji Watanabe



研究问题:在多说话人混响条件下,如何利用多个麦克风获取的混合信号进行无监督语音分离。
动机:在麦克风数量超过说话人数的情况下,我们可以将解决方案缩小到说话人图像,并通过将每个混合信号作为一个约束(即,一个麦克风处的估计说话人图像应该等于混合信号)来实现无监督的语音分离。
方法:我们提出了UNSSOR算法,这是一种通过利用过度确定的培训混合物进行无监督神经网络语音分离的方法。在每一步训练中,我们将输入混合物输入到深度神经网络(DNN)中,为每个说话人产生一个中间估计值,线性过滤这些估计值,并优化损失函数,使得在每个麦克风处,所有说话人的过滤后估计值可以相加等于混合物以满足上述约束。
效果:实验结果表明,这种损失函数可以促进说话人的无监督分离。线性过滤器是在每个子带中基于混合物和DNN估计值通过前向卷积预测(FCP)算法计算的。为了解决使用子带FCP产生的频率置换问题,我们提出了一种基于最小化源内幅度散射的损失项。虽然UNSSOR需要过度确定的培训混合物,但我们可以通过训练DNN实现不足确定的分离(例如,无监督单声道语音分离)。在混响条件下两个说话人的分离评估结果证明了UNSSOR的有效性和潜力。

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for $\underline{u}$nsupervised $\underline{n}$eural $\underline{s}$peech $\underline{s}$eparation by leveraging $\underline{o}$ver-determined training mixtu$\underline{r}$es. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered estimates of all the speakers can add up to the mixture to satisfy the above constraint. We show that this loss can promote unsupervised separation of speakers. The linear filters are computed in each sub-band based on the mixture and DNN estimates through the forward convolutive prediction (FCP) algorithm. To address the frequency permutation problem incurred by using sub-band FCP, a loss term based on minimizing intra-source magnitude scattering is proposed. Although UNSSOR requires over-determined training mixtures, we can train DNNs to achieve under-determined separation (e.g., unsupervised monaural speech separation). Evaluation results on two-speaker separation in reverberant conditions show the effectiveness and potential of UNSSOR.

SpecTr: Fast Speculative Decoding via Optimal Transport
Ziteng Sun Ananda Theertha Suresh Jae Hun Ro Ahmad Beirami Himanshu Jain Felix Yu



研究问题:大型语言模型的自回归采样在几个自然语言任务中取得了最先进的结果,但速度慢,甚至在某些任务中不可行。
动机:为了加快采样速度,提出了一种名为“投机解码”的方法,即使用一个小模型生成一个草案(一组或一序列的标记),然后通过大型语言模型并行地对所有草案中的标记进行评分。根据统计方法,接受草案中的一些标记(其余的被拒绝),以保证最终输出遵循大型模型的分布。
方法:本文从最优传输(OT)和成员成本的角度对投机解码进行了原理性的理解。这种新的公式可以将投机解码方法推广到允许在标记级别有一组k个候选者,从而得到改进的最优成员成本。我们证明了最优草案选择算法(传输计划)可以通过线性规划计算,其最佳已知运行时间是k的指数级。然后,我们提出了一个有效的草案选择算法,其接受概率是(1-1/e)-最优的乘法。此外,它可以在单个标记域的大小上几乎以线性时间计算。
效果:利用这种新的草案选择算法,我们开发了一种新的自回归采样算法,称为SpecTr,它在保证解码输出质量不降低的同时,实现了解码速度的提升。实验表明,对于最先进的大型语言模型,所提出的方法在标准基准测试上实现了2.13倍的时钟速度提升,比投机解码快了1.37倍。

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is *speculative decoding*: use a small model to sample a *draft* (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with *membership cost*. This framework can be viewed as an extension of the well-known *maximal-coupling* problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this new draft selection algorithm, we develop a new autoregressive sampling algorithm called *SpecTr*, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.

Generating Behaviorally Diverse Policies with Latent Diffusion Models
Shashank Hegde Sumeet Batra K.R. Zentner Gaurav S. Sukhatme



研究问题:如何将大量且多样化的行为策略压缩到一个模型中,同时保持其性能和覆盖范围。
动机:现有的行为多样性强化学习(QD-RL)方法虽然能学习到一系列表现良好的策略,但需要存储数千个策略,导致空间复杂度高且难以扩展到更多行为。
方法:提出使用扩散模型将策略存档提炼为一个生成模型,该模型在保留原始策略集的性能和覆盖范围的同时,实现了13倍的压缩比。
效果:该方法成功恢复了98%的原始奖励和89%的原始人形存档覆盖范围,并且由于扩散模型的条件机制,可以灵活选择和排序行为,包括使用语言。

Recent progress in Quality Diversity Reinforcement Learning (QD-RL) has enabled learning a collection of behaviorally diverse, high performing policies. However, these methods typically involve storing thousands of policies, which results in high space-complexity and poor scaling to additional behaviors. Condensing the archive into a single model while retaining the performance and coverage of the original collection of policies has proved challenging. In this work, we propose using diffusion models to distill the archive into a single generative model over policy parameters. We show that our method achieves a compression ratio of 13x while recovering 98% of the original rewards and 89% of the original humanoid archive coverage. Further, the conditioning mechanism of diffusion models allows for flexibly selecting and sequencing behaviors, including using language. Project website: https://sites.google.com/view/policydiffusion/home.

Spatially Resolved Gene Expression Prediction from Histology Images via Bi-modal Contrastive Learning
Ronald Xie Kuan Pang Sai W Chung Catia Perciani Sonya MacParland BO WANG Gary Bader



研究问题:如何有效地利用组织学成像技术进行医学诊断和研究,并理解其背后的分子机制?
动机:组织学成像是医学诊断和研究的重要工具,理解其背后的分子机制对于揭示疾病机制和发展有效治疗至关重要。
方法:我们提出了BLEEP(双模态嵌入表达预测)框架,这是一种能够生成全片染色组织学图像的空间分辨基因表达谱的双模态嵌入框架。BLEEP使用对比学习从参考数据集构建低维联合嵌入空间,该空间包含微米分辨率的配对图像和表达轮廓。
效果:我们在人类肝脏组织数据集上进行了基准测试,证明了BLEEP在基因表达预测方面的有效性,并在10x Visium平台上实现了显著优于现有方法的结果。这展示了BLEEP在揭示组织架构背后的分子机制方面的巨大潜力,为各种疾病的诊断和研究开辟了新的途径。

Histology imaging is an important tool in medical diagnosis and research, enabling the examination of tissue structure and composition at the microscopic level. Understanding the underlying molecular mechanisms of tissue architecture is critical in uncovering disease mechanisms and developing effective treatments.Gene expression profiling provides insight into the molecular processes underlying tissue architecture, but the process can be time-consuming and expensive. We present BLEEP (Bi-modaL Embedding for Expression Prediction), a bi-modal embedding framework capable of generating spatially resolved gene expression profiles of whole-slide Hematoxylin and eosin (H&E) stained histology images. BLEEP uses contrastive learning to construct a low-dimensional joint embedding space from a reference dataset using paired image and expression profiles at micrometer resolution. With this approach, the gene expression of any query image patch can be imputed using the expression profiles from the reference dataset. We demonstrate BLEEP’s effectiveness in gene expression prediction by benchmarking its performance on a human liver tissue dataset captured using the 10x Visium platform, where it achieves significant improvements over existing methods. Our results demonstrate the potential of BLEEP to provide insights into the molecular mechanisms underlying tissue architecture, with important implications in diagnosis and research of various diseases. The proposed approach can significantly reduce the time and cost associated with gene expression profiling, opening up new avenues for high-throughput analysis of histology images for both research and clinical applications.

Compositional Sculpting of Iterative Generative Processes
Timur Garipov Sebastiaan De Peuter Ge Yang Vikas Garg Samuel Kaski Tommi S. Jaakkola



研究问题:如何有效地组合和调整生成模型以实现特定的任务目标?
动机:生成模型的高训练成本以及针对特定任务的微调需求,使得模型复用和组合成为研究热点。
方法:提出了一种通用的方法——"组合雕刻"(Compositional Sculpting),用于定义迭代生成过程的组合。并引入了一种基于分类器指导的从这些组合中采样的方法。
效果:在GFlowNets和扩散模型上展示了如何完成组合雕刻,并在图像和分子生成任务上提供了实证结果。项目代码库:https://github.com/timgaripov/compositional-sculpting。

High training costs of generative models and the need to fine-tune them for specific tasks have created a strong interest in model reuse and composition. A key challenge in composing iterative generative processes, such as GFlowNets and diffusion models, is that to realize the desired target distribution, all steps of the generative process need to be coordinated, and satisfy delicate balance conditions. In this work, we propose Compositional Sculpting: a general approach for defining compositions of iterative generative processes. We then introduce a method for sampling from these compositions built on classifier guidance. We showcase ways to accomplish compositional sculpting in both GFlowNets and diffusion models. We highlight two binary operations $\\unicode{x2014}$ the $\\textit{harmonic mean}\\unicode{x00A0}(p_1 \\otimes p_2$) and the $\\textit{contrast}\\unicode{x00A0}(p_1 \\,\\unicode{x25D1}\\,\\, p_2$) between pairs, and the generalization of these operations to multiple component distributions. We offer empirical results on image and molecular generation tasks. Project codebase: https://github.com/timgaripov/compositional-sculpting.

MarioGPT: Open-Ended Text2Level Generation through Large Language Models
Shyam Sudhakaran Miguel González-Duque Matthias Freiberger Claire Glanois Elias Najarro Sebastian Risi



研究问题:如何利用大规模语言模型生成反映特定意图和约束的有意义内容,以及如何实现开放性的内容生成。
动机:虽然程序化内容生成(PCG)技术可以自动生成复杂多样的环境,但生成具有特定意图和约束的有意义内容仍然具有挑战性,且许多PCG算法缺乏开放性的内容生成能力。
方法:介绍了一种经过微调的GPT2模型MarioGPT,用于生成基于瓦片的游戏级别,如超级马里奥兄弟级别。MarioGPT不仅可以生成多样化的级别,还可以通过文本提示进行可控级别的生成,解决了当前PCG技术的关键挑战之一。
效果:据我们所知,MarioGPT是第一个文本到级别的模型,结合新颖性搜索,它能够生成具有不同游戏风格动态(即玩家路径)的多样化级别,并开放性地发现越来越多样化的内容范围。

Procedural Content Generation (PCG) is a technique to generate complex and diverse environments in an automated way. However, while generating content with PCG methods is often straightforward, generating meaningful content that reflects specific intentions and constraints remains challenging. Furthermore, many PCG algorithms lack the ability to generate content in an open-ended manner. Recently, Large Language Models (LLMs) have shown to be incredibly effective in many diverse domains. These trained LLMs can be fine-tuned, re-using information and accelerating training for new tasks. Here, we introduce MarioGPT, a fine-tuned GPT2 model trained to generate tile-based game levels, in our case Super Mario Bros levels. MarioGPT can not only generate diverse levels, but can be text-prompted for controllable level generation, addressing one of the key challenges of current PCG techniques. As far as we know, MarioGPT is the first text-to-level model and combined with novelty search it enables the generation of diverse levels with varying play-style dynamics (i.e. player paths) and the open-ended discovery of an increasingly diverse range of content. Code available at https://github.com/shyamsn97/mario-gpt.

A Regularized Conditional GAN for Posterior Sampling in Image Recovery Problems
Matthew C Bendel Rizwan Ahmad Philip Schniter



研究问题:在图像恢复问题中,如何从失真、不完整和/或噪声干扰的测量中推断出图像。
动机:此类问题出现在磁共振成像(MRI)、计算机断层扫描、去模糊、超分辨率、修复、相位检索、图像到图像转换等应用中。我们的目标是快速准确地从后验分布中采样。
方法:我们提出了一种正则化条件Wasserstein GAN,每秒钟生成数十个高质量的后验样本。我们的正则化包括$ell_1$惩罚和自适应加权标准差奖励。
效果:通过使用条件Fréchet inception距离等定量评估指标,我们在多线圈MRI和大规模修复应用中展示了该方法产生的后验样本处于最先进的水平。

In image recovery problems, one seeks to infer an image from distorted, incomplete, and/or noise-corrupted measurements. Such problems arise in magnetic resonance imaging (MRI), computed tomography, deblurring, super-resolution, inpainting, phase retrieval, image-to-image translation, and other applications. Given a training set of signal/measurement pairs, we seek to do more than just produce one good image estimate. Rather, we aim to rapidly and accurately sample from the posterior distribution. To do this, we propose a regularized conditional Wasserstein GAN that generates dozens of high-quality posterior samples per second. Our regularization comprises an $\ell_1$ penalty and an adaptively weighted standard-deviation reward. Using quantitative evaluation metrics like conditional Fréchet inception distance, we demonstrate that our method produces state-of-the-art posterior samples in both multicoil MRI and large-scale inpainting applications. The code for our model can be found here: https://github.com/matt-bendel/rcGAN.

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
Sungwon Kim Kevin J. Shih Rohan Badlani Joao Felipe Santos Evelina Bakhturina Mikyas T. Desta Rafael Valle Sungroh Yoon Bryan Catanzaro



研究问题:训练一种快速、数据高效的零样本TTS模型,用于语音提示的说话人适应。
动机:现有的大规模神经网络编解码语言模型在零样本TTS上表现出显著改进,但存在鲁棒性差、采样速度慢和依赖预训练神经编解码表示等缺点。
方法:提出P-Flow模型,该模型使用语音提示进行说话人适应,并采用流匹配生成解码器进行高质量且快速的语音合成。
效果:通过连续语音提示的训练方法,P-Flow在所需训练数据量减少两个数量级的情况下,达到与大型零样本TTS模型相当的说话人相似性性能,并且采样速度快于实时。实验结果表明,P-Flow在发音和人类相似度以及说话人相似度方面优于最新的最先进的模型,因此是一种吸引人且理想的替代方案。

While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20$\times$ faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative. We provide audio samples on our demo page: \url{https://research.nvidia.com/labs/adlr/pflow}

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
Grace Luo Lisa Dunlap Dong Huk Park Aleksander Holynski Trevor Darrell



研究问题:如何从扩散模型中提取有意义的内部表示,并整合多尺度和多时间步长的特征图以用于后续任务。
动机:现有的扩散模型虽然能生成高质量的图像,但其内部信息的特征映射分布在网络的各层和扩散的时间步上,难以提取有用的描述符。
方法:提出一种名为“扩散超特征”的框架,将多尺度和多时间步长的特征图整合为可用于下游任务的每个像素的特征描述符。这些描述符可以通过生成和反转过程在合成图像和真实图像上提取。
效果:该方法在语义关键点对应任务上表现出优越的性能,并在SPair-71k真实图像基准测试中取得了优异的成绩。此外,该方法具有灵活性和可转移性,可以在未见过的对象和组合的合成图像对上使用。

Diffusion models have been shown to be capable of generating high-quality images, suggesting that they could contain meaningful internal representations. Unfortunately, the feature maps that encode a diffusion model's internal information are spread not only over layers of the network, but also over diffusion timesteps, making it challenging to extract useful descriptors. We propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors that can be used for downstream tasks. These descriptors can be extracted for both synthetic and real images using the generation and inversion processes. We evaluate the utility of our Diffusion Hyperfeatures on the task of semantic keypoint correspondence: our method achieves superior performance on the SPair-71k real image benchmark. We also demonstrate that our method is flexible and transferable: our feature aggregation network trained on the inversion features of real image pairs can be used on the generation features of synthetic image pairs with unseen objects and compositions. Our code is available at https://diffusion-hyperfeatures.github.io.

Precision-Recall Divergence Optimization for Generative Modeling with GANs and Normalizing Flows
Alexandre Verine benjamin negrevergne Muni Sreenivas Pydi Yann Chevaleyre



研究问题:在生成模型领域,如何在图像质量(精确度)和多样性(召回率)之间取得平衡是一个重大挑战。
动机:目前最先进的模型主要依赖于优化启发式方法,如Frechet Inception Distance。虽然最近的进展引入了评估精确度和召回率的原则性方法,但尚未成功整合到生成模型的训练中。
方法:我们的主要贡献是一种新的生成模型训练方法,如生成对抗网络和归一化流,该方法明确地优化用户定义的精确度和召回率之间的权衡。更具体地说,我们表明实现指定的精确度-召回率权衡对应于最小化一种称为PR-divergences的独特f-divergence。反之,任何f-divergence都可以表示为PR-divergences和某种权重的组合,对应于加权的精确度-召回率权衡。
效果:通过全面的评估,我们发现这种方法在ImageNet等数据集上,无论是在精确度还是召回率方面,都优于现有的最先进的BigGAN模型。

Achieving a balance between image quality (precision) and diversity (recall) is a significant challenge in the domain of generative models. Current state-of-the-art models primarily rely on optimizing heuristics, such as the Fr\'echet Inception Distance. While recent developments have introduced principled methods for evaluating precision and recall, they have yet to be successfully integrated into the training of generative models. Our main contribution is a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows, which explicitly optimizes a user-defined trade-off between precision and recall. More precisely, we show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the \mbox{\em PR-divergences}. Conversely, any $f$-divergence can be written as a linear combination of PR-divergences and corresponds to a weighted precision-recall trade-off. Through comprehensive evaluations, we show that our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning
Haoran He Chenjia Bai Kang Xu Zhuoran Yang Weinan Zhang Dong Wang Bin Zhao Xuelong Li



研究问题:本文旨在研究扩散模型在处理大规模多任务离线数据时的有效性,特别是在面临多样化和多模态数据分布的挑战时。
动机:尽管现有的扩散模型在视觉和自然语言处理中表现出强大的生成能力,并在强化学习中显示出对复杂策略或轨迹的强大建模能力,但这些研究仅限于单任务设置,缺乏能够处理多任务困境的通用代理。
方法:本文提出了一种基于扩散的多任务离线数据建模方法——Multi-Task Diffusion Model (\textsc{MTDiff})。该方法结合了Transformer主干网络和提示学习,用于在多任务离线环境中进行生成规划和数据合成。\textsc{MTDiff}利用多任务数据中的知识,并在任务之间执行隐式知识共享。
效果:实验结果表明,对于生成规划,textsc{MTDiff}在Meta-World的50个任务和Maze2D的8个地图上超过了现有的最佳算法。对于数据合成,\textsc{MTDiff}能够在给定单个演示作为提示的情况下为测试任务生成高质量的数据,从而提高了甚至未见过的任务的低质量数据集的质量。

Diffusion models have demonstrated highly-expressive generative capabilities in vision and NLP. Recent studies in reinforcement learning (RL) have shown that diffusion models are also powerful in modeling complex policies or trajectories in offline datasets. However, these works have been limited to single-task settings where a generalist agent capable of addressing multi-task predicaments is absent. In this paper, we aim to investigate the effectiveness of a single diffusion model in modeling large-scale multi-task offline data, which can be challenging due to diverse and multimodal data distribution. Specifically, we propose Multi-Task Diffusion Model (\textsc{MTDiff}), a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis in multi-task offline settings. \textsc{MTDiff} leverages vast amounts of knowledge available in multi-task data and performs implicit knowledge sharing among tasks. For generative planning, we find \textsc{MTDiff} outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D. For data synthesis, \textsc{MTDiff} generates high-quality data for testing tasks given a single demonstration as a prompt, which enhances the low-quality datasets for even unseen tasks.

Tree-Rings Watermarks: Invisible Fingerprints for Diffusion Images
Yuxin Wen John Kirchenbauer Jonas Geiping Tom Goldstein



研究问题:如何通过水印技术对生成模型的输出进行版权追踪和防止AI生成内容的潜在危害。
动机:现有的水印技术需要在采样后对图像进行后处理,而本文提出的Tree-Ring Watermarking技术则在采样过程中微妙地影响整个流程,使得模型指纹对人类不可见。
方法:Tree-Ring Watermarking将模式嵌入到用于采样的初始噪声向量中,这些模式在傅里叶空间中结构化,使其对卷积、裁剪、膨胀、翻转和旋转具有不变性。生成图像后,通过反转扩散过程来检测水印信号,然后检查嵌入的信号。
效果:实验证明,这种技术可以很容易地应用于任意扩散模型,包括文本条件稳定的扩散模型,作为插值使用,其FID的损失可以忽略不计。与当前部署的水印替代方案相比,该水印在图像空间中是语义上隐藏的,并且更加鲁棒。

Watermarking the outputs of generative models is a crucial technique for tracing copyright and preventing potential harm from AI-generated content. In this paper, we introduce a novel technique called Tree-Ring Watermarking that robustly fingerprints diffusion model outputs. Unlike existing methods that perform post-hoc modifications to images after sampling, Tree-Ring Watermarking subtly influences the entire sampling process, resulting in a model fingerprint that is invisible to humans. The watermark embeds a pattern into the initial noise vector used for sampling. These patterns are structured in Fourier space so that they are invariant to convolutions, crops, dilations, flips, and rotations. After image generation, the watermark signal is detected by inverting the diffusion process to retrieve the noise vector, which is then checked for the embedded signal. We demonstrate that this technique can be easily applied to arbitrary diffusion models, including text-conditioned Stable Diffusion, as a plug-in with negligible loss in FID. Our watermark is semantically hidden in the image space and is far more robust than watermarking alternatives that are currently deployed.

Collaborative Score Distillation for Consistent Visual Editing
Subin Kim Kyungmin Lee June Suk Choi Jongheon Jeong Kihyuk Sohn Jinwoo Shin



研究问题:如何将大规模文本到图像扩散模型的生成先验适应到复杂的视觉模态,特别是在多图像(如视频或3D场景)中实现一致性。
动机:现有的文本到图像扩散模型在处理复杂视觉模态时,难以保持一组图像的一致性。
方法:提出一种新的协作得分蒸馏(CSD)方法,基于斯坦因变分梯度下降(SVGD),通过将多个样本视为“粒子”并在SVGD更新中结合它们的得分函数,同步提炼一组图像上的生成先验。
效果:实验表明,CSD在各种编辑任务中都有效,包括全景图像、视频和3D场景的视觉编辑,证明了其作为提高样本间一致性的通用方法的有效性,从而拓宽了文本到图像扩散模型的应用范围。

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video or 3D scene), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as “particles” in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates the seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of editing tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise
Arpit Bansal Eitan Borgnia Hong-Min Chu Jie S. Li Hamid Kazemi Furong Huang Micah Goldblum Jonas Geiping Tom Goldstein



研究问题:本文旨在探讨扩散模型的生成行为是否强烈依赖于图像退化的选择,并探索如何通过改变这一选择来构建整个系列的生成模型。
动机:作者观察到,即使使用完全确定的退化(如模糊、遮蔽等),扩散模型的训练和测试更新规则也可以轻松地推广以创建生成模型。
方法:通过改变图像退化的选择,可以构建出整个系列的生成模型。这些完全确定的模型的成功,挑战了社区对扩散模型的理解,即其依赖于梯度Langevin动力学或变分推理中的噪声。
效果:这为通用扩散模型铺平了道路,该模型可以反转任意过程。

Standard diffusion models involve an image transform -- adding Gaussian noise -- and an image restoration operator that inverts this degradation. We observe that the generative behavior of diffusion models is not strongly dependent on the choice of image degradation, and in fact, an entire family of generative models can be constructed by varying this choice. Even when using completely deterministic degradations (e.g., blur, masking, and more), the training and test-time update rules that underlie diffusion models can be easily generalized to create generative models. The success of these fully deterministic models calls into question the community's understanding of diffusion models, which relies on noise in either gradient Langevin dynamics or variational inference and paves the way for generalized diffusion models that invert arbitrary processes.

CSLP-AE: A Contrastive Split-Latent Permutation Autoencoder Framework for Zero-Shot Electroencephalography Signal Conversion
Anders Vestergaard Nørskov Alexander Neergaard Zahid Morten Mørup



研究问题:本文旨在解决EEG数据分析中的问题,即如何提取潜在的神经激活(内容)并考虑个体差异(风格)。
动机:由于EEG数据存在高度的噪声和受试者间变异性,因此需要一种能够提取潜在表示并考虑内容和风格的信号转换方法。
方法:受到语音转换技术进展的启发,提出了一种新的对比性分裂潜伏期置换自编码器(CSLP-AE)框架,该框架直接优化EEG转换。通过对比学习引导潜在表示,以明确表示受试者(风格)和任务(内容)。
效果:与常规的有监督、无监督(AE)和自我监督(对比学习)训练相比,发现该方法提供了有利的可泛化的任务和受试者表征。此外,该方法还实现了未见过的受试者之间的零射转换。

Electroencephalography (EEG) is a prominent non-invasive neuroimaging technique providing insights into brain function. Unfortunately, EEG data exhibit a high degree of noise and variability across subjects hampering generalizable signal extraction. Therefore, a key aim in EEG analysis is to extract the underlying neural activation (content) as well as to account for the individual subject variability (style). We hypothesize that the ability to convert EEG signals between tasks and subjects requires the extraction of latent representations accounting for content and style. Inspired by recent advancements in voice conversion technologies, we propose a novel contrastive split-latent permutation autoencoder (CSLP-AE) framework that directly optimizes for EEG conversion. Importantly, the latent representations are guided using contrastive learning to promote the latent splits to explicitly represent subject (style) and task (content). We contrast CSLP-AE to conventional supervised, unsupervised (AE), and self-supervised (contrastive learning) training and find that the proposed approach provides favorable generalizable characterizations of subject and task. Importantly, the procedure also enables zero-shot conversion between unseen subjects. While the present work only considers conversion of EEG, the proposed CSLP-AE provides a general framework for signal conversion and extraction of content (task activation) and style (subject variability) components of general interest for the modeling and analysis of biological signals.

PromptIR: Prompting for All-in-One Image Restoration
Vaishnav Potlapalli Syed Waqas Zamir Salman Khan Fahad Khan



研究问题:本文旨在解决深度学习在图像恢复任务中对不同类型和级别的退化具有有限泛化能力的问题。
动机:目前的深度学习方法需要为每种特定的退化类型训练单独的模型,并且需要知道输入的退化类型才能应用相关的模型,这限制了其在实际世界中的应用。
方法:本文提出了一种基于提示的学习方式PromptIR,用于一体化图像恢复,可以有效地从各种类型和级别的退化中恢复图像。具体来说,该方法使用提示来编码特定于退化的信息,然后动态地指导恢复网络。
效果:实验结果表明,PromptIR能够推广到不同的退化类型和级别,同时在图像去噪、去雨滴和去雾等方面取得了最先进的结果。

Image restoration involves recovering a high-quality clean image from its degraded version. Deep learning-based methods have significantly improved image restoration performance, however, they have limited generalization ability to different degradation types and levels. This restricts their real-world application since it requires training individual models for each specific degradation and knowing the input degradation type to apply the relevant model. We present a prompt-based learning approach, PromptIR, for All-In-One image restoration that can effectively restore images from various types and levels of degradation. In particular, our method uses prompts to encode degradation-specific information, which is then used to dynamically guide the restoration network. This allows our method to generalize to different degradation types and levels, while still achieving state-of-the-art results on image denoising, deraining, and dehazing. Overall, PromptIR offers a generic and efficient plugin module with few lightweight prompts that can be used to restore images of various types and levels of degradation with no prior information on the corruptions present in the image. Our code and pre-trained models are available here: https://github.com/va1shn9v/PromptIR

Diffusion Model for Graph Inverse Problems: Towards Effective Source Localization on Complex Networks
Xin Yan Hui Fang Qiang He



研究问题:信息扩散问题的解决,如疫情或谣言的传播,是社会广泛存在的问题。基于当前观察到的扩散图进行源定位和识别扩散路径的图扩散逆问题对于控制信息传播至关重要。
动机:扩散源定位的问题高度病态,对准确评估涉及的不确定性构成了主要障碍。此外,虽然理解信息如何通过图表传播至关重要,但关于重建信息传播路径的研究却很少。
方法:我们提出了一种名为DDMSL(离散扩散模型用于源定位)的概率模型。该方法基于信息在复杂网络中自然扩散的传播过程,可以通过消息传递函数来表述。首先,我们使用马尔可夫链对信息的前向扩散进行建模。然后,我们设计了一个可逆的残差网络,以在离散空间中构建一个去噪扩散模型,用于源定位和信息扩散路径的重建。
效果:我们对DDMSL提供了严格的理论保证,并通过在五个真实世界数据集上的大量实验证明了其有效性。

Information diffusion problems, such as the spread of epidemics or rumors, are widespread in society. The inverse problems of graph diffusion, which involve locating the sources and identifying the paths of diffusion based on currently observed diffusion graphs, are crucial to controlling the spread of information. The problem of localizing the source of diffusion is highly ill-posed, presenting a major obstacle in accurately assessing the uncertainty involved. Besides, while comprehending how information diffuses through a graph is crucial, there is a scarcity of research on reconstructing the paths of information propagation. To tackle these challenges, we propose a probabilistic model called DDMSL (Discrete Diffusion Model for Source Localization). Our approach is based on the natural diffusion process of information propagation over complex networks, which can be formulated using a message-passing function. First, we model the forward diffusion of information using Markov chains. Then, we design a reversible residual network to construct a denoising-diffusion model in discrete space for both source localization and reconstruction of information diffusion paths. We provide rigorous theoretical guarantees for DDMSL and demonstrate its effectiveness through extensive experiments on five real-world datasets.

Beta Diffusion
Mingyuan Zhou Tianqi Chen Zhendong Wang Huangjie Zheng



研究问题:本文旨在介绍一种新的生成模型方法——beta扩散,该方法通过结合去掩蔽和去噪来生成在限定范围内的数据。
动机:传统的基于扩散的生成模型依赖于加性高斯噪声和重新加权的证据下界(ELBOs),而beta扩散是乘性的,并使用KL散度的上界(KLUBs)进行优化,这源于KL散度的凸性。
方法:beta扩散利用缩放和平移的贝塔分布,通过随时间进行乘性转换来创建正向和反向扩散过程,保持在任何时间点的数据的正向边际和反向条件中的贝塔分布。
效果:实验结果表明,所提出的KLUBs比负ELBOs更有效地优化了beta扩散,这对于生成范围受限的数据的生成模型具有独特的能力,并验证了KLUBs在优化扩散模型中的有效性,使它们成为基于扩散的生成模型家族和用于训练它们的优化技术的重要补充。

We introduce beta diffusion, a novel generative modeling method that integrates demasking and denoising to generate data within bounded ranges. Using scaled and shifted beta distributions, beta diffusion utilizes multiplicative transitions over time to create both forward and reverse diffusion processes, maintaining beta distributions in both the forward marginals and the reverse conditionals, given the data at any point in time. Unlike traditional diffusion-based generative models relying on additive Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived from the convexity of the KL divergence. We demonstrate that the proposed KLUBs are more effective for optimizing beta diffusion compared to negative ELBOs, which can also be derived as the KLUBs of the same KL divergence with its two arguments swapped. The loss function of beta diffusion, expressed in terms of Bregman divergence, further supports the efficacy of KLUBs for optimization. Experimental results on both synthetic data and natural images demonstrate the unique capabilities of beta diffusion in generative modeling of range-bounded data and validate the effectiveness of KLUBs in optimizing diffusion models, thereby making them valuable additions to the family of diffusion-based generative models and the optimization techniques used to train them.

Simple and Controllable Music Generation
Jade Copet Felix Kreuk Itai Gat Tal Remez Gabriel Synnaeve Yossi Adi Alexandre Défossez



研究问题:本文旨在解决条件音乐生成的任务。
动机:现有的音乐生成模型需要多个阶段或多个模型的级联,而本文提出的MusicGen只需要一个语言模型和有效的标记插值模式,避免了这种复杂性。
方法:MusicGen是一个单一的语言模型,它在几个压缩的离散音乐表示(即标记)流上操作。通过这种方式,我们可以在文本描述或旋律特征的条件下生成高质量的样本。
效果:实验结果表明,MusicGen在自动和人工研究中都优于基准线,证明了这种方法的优越性。

We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data
Jing Gong Minsheng Hao Xingyi Cheng Xin Zeng Chiming Liu Jianzhu Ma Xuegong Zhang Taifeng Wang Le Song



研究问题:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
动机:本文提出了一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classical transformer architectures are prohibitive to train on such data in terms of both computation and memory. To address this challenge, we propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene$^\alpha$ (or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today. Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes, and it also leads to SOTA performance over various downstream tasks, such as cell type annotation, perturb-seq effect prediction, and drug combination prediction. xTrimoGene model is now available for use as a service via the following link: https://api.biomap.com/xTrimoGene/apply.

Energy Guided Diffusion for Generating Neurally Exciting Images
Paweł A. Pierzchlewicz Konstantin Friedrich Willeke Arne Nix Pavithra Elumalai Kelli Restivo Tori Shinn Cate Nealley Gabrielle Rodriguez Saumil Patel Katrin Franke Andreas S. Tolias Fabian H. Sinz



研究问题:如何更有效地预测生物和人工视觉系统中神经元的活动?
动机:随着视觉层次的提升,神经元计算的复杂性增加,对神经元活动进行建模变得更具挑战性。
方法:本文提出了一种受视觉注意力机制启发的新架构——注意力读取,以及一个数据驱动的卷积核心,这两种新方法在预测猕猴V4区域神经元活动方面优于之前的任务驱动模型。同时,为了解决深度和复杂性增加可能导致的过度拟合问题,我们提出了一种基于能量引导(EGG)的扩散生成MEIs的方法。
效果:对于猕猴V4区域的模型,EGG生成的单神经元MEIs比最先进的梯度上升法(GA)具有更好的泛化能力,同时计算成本降低了4.7倍,有利于进行实验性的挑战性闭环实验。此外,EGG扩散还可以用于生成其他令人兴奋的神经图像,如最令人兴奋的自然图像和在各种架构中表现良好的图像重建。最后,EGG易于实现,无需重新训练扩散模型,可以方便地推广到提供视觉系统的其他表征,如不变性。因此,EGG提供了一个通用而灵活的框架,以自然图像为背景研究视觉系统的编码特性。

In recent years, most exciting inputs (MEIs) synthesized from encoding models of neuronal activity have become an established method for studying tuning properties of biological and artificial visual systems. However, as we move up the visual hierarchy, the complexity of neuronal computations increases. Consequently, it becomes more challenging to model neuronal activity, requiring more complex models. In this study, we introduce a novel readout architecture inspired by the mechanism of visual attention. This new architecture, which we call attention readout, together with a data-driven convolutional core outperforms previous task-driven models in predicting the activity of neurons in macaque area V4. However, as our predictive network becomes deeper and more complex, synthesizing MEIs via straightforward gradient ascent (GA) can struggle to produce qualitatively good results and overfit to idiosyncrasies of a more complex model, potentially decreasing the MEI's model-to-brain transferability. To solve this problem, we propose a diffusion-based method for generating MEIs via Energy Guidance (EGG). We show that for models of macaque V4, EGG generates single neuron MEIs that generalize better across varying model architectures than the state-of-the-art GA, while at the same time reducing computational costs by a factor of 4.7x, facilitating experimentally challenging closed-loop experiments. Furthermore, EGG diffusion can be used to generate other neurally exciting images, like most exciting naturalistic images that are on par with a selection of highly activating natural images, or image reconstructions that generalize better across architectures. Finally, EGG is simple to implement, requires no retraining of the diffusion model, and can easily be generalized to provide other characterizations of the visual system, such as invariances. Thus, EGG provides a general and flexible framework to study the coding properties of the visual system in the context of natural images.

Unified Segment-to-Segment Framework for Simultaneous Sequence Generation
Shaolei Zhang Yang Feng



研究问题:如何同时生成目标序列,以实现低延迟高质量的实时场景,如流式语音识别、同步机器翻译和同步语音翻译。
动机:现有的方法往往依赖于特定任务的策略来处理不同的序列类型,限制了模型自适应学习源-目标映射的能力,阻碍了多任务学习的探索。
方法:本文提出了一种统一的片段到片段(Seg2Seg)框架进行同步序列生成,以一种自适应和统一的方式学习映射。在同步生成过程中,模型交替等待源片段和生成目标片段,使片段成为源和目标之间的自然桥梁。
效果:实验证明,Seg2Seg在多种同步生成任务上实现了最先进的性能,并在各种任务中表现出更好的通用性。

Simultaneous sequence generation is a pivotal task for real-time scenarios, such as streaming speech recognition, simultaneous machine translation and simultaneous speech translation, where the target sequence is generated while receiving the source sequence. The crux of achieving high-quality generation with low latency lies in identifying the optimal moments for generating, accomplished by learning a mapping between the source and target sequences. However, existing methods often rely on task-specific heuristics for different sequence types, limiting the model’s capacity to adaptively learn the source-target mapping and hindering the exploration of multi-task learning for various simultaneous tasks. In this paper, we propose a unified segment-to-segment framework (Seg2Seg) for simultaneous sequence generation, which learns the mapping in an adaptive and unified manner. During the process of simultaneous generation, the model alternates between waiting for a source segment and generating a target segment, making the segment serve as the natural bridge between the source and target. To accomplish this, Seg2Seg introduces a latent segment as the pivot between source to target and explores all potential source-target mappings via the proposed expectation training, thereby learning the optimal moments for generating. Experiments on multiple simultaneous generation tasks demonstrate that Seg2Seg achieves state-of-the-art performance and exhibits better generality across various tasks.

Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation and Elaboration
Haitao Lin Yufei Huang Odin Zhang Yunfan Liu Lirong Wu Siyuan Li Zhiyuan Chen Stan Z. Li



研究问题:近年来,AI辅助药物设计方法被提出以生成目标蛋白口袋结构对应的分子。大多数方法是基于原子级别的,这导致难以生成具有复杂结构的逼真片段。
动机:为了解决这个问题,我们提出了D3FG,一种基于功能团的扩散模型,用于特定口袋的分子生成和详细化。
方法:D3FG将分子分解为两类组件:定义为刚体的官能团和作为质点的连接体。这两种类型的组件可以一起形成增强配体-蛋白质相互作用的复杂片段。在扩散过程中,D3FG将组件的位置、方向和类型的数据分布扩散到先验分布中;在生成过程中,通过用设计的等变图神经网络参数化的去噪器逐渐去除三个变量中的噪声。
效果:实验结果表明,我们的方法可以生成具有更真实3D结构的分子,对蛋白质靶点具有竞争力的亲和力,并具有更好的药物性质。此外,D3FG作为一种解决新任务分子详细化的解决方案,可以根据现有配体和目标蛋白热点生成具有高亲和力的分子。

In recent years, AI-assisted drug design methods have been proposed to generate molecules given the pockets' structures of target proteins. Most of them are {\em atom-level-based} methods, which consider atoms as basic components and generate atom positions and types. In this way, however, it is hard to generate realistic fragments with complicated structures. To solve this, we propose \textsc{D3FG}, a {\em functional-group-based} diffusion model for pocket-specific molecule generation and elaboration. \textsc{D3FG} decomposes molecules into two categories of components: functional groups defined as rigid bodies and linkers as mass points. And the two kinds of components can together form complicated fragments that enhance ligand-protein interactions. To be specific, in the diffusion process, \textsc{D3FG} diffuses the data distribution of the positions, orientations, and types of the components into a prior distribution; In the generative process, the noise is gradually removed from the three variables by denoisers parameterized with designed equivariant graph neural networks. In the experiments, our method can generate molecules with more realistic 3D structures, competitive affinities toward the protein targets, and better drug properties. Besides, \textsc{D3FG} as a solution to a new task of molecule elaboration, could generate molecules with high affinities based on existing ligands and the hotspots of target proteins.

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models
Geon Yeong Park Jeongsol Kim Beomsu Kim Sang Wan Lee Jong Chul Ye



研究问题:尽管文本到图像扩散模型在图像生成任务中表现优异,但生成的图像有时无法捕捉文本提示的预定语义内容,这种现象被称为语义不匹配。
动机:为了解决这个问题,我们提出了一种新的能量基础模型(EBM)框架,通过模拟上下文向量的后验分布进行自适应上下文控制。
方法:我们在去噪自编码器的每一层交叉注意力中,对潜在图像表示和文本嵌入进行EBM建模。然后,我们获取上下文向量后验对数的梯度,可以更新并转移到后续的交叉注意力层,从而隐式地最小化嵌套层次的能量函数。
效果:我们的潜EBMs进一步允许零样本组合生成,作为不同上下文的交叉注意力输出的线性组合。实验表明,该方法在处理各种图像生成任务时非常有效,包括多概念生成、文本引导的图像修复以及真实和合成图像编辑。

Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. Code: https://github.com/EnergyAttention/Energy-Based-CrossAttention.

Debiasing Scores and Prompts of 2D Diffusion for View-consistent Text-to-3D Generation
Susung Hong Donghoon Ahn Seungryong Kim



研究问题:现有的文本到3D生成技术在执行过程中,往往遇到视图不一致的问题,尤其是“双面人”问题。
动机:为了解决这一问题,我们探索了现有的框架并发现其主要原因是2D扩散模型的嵌入偏差。
方法:我们提出了两种去偏方法。第一种是分数去偏,即在优化过程中逐渐增加截断值以切断2D扩散模型的评分;第二种是提示去偏,通过语言模型识别用户提示和视图提示之间的冲突词,并调整视图提示与物体观察方向的差异。
效果:实验结果表明,这两种方法都能显著减少生成3D对象的人工痕迹,并在保持2D扩散模型忠实度和3D一致性之间取得了良好的平衡。

Existing score-distilling text-to-3D generation techniques, despite their considerable promise, often encounter the view inconsistency problem. One of the most notable issues is the Janus problem, where the most canonical view of an object (\textit{e.g}., face or head) appears in other views. In this work, we explore existing frameworks for score-distilling text-to-3D generation and identify the main causes of the view inconsistency problem---the embedded bias of 2D diffusion models. Based on these findings, we propose two approaches to debias the score-distillation frameworks for view-consistent text-to-3D generation. Our first approach, called score debiasing, involves cutting off the score estimated by 2D diffusion models and gradually increasing the truncation value throughout the optimization process. Our second approach, called prompt debiasing, identifies conflicting words between user prompts and view prompts using a language model, and adjusts the discrepancy between view prompts and the viewing direction of an object. Our experimental results show that our methods improve the realism of the generated 3D objects by significantly reducing artifacts and achieve a good trade-off between faithfulness to the 2D diffusion models and 3D consistency with little overhead. Our project page is available at~\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.

Disentangling Voice and Content with Self-Supervision for Speaker Recognition
TIANCHI LIU Kong Aik Lee Qiongqiong Wang Haizhou Li



研究问题:如何从语音中提取准确的说话人表示,因为其混合了说话人的特质和内容。
动机:由于说话人的特质和内容在语音中的混合,使得提取准确的说话人表示变得困难。
方法:本文提出了一个解耦框架,该框架同时模拟了语音中说话人的特质和内容的可变性。这是通过使用三个高斯推理层实现的,每个层都包含一个可学习的转换模型,用于提取独特的语音成分。特别是,设计了一个强化的转换模型来模拟复杂的语音动态。我们还提出了一种自我监督的方法,在没有除说话人身份以外的标签的情况下,动态地解耦内容。
效果:通过对VoxCeleb和SITW数据集进行实验,验证了所提出框架的有效性,平均EER和minDCF分别降低了9.56%和8.24%。由于不需要额外的模型训练或数据,因此在实践中易于应用。

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56\% and 8.24\% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.

FABind: Fast and Accurate Protein-Ligand Binding
Qizhi Pei Kaiyuan Gao Lijun Wu Jinhua Zhu Yingce Xia Shufang Xie Tao Qin Kun He Tie-Yan Liu Rui Yan



研究问题:如何准确预测蛋白质和配体之间的相互作用并预测它们的结合结构,这是药物发现中的关键但具有挑战性的任务。
动机:尽管深度学习在解决这个挑战上显示出了希望,但采样和回归方法作为两种主要的方法存在明显的限制。
方法:我们提出了FABind,这是一个端到端的模型,通过结合口袋预测和对接来快速准确地预测蛋白质-配体的结合。FABind引入了一个独特的配体信息口袋预测模块,该模块也用于对接姿势估计。
效果:通过在基准数据集上的大量实验,我们的FABind在有效性和效率方面与现有方法相比表现出明显的优势。

Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose FABind, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. FABind incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed FABind demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https://github.com/QizhiPei/FABind.

SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models
Shuchen Xue Mingyang Yi Weijian Luo Shifeng Zhang Jiacheng Sun Zhenguo Li Zhi-Ming Ma



研究问题:扩散概率模型(DPMs)在生成任务上取得了成功,但采样过程耗时。本研究旨在分析扩散随机微分方程的采样方法,并提出一种高效的随机亚当斯方法来生成高质量数据。
动机:尽管改进的微分方程求解器可以加速采样过程,但随机采样可以在生成多样化和高质量数据方面提供额外优势。
方法:本研究从两个方面对随机采样进行分析:方差控制的扩散随机微分方程和线性多步随机微分方程求解器。基于分析结果,提出了SA-Solver,一种用于解决扩散随机微分方程的高效随机亚当斯方法,以生成高质量数据。
效果:实验结果表明,SA-Solver在几个方面表现优异:1)与现有最先进的采样方法相比,在少数步骤采样中具有改进或相当的性能;2)在大量基准数据集上,在适当的函数评估次数下,实现了最先进的FID。

Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE or ODE which is time-consuming, numerous fast sampling methods built upon improved differential equation solvers are proposed. The majority of such techniques consider solving the diffusion ODE due to its superior efficiency. However, stochastic sampling could offer additional advantages in generating diverse and high-quality data. In this work, we engage in a comprehensive analysis of stochastic sampling from two aspects: variance-controlled diffusion SDE and linear multi-step SDE solver. Based on our analysis, we propose SA-Solver, which is an improved efficient stochastic Adams method for solving diffusion SDE to generate data with high quality. Our experiments show that SA-Solver achieves: 1) improved or comparable performance compared with the existing state-of-the-art (SOTA) sampling methods for few-step sampling; 2) SOTA FID on substantial benchmark datasets under a suitable number of function evaluations (NFEs).

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis
Zhiyu Jin Xuli Shen Bin Li Xiangyang Xue



研究问题:如何使文本到图像的扩散模型适应各种特定尺寸和纵横比的图像,同时保持视觉保真度。
动机:目前的扩散模型在训练和评估时都是基于固定尺寸的图像,但用户需要各种特定尺寸和纵横比的图像。
方法:通过观察发现,低分辨率图像在合成过程中会出现对象描绘不完整的问题,而高分辨率图像则会出现重复无序的展示。然后,建立了一个统计关系,表明注意力熵会随令牌数量的变化而变化,这表明模型聚合的空间信息与图像分辨率成正比。因此,提出了一个缩放因子来缓解注意力熵的变化,并减轻观察到的缺陷模式。
效果:大量的实验结果验证了所提出的缩放因子的有效性,使模型能够在不使用额外的训练或微调技术的情况下,实现更好的视觉效果、图像质量和文本对齐。

Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis. Abiding by the tradition in deep learning, DMs are trained and evaluated on the images with fixed sizes. However, users are demanding for various images with specific sizes and various aspect ratio. This paper focuses on adapting text-to-image diffusion models to handle such variety while maintaining visual fidelity. First we observe that, during the synthesis, lower resolution images suffer from incomplete object portrayal, while higher resolution images exhibit repetitively disordered presentation. Next, we establish a statistical relationship indicating that attention entropy changes with token quantity, suggesting that models aggregate spatial information in proportion to image resolution. The subsequent interpretation on our observations is that objects are incompletely depicted due to limited spatial information for low resolutions, while repetitively disorganized presentation arises from redundant spatial information for high resolutions. From this perspective, we propose a scaling factor to alleviate the change of attention entropy and mitigate the defective pattern observed. Extensive experimental results validate the efficacy of the proposed scaling factor, enabling models to achieve better visual effects, image quality, and text alignment. Notably, these improvements are achieved without additional training or fine-tuning techniques.

Multi-scale Diffusion Denoised Smoothing
Jongheon Jeong Jinwoo Shin



研究问题:如何在保证模型准确性的同时,为大规模预训练模型提供对抗性鲁棒性?
动机:随机平滑化是一种能够为大规模模型提供对抗性鲁棒性的实用方法。通过使用准确的去噪器(如扩散模型)进行“去噪-分类”的简单流程,可以实现所谓的去噪平滑化。
方法:我们提出了一种可扩展的方法来解决当前去噪平滑化中认证鲁棒性和准确性之间的权衡问题。主要思想是在多个噪声尺度之间“有选择地”应用平滑化,称为多尺度平滑化,这可以通过单个扩散模型高效实现。此外,我们还提出了一种新的目标来比较多尺度平滑化分类器的集体鲁棒性,并探讨了哪种扩散模型表示能够最大化这一目标。
效果:实验表明,所提出的多尺度平滑方案与扩散模型微调相结合,不仅在高噪声尺度上实现了强大的认证鲁棒性,而且保持了接近非平滑分类器的准确性。

Along with recent diffusion models, randomized smoothing has become one of a few tangible approaches that offers adversarial robustness to models at scale, e.g., those of large pre-trained models. Specifically, one can perform randomized smoothing on any classifier via a simple "denoise-and-classify" pipeline, so-called denoised smoothing, given that an accurate denoiser is available - such as diffusion model. In this paper, we present scalable methods to address the current trade-off between certified robustness and accuracy in denoised smoothing. Our key idea is to "selectively" apply smoothing among multiple noise scales, coined multi-scale smoothing, which can be efficiently implemented with a single diffusion model. This approach also suggests a new objective to compare the collective robustness of multi-scale smoothed classifiers, and questions which representation of diffusion model would maximize the objective. To address this, we propose to further fine-tune diffusion model (a) to perform consistent denoising whenever the original image is recoverable, but (b) to generate rather diverse outputs otherwise. Our experiments show that the proposed multi-scale smoothing scheme, combined with diffusion fine-tuning, not only allows strong certified robustness at high noise scales but also maintains accuracy close to non-smoothed classifiers. Code is available at https://github.com/jh-jeong/smoothing-multiscale.

DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models
XiMing Xing Chuang Wang Haitao Zhou Jing Zhang Qian Yu Dong Xu



研究问题:尽管主要在图像上进行训练,我们发现预训练的扩散模型在指导草图合成方面表现出强大的能力。
动机:本文提出了一种创新的算法DiffSketcher,它使用自然语言输入创建矢量化的徒手草图。
方法:DiffSketcher基于预训练的文本到图像扩散模型,通过优化一组Bézier曲线和扩展版的得分蒸馏采样(SDS)损失来执行任务,这使得我们可以将光栅级扩散模型用作参数化矢量化草图生成器的先验。此外,我们还探索了嵌入在扩散模型中的注意力图,用于有效的笔画初始化以加速生成过程。
效果:生成的草图展示了多层次的抽象性,同时保持了所绘制主题的可识别性、基本结构和关键视觉细节。实验表明,DiffSketcher的质量优于先前的工作。

Even though trained mainly on images, we discover that pretrained diffusion models show impressive power in guiding sketch synthesis. In this paper, we present DiffSketcher, an innovative algorithm that creates \textit{vectorized} free-hand sketches using natural language input. DiffSketcher is developed based on a pre-trained text-to-image diffusion model. It performs the task by directly optimizing a set of Bézier curves with an extended version of the score distillation sampling (SDS) loss, which allows us to use a raster-level diffusion model as a prior for optimizing a parametric vectorized sketch generator. Furthermore, we explore attention maps embedded in the diffusion model for effective stroke initialization to speed up the generation process. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual details of the subject drawn. Our experiments show that DiffSketcher achieves greater quality than prior work. The code and demo of DiffSketcher can be found at https://ximinng.github.io/DiffSketcher-project/.

DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics
Kaiwen Zheng Cheng Lu Jianfei Chen Jun Zhu



研究问题:扩散概率模型(DPMs)在高保真图像生成方面表现出色,但在采样效率上存在问题。
动机:现有的快速ODE求解器虽然可以加速采样过程,但它们在推理过程中高度依赖于特定的参数化(如噪声/数据预测),这可能并非最优选择。
方法:我们提出了一种新的优化采样过程中的参数化方法,该方法最小化了ODE解的一阶离散误差。基于这种新的参数化方法,我们提出了“DPM-Solver-v3”,这是一种新的用于DPM的快速ODE求解器,通过在预训练模型中有效计算一些称为“经验模型统计量”的系数。我们还引入了多步方法和预测-校正框架,并提出了一些技术来提高小数量函数评估(NFE)或大指导尺度下的样本质量。
效果:实验表明,DPM-Solver-v3在无条件和有条件采样中都能实现一致的更好或相当的性能,无论是在像素空间还是潜在空间的DPM中,特别是在5到10个NFE的情况下。我们在CIFAR10上的无条件FID为12.21(5 NFE),在Stable Diffusion上的MSE为0.55(5 NFE,7.5指导尺度),与以前的最先进的无训练方法相比,速度提高了15%到30%。代码可在https://github.com/thu-ml/DPM-Solver-v3获取。

Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose \textit{DPM-Solver-v3}, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call \textit{empirical model statistics}. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15\%$\sim$30\% compared to previous state-of-the-art training-free methods. Code is available at \url{https://github.com/thu-ml/DPM-Solver-v3}.

Likelihood-Based Diffusion Language Models
Ishaan Gulrajani Tatsunori Hashimoto



研究问题:扩散模型在标准语言建模基准测试上无法达到有意义的似然性,本研究旨在缩小自回归和扩散语言模型之间的似然性差距。
动机:尽管对基于扩散的语言模型的兴趣日益增长,但现有工作并未表明这些模型能够在标准的 language modeling 基准测试上实现非平凡的似然性。
方法:通过算法改进、扩展定律和增加计算力,我们为扩散语言模型的最大似然训练引入了几种方法上的改进。
效果:使用我们的方法及扩展分析,我们训练并发布了Plaid 1B,这是一个大型的扩散语言模型,它在基准数据集上的似然性超过了GPT-2 124M,并在无条件和零射控制设置中生成流畅的样本。

Despite a growing interest in diffusion-based language models, existing work has not shown that these models can attain nontrivial likelihoods on standard language modeling benchmarks. In this work, we take the first steps towards closing the likelihood gap between autoregressive and diffusion-based language models, with the goal of building and releasing a diffusion model which outperforms a small but widely-known autoregressive model. We pursue this goal through algorithmic improvements, scaling laws, and increased compute. On the algorithmic front, we introduce several methodological improvements for the maximum-likelihood training of diffusion language models. We then study scaling laws for our diffusion models and find compute-optimal training regimes which differ substantially from autoregressive models. Using our methods and scaling analysis, we train and release Plaid 1B, a large diffusion language model which outperforms GPT-2 124M in likelihood on benchmark datasets and generates fluent samples in unconditional and zero-shot control settings.

Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent
Giannis Daras Yuval Dagan Alex Dimakis Constantinos Costis Daskalakis



研究问题:扩散模型的训练和采样分布之间的偏移问题。
动机:由于生成过程的递归性质,前几步的错误会导致采样迭代从训练分布中偏离。然而,标准的去噪得分匹配(DSM)训练目标仅针对非偏移数据进行优化。
方法:提出一致性属性(CP),即模型对其自身生成的预测在时间上是一致的。理论上,证明了描述CP和保守向量场的微分方程在给定一些初始条件时有唯一解。因此,如果在非偏移点通过DSM(强制真实初始条件)很好地学习了分数,那么在偏移点强制CP会传播真实的分数值。
效果:实验表明,在CIFAR-10、AFHQ和FFHQ的条件和无条件生成中,强制CP提高了生成质量。

Imperfect score-matching leads to a shift between the training and the sampling distribution of diffusion models. Due to the recursive nature of the generation process, errors in previous steps yield sampling iterates that drift away from the training distribution. However, the standard training objective via Denoising Score Matching (DSM) is only designed to optimize over non-drifted data. To train on drifted data, we propose to enforce a \emph{Consistency} property (CP) which states that predictions of the model on its own generated data are consistent across time. Theoretically, we show that the differential equation that describes CP together with the one that describes a conservative vector field, have a unique solution given some initial condition. Consequently, if the score is learned well on non-drifted points via DSM (enforcing the true initial condition) then enforcing CP on drifted points propagates true score values. Empirically, we show that enforcing CP improves the generation quality for conditional and unconditional generation on CIFAR-10, and in AFHQ and FFHQ. We open-source our code and models: https://github.com/giannisdaras/cdm.

DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement
Wenxin Tai Yue Lei Fan Zhou Goce Trajcevski Ting Zhong



研究问题:如何将条件信息融入去噪扩散概率模型(DDPMs)以进行语音增强(SE)。
动机:尽管确定性深度学习模型已被广泛用于语音增强,但最近的研究表明生成方法,如去噪扩散概率模型,也可以有效。然而,将条件信息融入DDPMs仍是一个挑战。
方法:我们提出了一种名为DOSE的模型无关方法,该方法采用两种有效的条件增强技术来解决这个问题。首先,我们通过训练模型时使用丢弃操作,使模型在生成样本时优先考虑条件因素;其次,我们通过提供具有信息性的自适应先验,将条件信息注入到采样过程中。
效果:实验表明,我们的方法在高质量和稳定的语音生成、与条件因素的一致性以及推理效率方面取得了显著改进。代码已在https://github.com/ICDM-UESTC/DOSE上公开。

Speech enhancement (SE) aims to improve the intelligibility and quality of speech in the presence of non-stationary additive noise. Deterministic deep learning models have traditionally been used for SE, but recent studies have shown that generative approaches, such as denoising diffusion probabilistic models (DDPMs), can also be effective. However, incorporating condition information into DDPMs for SE remains a challenge. We propose a model-agnostic method called DOSE that employs two efficient condition-augmentation techniques to address this challenge, based on two key insights: (1) We force the model to prioritize the condition factor when generating samples by training it with dropout operation; (2) We inject the condition information into the sampling process by providing an informative adaptive prior. Experiments demonstrate that our approach yields substantial improvements in high-quality and stable speech generation, consistency with the condition factor, and inference efficiency. Codes are publicly available at https://github.com/ICDM-UESTC/DOSE.

Diffusion Self-Guidance for Controllable Image Generation
Dave Epstein Allan Jabri Ben Poole Alexei A Efros Aleksander Holynski



研究问题:如何通过引导内部表示来精确控制生成图像的属性?
动机:虽然大规模生成模型能够从详细的提示中生成高质量的图像,但图像的许多方面难以或无法通过文本传达。
方法:引入自我指导方法,通过引导扩散模型的内部表示来提供对生成图像属性的精确控制。
效果:实验表明,可以从这些表示中提取对象的大小、位置和外观,并展示如何使用它们来引导采样过程。自我指导在各种具有挑战性的图像操作中表现出灵活性和有效性,如修改单个对象的位置或大小(保持图像其余部分不变)、将一个图像中的对象外观与另一个图像的布局合并、将多个图像中的对象组合成一个等。此外,还提出了一种使用自我指导进行重建的新方法,可以将该方法扩展到编辑真实图像。

Large-scale generative models are capable of producing high-quality images from detailed prompts. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides precise control over properties of the generated image by guiding the internal representations of diffusion models. We demonstrate that the size, location, and appearance of objects can be extracted from these representations, and show how to use them to steer the sampling process. Self-guidance operates similarly to standard classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We demonstrate the flexibility and effectiveness of self-guided generation through a wide range of challenging image manipulations, such as modifying the position or size of a single object (keeping the rest of the image unchanged), merging the appearance of objects in one image with the layout of another, composing objects from multiple images into one, and more. We also propose a new method for reconstruction using self-guidance, which allows extending our approach to editing real images.

IMPRESS: Evaluating the Resilience of Imperceptible Perturbations Against Unauthorized Data Usage in Diffusion-Based Generative AI
Bochuan Cao Changjiang Li Ting Wang Jinyuan Jia Bo Li Jinghui Chen



研究问题:扩散基图像生成模型如Stable Diffusion或DALL·E 2,在未经原创图像所有者授权的情况下,可能被用于恶意编辑原始图像。
动机:为了解决这一问题,研究人员试图通过添加难以察觉的扰动来误导扩散模型,使其无法正确生成新的样本,从而保护原始图像免受未授权的数据使用。
方法:研究者提出了一种名为IMPRESS的扰动净化平台,该平台基于一个关键观察结果,即难以察觉的扰动可能导致原始图像和扩散重建图像之间的感知不一致,从而设计出一种新的图像净化优化策略。
效果:IMPRESS平台对几种当代保护方法进行了全面评估,可以作为未来保护方法的评估平台。

Diffusion-based image generation models, such as Stable Diffusion or DALL·E 2, are able to learn from given images and generate high-quality samples following the guidance from prompts. For instance, they can be used to create artistic images that mimic the style of an artist based on his/her original artworks or to maliciously edit the original images for fake content. However, such ability also brings serious ethical issues without proper authorization from the owner of the original images. In response, several attempts have been made to protect the original images from such unauthorized data usage by adding imperceptible perturbations, which are designed to mislead the diffusion model and make it unable to properly generate new samples. In this work, we introduce a perturbation purification platform, named IMPRESS, to evaluate the effectiveness of imperceptible perturbations as a protective measure. IMPRESS is based on the key observation that imperceptible perturbations could lead to a perceptible inconsistency between the original image and the diffusion-reconstructed image, which can be used to devise a new optimization strategy for purifying the image, which may weaken the protection of the original image from unauthorized data usage (e.g., style mimicking, malicious editing). The proposed IMPRESS platform offers a comprehensive evaluation of several contemporary protection methods, and can be used as an evaluation platform for future protection methods.

Non-autoregressive Machine Translation with Probabilistic Context-free Grammar
Shangtong Gui Chenze Shao Zhengrui Ma Xishan Zhang Yunji Chen Yang Feng



研究问题:如何提高非自回归变换器(NAT)在神经机器翻译中的表达能力和性能?
动机:传统的NAT模型由于假设目标令牌之间的条件独立,其表达能力和性能相比自回归(AT)模型有限。
方法:提出一种名为PCFG-NAT的新方法,利用专门设计的概率上下文无关文法(PCFG)增强NAT模型捕获输出令牌之间复杂依赖关系的能力。
效果:实验结果表明,PCFG-NAT进一步缩小了NAT和AT模型在翻译质量上的差距。此外,PCFG-NAT有助于更深入地理解生成的句子,解决了神经机器翻译中缺乏令人满意的可解释性的问题。

Non-autoregressive Transformer(NAT) significantly accelerates the inference of neural machine translation. However, conventional NAT models suffer from limited expression power and performance degradation compared to autoregressive (AT) models due to the assumption of conditional independence among target tokens. To address these limitations, we propose a novel approach called PCFG-NAT, which leverages a specially designed Probabilistic Context-Free Grammar (PCFG) to enhance the ability of NAT models to capture complex dependencies among output tokens. Experimental results on major machine translation benchmarks demonstrate that PCFG-NAT further narrows the gap in translation quality between NAT and AT models. Moreover, PCFG-NAT facilitates a deeper understanding of the generated sentences, addressing the lack of satisfactory explainability in neural machine translation. Code is publicly available at https://github.com/ictnlp/PCFG-NAT.

PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model
Yizhe Zhang Jiatao Gu Zhuofeng Wu Shuangfei Zhai Joshua M. Susskind Navdeep Jaitly



研究问题:本文旨在解决自动回归模型在生成文本时可能出现的重复和低质量输出问题。
动机:自动回归模型在生成过程中误差会累积,导致输出质量下降,而降噪扩散模型虽然能修正错误,但计算成本高且输出流畅度不足。
方法:本文提出PLANNER模型,结合了潜在语义扩散和自动回归生成,通过解码模块和规划模块实现对段落全局控制的同时生成流畅的文本。
效果:实验结果表明,PLANNER模型在语义生成、文本补全和摘要等任务上表现出色,能有效生成高质量的长篇文本。

Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during the steps of generation. This issue is often attributed to exposure bias -- the difference between how a model is trained, and how it is used during inference. Denoising diffusion models provide an alternative approach in which a model can revisit and revise its output. However, they can be computationally expensive and prior efforts on text have led to models that produce less fluent output compared to autoregressive models, especially for longer text and paragraphs. In this paper, we propose PLANNER, a model that combines latent semantic diffusion with autoregressive generation, to generate fluent text while exercising global control over paragraphs. The model achieves this by combining an autoregressive "decoding" module with a "planning" module that uses latent diffusion to generate semantic paragraph embeddings in a coarse-to-fine manner. The proposed method is evaluated on various conditional generation tasks, and results on semantic generation, text completion and summarization show its effectiveness in generating high-quality long-form text in an efficient manner.

LayoutPrompter: Awaken the Design Ability of Large Language Models
Jiawei Lin Jiaqi Guo Shizhao Sun Zijiang James Yang Jian-Guang Lou Dongmei Zhang



研究问题:如何通过上下文学习解决现有布局生成任务的通用性和数据效率问题。
动机:尽管现有的布局生成工作已取得良好表现,但其通用性和数据效率不足限制了其实际应用。
方法:提出LayoutPrompter,利用大型语言模型(LLMs)进行上下文学习。该模型由输入-输出序列化、动态示例选择和布局排序三个关键组件组成。
效果:在四个公共数据集上进行的实验表明,LayoutPrompter无需任何模型训练或微调,即可与或超越现有方法,显示出其有效性和数据效率。

Conditional graphic layout generation, which automatically maps user constraints to high-quality layouts, has attracted widespread attention today. Although recent works have achieved promising performance, the lack of versatility and data efficiency hinders their practical applications. In this work, we propose LayoutPrompter, which leverages large language models (LLMs) to address the above problems through in-context learning. LayoutPrompter is made up of three key components, namely input-output serialization, dynamic exemplar selection and layout ranking. Specifically, the input-output serialization component meticulously designs the input and output formats for each layout generation task. Dynamic exemplar selection is responsible for selecting the most helpful prompting exemplars for a given input. And a layout ranker is used to pick the highest quality layout from multiple outputs of LLMs. We conduct experiments on all existing layout generation tasks using four public datasets. Despite the simplicity of our approach, experimental results show that LayoutPrompter can compete with or even outperform state-of-the-art approaches on these tasks without any model training or fine-tuning. This demonstrates the effectiveness of this versatile and training-free approach. In addition, the ablation studies show that LayoutPrompter is significantly superior to the training-based baseline in a low-data regime, further indicating the data efficiency of LayoutPrompter. Our project is available at https://github.com/microsoft/LayoutGeneration/tree/main/LayoutPrompter.

ResoNet: a Physics-Informed DL Framework for Off-Resonance Correction in MRI Trained with Noise
Alfredo De Goyeneche Shreya Ramachandran Ke Wang Ekin Karasan Joseph Yitan Cheng Stella X. Yu Michael Lustig



研究问题:如何有效地消除磁共振成像(MRI)中的离共振效应?
动机:传统的MRI数据采集方法在采样k空间时效率低下,而更高效的非笛卡尔采样轨迹则更容易受到磁场不均匀性的影响,导致离共振伪影。
方法:提出了一种基于物理的深度学习框架,用于纠正MRI中的离共振效应。该框架可以模拟和分离脂肪/水部分体积效应,并实现并行成像加速。通过使用合成随机数据进行端到端训练,网络可以在各种解剖结构和对比度下无需重新训练即可消除离共振效应。
效果:通过对模型和实际数据的实验证明,该方法可以有效地消除离共振效应,为临床采用非笛卡尔采样轨迹提供了可能,从而实现快速、高效且稳健的MRI扫描。

Magnetic Resonance Imaging (MRI) is a powerful medical imaging modality that offers diagnostic information without harmful ionizing radiation. Unlike optical imaging, MRI sequentially samples the spatial Fourier domain (k-space) of the image. Measurements are collected in multiple shots, or readouts, and in each shot, data along a smooth trajectory is sampled. Conventional MRI data acquisition relies on sampling k-space row-by-row in short intervals, which is slow and inefficient. More efficient, non-Cartesian sampling trajectories (e.g., Spirals) use longer data readout intervals, but are more susceptible to magnetic field inhomogeneities, leading to off-resonance artifacts. Spiral trajectories cause off-resonance blurring in the image, and the mathematics of this blurring resembles that of optical blurring, where magnetic field variation corresponds to depth and readout duration to aperture size. Off-resonance blurring is a system issue with a physics-based, accurate forward model. We present a physics-informed deep learning framework for off-resonance correction in MRI, which is trained exclusively on synthetic data. Our approach allows for fat/water partial volume effects modeling and separation, and parallel imaging acceleration. Through end-to-end training using synthetic randomized data {i.e., images, coil sensitivities, field maps), we train the network to reverse off-resonance effects across diverse anatomies and contrasts without retraining. We demonstrate the effectiveness of our approach through results on phantom and in-vivo data. This work has the potential to facilitate the clinical adoption of non-Cartesian sampling trajectories, enabling efficient, rapid, and motion-robust MRI scans. Code is publicly available at: https://github.com/mikgroup/ResoNet.

Self-Supervised Visual Acoustic Matching
Arjun Somayazulu Changan Chen Kristen Grauman



研究问题:本文旨在解决现有声学匹配方法需要成对训练数据,限制了训练数据的多样性或需要使用模拟数据或启发式方法来创建配对样本的问题。
动机:为了解决这个问题,作者提出了一种自我监督的视觉声学匹配方法,只使用目标场景的图像和音频作为训练样本,无需参考声学不匹配的源音频。
方法:该方法通过条件GAN框架和一种新的度量指标,联合学习去混响房间声学和将音频重新合成为目标环境,该度量指标量化了去偏差音频中的剩余声学信息水平。
效果:无论是在野外网络数据还是模拟数据上进行训练,实验结果都表明,该方法在多个具有挑战性的数据集和各种真实世界的音频和环境中,都优于最先进的技术。

Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio---without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.

DESSERT: An Efficient Algorithm for Vector Set Search with Vector Set Queries
Joshua Engels Benjamin Coleman Vihan Lakshman Anshumali Shrivastava



研究问题:本文研究了向量集合搜索问题,即查询和集合中的元素都是向量集合的情况。
动机:该问题在语义搜索应用中是核心子程序,但现有解决方案速度慢得无法接受。
方法:提出了一种新的近似搜索算法DESSERT,通过检索表来高效地搜索嵌入的集合。
效果:将DESSERT集成到先进的语义搜索模型ColBERT中,在MS MARCO和LoTTE检索基准上实现了2-5倍的速度提升,同时保持了良好的召回率。

We study the problem of $\text{\emph{vector set search}}$ with $\text{\emph{vector set queries}}$. This task is analogous to traditional near-neighbor search, with the exception that both the query and each element in the collection are $\text{\textit{sets}}$ of vectors. We identify this problem as a core subroutine for semantic search applications and find that existing solutions are unacceptably slow. Towards this end, we present a new approximate search algorithm, DESSERT ($\text{\bf D}$ESSERT $\text{\bf E}$ffeciently $\text{\bf S}$earches $\text{\bf S}$ets of $\text{\bf E}$mbeddings via $\text{\bf R}$etrieval $\text{\bf T}$ables). DESSERT is a general tool with strong theoretical guarantees and excellent empirical performance. When we integrate DESSERT into ColBERT, a state-of-the-art semantic search model, we find a 2-5x speedup on the MS MARCO and LoTTE retrieval benchmarks with minimal loss in recall, underscoring the effectiveness and practical applicability of our proposal.

PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation
Jialu Li Mohit Bansal



研究问题:视觉-语言导航中,如何有效地训练模型以应对真实环境数量有限的问题。
动机:由于真实环境的获取困难,限制了模型在新环境中的泛化能力。
方法:提出PanoGen生成方法,通过文本描述来生成无限多样的新全景环境。具体步骤包括收集现有环境的图像描述,利用先进的文本到图像扩散模型生成新的全景环境,并通过递归出图创建一致的360度全景视图。
效果:在VLN预训练和微调中应用PanoGen,实验证明新生成的环境在Room-to-Room, Room-for-Room, CVDN等数据集上达到了新的最优效果。同时,发现使用PanoGen生成的环境进行预训练对CVDN特别有效,并且更多的生成环境有助于提高模型在未见环境中的泛化能力。

Vision-and-Language Navigation requires the agent to follow language instructions to navigate through 3D environments. One main challenge in Vision-and-Language Navigation is the limited availability of photorealistic training environments, which makes it hard to generalize to new and unseen environments. To address this problem, we propose PanoGen, a generation method that can potentially create an infinite number of diverse panoramic environments conditioned on text. Specifically, we collect room descriptions by captioning the room images in existing Matterport3D environments, and leverage a state-of-the-art text-to-image diffusion model to generate the new panoramic environments. We use recursive outpainting over the generated images to create consistent 360-degree panorama views. Our new panoramic environments share similar semantic information with the original environments by conditioning on text descriptions, which ensures the co-occurrence of objects in the panorama follows human intuition, and creates enough diversity in room appearance and layout with image outpainting. Lastly, we explore two ways of utilizing PanoGen in VLN pre-training and fine-tuning. We generate instructions for paths in our PanoGen environments with a speaker built on a pre-trained vision-and-language model for VLN pre-training, and augment the visual observation with our panoramic environments during agents' fine-tuning to avoid overfitting to seen environments. Empirically, learning with our PanoGen environments achieves the new state-of-the-art on the Room-to-Room, Room-for-Room, and CVDN datasets. Besides, we find that pre-training with our PanoGen speaker data is especially effective for CVDN, which has under-specified instructions and needs commonsense knowledge to reach the target. Lastly, we show that the agent can benefit from training with more generated panoramic environments, suggesting promising results for scaling up the PanoGen environments to enhance agents' generalization to unseen environments.

Any-to-Any Generation via Composable Diffusion
Zineng Tang Ziyi Yang Chenguang Zhu Michael Zeng Mohit Bansal



研究问题:本文旨在提出一种能够生成任何组合输出模态(如语言、图像、视频或音频)的新颖生成模型CoDi。
动机:现有的生成型AI系统通常只能处理单一模态的输入和输出,且需要大量的训练数据集。而CoDi能够并行生成多种模态,并且其输入不限于文本或图像等子集模态。
方法:CoDi通过在输入和输出空间中对齐模态来解决这个问题,即使训练数据中没有某些模态的组合,也能够自由地根据任何输入组合进行条件生成。CoDi采用一种新的可组合生成策略,通过在扩散过程中建立共享的多模态空间来桥接模态对齐,实现交织模态(如时间对齐的视频和音频)的同步生成。
效果:CoDi具有高度的自定义性和灵活性,实现了强大的联合模态生成质量,并在单模态合成方面优于或与现有的单模态最先进水平相当。

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis.

Learning to Tokenize for Generative Retrieval
Weiwei Sun Lingyong Yan Zheng Chen Shuaiqiang Wang Haichao Zhu Pengjie Ren Zhumin Chen Dawei Yin Maarten de Rijke Zhaochun Ren



研究问题:如何为每个文档分配唯一的docid(文档标记化)是信息检索中的关键问题,因为它决定了生成检索模型是否可以通过简单地解码其docid来精确地检索任何文档。
动机:大多数现有的方法采用基于规则的标记化,这种方法是特定于任务的,并不能很好地泛化。
方法:本文提出了一种新的文档标记化学习方法GenRet,该方法通过离散自编码的方式学习将完整的文档语义编码为docid。
效果:在NQ320K、MS MARCO和BEIR数据集上进行的实验表明,GenRet在这些数据集上建立了新的最先进的性能。与生成检索基线相比,GenRet可以在未见过的文件上实现显著的改进,并且还可以在MS MARCO和BEIR上超越可比的基线,证明了该方法的泛化能力。

As a new paradigm in information retrieval, generative retrieval directly generates a ranked list of document identifiers (docids) for a given query using generative language models (LMs). How to assign each document a unique docid (denoted as document tokenization) is a critical problem, because it determines whether the generative retrieval model can precisely retrieve any document by simply decoding its docid. Most existing methods adopt rule-based tokenization, which is ad-hoc and does not generalize well. In contrast, in this paper we propose a novel document tokenization learning method, GenRet, which learns to encode the complete document semantics into docids. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. We develop a progressive training scheme to capture the autoregressive nature of docids and diverse clustering techniques to stabilize the training process. Based on the semantic-embedded docids of any set of documents, the generative retrieval model can learn to generate the most relevant docid only according to the docids' semantic relevance to the queries. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets. GenRet establishes the new state-of-the-art on the NQ320K dataset. Compared to generative retrieval baselines, GenRet can achieve significant improvements on unseen documents. Moreover, GenRet can also outperform comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.

CELLE-2: Translating Proteins to Pictures and Back with a Bidirectional Text-to-Image Transformer
Emaad Khwaja Yun S. Song Aaron Agarunov Bo Huang



研究问题:如何将蛋白质亚细胞定位的氨基酸序列转化为图像,以及如何从图像生成氨基酸序列。
动机:蛋白质定位是一个需要整合序列和图像信息的难题,但现有的方法大多忽视了这一点。
方法:提出一种名为CELL-E 2的新型双向转换器,不仅能够捕捉蛋白质定位的空间复杂性,并在核图像上产生定位概率估计,还能从图像生成序列,实现全新的蛋白质设计。
效果:在两个大型人类蛋白质数据集上训练和微调CELL-E 2,并展示了如何使用CELL-E 2创建数百个新的核定位信号(NLS)。

We present CELL-E 2, a novel bidirectional transformer that can generate images depicting protein subcellular localization from the amino acid sequences (and vice versa). Protein localization is a challenging problem that requires integrating sequence and image information, which most existing methods ignore. CELL-E 2 extends the work of CELL-E, not only capturing the spatial complexity of protein localization and produce probability estimates of localization atop a nucleus image, but also being able to generate sequences from images, enabling de novo protein design. We train and finetune CELL-E 2 on two large-scale datasets of human proteins. We also demonstrate how to use CELL-E 2 to create hundreds of novel nuclear localization signals (NLS). Results and interactive demos are featured at https://bohuanglab.github.io/CELL-E_2/.

PoET: A generative model of protein families as sequences-of-sequences
Timothy Fei Truong Jr Tristan Bepler



研究问题:如何设计出具有所需功能的蛋白质?
动机:目前的蛋白质生成模型要么难以针对特定家族进行设计,要么需要对特定家族的大量多重序列比对进行训练,无法利用跨家族的迁移学习。
方法:提出蛋白质演化转换器(PoET),这是一种基于整个蛋白质家族的自回归生成模型,通过在数千万自然蛋白质序列簇中学习生成相关蛋白质序列集。
效果:实验结果表明,PoET在深度突变扫描数据集上的表现优于现有的蛋白质语言模型和进化序列模型,能够控制生成新的蛋白质序列。

Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose **P**r**o**tein **E**volutionary **T**ransformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training. In extensive experiments on deep mutational scanning datasets, we show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all MSA depths. We also demonstrate PoET's ability to controllably generate new protein sequences.

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
George Stein Jesse C. Cresswell Rasa Hosseinzadeh Yi Sui Brendan Leigh Ross Valentin Villecroze Zhaoyan Liu Anthony L. Caterini Eric Taylor Gabriel Loaiza-Ganem



研究问题:本文旨在通过广泛的生成模型和各种语义不同的图像数据集,理解和改进用于评估它们的功能提取器和度量标准。
动机:当前对生成样本的人类感知真实度进行测量的最佳实践是进行大规模的实验,但发现现有的度量标准与人类的评估结果没有强烈的相关性。
方法:通过比较17种现代度量标准,包括整体性能、逼真度、多样性、稀有性和记忆性,我们发现最先进的扩散模型在人类判断下的感知真实度并没有反映在常见的FID等指标上。我们通过研究替代的自我监督特征提取器来解决这个问题,并发现网络编码的语义信息强烈依赖于其训练过程。
效果:我们的实验表明,当前的度量标准并不能正确地检测到数据的记忆现象。为了推动生成模型及其评估的发展,我们发布了所有生成的图像数据集、人类评估数据和一个模块化的库,用于计算9种不同编码器的17种常见度量标准。

We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization: none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage. To facilitate further development of generative models and their evaluation we release all generated image datasets, human evaluation data, and a modular library to compute 17 common metrics for 9 different encoders at https://github.com/layer6ai-labs/dgm-eval.

Latent Diffusion for Language Generation
Justin Lovelace Varsha Kishore Chao Wan Eliot Seo Shekhtman Kilian Q Weinberger



研究问题:扩散模型在图像、音频和视频等连续数据模态中取得了巨大成功,但在语言等离散领域中应用有限。
动机:将扩散模型应用于语言处理,将其视为现有预训练语言模型的补充。
方法:利用编码器-解码器语言模型学习高质量的语言自编码器,然后在语言自编码器的潜空间中学习连续扩散模型,从而生成可以由预训练的解码器解码为自然语言的连续潜在表示。
效果:通过多个不同的数据集验证了该方法在无条件、有条件和序列到序列的语言生成任务上的有效性,证明我们的潜语言扩散模型明显优于以往的扩散语言模型。

Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have presented diffusion as an alternative to existing pretrained language models. We view diffusion and existing language models as complementary. We demonstrate that encoder-decoder language models can be utilized to efficiently learn high-quality language autoencoders. We then demonstrate that continuous diffusion models can be learned in the latent space of the language autoencoder, enabling us to sample continuous latent representations that can be decoded into natural language with the pretrained decoder. We validate the effectiveness of our approach for unconditional, class-conditional, and sequence-to-sequence language generation. We demonstrate across multiple diverse data sets that our latent language diffusion models are significantly more effective than previous diffusion language models. Our code is available at \url{https://github.com/justinlovelace/latent-diffusion-for-language}.

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Shihao Zhao Dongdong Chen Yen-Chun Chen Jianmin Bao Shaozhe Hao Lu Yuan Kwan-Yee K. Wong



研究问题:尽管文本到图像扩散模型取得了巨大进步,但现有的模型在理解复杂文本和生成相应图像方面仍面临挑战。
动机:为了解决现有模型在理解和生成复杂图像方面的困难,需要开发一种能够同时利用不同局部控制(如边缘图、深度图、分割掩码)和全局控制(如CLIP图像嵌入)的灵活且可组合的框架。
方法:本文提出了Uni-ControlNet,这是一个统一的框架,允许在一个单一的模型中以灵活且可组合的方式同时使用不同的局部控制和全局控制。与现有方法不同,Uni-ControlNet只需要对预训练好的文本到图像扩散模型进行两个额外适配器的微调,无需从零开始训练。
效果:通过定量和定性比较,Uni-ControlNet在可控性、生成质量和组合性方面均优于现有方法。

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one single model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at https://github.com/ShihaoZhaoZSH/Uni-ControlNet.

Debiasing Pretrained Generative Models by Uniformly Sampling Semantic Attributes
Walter Gerych Kevin Hickey Luke Buquicchio Kavin Chandrasekaran Abdulaziz Alajaji Elke Rundensteiner Emmanuel Agu



研究问题:生成模型在科学和工业应用中越来越广泛,但它们往往在其训练集中存在偏见,如导致某些群体在数据中被低估的社会偏见。
动机:由于训练数据中非白人样本较少,图像生成器可能主要产生白人的图像。因此,有必要对生成模型进行去偏处理,使其为每个群体合成相同数量的实例,同时避免重新训练模型以节省成本。
方法:我们提出了一个分布映射模块,该模块从公平噪声分布中产生样本,使得预训练的生成模型在对这些样本进行条件化时,能够产生语义上均匀的输出——每个群体的实例数量相等。这不需要重新训练生成器,也不需要任何真实训练数据。
效果:我们在流行的真实世界数据集上对去偏生成器进行了实验,结果显示我们的方法优于现有的方法。

Generative models are being increasingly used in science and industry applications. Unfortunately, they often perpetuate the biases present in their training sets, such as societal biases causing certain groups to be underrepresented in the data. For instance, image generators may overwhelmingly produce images of white people due to few non-white samples in their training data. It is imperative to debias generative models so they synthesize an equal number of instances for each group, while not requiring retraining of the model to avoid prohibitive expense. We thus propose a *distribution mapping module* that produces samples from a *fair noise distribution*, such that the pretrained generative model produces *semantically uniform* outputs - an equal number of instances for each group - when conditioned on these samples. This does *not* involve retraining the generator, nor does it require *any* real training data. Experiments on debiasing generators trained on popular real-world datasets show that our method outperforms existing approaches.

CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
Sihan Xu Ziqiao Ma Yidong Huang Honglak Lee Joyce Chai



研究问题:扩散模型(DMs)在图像合成任务上取得了突破,但在图像到图像(I2I)翻译任务中缺乏直观的界面。
动机:解决预训练DMs在无配对I2I翻译任务中的一致性问题。
方法:提出Cyclenet,一种将循环一致性融入DMs以规范图像操作的新方法。
效果:在各种粒度的无配对I2I任务上验证Cyclenet,实验表明其在翻译一致性和质量上表现优越,并能通过简单的文本提示生成高质量域外分布的图像。Cyclenet是一个实用的框架,即使只有非常有限的训练数据(约2k)和最少的计算资源(1个GPU)也能进行训练。

Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train. Project homepage: https://cyclenetweb.github.io/

Speculative Decoding with Big Little Decoder
Sehoon Kim Karttikeya Mangalam Suhong Moon Jitendra Malik Michael W. Mahoney Amir Gholami Kurt Keutzer



研究问题:大型语言模型的推理延迟限制了其部署和各种实时应用的使用,特别是在自回归生成任务中,因为模型需要迭代生成令牌,无法利用令牌级并行化。
动机:为了解决大型语言模型在推理过程中的延迟问题,本文提出了一种名为Big Little Decoder(BiLD)的框架,该框架包含两个不同大小的模型,可以协同生成文本,以提高推理效率和降低延迟。
方法:BiLD框架中的小模型以自回归方式生成文本,推理成本低;大模型仅在必要时以非自回归方式修正小模型的不准确预测。为了协调这两个模型,BiLD引入了两种简单而有效的策略:(1)退路策略,确定何时将控制权交给大模型;(2)回滚策略,确定大模型何时需要纠正小模型的不准确预测。
效果:通过对不同的任务和模型进行评估,我们发现BiLD在不同的文本生成场景中都能显著提高推理速度,同时保持较低的生成质量下降。在NVIDIA T4 GPU上,我们的框架实现了高达2.12倍的速度提升,且无需修改训练过程或模型架构即可直接应用。

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model’s inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced.

De novo Drug Design using Reinforcement Learning with Multiple GPT Agents
Xiuyuan Hu Guoqing Liu Yang Zhao Hao Zhang



研究问题:如何利用预训练语言模型和知识图谱进行增强的语言表示,以提升各种NLP任务的性能。
动机:现有的预训练语言模型未充分利用知识图谱中的结构化知识,而知识图谱中的有信息量的实体可以增强语言表示。
方法:通过结合大规模文本语料库和知识图谱训练ERNIE模型,使其能同时捕捉词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上表现优秀,且在其他常见NLP任务上与BERT模型相媲美。

*De novo* drug design is a pivotal issue in pharmacology and a new area of focus in AI for science research. A central challenge in this field is to generate molecules with specific properties while also producing a wide range of diverse candidates. Although advanced technologies such as transformer models and reinforcement learning have been applied in drug design, their potential has not been fully realized. Therefore, we propose MolRL-MGPT, a reinforcement learning algorithm with multiple GPT agents for drug molecular generation. To promote molecular diversity, we encourage the agents to collaborate in searching for desirable molecules in diverse directions. Our algorithm has shown promising results on the GuacaMol benchmark and exhibits efficacy in designing inhibitors against SARS-CoV-2 protein targets. The codes are available at: https://github.com/HXYfighter/MolRL-MGPT.

Efficient Neural Music Generation
Max W. Y. Lam Qiao Tian Tang Li Zongyu Yin Siyuan Feng Ming Tu Yuliang Ji Rui Xia Mingbo Ma Xuchen Song Jitong Chen Yuping Wang Yuxuan Wang



研究问题:如何提高音乐生成的效率和质量,使其达到与最先进的MusicLM相当的水平。
动机:现有的音乐生成模型如MusicLM虽然效果优秀,但需要通过多个语言模型进行处理,计算成本高且不适合实时生成。
方法:提出一种名为MeLoDy的引导扩散模型,该模型继承自MusicLM的最高级语义模型,并采用新颖的双路径扩散模型和音频VAE-GAN将条件语义令牌高效解码为波形。
效果:实验结果表明,MeLoDy不仅在采样速度和无限续生性方面具有优势,而且在音乐性、音频质量和文本相关性方面也达到了最先进的水平。

Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present **M**e**L**o**D**y (**M** for music; **L** for LM; **D** for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7\% to 99.6\% forward passes in MusicLM, respectively, for sampling 10s to 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.

Towards Efficient Image Compression Without Autoregressive Models
Muhammad Salman Ali Yeongwoong Kim Maryam Qamar Sung-Chang Lim Donghyun Kim Chaoning Zhang Sung-Ho Bae Hui Yong Kim



研究问题:如何提高学习型图像压缩(LIC)的性能,同时降低其计算复杂度。
动机:现有的基于高斯分布的自回归模型在处理实际图像的潜在特征时存在空间相关性问题,导致性能下降且计算复杂度增加。
方法:提出一种新的方法,通过引入相关性损失来最小化潜在特征的空间相关性,使其更好地适应独立概率模型。
效果:该方法在保持较低计算复杂度的同时,显著提高了学习型图像压缩的性能。与现有的自回归模型相比,该方法在解码时间和推理时间上的计算复杂度分别降低了50倍和30倍,同时性能增益达到了90%和98%。

Recently, learned image compression (LIC) has garnered increasing interest with its rapidly improving performance surpassing conventional codecs. A key ingredient of LIC is a hyperprior-based entropy model, where the underlying joint probability of the latent image features is modeled as a product of Gaussian distributions from each latent element. Since latents from the actual images are not spatially independent, autoregressive (AR) context based entropy models were proposed to handle the discrepancy between the assumed distribution and the actual distribution. Though the AR-based models have proven effective, the computational complexity is significantly increased due to the inherent sequential nature of the algorithm. In this paper, we present a novel alternative to the AR-based approach that can provide a significantly better trade-off between performance and complexity. To minimize the discrepancy, we introduce a correlation loss that forces the latents to be spatially decorrelated and better fitted to the independent probability model. Our correlation loss is proved to act as a general plug-in for the hyperprior (HP) based learned image compression methods. The performance gain from our correlation loss is ‘free’ in terms of computation complexity for both inference time and decoding time. To our knowledge, our method gives the best trade-off between the complexity and performance: combined with the Checkerboard-CM, it attains **90%** and when combined with ChARM-CM, it attains **98%** of the AR-based BD-Rate gains yet is around **50 times** and **30 times** faster than AR-based methods respectively

FaceDNeRF: Semantics-Driven Face Reconstruction, Prompt Editing and Relighting with Diffusion Models
Hao ZHANG Tianyuan DAI Yanbo Xu Yu-Wing Tai Chi-Keung Tang



研究问题:如何从单张图片中生成高质量的3D人脸模型,并实现语义编辑和重光照功能。
动机:随着视频通信、AR/VR和电影行业高级视频编辑等应用的发展,从单张图片生成高质量3D人脸的能力变得越来越重要。
方法:本文提出了Face Diffusion NeRF(FaceDNeRF)方法,这是一种新的生成方法,可以从单张图片重建高质量的人脸NeRFs,具备语义编辑和重光照功能。FaceDNeRF利用了高分辨率的3D GAN逆映射和经过专家训练的2D潜在扩散模型,使用户能够在零样本学习中操纵和构建人脸NeRFs,而无需显式3D数据。
效果:通过精心设计的照明和身份保持损失以及多模态预训练,FaceDNeRF为用户提供了无与伦比的编辑控制能力,使他们能够仅使用单视图图像、文本提示和显式目标照明来创建和编辑人脸NeRFs。与依赖2D分割图进行可编辑属性的现有2D编辑方法相比,FaceDNeRF的先进功能被设计为产生更令人印象深刻的结果。实验表明,与最先进的3D人脸重建和编辑方法相比,我们的FaceDNeRF实现了非常逼真的结果和前所未有的编辑灵活性。

The ability to create high-quality 3D faces from a single image has become increasingly important with wide applications in video conferencing, AR/VR, and advanced video editing in movie industries. In this paper, we propose Face Diffusion NeRF (FaceDNeRF), a new generative method to reconstruct high-quality Face NeRFs from single images, complete with semantic editing and relighting capabilities. FaceDNeRF utilizes high-resolution 3D GAN inversion and expertly trained 2D latent-diffusion model, allowing users to manipulate and construct Face NeRFs in zero-shot learning without the need for explicit 3D data. With carefully designed illumination and identity preserving loss, as well as multi-modal pre-training, FaceDNeRF offers users unparalleled control over the editing process enabling them to create and edit face NeRFs using just single-view images, text prompts, and explicit target lighting. The advanced features of FaceDNeRF have been designed to produce more impressive results than existing 2D editing approaches that rely on 2D segmentation maps for editable attributes. Experiments show that our FaceDNeRF achieves exceptionally realistic results and unprecedented flexibility in editing compared with state-of-the-art 3D face reconstruction and editing methods. Our code will be available at https://github.com/BillyXYB/FaceDNeRF.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Matthew Le Apoorv Vyas Bowen Shi Brian Karrer Leda Sari Rashel Moritz Mary Williamson Vimal Manohar Yossi Adi Jay Mahadeokar Wei-Ning Hsu



研究问题:本文旨在解决语音生成模型在规模和任务泛化方面的不足。
动机:虽然大规模生成模型如GPT和DALL-E已经在文本生成方面取得了显著的成果,但语音生成模型仍然处于初级阶段。
方法:本文提出了Voicebox,一种用于大规模语音生成的通用文本引导模型。Voicebox是一种非自回归流匹配模型,可以在给定音频上下文和文本的情况下填充语音。
效果:实验结果表明,Voicebox在各种任务中的表现优于最先进的零射击TTS模型VALL-E,包括可理解性和音频相似性,同时运行速度提高了20倍。

Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9\% vs 1.9\% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.

Diffusion-Based Adversarial Sample Generation for Improved Stealthiness and Controllability
Haotian Xue Alexandre Araujo Bin Hu Yongxin Chen



研究问题:现有的神经网络模型容易受到对抗性样本的影响,这些样本通过精心设计的微小变化来误导模型。
动机:虽然对抗性样本在数字和物理场景中可以通过基于梯度的技术轻松生成,但它们往往与自然图像的实际数据分布差异很大,导致强度和隐蔽性之间的权衡。
方法:本文提出了一种名为扩散基于投影梯度下降(Diff-PGD)的新框架,用于生成真实的对抗性样本。通过利用由扩散模型引导的梯度,Diff-PGD确保对抗性样本保持接近原始数据分布的同时保持其有效性。此外,我们的框架可以很容易地针对特定任务进行定制,如数字攻击、物理世界攻击和基于风格的攻击。
效果:与传统的基于梯度的方法相比,使用Diff-PGD生成的样本具有更好的迁移性和反净化能力。

Neural networks are known to be susceptible to adversarial samples: small variations of natural examples crafted to deliberately mislead the models. While they can be easily generated using gradient-based techniques in digital and physical scenarios, they often differ greatly from the actual data distribution of natural images, resulting in a trade-off between strength and stealthiness. In this paper, we propose a novel framework dubbed Diffusion-Based Projected Gradient Descent (Diff-PGD) for generating realistic adversarial samples. By exploiting a gradient guided by a diffusion model, Diff-PGD ensures that adversarial samples remain close to the original data distribution while maintaining their effectiveness. Moreover, our framework can be easily customized for specific tasks such as digital attacks, physical-world attacks, and style-based attacks. Compared with existing methods for generating natural-style adversarial samples, our framework enables the separation of optimizing adversarial loss from other surrogate losses (e.g. content/smoothness/style loss), making it more stable and controllable. Finally, we demonstrate that the samples generated using Diff-PGD have better transferability and anti-purification power than traditional gradient-based methods.

Predicting mutational effects on protein-protein binding via a side-chain diffusion probabilistic model
Shiwei Liu Tian Zhu Milong Ren Yu Chungong Dongbo Bu Haicang Zhang



研究问题:预测氨基酸突变对蛋白质-蛋白质结合的影响,特别是在实验数据稀缺的情况下。
动机:氨基酸突变对蛋白质-蛋白质结合的影响预测在蛋白质工程和治疗发现中非常重要,但缺乏标记的实验数据对开发计算方法构成了重大挑战。
方法:提出SidechainDiff,一种利用未标记的实验蛋白质结构的新颖表示学习方法。该方法使用黎曼扩散模型学习侧链构象的生成过程,并能给出蛋白质界面上的突变结构上下文表示。
效果:通过利用学习到的表示,我们在预测蛋白质-蛋白质结合的突变效应方面取得了最先进的性能。此外,SidechainDiff是第一个用于侧链的扩散基生成模型,与主要关注蛋白质主链结构生成的先前努力有所不同。

Many crucial biological processes rely on networks of protein-protein interactions. Predicting the effect of amino acid mutations on protein-protein binding is important in protein engineering, including therapeutic discovery. However, the scarcity of annotated experimental data on binding energy poses a significant challenge for developing computational approaches, particularly deep learning-based methods. In this work, we propose SidechainDiff, a novel representation learning-based approach that leverages unlabelled experimental protein structures. SidechainDiff utilizes a Riemannian diffusion model to learn the generative process of side-chain conformations and can also give the structural context representations of mutations on the protein-protein interface. Leveraging the learned representations, we achieve state-of-the-art performance in predicting the mutational effects on protein-protein binding. Furthermore, SidechainDiff is the first diffusion-based generative model for side-chains, distinguishing it from prior efforts that have predominantly focused on the generation of protein backbone structures.

Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models
Weijian Luo Tianyang Hu Shifeng Zhang Jiacheng Sun Zhenguo Li Zhihua Zhang



研究问题:如何从预训练的扩散模型中学习,并将知识转移到其他生成模型中,以实现无数据的训练。
动机:预训练的扩散模型包含了关于数据分布的复杂信息,是进行下游应用的宝贵资产。
方法:提出了一个名为Diff-Instruct的通用框架,只要生成的样本可以相对于模型参数进行微分,就可以指导任意生成模型的训练。该框架基于严谨的数学基础,其中指导过程直接对应于最小化一种新的散度——积分KL散度。
效果:在蒸馏预训练扩散模型和优化现有GAN模型的两个场景下进行的实验表明,Diff-Instruct能够产生最先进的一步扩散模型,并可以持续提高各种设置下的预训练GAN模型的生成器性能。

Due to the ease of training, ability to scale, and high sample quality, diffusion models (DMs) have become the preferred option for generative modeling, with numerous pre-trained models available for a wide variety of datasets. Containing intricate information about data distributions, pre-trained DMs are valuable assets for downstream applications. In this work, we consider learning from pre-trained DMs and transferring their knowledge to other generative models in a data-free fashion. Specifically, we propose a general framework called Diff-Instruct to instruct the training of arbitrary generative models as long as the generated samples are differentiable with respect to the model parameters. Our proposed Diff-Instruct is built on a rigorous mathematical foundation where the instruction process directly corresponds to minimizing a novel divergence we call Integral Kullback-Leibler (IKL) divergence. IKL is tailored for DMs by calculating the integral of the KL divergence along a diffusion process, which we show to be more robust in comparing distributions with misaligned supports. We also reveal non-trivial connections of our method to existing works such as DreamFusion, and generative adversarial training. To demonstrate the effectiveness and universality of Diff-Instruct, we consider two scenarios: distilling pre-trained diffusion models and refining existing GAN models. The experiments on distilling pre-trained diffusion models show that Diff-Instruct results in state-of-the-art single-step diffusion-based models. The experiments on refining GAN models show that the Diff-Instruct can consistently improve the pre-trained generators of GAN models across various settings.

The CLIP Model is Secretly an Image-to-Prompt Converter
Yuxuan Ding Chunna Tian Haoxuan Ding Lingqiao Liu



研究问题:本文旨在解决文本到图像生成模型(如Stable Diffusion)在利用参考图像的隐含信息方面的限制。
动机:现有的方法通过昂贵的训练过程来解决这个问题,但本文提出的方法可以更简单、灵活地实现图像和文本提示之间的桥梁。
方法:利用CLIP模型将图像转换为文本提示,通过线性投影矩阵实现,并可以通过类似领域的少量训练数据或在线训练步骤进一步优化。
效果:该方法可以应用于图像变化和图像编辑等任务,提高图像和文本提示之间的交互效果。

The Stable Diffusion model is a prominent text-to-image generation model that relies on a text prompt as its input, which is encoded using the Contrastive Language-Image Pre-Training (CLIP). However, text prompts have limitations when it comes to incorporating implicit information from reference images. Existing methods have attempted to address this limitation by employing expensive training procedures involving millions of training samples for image-to-image generation. In contrast, this paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form. Moreover, the paper showcases that this capability can be further enhanced by either utilizing a small amount of similar-domain training data (approximately 100 images) or incorporating several online training steps (around 30 iterations) on the reference images. By leveraging these approaches, the proposed method offers a simple and flexible solution to bridge the gap between images and text prompts. This methodology can be applied to various tasks such as image variation and image editing, facilitating more effective and seamless interaction between images and textual prompts.

Extremal Domain Translation with Neural Optimal Transport
Milena Gazdieva Alexander Korotin Daniil Selikhanovych Evgeny Burnaev



研究问题:在许多非配对图像领域转换问题中,如何保持翻译后的图像与其各自的输入图像相似。
动机:为了解决如风格转换或超分辨率等非配对图像领域转换问题,需要保持翻译后的图像与其对应的输入图像的相似性。
方法:提出了极值传输(ET)算法,这是一种理论上最好的非配对领域之间的转换的数学形式化,根据给定的相似度函数。受到最近神经最优传输(OT)进展的启发,我们提出了一个可扩展的算法来近似ET映射作为部分OT映射的极限。
效果:我们在玩具示例和非配对图像到图像转换任务上测试了我们的算法。代码已在https://github.com/milenagazdieva/ExtremalNeuralOptimalTransport上公开发布。

In many unpaired image domain translation problems, e.g., style transfer or super-resolution, it is important to keep the translated image similar to its respective input image. We propose the extremal transport (ET) which is a mathematical formalization of the theoretically best possible unpaired translation between a pair of domains w.r.t. the given similarity function. Inspired by the recent advances in neural optimal transport (OT), we propose a scalable algorithm to approximate ET maps as a limit of partial OT maps. We test our algorithm on toy examples and on the unpaired image-to-image translation task. The code is publicly available at https://github.com/milenagazdieva/ExtremalNeuralOptimalTransport

Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
Thomas Hartvigsen Swami Sankaranarayanan Hamid Palangi Yoon Kim Marzyeh Ghassemi



研究问题:预训练语言模型会因输入变化、用户需求改变或新知识出现而性能下降,如何进行有针对性的编辑以避免昂贵的重新训练。
动机:目前的模型编辑器在修改预训练模型的行为时,会在多个连续的编辑中快速降低模型性能。
方法:提出GRACE,一种终身模型编辑方法,对已部署模型的流错误进行点修复,确保对无关输入的影响最小。GRACE将新的映射写入预训练模型的潜在空间,创建一个离散的局部编辑代码库,而不改变模型权重。
效果:在T5、BERT和GPT模型上的实验表明,GRACE在制作和保留编辑方面具有最先进的性能,并能推广到未见过的输入。

Deployed language models decay over time due to shifting inputs, changing user needs, or emergent world-knowledge gaps. When such problems are identified, we want to make targeted edits while avoiding expensive retraining. However, current model editors, which modify such behaviors of pre-trained models, degrade model performance quickly across multiple, sequential edits. We propose GRACE, a \textit{lifelong} model editing method, which implements spot-fixes on streaming errors of a deployed model, ensuring minimal impact on unrelated inputs. GRACE writes new mappings into a pre-trained model's latent space, creating a discrete, local codebook of edits without altering model weights. This is the first method enabling thousands of sequential edits using only streaming errors. Our experiments on T5, BERT, and GPT models show GRACE's state-of-the-art performance in making and retaining edits, while generalizing to unseen inputs. Our code is available at [github.com/thartvigsen/grace](https://www.github.com/thartvigsen/grace}).

One-Line-of-Code Data Mollification Improves Optimization of Likelihood-based Generative Models
Ba-Hien Tran Giulio Franzese Pietro Michiardi Maurizio Filippone



研究问题:本文旨在解决生成模型在样本质量上不如基于分数的扩散模型的问题。
动机:生成模型在计算机视觉等领域取得了巨大成功,但通常在样本质量上不如基于分数的扩散模型。
方法:借鉴基于分数的扩散模型的优点,通过数据软化进行密度估计和避免流形过拟合,提出一种数据软化作为延续方法的观点。
效果:在真实世界图像数据集和UCI基准测试集上,包括变分自动编码器和归一化流的流行基于似然的生成模型中,报告了FID得分和密度估计方面的显著改进。

Generative Models (GMs) have attracted considerable attention due to their tremendous success in various domains, such as computer vision where they are capable to generate impressive realistic-looking images. Likelihood-based GMs are attractive due to the possibility to generate new data by a single model evaluation. However, they typically achieve lower sample quality compared to state-of-the-art score-based Diffusion Models (DMs). This paper provides a significant step in the direction of addressing this limitation. The idea is to borrow one of the strengths of score-based DMs, which is the ability to perform accurate density estimation in low-density regions and to address manifold overfitting by means of data mollification. We propose a view of data mollification within likelihood-based GMs as a continuation method, whereby the optimization objective smoothly transitions from simple-to-optimize to the original target. Crucially, data mollification can be implemented by adding one line of code in the optimization loop, and we demonstrate that this provides a boost in generation quality of likelihood-based GMs, without computational overheads. We report results on real-world image data sets and UCI benchmarks with popular likelihood-based GMs, including variants of variational autoencoders and normalizing flows, showing large improvements in FID score and density estimation.

DiffTraj: Generating GPS Trajectory with Diffusion Probabilistic Model
Yuanshao Zhu Yongchao Ye Shiyao Zhang Xiangyu Zhao James Yu



研究问题:如何有效地生成高质量且保护隐私的GPS轨迹数据。
动机:由于GPS设备的广泛使用和数据采集技术的进步,GPS轨迹数据呈指数增长,这促进了空间-时间数据挖掘的研究。然而,GPS轨迹包含个人地理位置信息,直接处理原始数据存在严重的隐私问题。
方法:本文提出了一种空间-时间扩散概率模型用于轨迹生成(DiffTraj)。该模型将扩散模型的生成能力与从真实轨迹中提取的空间-时间特征有效结合,通过反向轨迹去噪过程从白噪声重构和合成地理轨迹。此外,还提出了一种轨迹UNet(Traj-UNet)深度神经网络,以嵌入条件信息并在反向过程中准确估计噪声水平。
效果:实验结果显示,DiffTraj可以直观地应用于生成高保真轨迹,同时保留原始分布。生成的结果可以支持下游轨迹分析任务,并在地理分布评估方面显著优于其他方法。

Pervasive integration of GPS-enabled devices and data acquisition technologies has led to an exponential increase in GPS trajectory data, fostering advancements in spatial-temporal data mining research. Nonetheless, GPS trajectories contain personal geolocation information, rendering serious privacy concerns when working with raw data. A promising approach to address this issue is trajectory generation, which involves replacing original data with generated, privacy-free alternatives. Despite the potential of trajectory generation, the complex nature of human behavior and its inherent stochastic characteristics pose challenges in generating high-quality trajectories. In this work, we propose a spatial-temporal diffusion probabilistic model for trajectory generation (DiffTraj). This model effectively combines the generative abilities of diffusion models with the spatial-temporal features derived from real trajectories. The core idea is to reconstruct and synthesize geographic trajectories from white noise through a reverse trajectory denoising process. Furthermore, we propose a Trajectory UNet (Traj-UNet) deep neural network to embed conditional information and accurately estimate noise levels during the reverse process. Experiments on two real-world datasets show that DiffTraj can be intuitively applied to generate high-fidelity trajectories while retaining the original distributions. Moreover, the generated results can support downstream trajectory analysis tasks and significantly outperform other methods in terms of geo-distribution evaluations.

SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process
Mengyu Wang Henghui Ding Jun Hao Liew Jiajun Liu Yao Zhao Yunchao Wei



研究问题:本文旨在探索提高不同分割模型产生的物体掩码质量的主要方式。
动机:现有的分割模型在生成物体掩码时,往往存在噪声和不精确的问题,需要进一步优化和改进。
方法:提出一种名为SegRefiner的模型无关解决方案,将分割优化视为数据生成过程,通过一系列去噪扩散步骤实现平滑的优化过程。
效果:实验结果表明,SegRefiner在各种分割任务上均表现出优越性,无论是语义分割、实例分割还是二分类图像分割,都能显著提高分割质量和边界质量,且优于以往的模型无关优化方法,尤其在高分辨率图像的精细细节捕捉上表现突出。

In this paper, we explore a principal way to enhance the quality of object masks produced by different segmentation models. We propose a model-agnostic solution called SegRefiner, which offers a novel perspective on this problem by interpreting segmentation refinement as a data generation process. As a result, the refinement process can be smoothly implemented through a series of denoising diffusion steps. Specifically, SegRefiner takes coarse masks as inputs and refines them using a discrete diffusion process. By predicting the label and corresponding states-transition probabilities for each pixel, SegRefiner progressively refines the noisy masks in a conditional denoising manner. To assess the effectiveness of SegRefiner, we conduct comprehensive experiments on various segmentation tasks, including semantic segmentation, instance segmentation, and dichotomous image segmentation. The results demonstrate the superiority of our SegRefiner from multiple aspects. Firstly, it consistently improves both the segmentation metrics and boundary metrics across different types of coarse masks. Secondly, it outperforms previous model-agnostic refinement methods by a significant margin. Lastly, it exhibits a strong capability to capture extremely fine details when refining high-resolution images. The source code and trained models are available at [SegRefiner.git](https://github.com/MengyuWang826/SegRefiner)

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
Yuancheng Wang Zeqian Ju Xu Tan Lei He Zhizheng Wu Jiang Bian sheng zhao



研究问题:本文旨在解决音频编辑中的问题,如添加背景音乐效果、替换乐器和修复损坏的音频。
动机:尽管现有的基于扩散的方法能够通过文本描述输出音频进行零样本音频编辑,但仍存在一些问题,如未在编辑任务上进行训练、可能错误地修改不需要编辑的音频段以及需要完整的输出音频描述。
方法:本文提出了一种基于潜在扩散模型的指令引导音频编辑模型AUDIT。它通过构建不同音频编辑任务的训练数据对(指令、输入音频、输出音频),并使用指令和输入音频作为条件生成输出音频来训练扩散模型。
效果:实验结果表明,AUDIT在多个音频编辑任务上取得了最先进的结果,包括添加、删除、替换、修复和超分辨率等。

Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been trained on editing tasks and cannot ensure good editing effects; 2) they can erroneously modify audio segments that do not require editing; 3) they need a complete description of the output audio, which is not always available or necessary in practical scenarios. In this work, we propose AUDIT, an instruction-guided audio editing model based on latent diffusion models. Specifically, \textbf{AUDIT} has three main design features: 1) we construct triplet training data (instruction, input audio, output audio) for different audio editing tasks and train a diffusion model using instruction and input (to be edited) audio as conditions and generating output (edited) audio; 2) it can automatically learn to only modify segments that need to be edited by comparing the difference between the input and output audio; 3) it only needs edit instructions instead of full target audio descriptions as text input. AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution). Demo samples are available at https://audit-demopage.github.io/.

Unsupervised Image Denoising with Score Function
Yutong Xie Mingze Yuan Bin Dong Quanzheng Li



研究问题:本文旨在提出一种新的方法,用于处理复杂的噪声模型的单图像去噪问题。
动机:目前的无监督学习方法在处理复杂噪声模型时存在限制。
方法:利用得分函数和对数概率梯度的性质,定义了一个去噪求解系统。一旦估计出噪声图像的得分函数,就可以通过求解系统得到去噪结果。该方法可以应用于多种噪声模型,如乘性加性混合噪声与结构化相关性的组合。
效果:实验结果表明,当噪声模型简单时,该方法与其他方法相当;在复杂情况下,其他方法不适用或性能较差时,该方法表现出良好的性能。

Though achieving excellent performance in some cases, current unsupervised learning methods for single image denoising usually have constraints in applications. In this paper, we propose a new approach which is more general and applicable to complicated noise models. Utilizing the property of score function, the gradient of logarithmic probability, we define a solving system for denoising. Once the score function of noisy images has been estimated, the denoised result can be obtained through the solving system. Our approach can be applied to multiple noise models, such as the mixture of multiplicative and additive noise combined with structured correlation. Experimental results show that our method is comparable when the noise model is simple, and has good performance in complicated cases where other methods are not applicable or perform poorly.

Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities
Jingyuan Sun Mingxiao Li Yunhao Zhang Marie-Francine Moens Zijiao Chen Shaonan Wang



研究问题:如何通过功能磁共振成像(fMRI)记录的神经反应解码视觉刺激,以理解人类视觉感知。
动机:fMRI信号的噪声性质和大脑视觉表示的复杂模式使得这一任务具有挑战性。
方法:提出了一个两阶段的fMRI表示学习框架。第一阶段使用提出的双对比掩码自动编码器预训练fMRI特征学习器以学习去噪表示。第二阶段调整特征学习器以关注对视觉重建最有益的神经激活模式,并使用图像自动编码器进行指导。优化后的fMRI特征学习器然后使潜在扩散模型根据大脑活动重建图像刺激。
效果:实验结果表明,我们的模型在生成高分辨率和语义准确的图像方面表现出优越性,在50类-top-1语义分类准确率上比之前最先进的方法提高了39.34%。代码实现将在https://github.com/soinx0629/vis_dec_neurips/上提供。

Decoding visual stimuli from neural responses recorded by functional Magnetic Resonance Imaging (fMRI) presents an intriguing intersection between cognitive neuroscience and machine learning, promising advancements in understanding human visual perception. However, the task is challenging due to the noisy nature of fMRI signals and the intricate pattern of brain visual representations. To mitigate these challenges, we introduce a two-phase fMRI representation learning framework. The first phase pre-trains an fMRI feature learner with a proposed Double-contrastive Mask Auto-encoder to learn denoised representations. The second phase tunes the feature learner to attend to neural activation patterns most informative for visual reconstruction with guidance from an image auto-encoder. The optimized fMRI feature learner then conditions a latent diffusion model to reconstruct image stimuli from brain activities. Experimental results demonstrate our model's superiority in generating high-resolution and semantically accurate images, substantially exceeding previous state-of-the-art methods by 39.34% in the 50-way-top-1 semantic classification accuracy. The code implementations will be available at https://github.com/soinx0629/vis_dec_neurips/.

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
Tong Wu Zhihao Fan Xiao Liu Hai-Tao Zheng Yeyun Gong yelong shen Jian Jiao Juntao Li zhongyu wei Jian Guo Nan Duan Weizhu Chen



研究问题:如何改进现有的自然语言模型,使其更好地捕捉到文本的序列依赖性。
动机:现有的大部分语言模型都是采用自回归的方式训练,而自然语言具有更强的序列依赖性。
方法:提出了一种自回归扩散(AR-Diffusion)模型,通过动态调整去噪步骤的数量,使生成的tokens能够根据其位置影响后续的生成过程。
效果:在文本摘要、机器翻译和常识生成等任务上,AR-Diffusion模型明显优于现有的扩散语言模型,并且在达到相同结果的情况下,速度可以提高$100\times\sim600times$。

Diffusion models have gained significant attention in the realm of image generation due to their exceptional performance. Their success has been recently expanded to text generation via generating all tokens within a sequence concurrently. However, natural language exhibits a far more pronounced sequential dependency in comparison to images, and the majority of existing language models are trained with a left-to-right auto-regressive approach. To account for the inherent sequential characteristic of natural language, we introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right. In a series of experiments on various text generation tasks, including text summarization, machine translation, and common sense generation, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models and that it can be $100\times\sim600\times$ faster when achieving comparable results. Our code is available at https://github.com/microsoft/ProphetNet/tree/master/AR-diffusion.

Constructing Non-isotropic Gaussian Diffusion Model Using Isotropic Gaussian Diffusion Model for Image Editing
Xi Yu Xiang Gu Haozhi Liu Jian Sun



研究问题:本文旨在提出一种非均匀高斯扩散模型(NGDM),用于图像编辑,要求在编辑源图像的同时保留与编辑任务无关的图像区域。
动机:现有的基于分数的扩散模型在图像生成方面取得了最先进的结果,但需要进一步改进以适应图像编辑任务。
方法:通过向不同像素添加具有不同方差的独立高斯噪声来构建NGDM。然后,将NGDM矫正为具有不同像素的不同总前向扩散时间的各向同性高斯扩散模型。最后,设计一种采样方法,该方法从不同的时间开始对不同的像素进行逆扩散,以利用预训练的各向同性高斯扩散模型进行去噪并生成图像。
效果:实验结果表明,NGDM在图像编辑任务上实现了最先进的性能,同时考虑了源图像的保真度和与期望编辑目标的对齐之间的权衡。

Score-based diffusion models (SBDMs) have achieved state-of-the-art results in image generation. In this paper, we propose a Non-isotropic Gaussian Diffusion Model (NGDM) for image editing, which requires editing the source image while preserving the image regions irrelevant to the editing task. We construct NGDM by adding independent Gaussian noises with different variances to different image pixels. Instead of specifically training the NGDM, we rectify the NGDM into an isotropic Gaussian diffusion model with different pixels having different total forward diffusion time. We propose to reverse the diffusion by designing a sampling method that starts at different time for different pixels for denoising to generate images using the pre-trained isotropic Gaussian diffusion model. Experimental results show that NGDM achieves state-of-the-art performance for image editing tasks, considering the trade-off between the fidelity to the source image and alignment with the desired editing target.

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
Can Qin Shu Zhang Ning Yu Yihao Feng Xinyi Yang Yingbo Zhou Huan Wang Juan Carlos Niebles Caiming Xiong Silvio Savarese Stefano Ermon Yun Fu Ran Xu



研究问题:设计交互式AI系统时,实现机器自主性和人类控制往往是两个相互矛盾的目标。
动机:虽然视觉生成基础模型如稳定扩散在处理这些目标上显示出潜力,特别是在使用任意语言提示的情况下,但它们在生成具有空间、结构或几何控制的图像方面往往表现不佳。
方法:为此,我们引入了UniControl,这是一个新的生成基础模型,它将各种可控制的从条件到图像(C2I)任务整合在一个统一的框架中,同时仍然允许使用任意的语言提示。
效果:通过在预训练的文本到图像扩散模型上进行增强,并引入一个任务感知的超网络来调整扩散模型,使UniControl能够适应不同的C2I任务。在九个独特的C2I任务上进行训练后,UniControl展示了令人印象深刻的零样本生成能力,可以处理未见过的视频条件。实验结果表明,UniControl的性能经常超过同等规模单任务控制方法的表现。这种控制的多功能性使UniControl成为可控视觉生成领域的重要进步。

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

Graph Denoising Diffusion for Inverse Protein Folding
Kai Yi Bingxin Zhou Yiqing Shen Pietro Lio Yu Guang Wang



研究问题:逆蛋白质折叠由于其固有的一对多映射特性而具有挑战性,现有的判别模型难以捕捉到可能的解决方案的多样性。
动机:为了解决这一问题,我们提出了一种新的图去噪扩散模型,利用蛋白质骨架引导氨基酸残基类型的扩散过程。
方法:该模型通过节点的物理化学性质和局部环境来推断氨基酸的条件联合分布。同时,我们还利用氨基酸替换矩阵进行扩散前向过程,编码了氨基酸从其空间和序列邻居以及自身获得的生物学意义先验知识,从而减少了生成过程的采样空间。
效果:实验结果表明,我们的模型在序列恢复方面优于一组流行的基线方法,并在为确定的蛋白质骨架结构生成多样化的蛋白质序列方面具有巨大潜力。

Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive models, struggle to encapsulate the diverse range of plausible solutions. In contrast, diffusion probabilistic models, as an emerging genre of generative approaches, offer the potential to generate a diverse set of sequence candidates for determined protein backbones. We propose a novel graph denoising diffusion model for inverse protein folding, where a given protein backbone guides the diffusion process on the corresponding amino acid residue types. The model infers the joint distribution of amino acids conditioned on the nodes' physiochemical properties and local environment. Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically-meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process. Our model achieves state-of-the-art performance over a set of popular baseline methods in sequence recovery and exhibits great potential in generating diverse protein sequences for a determined protein backbone structure.

Refining Diffusion Planner for Reliable Behavior Synthesis by Automatic Detection of Infeasible Plans
Kyowoon Lee Seongun Kim Jaesik Choi



研究问题:扩散模型在长期稀疏奖励任务中表现出良好的结果,但其生成的计划可能不可行,限制了其在安全关键应用中的使用。
动机:提出一种新的方法来优化由扩散模型生成的不可靠计划,通过提供错误倾向计划的优化指导。
方法:提出了一种名为恢复差距的新指标来评估扩散模型生成的单个计划的质量。并通过间隙预测器产生恢复差距指导以优化扩散规划器。同时,还提出了一个属性图正则化器,防止可能从次优间隙预测器产生的对抗性优化指导,使不可行的计划得到进一步优化。
效果:在需要长期规划的离线控制设置的三个不同基准上展示了该方法的有效性。同时,通过展示间隙预测器的属性图和突出显示错误倾向的转换,说明了该方法的可解释性,使人们能够更深入地理解生成的计划。

Diffusion-based planning has shown promising results in long-horizon, sparse-reward tasks by training trajectory diffusion models and conditioning the sampled trajectories using auxiliary guidance functions. However, due to their nature as generative models, diffusion models are not guaranteed to generate feasible plans, resulting in failed execution and precluding planners from being useful in safety-critical applications. In this work, we propose a novel approach to refine unreliable plans generated by diffusion models by providing refining guidance to error-prone plans. To this end, we suggest a new metric named restoration gap for evaluating the quality of individual plans generated by the diffusion model. A restoration gap is estimated by a gap predictor which produces restoration gap guidance to refine a diffusion planner. We additionally present an attribution map regularizer to prevent adversarial refining guidance that could be generated from the sub-optimal gap predictor, which enables further refinement of infeasible plans. We demonstrate the effectiveness of our approach on three different benchmarks in offline control settings that require long-horizon planning. We also illustrate that our approach presents explainability by presenting the attribution maps of the gap predictor and highlighting error-prone transitions, allowing for a deeper understanding of the generated plans.

Dataset Diffusion: Diffusion-based Synthetic Data Generation for Pixel-Level Semantic Segmentation
Quang Ho Nguyen Truong Tuan Vu Anh Tuan Tran Khoi Nguyen



研究问题:如何有效地为深度视觉模型准备训练数据?
动机:生成对抗模型可以有效解决生成合成数据的问题,但目前的模型只能产生图像级别的类别标签。
方法:提出一种新颖的方法,利用文本到图像的生成模型Stable Diffusion(SD)生成像素级的语义分割标签。通过使用SD的文本提示、交叉注意力和自我注意力,引入了三种新技术:类别提示附加、类别提示交叉注意力和自我注意力指数化。这些技术使得我们能够生成与合成图像对应的分割图。
效果:在PASCAL VOC和MSCOCO两个数据集上进行评估,该方法显著优于同时期的工作。

Preparing training data for deep vision models is a labor-intensive task. To address this, generative models have emerged as an effective solution for generating synthetic data. While current generative models produce image-level category labels, we propose a novel method for generating pixel-level semantic segmentation labels using the text-to-image generative model Stable Diffusion (SD). By utilizing the text prompts, cross-attention, and self-attention of SD, we introduce three new techniques: class-prompt appending, class-prompt cross-attention, and self-attention exponentiation. These techniques enable us to generate segmentation maps corresponding to synthetic images. These maps serve as pseudo-labels for training semantic segmenters, eliminating the need for labor-intensive pixel-wise annotation. To account for the imperfections in our pseudo-labels, we incorporate uncertainty regions into the segmentation, allowing us to disregard loss from those regions. We conduct evaluations on two datasets, PASCAL VOC and MSCOCO, and our approach significantly outperforms concurrent work. Our benchmarks and code will be released at https://github.com/VinAIResearch/Dataset-Diffusion.

Restart Sampling for Improving Generative Processes
Yilun Xu Mingyang Deng Xiang Cheng Yonglong Tian Ziming Liu Tommi S. Jaakkola



研究问题:解决涉及微分方程的生成过程(如扩散模型)通常需要在速度和质量之间取得平衡。
动机:基于ODE的采样器速度快但性能停滞,而基于SDE的采样器在提高采样质量的同时增加了采样时间。
方法:提出了一种名为“Restart”的新型采样算法,通过在额外的正向步骤中添加大量噪声和严格遵循反向ODE来更好地平衡离散化误差和收缩。
效果:实验结果表明,Restart采样器在速度和准确性上都超过了之前的SDE和ODE采样器。在CIFAR-10/ImageNet $64{times} 64$上,采样速度提高了10倍/2倍。此外,在与之前的采样器相比的时间范围内,它实现了比ODE采样器更好的采样质量。在大尺度文本到图像的稳定扩散模型中,它在文本-图像对齐/视觉质量和多样性方面也优于之前的采样器。代码可在https://github.com/Newbeeer/diffusion_restart_sampling获取。

Generative processes that involve solving differential equations, such as diffusion models, frequently necessitate balancing speed and quality. ODE-based samplers are fast but plateau in performance while SDE-based samplers deliver higher sample quality at the cost of increased sampling time. We attribute this difference to sampling errors: ODE-samplers involve smaller discretization errors while stochasticity in SDE contracts accumulated errors. Based on these findings, we propose a novel sampling algorithm called \textit{Restart} in order to better balance discretization errors and contraction. The sampling method alternates between adding substantial noise in additional forward steps and strictly following a backward ODE. Empirically, Restart sampler surpasses previous SDE and ODE samplers in both speed and accuracy. Restart not only outperforms the previous best SDE results, but also accelerates the sampling speed by 10-fold / 2-fold on CIFAR-10 / ImageNet $64{\times} 64$. In addition, it attains significantly better sample quality than ODE samplers within comparable sampling times. Moreover, Restart better balances text-image alignment/visual quality versus diversity than previous samplers in the large-scale text-to-image Stable Diffusion model pre-trained on LAION $512{\times} 512$. Code is available at https://github.com/Newbeeer/diffusion_restart_sampling

Conditional Score Guidance for Text-Driven Image-to-Image Translation
Hyunsoo Lee Minsoo Kang Bohyung Han



研究问题:本文旨在提出一种基于预训练文本到图像扩散模型的文本驱动图像到图像翻译的新算法。
动机:现有的技术仅依赖于目标提示,而我们的方法引入了一个新的评分函数,同时考虑源图像和源文本提示,以适应特定的翻译任务。
方法:我们通过选择性编辑源图像的关注区域来生成目标图像,同时保留其余部分。我们还引入了一种简单而有效的混合技术,将源和目标潜在值产生的两个交叉注意力图进行融合。
效果:实验结果表明,我们的方法在各种任务上实现了优秀的图像到图像翻译性能。

We present a novel algorithm for text-driven image-to-image translation based on a pretrained text-to-image diffusion model. Our method aims to generate a target image by selectively editing the regions of interest in a source image, defined by a modifying text, while preserving the remaining parts. In contrast to existing techniques that solely rely on a target prompt, we introduce a new score function that additionally considers both the source image and the source text prompt, tailored to address specific translation tasks. To this end, we derive the conditional score function in a principled manner, decomposing it into the standard score and a guiding term for target image generation. For the gradient computation of the guiding term, we assume a Gaussian distribution of the posterior distribution and estimate its mean and variance to adjust the gradient without additional training. In addition, to improve the quality of the conditional score guidance, we incorporate a simple yet effective mixup technique, which combines two cross-attention maps derived from the source and target latents. This strategy is effective for promoting a desirable fusion of the invariant parts in the source image and the edited regions aligned with the target prompt, leading to high-fidelity target image generation. Through comprehensive experiments, we demonstrate that our approach achieves outstanding image-to-image translation performance on various tasks.

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
Zhendong Wang Yifan Jiang Huangjie Zheng Peihao Wang Pengcheng He Zhangyang Wang Weizhu Chen Mingyuan Zhou



研究问题:如何减少扩散模型的训练时间成本并提高数据效率。
动机:现有的扩散模型需要大量的时间和数据进行训练,限制了其广泛应用。
方法:提出Patch Diffusion,一种通用的基于补丁的训练框架,通过在原始图像中加入补丁位置作为额外的坐标通道,并在训练过程中随机化和多样化补丁大小以编码多尺度的跨区域依赖性,显著减少了训练时间成本并提高了数据效率。
效果:Patch Diffusion可以在保持或提高生成质量的同时,将训练速度提高至少2倍。同时,Patch Diffusion也改善了在相对小的数据集上训练的扩散模型的性能,例如从零开始只需5000张图片进行训练。在与最先进的基准测试相比,我们的方法在FID分数上取得了出色的成绩。

Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.

Understanding and Mitigating Copying in Diffusion Models
Gowthami Somepalli Vasu Singla Micah Goldblum Jonas Geiping Tom Goldstein



研究问题:本文旨在解决文本到图像扩散模型中的数据复制问题。
动机:尽管人们普遍认为训练集中的重复图像是推理时内容复制的原因,但研究发现,模型的文本条件也起着同样重要的作用。
方法:通过在训练集上随机化和增强图像标题,提出几种减少数据复制的技术。
效果:实验结果表明,这些技术可以有效地减少训练和推理时的数据复制。

Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible for content replication at inference time, we observe that the text conditioning of the model plays a similarly important role. In fact, we see in our experiments that data replication often does not happen for unconditional models, while it is common in the text-conditional case. Motivated by our findings, we then propose several techniques for reducing data replication at both training and inference time by randomizing and augmenting image captions in the training set. Code is available at https://github.com/somepago/DCR.

Where Did I Come From? Origin Attribution of AI-Generated Images
Zhenting Wang Chen Chen Yi Zeng Lingjuan Lyu Shiqing Ma



研究问题:如何准确判断特定图像是否由特定的生成模型生成,即图像来源归属。
动机:随着图像生成技术受到越来越多的关注,人们开始关注其可能的误用和知识产权侵权问题。因此,需要通过分析图像的来源来判断其是否由特定模型生成。
方法:我们开发了一种无需修改且与模型无关的图像来源归属方法,通过对图像生成模型进行逆向工程,即对特定模型的特定图像的输入进行反转。
效果:我们的方法能有效地区分特定生成模型生成的图像和其他图像(如其他模型生成的图像和真实图像),证实了其有效性。

Image generation techniques have been gaining increasing attention recently, but concerns have been raised about the potential misuse and intellectual property (IP) infringement associated with image generation models. It is, therefore, necessary to analyze the origin of images by inferring if a specific image was generated by a particular model, i.e., origin attribution. Existing methods only focus on specific types of generative models and require additional procedures during the training phase or generation phase. This makes them unsuitable for pre-trained models that lack these specific operations and may impair generation quality. To address this problem, we first develop an alteration-free and model-agnostic origin attribution method via reverse-engineering on image generation models, i.e., inverting the input of a particular model for a specific image. Given a particular model, we first analyze the differences in the hardness of reverse-engineering tasks for generated samples of the given model and other images. Based on our analysis, we then propose a method that utilizes the reconstruction loss of reverse-engineering to infer the origin. Our proposed method effectively distinguishes between generated images of a specific generative model and other images, i.e., images generated by other models and real images.

On the choice of Perception Loss Function for Learned Video Compression
Sadaf Salehkalaibar Truong Buu Phan Jun Chen Wei Yu Ashish J Khisti



研究问题:本研究旨在探讨在输出受到均方误差(MSE)失真损失和感知损失影响时,如何进行因果、低延迟的序列视频压缩。
动机:受先前方法的启发,我们考虑了两种不同的感知损失函数(PLF)。第一种是PLF-JD,它考虑了当前帧之前的所有视频帧的联合分布;第二种是PLF-FMD,它考虑了源和重建之间的帧间边际分布。
方法:通过信息理论分析和基于深度学习的实验,我们证明了PLF的选择对重建效果有显著影响,尤其是在低比特率下。特别是,虽然基于PLF-JD的重建可以更好地保留帧间的 temporal correlation,但它与PLF-FMD相比在失真方面施加了显著的惩罚,并使其更难以从早期输出帧中的错误中恢复。
效果:尽管PLF的选择对重建质量有决定性影响,但我们证明在编码过程中不一定需要选择特定的PLF,而PLF的选择可以委托给解码器。特别是,通过训练一个系统最小化MSE(不需要任何PLF)生成的编码表示可以是“近乎通用”的,并且可以为解码器的任何PLF选择生成接近最优的重建。

We study causal, low-latency, sequential video compression when the output is subjected to both a mean squared-error (MSE) distortion loss as well as a perception loss to target realism. Motivated by prior approaches, we consider two different perception loss functions (PLFs). The first, PLF-JD, considers the joint distribution (JD) of all the video frames up to the current one, while the second metric, PLF-FMD, considers the framewise marginal distributions (FMD) between the source and reconstruction. Using information theoretic analysis and deep-learning based experiments, we demonstrate that the choice of PLF can have a significant effect on the reconstruction, especially at low-bit rates. In particular, while the reconstruction based on PLF-JD can better preserve the temporal correlation across frames, it also imposes a significant penalty in distortion compared to PLF-FMD and further makes it more difficult to recover from errors made in the earlier output frames. Although the choice of PLF decisively affects reconstruction quality, we also demonstrate that it may not be essential to commit to a particular PLF during encoding and the choice of PLF can be delegated to the decoder. In particular, encoded representations generated by training a system to minimize the MSE (without requiring either PLF) can be {\em near universal} and can generate close to optimal reconstructions for either choice of PLF at the decoder. We validate our results using (one-shot) information-theoretic analysis, detailed study of the rate-distortion-perception tradeoff of the Gauss-Markov source model as well as deep-learning based experiments on moving MNIST and KTH datasets.

Neural Circuits for Fast Poisson Compressed Sensing in the Olfactory Bulb
Jacob A Zavatone-Veth Paul Masset William Lingxiao Tong Joseph Zak Venkatesh N Murthy Cengiz Pehlevan



研究问题:如何通过压缩感知模型解决哺乳动物嗅觉系统在混乱气味流中解码气味身份和浓度的问题。
动机:目前的压缩感知模型未能捕捉到嗅觉系统的解剖学和生理学特性,也未证明能在一次嗅闻的100毫秒时间尺度内完成感知。
方法:提出一种基于速率的泊松压缩感知电路模型,该模型映射到嗅觉球的神经元类别,并再现了它们连接性和生理学的显著特征。
效果:对于与人嗅觉球相当的电路规模,该模型能在一次嗅闻的时间尺度内准确检测出数十种气味。同时,该模型可以进行贝叶斯后验采样以进行准确的不确定性估计。

Within a single sniff, the mammalian olfactory system can decode the identity and concentration of odorants wafted on turbulent plumes of air. Yet, it must do so given access only to the noisy, dimensionally-reduced representation of the odor world provided by olfactory receptor neurons. As a result, the olfactory system must solve a compressed sensing problem, relying on the fact that only a handful of the millions of possible odorants are present in a given scene. Inspired by this principle, past works have proposed normative compressed sensing models for olfactory decoding. However, these models have not captured the unique anatomy and physiology of the olfactory bulb, nor have they shown that sensing can be achieved within the 100-millisecond timescale of a single sniff. Here, we propose a rate-based Poisson compressed sensing circuit model for the olfactory bulb. This model maps onto the neuron classes of the olfactory bulb, and recapitulates salient features of their connectivity and physiology. For circuit sizes comparable to the human olfactory bulb, we show that this model can accurately detect tens of odors within the timescale of a single sniff. We also show that this model can perform Bayesian posterior sampling for accurate uncertainty estimation. Fast inference is possible only if the geometry of the neural code is chosen to match receptor properties, yielding a distributed neural code that is not axis-aligned to individual odor identities. Our results illustrate how normative modeling can help us map function onto specific neural circuits to generate new hypotheses.

Learning a 1-layer conditional generative model in total variation
Ajil Jalal Justin Kang Ananya Uppal Kannan Ramchandran Eric Price



研究问题:如何训练一种条件生成模型,以从条件分布中进行采样。
动机:现有的学习模型需要对输入分布做出假设,而本文提出的模型不需要这些假设。
方法:通过给定样本 $(x, y)$ 来学习一种单层ReLU条件生成模型,该模型能够逐步学习深层模型,且具有线性数量的样本。
效果:实验结果表明,该方法能够在各种任务上取得显著的改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

A conditional generative model is a method for sampling from a conditional distribution $p(y \mid x)$. For example, one may want to sample an image of a cat given the label ``cat''. A feed-forward conditional generative model is a function $g(x, z)$ that takes the input $x$ and a random seed $z$, and outputs a sample $y$ from $p(y \mid x)$. Ideally the distribution of outputs $(x, g(x, z))$ would be close in total variation to the ideal distribution $(x, y)$. Generalization bounds for other learning models require assumptions on the distribution of $x$, even in simple settings like linear regression with Gaussian noise. We show these assumptions are unnecessary in our model, for both linear regression and single-layer ReLU networks. Given samples $(x, y)$, we show how to learn a 1-layer ReLU conditional generative model in total variation. As our result has no assumption on the distribution of inputs $x$, if we are given access to the internal activations of a deep generative model, we can compose our 1-layer guarantee to progressively learn the deep model using a near-linear number of samples.

Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting
Marcel Kollovieh Abdul Fatir Ansari Michael Bohlke-Schneider Jasper Zschiegner Hao Wang Bernie Wang



研究问题:本文旨在探索无条件的时间序列扩散模型在各种时间序列任务中的应用潜力。
动机:现有的时间序列扩散模型主要针对特定的预测或插补任务进行条件建模,而本文则尝试开发一种无需特定任务的、无条件的时间序列扩散模型。
方法:本文提出了一种名为TSDiff的无条件训练的时间序列扩散模型,并设计了一种自我指导机制,使得该模型在推理阶段能够适应下游任务,而无需额外的网络或改变训练过程。
效果:实验结果表明,TSDiff在预测、优化和合成数据生成三种不同的时间序列任务上都表现出色,其性能与多种特定条件预测方法相当,且能以较低的计算开销对基础预测器的结果进行迭代优化。此外,利用TSDiff生成的合成样本训练的预测器的性能甚至超过了其他最先进的生成式时间序列模型,有时甚至超过了真实数据训练的模型。

Diffusion models have achieved state-of-the-art performance in generative modeling tasks across various domains. Prior works on time series diffusion models have primarily focused on developing conditional models tailored to specific forecasting or imputation tasks. In this work, we explore the potential of task-agnostic, unconditional diffusion models for several time series applications. We propose TSDiff, an unconditionally-trained diffusion model for time series. Our proposed self-guidance mechanism enables conditioning TSDiff for downstream tasks during inference, without requiring auxiliary networks or altering the training procedure. We demonstrate the effectiveness of our method on three different time series tasks: forecasting, refinement, and synthetic data generation. First, we show that TSDiff is competitive with several task-specific conditional forecasting methods (*predict*). Second, we leverage the learned implicit probability density of TSDiff to iteratively refine the predictions of base forecasters with reduced computational overhead over reverse diffusion (*refine*). Notably, the generative performance of the model remains intact — downstream forecasters trained on synthetic samples from TSDiff outperform forecasters that are trained on samples from other state-of-the-art generative time series models, occasionally even outperforming models trained on real data (*synthesize*).

Uncertainty Quantification via Neural Posterior Principal Components
Elias Nehme Omer Yair Tomer Michaeli



研究问题:如何有效地对图像恢复模型的不确定性进行量化,特别是在自动驾驶和生物成像等安全关键领域中。
动机:目前的不确定性可视化方法主要关注于每个像素的估计,然而这种方法通常缺乏实用性,因为它无法捕捉像素之间的强相关性。因此,需要一种更自然的方式来度量不确定性,即后验分布的主成分(PCs)的方差。
方法:本文提出了一种在单次神经网络前向传播中预测任何输入图像后验分布主成分的方法。该方法可以围绕一个预先训练的最小化均方误差(MSE)的模型进行,也可以从头开始训练以输出预测的图像和后验主成分。
效果:通过在多个图像逆问题(包括去噪、修复、超分辨率和生物图像转换)上展示该方法,证明了其能够可靠地传达实例自适应的不确定性方向,实现了与后验采样器相当的不确定性量化,同时速度快了几个数量级。

Uncertainty quantification is crucial for the deployment of image restoration models in safety-critical domains, like autonomous driving and biological imaging. To date, methods for uncertainty visualization have mainly focused on per-pixel estimates. Yet, a heatmap of per-pixel variances is typically of little practical use, as it does not capture the strong correlations between pixels. A more natural measure of uncertainty corresponds to the variances along the principal components (PCs) of the posterior distribution. Theoretically, the PCs can be computed by applying PCA on samples generated from a conditional generative model for the input image. However, this requires generating a very large number of samples at test time, which is painfully slow with the current state-of-the-art (diffusion) models. In this work, we present a method for predicting the PCs of the posterior distribution for any input image, in a single forward pass of a neural network. Our method can either wrap around a pre-trained model that was trained to minimize the mean square error (MSE), or can be trained from scratch to output both a predicted image and the posterior PCs. We showcase our method on multiple inverse problems in imaging, including denoising, inpainting, super-resolution, and biological image-to-image translation. Our method reliably conveys instance-adaptive uncertainty directions, achieving uncertainty quantification comparable with posterior samplers while being orders of magnitude faster. Code and examples are available on our [webpage](https://eliasnehme.github.io/NPPC/).

PHOTOSWAP: Personalized Subject Swapping in Images
Jing Gu Yilin Wang Nanxuan Zhao Tsu-Jui Fu Wei Xiong Qing Liu Zhifei Zhang HE Zhang Jianming Zhang HyunJoon Jung Xin Eric Wang



研究问题:如何实现在保持图像原有魅力和构图的同时,将现有图像中的特定主体替换为个人化的主体?
动机:在图片和视觉内容主导的数字时代,能够操作和个性化这些图像已经成为一种必要。
方法:提出了一种新的方法“Photoswap”,通过在现有图像中进行个性化的主题交换,实现了这种沉浸式的图像编辑体验。首先从参考图像中学习主题的视觉概念,然后使用预训练的扩散模型将其无缝地交换到目标图像中。
效果:实验表明,“Photoswap”在个性化主题交换方面具有高效性和可控性,并在人类评价中显著优于基线方法,显示出其广泛的应用潜力,从娱乐到专业编辑。

In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present \emph{Photoswap}, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. \emph{Photoswap} first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of \emph{Photoswap} in personalized subject swapping. Furthermore, \emph{Photoswap} significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing.

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models
Weijia Wu Yuzhong Zhao Hao Chen Yuchao Gu Rui Zhao Yefei He Hong Zhou Mike Zheng Shou Chunhua Shen



研究问题:如何有效地生成大规模的、多样化的合成数据集,并对其进行高质量的感知标注。
动机:目前的深度学习模型需要大量的数据进行训练,而收集和标注大规模数据集既耗时又耗力。相比之下,使用生成模型如DALL-E和扩散模型可以无限生成合成数据,且成本极低。
方法:本文提出了一个通用的数据集生成模型DatasetDM,该模型基于预训练的扩散模型,将文本引导的图像合成扩展到了感知数据生成。通过解码器模块,我们可以将丰富而准确的感知标注从扩散模型的丰富潜在代码中解码出来。
效果:实验结果表明,该方法在语义分割、实例分割和深度估计等下游任务上取得了最先进的结果。同时,与真实数据相比,这种方法在领域泛化方面更高效、更鲁棒。此外,该方法还具有在零样本分割设置中获得最先进的结果以及用于有效应用和新颖任务组合(如图像编辑)的灵活性。

Current deep networks are very data-hungry and benefit from training on large-scale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse synthetic images and the corresponding high-quality perception annotations (e.g., segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) of manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models on downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic15 segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more efficient and robust in domain generalization than the real data; 3) state-of-the-art results in zero-shot segmentation setting; and 4) flexibility for efficient application and novel task composition (e.g., image editing)

Assessor360: Multi-sequence Network for Blind Omnidirectional Image Quality Assessment
Tianhe Wu Shuwei Shi Haoming Cai Mingdeng Cao Jing Xiao Yinqiang Zheng Yujiu Yang



研究问题:本文旨在解决现有全向图像质量评估(BOIQA)方法在没有原始质量图像信息的情况下,无法客观评估人类对全向图像的感知质量的问题。
动机:随着虚拟现实(VR)技术的不断发展,全向图像质量评估的重要性日益凸显。然而,现有的BOIQA方法由于缺乏对观察者浏览过程的建模,严重阻碍了其发展。
方法:本文提出了一种名为Assessor360的新型多序列网络进行BOIQA,该网络源于现实的多评估器全向图像质量评估过程。具体来说,我们提出了一种通用的递归概率采样(RPS)方法,结合内容和细节信息,从给定的起点生成多个伪视口序列。此外,我们还设计了一个带有畸变感知块(DAB)的多尺度特征聚合(MFA)模块,以融合每个视口的畸变和语义特征。我们还设计了时间建模模块(TMM)来学习视口在时间域中的转换。
效果:大量的实验结果表明,Assessor360在多个全向图像质量评估数据集上优于最先进的方法。代码和模型可在https://github.com/TianheWu/Assessor360获取。

Blind Omnidirectional Image Quality Assessment (BOIQA) aims to objectively assess the human perceptual quality of omnidirectional images (ODIs) without relying on pristine-quality image information. It is becoming more significant with the increasing advancement of virtual reality (VR) technology. However, the quality assessment of ODIs is severely hampered by the fact that the existing BOIQA pipeline lacks the modeling of the observer's browsing process. To tackle this issue, we propose a novel multi-sequence network for BOIQA called Assessor360, which is derived from the realistic multi-assessor ODI quality assessment procedure. Specifically, we propose a generalized Recursive Probability Sampling (RPS) method for the BOIQA task, combining content and details information to generate multiple pseudo viewport sequences from a given starting point. Additionally, we design a Multi-scale Feature Aggregation (MFA) module with a Distortion-aware Block (DAB) to fuse distorted and semantic features of each viewport. We also devise Temporal Modeling Module (TMM) to learn the viewport transition in the temporal domain. Extensive experimental results demonstrate that Assessor360 outperforms state-of-the-art methods on multiple OIQA datasets. The code and models are available at https://github.com/TianheWu/Assessor360.

Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences
Minsu Kim Federico Berto Sungsoo Ahn Jinkyoo Park



研究问题:优化生物序列(如蛋白质、DNA和RNA)以最大化离线数据集中的黑盒得分函数。
动机:现有的方法无法有效优化生物序列,因此提出新的解决方案。
方法:提出了一种名为“得分条件生成器引导训练”(BootGen)的新算法。该算法通过基于排名的权重训练生物序列生成器以提高高得分下的序列生成准确性,并通过自生成数据扩充训练数据集,使用代理得分函数进行标记。
效果:在生物序列设计任务上,该方法优于竞争性基线,提供了可复现的源代码。

We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks. We provide reproducible source code: https://github.com/kaist-silab/bootgen.

Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?
Jialu Gao Kaizhe Hu Guowei Xu Huazhe Xu



研究问题:如何利用预训练的文本到图像生成模型和先进的图像编辑技术来指导机器人学习。
动机:相比于语言,图像通常能更详细且不含糊地传达信息。因此,我们提出了一种名为“Learning from the Void”的方法,该方法利用预训练的文本到图像模型和先进的图像编辑技术的力量来指导机器人学习。
方法:给定自然语言指令,LfVoid可以编辑原始观察结果以获取目标图像,例如“擦拭”桌子上的污渍。然后,LfVoid在生成的图像上训练一个集成的目标判别器,为强化学习代理提供奖励信号,引导其实现目标。
效果:我们在三个模拟任务中评估了LfVoid,并在相应的真实世界场景中验证了其可行性。此外,我们还提供了关于有效整合视觉生成模型到机器人学习工作流程的关键考虑因素的见解。我们认为这项工作代表了预训练的视觉生成模型在机器人领域更广泛应用的第一步。

Pre-trained text-to-image generative models can produce diverse, semantically rich, and realistic images from natural language descriptions. Compared with language, images usually convey information with more details and less ambiguity. In this study, we propose Learning from the Void (LfVoid), a method that leverages the power of pre-trained text-to-image models and advanced image editing techniques to guide robot learning. Given natural language instructions, LfVoid can edit the original observations to obtain goal images, such as "wiping" a stain off a table. Subsequently, LfVoid trains an ensembled goal discriminator on the generated image to provide reward signals for a reinforcement learning agent, guiding it to achieve the goal. The ability of LfVoid to learn with zero in-domain training on expert demonstrations or true goal observations (the void) is attributed to the utilization of knowledge from web-scale generative models. We evaluate LfVoid across three simulated tasks and validate its feasibility in the corresponding real-world scenarios. In addition, we offer insights into the key considerations for the effective integration of visual generative models into robot learning workflows. We posit that our work represents an initial step towards the broader application of pre-trained visual generative models in the robotics field. Our project page: https://lfvoid-rl.github.io/.

Global Structure-Aware Diffusion Process for Low-light Image Enhancement
Jinhui HOU Zhiyu Zhu Junhui Hou Hui LIU Huanqiang Zeng Hui Yuan



研究问题:本文旨在解决低光图像增强问题。
动机:现有的扩散模型在处理低光图像时,可能会产生噪声和伪影,影响图像质量。
方法:本文提出了一种基于扩散的框架,通过引入曲率正则化项和不确定性引导的正则化技术,来保护图像的复杂细节并增强对比度,同时减少噪声和伪影的影响。
效果:实验结果表明,该框架在低光图像增强方面取得了显著的性能提升,其图像质量、噪声抑制和对比度增强等方面均优于现有方法。

This paper studies a diffusion-based framework to address the low-light image enhancement problem. To harness the capabilities of diffusion models, we delve into this intricate process and advocate for the regularization of its inherent ODE-trajectory. To be specific, inspired by the recent research that low curvature ODE-trajectory results in a stable and effective diffusion process, we formulate a curvature regularization term anchored in the intrinsic non-local structures of image data, i.e., global structure-aware regularization, which gradually facilitates the preservation of complicated details and the augmentation of contrast during the diffusion process. This incorporation mitigates the adverse effects of noise and artifacts resulting from the diffusion process, leading to a more precise and flexible enhancement. To additionally promote learning in challenging regions, we introduce an uncertainty-guided regularization technique, which wisely relaxes constraints on the most extreme regions of the image. Experimental evaluations reveal that the proposed diffusion-based framework, complemented by rank-informed regularization, attains distinguished performance in low-light enhancement. The outcomes indicate substantial advancements in image quality, noise suppression, and contrast amplification in comparison with state-of-the-art methods. We believe this innovative approach will stimulate further exploration and advancement in low-light image processing, with potential implications for other applications of diffusion models. The code is publicly available at https://github.com/jinnh/GSAD.

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation
Minghui Hu Jianbin Zheng Daqing Liu Chuanxia Zheng Chaoyue Wang Dacheng Tao Tat-Jen Cham



研究问题:本文旨在解决文本引导扩散模型在描述预期目标图像时语言表示模糊的问题,需要引入额外的控制信号以提高其效果。
动机:目前的文本引导扩散模型在生成高质量、多样化内容的图像方面表现出色,但由于语言表示对预期目标图像的描述常常模糊不清,因此需要引入额外的控制信号来提高其效果。
方法:本文提出了Cocktail模型,该模型将各种模态混合成一个嵌入,并结合了一个通用的控制网络(gControlNet)、一个可控的归一化(ControlNorm)和一个空间引导采样方法,以实现多模态和空间精细化的控制。具体来说,我们引入了一个超网络gControlNet,专门用于将来自不同模态的控制信号对齐并注入预训练的扩散模型中。
效果:实验结果表明,我们的方法在控制各种模态方面表现出色,能够生成高质量的合成图像,并对多种外部信号保持高保真度。

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. gControlNet is capable of accepting flexible modality signals, encompassing the simultaneous reception of any combination of modality signals, or the supplementary fusion of multiple modality signals. The control signals are then fused and injected into the backbone model according to our proposed ControlNorm. Furthermore, our advanced spatial guidance sampling methodology proficiently incorporates the control signal into the designated region, thereby circumventing the manifestation of undesired objects within the generated image. We demonstrate the results of our method in controlling various modalities, proving high-quality synthesis and fidelity to multiple external signals.

Implicit Transfer Operator Learning: Multiple Time-Resolution Models for Molecular Dynamics
Mathias Schreiner Ole Winther Simon Olsson



研究问题:如何更准确地估计分子系统的Boltzmann分布,并实现对不同时间尺度的模拟过程进行快速和准确的建模?
动机:现有的分子动力学模拟方法需要非常小的时间步长才能保持稳定,但某些物理量的收敛可能需要更长的时间尺度。此外,每种分子系统都需要单独进行模拟。
方法:提出了Implict Transfer Operator (ITO)学习框架,通过使用去噪扩散概率模型和新的SE(3)等变架构,实现了对多时间尺度模拟过程的学习和建模。
效果:所提出的模型可以在多个时间尺度上生成一致的随机动力学,即使系统只有部分被观察。同时,还提出了一种粗粒化的CG-SE3-ITO模型,可以使用仅包含粗分子表示的方法对全原子分子动力学进行定量建模。因此,ITO为实现多时间和空间分辨率的分子动力学加速提供了重要步骤。

Computing properties of molecular systems rely on estimating expectations of the (unnormalized) Boltzmann distribution. Molecular dynamics (MD) is a broadly adopted technique to approximate such quantities. However, stable simulations rely on very small integration time-steps ($10^{-15}\,\mathrm{s}$), whereas convergence of some moments, e.g. binding free energy or rates, might rely on sampling processes on time-scales as long as $10^{-1}\, \mathrm{s}$, and these simulations must be repeated for every molecular system independently. Here, we present Implict Transfer Operator (ITO) Learning, a framework to learn surrogates of the simulation process with multiple time-resolutions. We implement ITO with denoising diffusion probabilistic models with a new SE(3) equivariant architecture and show the resulting models can generate self-consistent stochastic dynamics across multiple time-scales, even when the system is only partially observed. Finally, we present a coarse-grained CG-SE3-ITO model which can quantitatively model all-atom molecular dynamics using only coarse molecular representations. As such, ITO provides an important step towards multiple time- and space-resolution acceleration of MD. Code is available at \href{https://github.com/olsson-group/ito}{https://github.com/olsson-group/ito}.

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Robin San Roman Yossi Adi Antoine Deleforge Romain Serizel Gabriel Synnaeve Alexandre Défossez



研究问题:如何利用深度生成模型从低比特率离散表示中生成高保真音频?
动机:目前的生成模型在生成音频时,如果条件有误或不完美,容易产生可听的人工痕迹。而扩散模型虽然能够生成相对低采样率的信号,但主要用于语音编码器或生成特定类型的音频。
方法:提出一种基于多频段扩散的高保真音频生成框架,可以从低比特率离散表示中生成任何类型的音频(如语音、音乐、环境声音)。
效果:在相同比特率下,该方法在感知质量上优于最先进的生成技术。训练和评估代码可在facebookresearch/audiocraft github项目中找到,样本可在https://ai.honu.io/papers/mbd/查看。

Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and evaluation code are available on the facebookresearch/ audiocraft github project. Samples are available on the following link (https://ai.honu.io/papers/mbd/).

Idempotent Learned Image Compression with Right-Inverse
Yanghao Li Tongda Xu Yan Wang Jingjing Liu Ya-Qin Zhang



研究问题:本文旨在解决学习式图像压缩的幂等性问题。
动机:现有的编解码器在重压缩稳定性方面存在不足,即缺乏幂等性。
方法:本文首先提出将变换的可逆性放宽为右可逆性,并使用提出的分块卷积和零空间增强实现了一种幂等编解码器。
效果:实验结果表明,该编解码器在幂等编解码器中具有最先进的率失真性能。此外,通过放宽右可逆性,该编解码器还可以扩展为近幂等编解码器,与其他近幂等编解码器相比,经过50轮重压缩后质量衰减明显较小。

We consider the problem of idempotent learned image compression (LIC). The idempotence of codec refers to the stability of codec to re-compression. To achieve idempotence, previous codecs adopt invertible transforms such as DCT and normalizing flow. In this paper, we first identify that invertibility of transform is sufficient but not necessary for idempotence. Instead, it can be relaxed into right-invertibility. And such relaxation allows wider family of transforms. Based on this identification, we implement an idempotent codec using our proposed blocked convolution and null-space enhancement. Empirical results show that we achieve state-of-the-art rate-distortion performance among idempotent codecs. Furthermore, our codec can be extended into near-idempotent codec by relaxing the right-invertibility. And this near-idempotent codec has significantly less quality decay after $50$ rounds of re-compression compared with other near-idempotent codecs.

PUCA: Patch-Unshuffle and Channel Attention for Enhanced Self-Supervised Image Denoising
Hyemi Jang Junsung Park Dahuin Jung Jaihyun Lew Ho Bae Sungroh Yoon



研究问题:尽管有监督的图像去噪网络在合成噪声图像上表现出了显著的性能,但由于真实世界和合成噪声的差异,它们在实践中往往失败。
动机:由于从现实世界收集干净-有噪声的图像对的成本极高,因此研究了利用噪声输入本身作为目标的自监督学习。为了防止自监督去噪模型学习到相同的映射,每个输出像素不应受其对应输入像素的影响,这一要求被称为J不变性。
方法:我们提出了一种新的J不变性U-Net架构PUCA,用于自监督去噪。PUCA利用补丁-unshuffle/shuffle来大幅扩展感受野,同时保持J不变性和引入全局上下文的扩张注意力块(DABs)。
效果:实验结果表明,PUCA实现了最先进的性能,超过了现有的自监督图像去噪方法。

Although supervised image denoising networks have shown remarkable performance on synthesized noisy images, they often fail in practice due to the difference between real and synthesized noise. Since clean-noisy image pairs from the real world are extremely costly to gather, self-supervised learning, which utilizes noisy input itself as a target, has been studied. To prevent a self-supervised denoising model from learning identical mapping, each output pixel should not be influenced by its corresponding input pixel; This requirement is known as J-invariance. Blind-spot networks (BSNs) have been a prevalent choice to ensure J-invariance in self-supervised image denoising. However, constructing variations of BSNs by injecting additional operations such as downsampling can expose blinded information, thereby violating J-invariance. Consequently, convolutions designed specifically for BSNs have been allowed only, limiting architectural flexibility. To overcome this limitation, we propose PUCA, a novel J-invariant U-Net architecture, for self-supervised denoising. PUCA leverages patch-unshuffle/shuffle to dramatically expand receptive fields while maintaining J-invariance and dilated attention blocks (DABs) for global context incorporation. Experimental results demonstrate that PUCA achieves state-of-the-art performance, outperforming existing methods in self-supervised image denoising.

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models
Sheng-Yen Chou Pin-Yu Chen Tsung-Yi Ho



研究问题:本文旨在解决当前预训练语言模型对结构化知识的利用不足,以及扩散模型容易受到恶意输入模式触发的后门攻击的问题。
动机:为了提高语言模型的性能和安全性,本文提出了一种结合知识图谱的增强语言表示模型ERNIE,并设计了一种针对扩散模型的统一后门攻击框架VillanDiffusion。
方法:采用大规模文本语料库和知识图谱训练ERNIE模型,将KG中的知识与文本语料库进行联合训练;设计了一种针对扩散模型的统一后门攻击框架VillanDiffusion,用于评估不同DM配置的安全性。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美;VillanDiffusion框架有助于分析不同DM配置的安全性,为对抗扩散模型的后门攻击提供了新的见解。

Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs.

Deep Optimal Transport: A Practical Algorithm for Photo-realistic Image Restoration
Theo Joseph Adrai Guy Ohayon Michael Elad Tomer Michaeli



研究问题:本文旨在提出一种图像恢复算法,该算法可以控制任何预训练模型的感知质量或均方误差(MSE),并在测试时进行权衡。
动机:由于最近的理论结果将最小均方误差(MMSE)预测器与在完美感知质量约束下最小化MSE的预测器联系起来,因此我们的方法受到启发。具体来说,已经证明,通过最优传输MMSE预测器的输出,使其分布匹配源数据,可以获得后者。
方法:为了提高最初训练以最小化MSE的预测器的感知质量,我们在变分自动编码器的潜空间中近似最优传输,使用经验均值和协方差来计算闭型形式。
效果:我们在各种通用内容图像上应用了不同的退化方法,并展示了该方法的效果。实验结果表明,我们的算法可以在不需要进一步训练的情况下显著提高新恢复图像的感知质量和/或MSE。

We propose an image restoration algorithm that can control the perceptual quality and/or the mean square error (MSE) of any pre-trained model, trading one over the other at test time. Our algorithm is few-shot: Given about a dozen images restored by the model, it can significantly improve the perceptual quality and/or the MSE of the model for newly restored images without further training. Our approach is motivated by a recent theoretical result that links between the minimum MSE (MMSE) predictor and the predictor that minimizes the MSE under a perfect perceptual quality constraint. Specifically, it has been shown that the latter can be obtained by optimally transporting the output of the former, such that its distribution matches that of the source data. Thus, to improve the perceptual quality of a predictor that was originally trained to minimize MSE, we approximate the optimal transport by a linear transformation in the latent space of a variational auto-encoder, which we compute in closed-form using empirical means and covariances. Going beyond the theory, we find that applying the same procedure on models that were initially trained to achieve high perceptual quality, typically improves their perceptual quality even further. And by interpolating the results with the original output of the model, we can improve their MSE on the expense of perceptual quality. We illustrate our method on a variety of degradations applied to general content images with arbitrary dimensions.

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
Zeyue Xue Guanglu Song Qiushan Guo Boxiao Liu Zhuofan Zong Yu Liu Ping Luo



研究问题:本文旨在利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on an anonymous webpage: https://raphaelpainting.github.io/.

Lossy Image Compression with Conditional Diffusion Models
Ruihan Yang Stephan Mandt



研究问题:本文旨在提出一种使用扩散生成模型的端到端优化的有损图像压缩框架。
动机:现有的基于VAE的神经网络压缩方法中,解码器是一个确定性神经网络,而我们的方法中的解码器是一个条件扩散模型,可以更好地存储图像信息。
方法:该方法采用变换编码范式,将图像映射到潜在空间进行熵编码,然后从那里映射回数据空间进行重建。我们的解码器是一个条件扩散模型,引入了一个额外的“内容”潜在变量来存储图像信息。
效果:实验结果表明,该方法在多个数据集和图像质量评估指标上的表现优于基于GAN的模型,同时在一些失真指标上也与基于VAE的模型具有竞争力。此外,通过 $mathcal{X}$ 参数化训练扩散模型,只需少数几个解码步骤就可以实现高质量的重建,大大提高了模型的实用性。

This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional "content" latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining "texture" variables characterizing the diffusion process are synthesized at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving multiple datasets and image quality assessment metrics show that our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics. Furthermore, training the diffusion with $\mathcal{X}$-parameterization enables high-quality reconstructions in only a handful of decoding steps, greatly affecting the model's practicality.

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
Yanyu Li Huan Wang Qing Jin Ju Hu Pavlo Chemerys Yun Fu Yanzhi Wang Sergey Tulyakov Jian Ren



研究问题:如何降低运行文本到图像扩散模型的计算成本,使其能在移动设备上快速运行。
动机:现有的文本到图像扩散模型需要复杂的网络结构和大量的去噪迭代,计算成本高且运行速度慢,需要高端GPU和基于云的推理,这既昂贵又存在隐私问题。
方法:提出一种通用的方法,通过引入高效的网络架构和改进步蒸馏来运行文本到图像扩散模型。具体来说,我们通过数据蒸馏减少原始模型的计算量,提出了一种有效的UNet。此外,我们还通过探索训练策略和引入无分类器指导的正则化来增强步蒸馏。
效果:在MS-COCO数据集上的大量实验表明,我们的模型在8步去噪的情况下,FID和CLIP分数均优于Stable Diffusion v1.5的50步。我们的工作使强大的文本到图像扩散模型能够被用户使用,从而推动了内容创作的民主化。

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in **less than 2 seconds**. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning
Casey Meehan Florian Bordes Pascal Vincent Kamalika Chaudhuri Chuan Guo



研究问题:本文旨在研究自监督学习(SSL)模型中对图像特定信息的无意记忆现象,即“似曾相识”的记忆。
动机:当SSL模型走向极端时,可能会无意中记住训练样本中的特定部分,而不是学习语义上有意义的关联。这种现象可能带来未知的隐私风险。
方法:通过系统地研究SSL模型中的“似曾相识”的记忆现象,展示即使在只含有背景(如水、天空、草地)的训练图像裁剪下,也能以高准确度推断出前景物体,甚至视觉重建它。
效果:研究发现“似曾相识”的记忆现象普遍存在于不同的SSL算法中,某些设计选择会加剧这种现象,且无法通过传统的评估表示质量的技术来检测。这项研究揭示了SSL模型中以前未知的隐私风险,并提出了潜在的实际缓解策略。

Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as déjà vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that déjà vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of déjà vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies.

3D molecule generation by denoising voxel grids
Pedro O. Pinheiro Joshua Rackers joseph Kleinhenz Michael Maser Omar Mahmood Andrew Martin Watkins Stephen Ra Vishnu Sresht Saeed Saremi



研究问题:提出一种新的基于得分的方法,将3D分子表示为在规则网格上的原子密度。
动机:现有的分子生成方法与当前最先进的技术(即应用于原子点云的扩散模型)在数据表示、噪声模型、网络架构和生成建模算法等方面存在差异。
方法:首先训练一个去噪神经网络,学习从噪声分子的平滑分布映射到真实分子的分布。然后按照神经经验贝叶斯框架[Saremi和Hyvarinen,2019]分两步生成分子:(i)通过欠阻尼Langevin马尔可夫链蒙特卡罗从平滑分布中采样噪声密度网格,(ii)通过一步去噪从噪声网格恢复“清洁”分子。
效果:该方法称为VoxMol,其实验表明,VoxMol能更好地捕捉药物类分子的分布,同时生成样本的速度更快。

We propose a new score-based approach to generate 3D molecules represented as atomic densities on regular grids. First, we train a denoising neural network that learns to map from a smooth distribution of noisy molecules to the distribution of real molecules. Then, we follow the _neural empirical Bayes_ framework [Saremi and Hyvarinen, 2019] and generate molecules in two steps: (i) sample noisy density grids from a smooth distribution via underdamped Langevin Markov chain Monte Carlo, and (ii) recover the "clean" molecule by denoising the noisy grid with a single step. Our method, _VoxMol_, generates molecules in a fundamentally different way than the current state of the art (ie, diffusion models applied to atom point clouds). It differs in terms of the data representation, the noise model, the network architecture and the generative modeling algorithm. Our experiments show that VoxMol captures the distribution of drug-like molecules better than state of the art, while being faster to generate samples.

Learning Re-sampling Methods with Parameter Attribution for Image Super-resolution
Xiaotong Luo Yuan Xie Yanyun Qu



研究问题:现有的单图像超分辨率(SISR)方法主要关注网络架构设计和优化方案,而对训练数据的关注度不高。
动机:大多数现有的SR方法在整个图像上均匀地采样补丁对进行训练,但图像内容不均匀导致训练数据分布不平衡,即易重构区域(平滑)占据了大部分数据,而难以重构的区域(边缘或纹理)样本却很少。
方法:本文提出了一种简单而有效的双采样参数归因(BSPA)方法用于精确的图像SR。具体来说,双采样包括均匀采样和反向采样,旨在调和固有的数据偏差。前者保持内在数据分布,后者设计用于增强模型在困难样本上的特征提取能力。此外,引入综合梯度来归因于两种采样数据训练的交替模型中每个参数的贡献,以便过滤掉无关紧要的参数进行进一步的动态细化。通过逐步解耦参数分配,SR模型可以学习更紧凑的表示。
效果:在公开数据集上的大量实验表明,我们的方法可以从数据重新采样的角度有效地提高基线方法的性能。

Single image super-resolution (SISR) has made a significant breakthrough benefiting from the prevalent rise of deep neural networks and large-scale training samples. The mainstream deep SR models primarily focus on network architecture design as well as optimization schemes, while few pay attention to the training data. In fact, most of the existing SR methods train the model on uniformly sampled patch pairs from the whole image. However, the uneven image content makes the training data present an unbalanced distribution, i.e., the easily reconstructed region (smooth) occupies the majority of the data, while the hard reconstructed region (edge or texture) has rarely few samples. Based on this phenomenon, we consider rethinking the current paradigm of merely using uniform data sampling way for training SR models. In this paper, we propose a simple yet effective Bi-Sampling Parameter Attribution (BSPA) method for accurate image SR. Specifically, the bi-sampling consists of uniform sampling and inverse sampling, which is introduced to reconcile the unbalanced inherent data bias. The former aims to keep the intrinsic data distribution, and the latter is designed to enhance the feature extraction ability of the model on the hard samples. Moreover, integrated gradient is introduced to attribute the contribution of each parameter in the alternate models trained by both sampling data so as to filter the trivial parameters for further dynamic refinement. By progressively decoupling the allocation of parameters, the SR model can learn a more compact representation. Extensive experiments on publicly available datasets demonstrate that our proposal can effectively boost the performance of baseline methods from the data re-sampling view.

DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing
Yangtian Zhang Zuobai Zhang Bozitao Zhong Sanchit Misra Jian Tang



研究问题:如何准确预测蛋白质侧链的构象,这对于蛋白质结构预测、设计和蛋白质-蛋白质相互作用的应用至关重要。
动机:传统的计算方法既耗时又耗力,而现有的机器学习方法将此问题视为回归任务,忽视了由恒定共价键长度和角度施加的限制。
方法:我们提出了DiffPack,一种扭转扩散模型,通过在扭转空间上进行扩散和去噪来学习侧链扭转角的联合分布,这是侧链包装中唯一的自由度。为了避免同时扰动所有四个扭转角的问题,我们提出从$\chi_1$到$\chi_4$自动生成四个扭转角,并为每个扭转角训练扩散模型。
效果:我们在几个蛋白质侧链包装基准测试上评估了该方法,结果显示,我们的模型在CASP13和CASP14上的角精度分别提高了11.9%和13.5%,并且模型大小显著减小(参数减少了60倍)。此外,我们还展示了该方法在增强AlphaFold2模型中的侧链预测方面的有效性。

Proteins play a critical role in carrying out biological functions, and their 3D structures are essential in determining their functions. Accurately predicting the conformation of protein side-chains given their backbones is important for applications in protein structure prediction, design and protein-protein interactions. Traditional methods are computationally intensive and have limited accuracy, while existing machine learning methods treat the problem as a regression task and overlook the restrictions imposed by the constant covalent bond lengths and angles. In this work, we present DiffPack, a torsional diffusion model that learns the joint distribution of side-chain torsional angles, the only degrees of freedom in side-chain packing, by diffusing and denoising on the torsional space. To avoid issues arising from simultaneous perturbation of all four torsional angles, we propose autoregressively generating the four torsional angles from $\chi_1$ to $\chi_4$ and training diffusion models for each torsional angle. We evaluate the method on several benchmarks for protein side-chain packing and show that our method achieves improvements of 11.9% and 13.5% in angle accuracy on CASP13 and CASP14, respectively, with a significantly smaller model size ($60\times$ fewer parameters). Additionally, we show the effectiveness of our method in enhancing side-chain predictions in the AlphaFold2 model. Code is available at https://github.com/DeepGraphLearning/DiffPack.

Inserting Anybody in Diffusion Models via Celeb Basis
Ge Yuan Xiaodong Cun Yong Zhang Maomao Li Chenyang Qi Xintao Wang Ying Shan Huicheng Zheng



研究问题:如何将用户自身的独特概念定制化地融入到预训练的大型文本到图像模型中,如Stable Diffusion。
动机:现有的定制方法添加的新概念在训练过程中与原始概念的结合能力较弱。
方法:提出一种新的个性化方法,仅使用一张面部照片和1024个可学习的参数,在3分钟内将独特的个体无缝集成到预训练的扩散模型中。
效果:新的身份在我们的定制模型中展示了比先前的个性化方法更好的概念结合能力,并且可以同时学习几个新的身份并进行交互。

Exquisite demand exists for customizing the pretrained large text-to-image model, $e.g.$ Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just $one\ facial\ photograph$ and only $1024\ learnable\ parameters$ under $3\ minutes$. So we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. Project page is at: http://celeb-basis.github.io. Code is at: https://github.com/ygtxr1997/CelebBasis.

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
Yuseung Lee Kunho Kim Hyunjin Kim Minhyuk Sung



研究问题:现有的图像扩散模型在拼接多张图片时,结果中常常出现明显的接缝,且混合场景的输出结果往往不连贯。
动机:为了解决这一问题,我们提出了SyncDiffusion,一个通过梯度下降从感知相似度损失进行多扩散同步的即插即用模块。
方法:我们计算每个去噪步骤预测的去噪图像的感知损失的梯度,为生成连贯的蒙太奇提供有意义的指导。
效果:实验结果表明,我们的方法比之前的方法产生的输出结果更加连贯(在我们的用户研究中为66.35%对比33.65%),同时保持了保真度和与输入提示的兼容性。我们在三个即插即用的应用程序中展示了该方法的通用性:布局引导的图像生成、条件图像生成和360度全景生成。

The remarkable capabilities of pretrained image diffusion models have been utilized not only for generating fixed-size images but also for creating panoramas. However, naive stitching of multiple images often results in visible seams. Recent techniques have attempted to address this issue by performing joint diffusions in multiple windows and averaging latent features in overlapping regions. However, these approaches, which focus on seamless montage generation, often yield incoherent outputs by blending different scenes within a single image. To overcome this limitation, we propose SyncDiffusion, a plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss. Specifically, we compute the gradient of the perceptual loss using the predicted denoised images at each denoising step, providing meaningful guidance for achieving coherent montages. Our experimental results demonstrate that our method produces significantly more coherent outputs compared to previous methods (66.35% vs. 33.65% in our user study) while still maintaining fidelity (as assessed by GIQA) and compatibility with the input prompt (as measured by CLIP score). We further demonstrate the versatility our method across three plug-and-play applications: layout-guided image generation, conditional image generation and 360-degree panorama generation. Our project page is at https://syncdiffusion.github.io.

Norm-guided latent space exploration for text-to-image generation
Dvir Samuel Rami Ben-Ari Nir Darshan Haggai Maron Gal Chechik



研究问题:本文旨在解决当前扩散模型中初始种子的潜在空间结构及其对各种概念生成的影响,以及在种子操作方法中存在的问题。
动机:目前的扩散模型在种子操作方法上存在问题,如简单的插值和寻找一组种子的质心等操作在标准的欧几里得或球形潜在空间度量中表现不佳。此外,现有的训练程序导致扩散模型观察到的输入具有狭窄的范数值范围,这对依赖种子操作进行图像生成的方法产生了影响,尤其是在少量样本和长尾学习任务中的应用。
方法:为解决这个问题,本文提出了一种新的插值方法,该方法定义了一种新的非欧几里得度量,该度量考虑了基于种子的范数先验。同时,描述了一种简单而有效的算法来近似这个插值过程,并使用它进一步定义了潜在种子空间中的质心。
效果:实验结果表明,新的插值和质心技术显著提高了罕见概念图像的生成能力,并在少量样本和长尾基准测试上取得了最先进的性能,无论是在生成速度、图像质量还是语义内容方面,都优于先前的方法。

Text-to-image diffusion models show great potential in synthesizing a large variety of concepts in new compositions and scenarios. However, the latent space of initial seeds is still not well understood and its structure was shown to impact the generation of various concepts. Specifically, simple operations like interpolation and finding the centroid of a set of seeds perform poorly when using standard Euclidean or spherical metrics in the latent space. This paper makes the observation that, in current training procedures, diffusion models observed inputs with a narrow range of norm values. This has strong implications for methods that rely on seed manipulation for image generation, with applications to few-shot and long-tail learning tasks. To address this issue, we propose a novel method for interpolating between two seeds and demonstrate that it defines a new non-Euclidean metric that takes into account a norm-based prior on seeds. We describe a simple yet efficient algorithm for approximating this interpolation procedure and use it to further define centroids in the latent seed space. We show that our new interpolation and centroid techniques significantly enhance the generation of rare concept images. This further leads to state-of-the-art performance on few-shot and long-tail benchmarks, improving prior approaches in terms of generation speed, image quality, and semantic content.

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
Wenliang Zhao Lujia Bai Yongming Rao Jie Zhou Jiwen Lu



研究问题:扩散概率模型(DPMs)在高分辨率图像合成中表现出强大的能力,但预训练的DPM采样过程耗时长,因为需要多次评估去噪网络。
动机:尽管现有的快速采样器设计取得了进展,但在许多应用中,它们仍然无法生成满意的图像,特别是在步骤较少的情况下(如10步)。
方法:本文开发了一种统一的校正器(UniC),可以应用于任何现有的DPM采样器之后,以提高精度而不进行额外的模型评估。同时,还推导出一种支持任意阶数的统一预测器(UniP)。结合UniP和UniC,提出了一种用于DPM快速采样的统一预测器-校正器框架(UniPC)。
效果:通过广泛的实验,包括使用像素空间和潜在空间DPM的无条件和有条件采样,验证了我们的方法。我们的UniPC在只有10次函数评估的情况下,就可以在CIFAR10上实现3.87 FID(无条件),在ImageNet 256x256上实现7.51 FID(有条件)。代码可在https://github.com/wl-zhao/UniPC获取。

Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., $<$10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC.

Learning Modulated Transformation in GANs
Ceyuan Yang Qihang Zhang Yinghao Xu Jiapeng Zhu Yujun Shen Bo Dai



研究问题:如何提高风格生成器在处理数据中的跨实例变化和几何变形方面的能力?
动机:现有的风格生成器通过固定位置的卷积引入实例随机性,限制了其对几何变形的建模能力。
方法:提出一种模块化转换模块(MTM),通过可变的卷积操作位置来处理不同实例的几何变形,为模型提供额外的自由度。
效果:实验表明该方法可以广泛应用于各种生成任务,包括图像生成、3D感知图像合成和视频生成,并在无需任何超参数调整的情况下与最先进的框架兼容。在具有挑战性的太极数据集上,将StyleGAN3的FID从21.36提高到13.60,证明了学习调制几何变换的有效性。

The success of style-based generators largely benefits from style modulation, which helps take care of the cross-instance variation within data. However, the instance-wise stochasticity is typically introduced via regular convolution, where kernels interact with features at some fixed locations, limiting its capacity for modeling geometric variation. To alleviate this problem, we equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM). This module predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations for different instances, and hence offers the model an additional degree of freedom to handle geometry deformation. Extensive experiments suggest that our approach can be faithfully generalized to various generative tasks, including image generation, 3D-aware image synthesis, and video generation, and get compatible with state-of-the-art frameworks without any hyper-parameter tuning. It is noteworthy that, towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation. Code and models are available at https://github.com/limbo0000/mtm.

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Dongxu Li Junnan Li Steven Hoi



研究问题:如何提高基于文本提示的主题驱动文本到图像生成模型的生成效率和主题保真度?
动机:现有的主题驱动文本到图像生成模型在微调过程长且难以保持主题一致性的问题。
方法:提出BLIP-Diffusion,一种新主题驱动的图像生成模型,支持多模态控制,即输入主题图像和文本提示。该模型引入了新的多模态编码器,预训练以提供主题表示。
效果:与DreamBooth等先前的方法相比,BLIP-Diffusion实现了零样本主题驱动生成,并且可以灵活地与ControlNet和prompt-to-prompt等现有技术结合,实现新颖的主题驱动生成和编辑应用。

Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Implementations are available at: https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion.

Improving Diffusion-Based Image Synthesis with Context Prediction
Ling Yang Jingwei Liu Shenda Hong Zhilong Zhang Zhilin Huang Zheming Cai Wentao Zhang Bin CUI



研究问题:现有的扩散模型主要通过像素或特征的约束来重建输入图像,但这种方法可能会研究问题:现有的扩散模型主要通过像素或特征的约束来重建输入图像,但这种方法可能会破坏每个预测像素/特征的邻域上下文,影响基于扩散的图像合成。
动机:为了解决上述问题,我们首次提出了ConPreDiff模型,利用上下文预测来改进基于扩散的图像合成。
方法:在训练阶段,我们在扩散去噪块的末端添加一个上下文解码器,使每个点都能预测其邻域上下文(即多步长的像素/特征)。在推理阶段,我们移除解码器。这样,每个点就能更好地重建自身,同时保留与邻域上下文的语义连接。
效果:我们的ConPreDiff模型在无条件图像生成、文本到图像生成和图像修复任务上表现出色。在MS-COCO数据集上,我们的模型实现了新的SOTA文本到图像生成结果,零样本FID得分为6.21。

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride pixels/features) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

Crystal Structure Prediction by Joint Equivariant Diffusion
Rui Jiao Wenbing Huang Peijia Lin Jiaqi Han Pin Chen Yutong Lu Yang Liu



研究问题:本文旨在解决科学领域中的晶体结构预测(CSP)问题,由于晶体结构的对称性,这个问题具有独特的挑战。
动机:虽然现有的生成模型(如扩散模型)可以用来解决CSP问题,但由于晶体结构的对称性——平移、旋转和周期性的不变性,这个问题遇到了独特的挑战。
方法:本文提出了一种新的扩散模型DiffCSP,通过使用周期性E(3)等变去噪模型来学习稳定晶体的结构分布,以更好地模拟晶体几何形状。
效果:实验结果表明,我们的DiffCSP显著优于现有的CSP方法,与基于密度泛函理论(DFT)的方法相比,计算成本更低。此外,当扩展到从头开始生成晶体时,DiffCSP的优势依然明显。

Crystal Structure Prediction (CSP) is crucial in various scientific disciplines. While CSP can be addressed by employing currently-prevailing generative models (**e.g.** diffusion models), this task encounters unique challenges owing to the symmetric geometry of crystal structures---the invariance of translation, rotation, and periodicity. To incorporate the above symmetries, this paper proposes DiffCSP, a novel diffusion model to learn the structure distribution from stable crystals. To be specific, DiffCSP jointly generates the lattice and atom coordinates for each crystal by employing a periodic-E(3)-equivariant denoising model, to better model the crystal geometry. Notably, different from related equivariant generative approaches, DiffCSP leverages fractional coordinates other than Cartesian coordinates to represent crystals, remarkably promoting the diffusion and the generation process of atom positions. Extensive experiments verify that our DiffCSP remarkably outperforms existing CSP methods, with a much lower computation cost in contrast to DFT-based methods. Moreover, the superiority of DiffCSP is still observed when it is extended for ab initio crystal generation.

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
Yuval Kirstain Adam Polyak Uriel Singer Shahbuland Matiana Joe Penna Omer Levy



研究问题:如何收集大规模的人类文本-图像偏好数据集,并利用这些数据进行模型评估和优化。
动机:由于大型的人类文本-图像偏好数据集通常由公司持有,导致公众无法访问。为了解决这个问题,我们创建了一个网络应用程序,使文本-图像用户能够生成图像并指定他们的偏好。
方法:我们使用这个网络应用程序构建了Pick-a-Pic,这是一个大型的、开放的文本-图像提示和真实用户对生成的图像偏好的数据集。然后,我们利用这个数据集训练了一个基于CLIP的评分函数PickScore,它在预测人类偏好的任务上表现出超人的性能。
效果:我们的实验结果表明,PickScore在执行模型评估方面的能力比其他自动评估指标更能与人类的排名相吻合。因此,我们建议使用PickScore来评估未来的文本-图像生成模型,并使用Pick-a-Pic提示作为比MS-COCO更相关的数据集。最后,我们展示了PickScore如何通过排名来增强现有的文本-图像模型。

The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users’ preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore, which exhibits superhuman performance on the task of predicting human preferences. Then, we test PickScore’s ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics. Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO. Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.

Contrastive Sampling Chains in Diffusion Models
Junyu Zhang Daochang Liu Shichao Zhang Chang Xu



研究问题:如何减少在使用扩散模型生成高保真图像时,由于数值求解器解决随机微分方程产生的离散化误差。
动机:离散化误差是在使用数值求解器解决随机微分方程时不可避免的限制。
方法:通过对比损失和得分匹配的组合,构建一个对比采样链来微调预训练的扩散模型,以减小离散化误差,从而缩小真实数据分布与模型分布之间的差距。
效果:在CIFAR10上的应用实验表明,该方法可以显著提高生成图像的质量,并减少所需的神经函数评估次数。

The past few years have witnessed great success in the use of diffusion models (DMs) to generate high-fidelity images with the help of stochastic differential equations (SDEs). However, discretization error is an inevitable limitation when utilizing numerical solvers to solve SDEs. To address this limitation, we provide a theoretical analysis demonstrating that an appropriate combination of the contrastive loss and score matching serves as an upper bound of the KL divergence between the true data distribution and the model distribution. To obtain this bound, we utilize a contrastive loss to construct a contrastive sampling chain to fine-tuning the pre-trained DM. In this manner, our method reduces the discretization error and thus yields a smaller gap between the true data distribution and our model distribution. Moreover, the presented method can be applied to fine-tuning various pre-trained DMs, both with or without fast sampling algorithms, contributing to better sample quality or slightly faster sampling speeds. To validate the efficacy of our method, we conduct comprehensive experiments. For example, on CIFAR10, when applied to a pre-trained EDM, our method improves the FID from 2.04 to 1.88 with 35 neural function evaluations (NFEs), and reduces NFEs from 35 to 25 to achieve the same 2.04 FID.

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
Qingkai Fang Yan Zhou Yang Feng



研究问题:如何实现高质量且快速的语音翻译。
动机:由于语言和声学多样性,目标语音遵循复杂的多模态分布,这对语音到语音的翻译模型提出了挑战。
方法:提出DASpeech,一种非自回归的直接语音翻译模型,通过双阶段架构将生成过程分解为两步,先由语言解码器生成目标文本,再由声学解码器根据语言解码器的隐藏状态生成目标语音。
效果:在CVSS Fr$rightarrow$En基准测试中,DASpeech的性能与最先进的S2ST模型Translatotron 2相当甚至更好,同时比自回归基线快18.53倍。与先前的非自回归S2ST模型相比,DASpeech在翻译质量和解码速度上都取得了显著改进,并能保留源语音的说话人声音。

Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. Experiments on the CVSS Fr$\rightarrow$En benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53$\times$ speedup compared to the autoregressive baseline. Compared with the previous non-autoregressive S2ST model, DASpeech does not rely on knowledge distillation and iterative decoding, achieving significant improvements in both translation quality and decoding speed. Furthermore, DASpeech shows the ability to preserve the speaker's voice of the source speech during translation.

Efficient Test-Time Adaptation for Super-Resolution with Second-Order Degradation and Reconstruction
Zeshuai Deng Zhuokun Chen Shuaicheng Niu Thomas H. Li Bohan Zhuang Mingkui Tan



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Image super-resolution (SR) aims to learn a mapping from low-resolution (LR) to high-resolution (HR) using paired HR-LR training images. Conventional SR methods typically gather the paired training data by synthesizing LR images from HR images using a predetermined degradation model, e.g., Bicubic down-sampling. However, the realistic degradation type of test images may mismatch with the training-time degradation type due to the dynamic changes of the real-world scenarios, resulting in inferior-quality SR images. To address this, existing methods attempt to estimate the degradation model and train an image-specific model, which, however, is quite time-consuming and impracticable to handle rapidly changing domain shifts. Moreover, these methods largely concentrate on the estimation of one degradation type (e.g., blur degradation), overlooking other degradation types like noise and JPEG in real-world test-time scenarios, thus limiting their practicality. To tackle these problems, we present an efficient test-time adaptation framework for SR, named SRTTA, which is able to quickly adapt SR models to test domains with different/unknown degradation types. Specifically, we design a second-order degradation scheme to construct paired data based on the degradation type of the test image, which is predicted by a pre-trained degradation classifier. Then, we adapt the SR model by implementing feature-level reconstruction learning from the initial test image to its second-order degraded counterparts, which helps the SR model generate plausible HR images. Extensive experiments are conducted on newly synthesized corrupted DIV2K datasets with 8 different degradations and several real-world datasets, demonstrating that our SRTTA framework achieves an impressive improvement over existing methods with satisfying speed. The source code is available at https://github.com/DengZeshuai/SRTTA.

A Unified Conditional Framework for Diffusion-based Image Restoration
Yi Zhang Xiaoyu Shi Dasong Li Xiaogang Wang Jian Wang Hongsheng Li



研究问题:如何将条件信息整合到扩散概率模型中,以指导图像恢复任务。
动机:现有的扩散概率模型在图像生成任务上表现出色,但在图像恢复任务中,如何整合条件信息以提高准确性和自然性是一个被忽视的问题。
方法:提出了一种基于扩散模型的统一条件框架进行图像恢复。利用轻量级的UNet预测初始引导,并使用扩散模型学习引导的残差。通过精心设计扩散模型块的基本模块和集成模块,将引导和其他辅助条件信息整合到每个扩散模型块中,实现空间自适应生成条件。为处理高分辨率图像,提出了一种简单而有效的跨步分片策略,以产生任意分辨率的图像,无网格伪影。
效果:在极端低光去噪、去模糊和JPEG恢复三个具有挑战性的任务上评估了该条件框架,证明了其在感知质量和泛化到恢复任务方面的显著改进。

Diffusion Probabilistic Models (DPMs) have recently shown remarkable performance in image generation tasks, which are capable of generating highly realistic images. When adopting DPMs for image restoration tasks, the crucial aspect lies in how to integrate the conditional information to guide the DPMs to generate accurate and natural output, which has been largely overlooked in existing works. In this paper, we present a unified conditional framework based on diffusion models for image restoration. We leverage a lightweight UNet to predict initial guidance and the diffusion model to learn the residual of the guidance. By carefully designing the basic module and integration module for the diffusion model block, we integrate the guidance and other auxiliary conditional information into every block of the diffusion model to achieve spatially-adaptive generation conditioning. To handle high-resolution images, we propose a simple yet effective inter-step patch-splitting strategy to produce arbitrary-resolution images without grid artifacts. We evaluate our conditional framework on three challenging tasks: extreme low-light denoising, deblurring, and JPEG restoration, demonstrating its significant improvements in perceptual quality and the generalization to restoration tasks. The code will be released at https://zhangyi-3.github.io/project/UCDIR/.

StyleDrop: Text-to-Image Synthesis of Any Style
Kihyuk Sohn Lu Jiang Jarred Barber Kimin Lee Nataniel Ruiz Dilip Krishnan Huiwen Chang Yuanzhen Li Irfan Essa Michael Rubinstein Yuan Hao Glenn Entis Irina Blok Daniel Castro Chin



研究问题:如何利用预训练的大规模文本到图像模型,通过适当的文本提示合成令人印象深刻的图像?
动机:自然语言的固有模糊性和分布外效应使得难以合成任意的图像风格,利用特定的设计模式、纹理或材料。
方法:引入*StyleDrop*,一种能够使用文本到图像模型忠实地遵循特定风格的图像合成方法。StyleDrop非常灵活,能捕捉用户提供的风格的细节和细微差别,如色彩方案、阴影、设计模式以及局部和全局效果。
效果:通过高效地学习新风格并微调少量可训练参数(少于总模型参数的1%),并通过人或自动化反馈进行迭代训练来提高质量,StyleDrop即使在用户提供的指定风格的单一图像上也能产生出色的结果。在风格调整文本到图像模型的任务中,StyleDrop在Muse上的表现优于其他方法,包括DreamBooth和Imagen或Stable Diffusion的文本反转。

Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language, and out-of-distribution effects make it hard to synthesize arbitrary image styles, leveraging a specific design pattern, texture or material. In this paper, we introduce *StyleDrop*, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. StyleDrop is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. StyleDrop works by efficiently learning a new style by fine-tuning very few trainable parameters (less than 1\% of total model parameters), and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a *single* image specifying the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: [https://styledrop.github.io](https://styledrop.github.io).

Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation
Wengong Jin Siranush Sarkizova Xun Chen Nir Hacohen Caroline Uhler



研究问题:本文旨在解决蛋白质-配体结合预测问题,特别是在标签数据有限的抗体类配体中。
动机:传统的监督学习方法在小分子配体上表现良好,但在标签数据有限的抗体类配体上难以应用。因此,本文探索了无监督学习方法,并将结合能预测转化为生成模型任务。
方法:本文使用SE(3)去噪得分匹配(DSM)在一组未标记的蛋白质-配体复合物上训练能量模型,并将其对数似然度解释为结合能。主要贡献是提出了一种新的等变旋转预测网络——神经欧拉旋转方程(NERE),用于SE(3) DSM。
效果:通过两个蛋白质-配体和抗体-抗原结合亲和力预测基准测试,本文表明NERE在所有情况下均优于所有无监督基线(基于物理的潜力和蛋白质语言模型),并在抗体案例中超越了监督基线。

Protein-ligand binding prediction is a fundamental problem in AI-driven drug discovery. Previous work focused on supervised learning methods for small molecules where binding affinity data is abundant, but it is hard to apply the same strategy to other ligand classes like antibodies where labelled data is limited. In this paper, we explore unsupervised approaches and reformulate binding energy prediction as a generative modeling task. Specifically, we train an energy-based model on a set of unlabelled protein-ligand complexes using SE(3) denoising score matching (DSM) and interpret its log-likelihood as binding affinity. Our key contribution is a new equivariant rotation prediction network called Neural Euler's Rotation Equations (NERE) for SE(3) DSM. It predicts a rotation by modeling the force and torque between protein and ligand atoms, where the force is defined as the gradient of an energy function with respect to atom coordinates. Using two protein-ligand and antibody-antigen binding affinity prediction benchmarks, we show that NERE outperforms all unsupervised baselines (physics-based potentials and protein language models) in both cases and surpasses supervised baselines in the antibody case.

Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback
TaeHo Yoon Kibeom Myoung Keon Lee Jaewoong Cho Albert No Ernest K. Ryu



研究问题:预训练的扩散模型在高质量图像生成上表现出色,但有时会产生不想要的图像,如何防止这种情况?
动机:通过最小化人工反馈训练奖励模型,实现对预训练扩散模型的审查生成。
方法:使用奖励模型和少量人工反馈进行审查生成。
效果:证明了审查生成可以通过极高效率的人工反馈完成,仅需要几分钟的人工反馈即可生成标签。

Diffusion models have recently shown remarkable success in high-quality image generation. Sometimes, however, a pre-trained diffusion model exhibits partial misalignment in the sense that the model can generate good images, but it sometimes outputs undesirable images. If so, we simply need to prevent the generation of the bad images, and we call this task censoring. In this work, we present censored generation with a pre-trained diffusion model using a reward model trained on minimal human feedback. We show that censoring can be accomplished with extreme human feedback efficiency and that labels generated with a mere few minutes of human feedback are sufficient.

GlyphControl: Glyph Conditional Control for Visual Text Generation
Yukang Yang Dongnan Gui Yuhui Yuan Weicong Liang Haisong Ding Han Hu Kai Chen



研究问题:开发一种基于扩散的文本到图像生成模型,能够生成连贯且格式良好的视觉文本。
动机:现有的方法需要依赖字符感知的文本编码器,并且需要重新训练文本到图像模型,我们的方法通过引入额外的字形条件信息来提高现有的Stable-Diffusion模型在生成准确视觉文本上的性能。
方法:我们提出了一种名为GlyphControl的新方法,通过引入字形指令,用户可以根据他们的特定需求自定义生成的文本的内容、位置和大小。
效果:通过测量生成的视觉文本的OCR基准指标、CLIP分数和FID,我们的实证评估表明,GlyphControl在OCR准确性、CLIP分数和FID方面优于最近的DeepFloyd IF方法,突出了我们方法的有效性。

Recently, there has been an increasing interest in developing diffusion-based text-to-image generative models capable of generating coherent and well-formed visual text. In this paper, we propose a novel and efficient approach called GlyphControl to address this task. Unlike existing methods that rely on character-aware text encoders like ByT5 and require retraining of text-to-image models, our approach leverages additional glyph conditional information to enhance the performance of the off-the-shelf Stable-Diffusion model in generating accurate visual text. By incorporating glyph instructions, users can customize the content, location, and size of the generated text according to their specific requirements. To facilitate further research in visual text generation, we construct a training benchmark dataset called LAION-Glyph. We evaluate the effectiveness of our approach by measuring OCR-based metrics, CLIP score, and FID of the generated visual text. Our empirical evaluations demonstrate that GlyphControl outperforms the recent DeepFloyd IF approach in terms of OCR accuracy, CLIP score, and FID, highlighting the efficacy of our method.

HubRouter: Learning Global Routing via Hub Generation and Pin-hub Connection
Xingbo Du Chonghua Wang Ruizhe Zhong Junchi Yan



研究问题:本文旨在解决VLSI系统中的核心任务——全局布线(Global Routing, GR)的问题,特别是如何通过机器学习方法生成确定性连接的路由。
动机:尽管生成模型在全局布线任务中得到了应用,但由于生成的路由之间缺乏连通性,需要通过传统方法进行后处理。因此,作者提出了一种新的定义“hub”,将全局布线问题从pin-pin连接问题转变为hub-pin连接问题。
方法:本文提出了一种名为HubRouter的两阶段学习方案,包括1) hub生成阶段:使用深度生成模型的条件引导hub生成器;2) pin-hub连接阶段:使用actor-critic模型的RSMT构建模块连接hub和pin。在第一阶段,我们将典型的生成模型纳入多任务学习框架进行hub生成,并使用带状掩模学习解决敏感噪声点的影响。在第二阶段,HubRouter使用actor-critic模型完成路由,该方法效率高且错误极小。
效果:实验在模拟和实际的全局布线基准上进行,结果显示HubRouter在导线长度、溢出和运行时间等方面优于最先进的生成式全局布线方法。此外,HubRouter在其他应用如RSMT构建和交互式路径重规划方面也表现出优势。

Global Routing (GR) is a core yet time-consuming task in VLSI systems. It recently attracted efforts from the machine learning community, especially generative models, but they suffer from the non-connectivity of generated routes. We argue that the inherent non-connectivity can harm the advantage of its one-shot generation and has to be post-processed by traditional approaches. Thus, we propose a novel definition, called hub, which represents the key point in the route. Equipped with hubs, global routing is transferred from a pin-pin connection problem to a hub-pin connection problem. Specifically, to generate definitely-connected routes, this paper proposes a two-phase learning scheme named HubRouter, which includes 1) hub-generation phase: A condition-guided hub generator using deep generative models; 2) pin-hub-connection phase: An RSMT construction module that connects the hubs and pins using an actor-critic model. In the first phase, we incorporate typical generative models into a multi-task learning framework to perform hub generation and address the impact of sensitive noise points with stripe mask learning. During the second phase, HubRouter employs an actor-critic model to finish the routing, which is efficient and has very slight errors. Experiments on simulated and real-world global routing benchmarks are performed to show our approach's efficiency, particularly HubRouter outperforms the state-of-the-art generative global routing methods in wirelength, overflow, and running time. Moreover, HubRouter also shows strength in other applications, such as RSMT construction and interactive path replanning.

Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Zeju Qiu Weiyang Liu Haiwen Feng Yuxuan Xue Yao Feng Zhen Liu Dan Zhang Adrian Weller Bernhard Schölkopf



研究问题:如何有效地引导强大的文本到图像扩散模型执行不同的下游任务。
动机:现有的方法无法有效控制这些强大的模型,因此需要一种原则性的微调方法来适应下游任务。
方法:引入了一种称为正交微调(OFT)的原则性微调方法,通过保持超球面上的双神经元关系来保留文本到图像扩散模型的语义生成能力。
效果:实验结果表明,OFT框架在生成质量和收敛速度上都优于现有方法。

Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.

One-Step Diffusion Distillation via Deep Equilibrium Models
Zhengyang Geng Ashwini Pokle J Zico Kolter



研究问题:如何将扩散模型快速地训练成高质量的图像生成模型。
动机:现有的方法在训练过程中需要多次迭代,且训练过程复杂,导致生成模型性能不佳。
方法:提出了一种新的方法,通过直接从初始噪声到生成图像来蒸馏扩散模型,并利用深度平衡(DEQ)模型作为蒸馏架构——生成平衡变压器(GET)。该方法仅使用扩散模型的噪声/图像对进行完全离线训练,并在同等训练预算下,比现有一步法方法具有更好的性能。
效果:GET在FID分数上匹配了5倍大的ViT,同时在计算成本和图像质量之间取得了关键平衡。

Diffusion models excel at producing high-quality samples but naively require hundreds of iterations, prompting multiple attempts to distill the generation process into a faster network. However, many existing approaches suffer from a variety of challenges: the process for distillation training can be complex, often requiring multiple training stages, and the resulting models perform poorly when utilized in single-step generative applications. In this paper, we introduce a simple yet effective means of distilling diffusion models *directly* from the initial noise to the resulting image. Of particular importance to our approach is to leverage a new Deep Equilibrium (DEQ) model as the distilled architecture: the Generative Equilibrium Transformer (GET). Our method enables fully offline training with just noise/image pairs from the diffusion model while achieving superior performance compared to existing one-step methods on comparable training budgets. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5\times$ larger ViT in terms of FID scores while striking a critical balance of computational cost and image quality. Code, checkpoints, and datasets are available [here](https://github.com/locuslab/get).

Decorate3D: Text-Driven High-Quality Texture Generation for Mesh Decoration in the Wild
Yanhui Guo Xinxin Zuo Peng Dai Juwei Lu Xiaolin Wu Li Cheng Youliang Yan Songcen Xu Xiaofei Wu



研究问题:本文旨在提出一种使用图像创建和编辑3D对象的多功能、用户友好的方法。
动机:现有的3D对象创建和编辑方法需要专业知识,且过程复杂。
方法:Decorate3D通过神经辐射场(NeRF)对真实世界的物体进行建模,并将其分解为显式网格表示、视依赖纹理和漫反射UV纹理。然后,用户可以手动编辑UV,或提供提示以自动生成新的3D一致纹理。
效果:通过结构感知的分数蒸馏采样方法和几视图重采样训练方法,以及利用超分辨率模型获取高分辨率(2048x2048)的UV纹理,Decorate3D在重新纹理化真实世界的3D对象方面表现出优越的性能。

This paper presents Decorate3D, a versatile and user-friendly method for the creation and editing of 3D objects using images. Decorate3D models a real-world object of interest by neural radiance field (NeRF) and decomposes the NeRF representation into an explicit mesh representation, a view-dependent texture, and a diffuse UV texture. Subsequently, users can either manually edit the UV or provide a prompt for the automatic generation of a new 3D-consistent texture. To achieve high-quality 3D texture generation, we propose a structure-aware score distillation sampling method to optimize a neural UV texture based on user-defined text and empower an image diffusion model with 3D-consistent generation capability. Furthermore, we introduce a few-view resampling training method and utilize a super-resolution model to obtain refined high-resolution UV textures (2048$\times$2048) for 3D texturing. Extensive experiments collectively validate the superior performance of Decorate3D in retexturing real-world 3D objects. Project page: https://decorate3d.github.io/Decorate3D/.

Unifying GANs and Score-Based Diffusion as Generative Particle Models
Jean-Yves Franceschi Mike Gartrell Ludovic Dos Santos Thibaut Issenhuth Emmanuel de Bezenac Mickael Chen Alain Rakotomamonjy



研究问题:本文旨在通过将生成器训练视为粒子模型的泛化,统一粒子和对抗性生成模型。
动机:虽然梯度流和基于分数的扩散模型等基于粒子的深度生成模型由于其出色的性能而受到关注,但它们使用微分方程来移动粒子分布的原理通常被视为与之前广泛使用的生成对抗网络(GANs)相反,后者涉及训练一个前向生成器网络。
方法:本文提出了一个新的框架,通过将生成器训练视为粒子模型的泛化,统一了粒子和对抗性生成模型。
效果:实验结果表明,这种新的框架是可行的,并且可以自然地将生成器集成到基于分数的扩散模型中,以及训练不带生成器的GAN。

Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions using differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper we challenge this interpretation, and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.

Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion
Shengqiong Wu Hao Fei Hanwang Zhang Tat-Seng Chua



研究问题:本文旨在探索从简单抽象文本提示生成复杂视觉内容的文本到图像(T2I)合成任务。
动机:受到人类想象力直觉的启发,我们提出了一种新的场景图幻觉(SGH)机制,用于有效的抽象到复杂的T2I合成。
方法:通过扩展输入提示的初始场景图(SG)来执行场景幻觉,其中结构化的场景图语义表示确保了内在场景想象的高可控性。我们构建了一个基于SG的幻觉扩散系统来实现T2I合成。
效果:在基准COCO数据集上,我们的系统显著优于现有的最佳T2I模型,特别是在抽象到复杂的T2I生成方面。

In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-intricate setting, i.e., generating intricate visual content from simple abstract text prompts. Inspired by human imagination intuition, we propose a novel scene-graph hallucination (SGH) mechanism for effective abstract-to-intricate T2I synthesis. SGH carries out scene hallucination by expanding the initial scene graph (SG) of the input prompt with more feasible specific scene structures, in which the structured semantic representation of SG ensures high controllability of the intrinsic scene imagination. To approach the T2I synthesis, we deliberately build an SG-based hallucination diffusion system. First, we implement the SGH module based on the discrete diffusion technique, which evolves the SG structure by iteratively adding new scene elements. Then, we utilize another continuous-state diffusion model as the T2I synthesizer, where the overt image-generating process is navigated by the underlying semantic scene structure induced from the SGH module. On the benchmark COCO dataset, our system outperforms the existing best-performing T2I model by a significant margin, especially improving on the abstract-to-intricate T2I generation. Further in-depth analyses reveal how our methods advance.

SEGA: Instructing Text-to-Image Models using Semantic Guidance
Manuel Brack Felix Friedrich Dominik Hintersdorf Lukas Struppek Patrick Schramowski Kristian Kersting



研究问题:如何让文本到图像的扩散模型更好地符合用户的意图,并允许用户进行细微和广泛的编辑、组合和风格变化。
动机:目前的文本到图像扩散模型虽然能够生成高保真的图像,但很难一次生成就符合用户意图,而且对输入提示的小改动也会导致图像差异很大,导致用户在语义上无法控制。
方法:提出了一种语义指导(SEGA)的方法,通过与扩散过程进行交互来灵活地沿着语义方向引导它。这种方法可以应用于任何使用无分类器指导的生成架构。
效果:在潜变量和基于像素的扩散模型(如Stable Diffusion、Paella和DeepFloyd-IF)上进行了实验,证明了SEGA在不同任务上的有效性,展示了其灵活性和通用性。

Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user’s intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, composition and style changes, and optimizing the overall artistic conception. We demonstrate SEGA’s effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility and flexibility.

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
Simian Luo Chuanhao Yan Chenxu Hu Hang Zhao



研究问题:如何提高从无声视频到音频的转换模型(V2A)在时间同步和视听相关性方面的生成质量。
动机:现有的V2A方法在时间同步和视听相关性方面的表现有限,影响了生成音频的质量。
方法:提出了一种名为Diff-Foley的同步V2A合成方法,该方法使用潜在扩散模型(LDM)来生成高质量的音频,同时改善了时间同步和视听相关性。通过对比性视听预训练(CAVP)学习更具时序和语义对齐的特征,然后在频谱图潜在空间上用CAVP对齐的视觉特征训练LDM。CAVP对齐的特征使LDM能够通过跨注意力模块捕捉更微妙的视听关联。通过“双重指导”进一步提高样本质量。
效果:Diff-Foley在当前大型V2A数据集上实现了最先进的V2A性能。此外,通过定制的下游微调,展示了Diff-Foley的实际适用性和适应性。

The Video-to-Audio (V2A) model has recently gained attention for its practical application in generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture the subtler audio-visual correlation via a cross-attention module. We further significantly improve sample quality with `double guidance'. Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate Diff-Foley practical applicability and adaptability via customized downstream finetuning. Project Page: https://diff-foley.github.io/

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Yinghao Aaron Li Cong Han Vinay S Raghavan Gavin Mischler Nima Mesgarani



研究问题:如何利用风格扩散和大型语音语言模型实现人类级别的文本转语音(TTS)合成。
动机:目前的TTS模型需要参考语音才能生成合适的风格,效率较低。
方法:通过将风格建模为潜变量并通过扩散模型进行扩散,无需参考语音即可生成最合适的风格,同时利用大型预训练的语音语言模型进行端到端训练,提高语音的自然度。
效果:在单说话人和多说话人数据集上均超过了人类录制的水平,并在零射弹说话人适应任务上优于先前的公开模型。

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

Subject-driven Text-to-Image Generation via Apprenticeship Learning
Wenhu Chen Hexiang Hu YANDONG LI Nataniel Ruiz Xuhui Jia Ming-Wei Chang William W. Cohen



研究问题:如何降低生成特定主题图像的模型训练成本。
动机:目前的文本到图像生成模型需要为每个主题单独进行精细调整,这在计算上是昂贵的。
方法:提出SuTI,一种以主题驱动的文本到图像生成器,用上下文学习取代了针对特定主题的优化。通过挖掘互联网上的大量图像集群,训练大量的专家模型,然后让一个通用的学习模型模仿这些专家的行为。
效果:SuTI能快速生成高质量的、特定主题的图像,其性能显著优于现有的优化基线方法,并在人类评估中表现出色。

Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with {in-context} learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by {apprenticeship learning}, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth.

Directional diffusion models for graph representation learning
Run Yang Yuling Yang Fan Zhou Qiang Sun



研究问题:扩散模型在图像合成、超分辨率和3D分子生成等领域取得了显著的成功,但在图学习中的应用却鲜有关注。
动机:扩散模型在处理图中的各向异性结构时存在限制,原前向扩散过程不断添加各向同性高斯噪声可能会过度稀释各向异性信号,导致快速的信号-噪声转换,这对训练去噪神经网络和获取语义有意义的表示构成了挑战。
方法:提出一种新的类别模型——定向扩散模型,这些模型在前向扩散过程中采用数据依赖的各向异性和定向噪声。
效果:通过在12个公开数据集上进行大量实验,特别是在两个不同的图表示学习任务上,实验结果明确证实了我们的模型优于最先进的基线,突显了它们在捕获有意义的图表示方面的有效性。

Diffusion models have achieved remarkable success in diverse domains such as image synthesis, super-resolution, and 3D molecule generation. Surprisingly, the application of diffusion models in graph learning has garnered little attention. In this paper, we aim to bridge this gap by exploring the use of diffusion models for unsupervised graph representation learning. Our investigation commences with the identification of anisotropic structures within graphs and the recognition of a crucial limitation in the vanilla forward diffusion process when dealing with these anisotropic structures. The original forward diffusion process continually adds isotropic Gaussian noise to the data, which may excessively dilute anisotropic signals, leading to rapid signal-to-noise conversion. This rapid conversion poses challenges for training denoising neural networks and obstructs the acquisition of semantically meaningful representations during the reverse process. To overcome this challenge, we introduce a novel class of models termed {\it directional diffusion models}. These models adopt data-dependent, anisotropic, and directional noises in the forward diffusion process. In order to assess the effectiveness of our proposed models, we conduct extensive experiments on 12 publicly available datasets, with a particular focus on two distinct graph representation learning tasks. The experimental results unequivocally establish the superiority of our models over state-of-the-art baselines, underscoring their effectiveness in capturing meaningful graph representations. Our research not only sheds light on the intricacies of the forward process in diffusion models but also underscores the vast potential of these models in addressing a wide spectrum of graph-related tasks. Our code is available at \url{https://github.com/statsle/DDM}.

InsActor: Instruction-driven Physics-based Characters
Jiawei Ren Mingyuan Zhang Cunjun Yu Xiao Ma Liang Pan Ziwei Liu



研究问题:如何生成反映高级人类指令的物理模拟动画,以实现基于物理的角色动画的直观控制。
动机:由于物理环境和人类语言的丰富性,生成反映高级人类指令的物理模拟动画仍然是一个困难的问题。
方法:提出了InsActor,一个利用扩散式人体运动模型的最新进展来生成指令驱动的物理角色动画的原则性生成框架。
效果:实验结果表明,InsActor在各种任务上取得了最先进的成果,包括指令驱动的运动生成和指令驱动的航点导航。特别是,InsActor使用高级人类指令生成物理模拟动画的能力使其成为一个有价值的工具,特别是在执行具有丰富指令集的长程任务时。

Generating animation of physics-based characters with intuitive control has long been a desirable task with numerous applications. However, generating physically simulated animations that reflect high-level human instructions remains a difficult problem due to the complexity of physical environments and the richness of human language. In this paper, we present $\textbf{InsActor}$, a principled generative framework that leverages recent advancements in diffusion-based human motion models to produce instruction-driven animations of physics-based characters. Our framework empowers InsActor to capture complex relationships between high-level human instructions and character motions by employing diffusion policies for flexibly conditioned motion planning. To overcome invalid states and infeasible state transitions in planned motions, InsActor discovers low-level skills and maps plans to latent skill sequences in a compact latent space. Extensive experiments demonstrate that InsActor achieves state-of-the-art results on various tasks, including instruction-driven motion generation and instruction-driven waypoint heading. Notably, the ability of InsActor to generate physically simulated animations using high-level human instructions makes it a valuable tool, particularly in executing long-horizon tasks with a rich set of instructions. Our project page is available at [jiawei-ren.github.io/projects/insactor/index.html](https://jiawei-ren.github.io/projects/insactor/index.html)

PaintSeg: Painting Pixels for Training-free Segmentation
Xiang Li Chung-Ching Lin Yinpeng Chen Zicheng Liu Jinglu Wang Rita Singh Bhiksha Raj



研究问题:如何实现无需训练的物体分割方法。
动机:现有的需要训练的物体分割方法需要大量的标注数据,而未标记的数据无法得到有效利用。
方法:提出一种名为AMCP的对抗性蒙版对比绘画过程,通过使用现成的生成模型在被蒙版的区域进行绘画,创建原始图像和绘画图像之间的对比。在绘画过程中交替进行填充背景和恢复前景对象缺失部分的两种操作。
效果:实验结果表明,该方法在粗粒度掩码提示、框提示和点提示分割任务上优于现有方法,为无监督分割提供了一种无需训练的解决方案。

The paper introduces PaintSeg, a new unsupervised method for segmenting objects without any training. We propose an adversarial masked contrastive painting (AMCP) process, which creates a contrast between the original image and a painted image in which a masked area is painted using off-the-shelf generative models. During the painting process, inpainting and outpainting are alternated, with the former masking the foreground and filling in the background, and the latter masking the background while recovering the missing part of the foreground object. Inpainting and outpainting, also referred to as I-step and O-step, allow our method to gradually advance the target segmentation mask toward the ground truth without supervision or training. PaintSeg can be configured to work with a variety of prompts, e.g. coarse masks, boxes, scribbles, and points. Our experimental results demonstrate that PaintSeg outperforms existing approaches in coarse mask-prompt, box-prompt, and point-prompt segmentation tasks, providing a training-free solution suitable for unsupervised segmentation. Code: https://github.com/lxa9867/PaintSeg.

Unsupervised Semantic Correspondence Using Stable Diffusion
Eric Hedlin Gopal Sharma Shweta Mahajan Hossam Isack Abhishek Kar Andrea Tagliasacchi Kwang Moo Yi



研究问题:本文旨在探索无需训练,如何利用扩散模型中的语义知识找到多张图片中具有相同语义含义的位置。
动机:目前的文本到图像扩散模型能够生成与真实图像难以区分的图像,但需要理解它们被要求生成的对象的语义。
方法:在给定一张图像的情况下,优化这些模型的提示嵌入以最大程度地关注感兴趣的区域。优化后的嵌入捕获有关位置的语义信息,然后可以转移到另一张图像上。
效果:实验结果表明,该方法在PF-Willow数据集上的表现与强监督的最新技术相当,并在PF-Willow、CUB-200和SPair-71k数据集上显著优于任何现有的弱监督或无监督方法(相对提升了20.9%)。

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences – locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly- or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
Hanzhuo Huang Yufan Feng Cheng Shi Lan Xu Jingyi Yu Sibei Yang



研究问题:本文旨在解决文本到视频生成中的数据效率和成本效益问题,以及如何生成具有语义连贯性的视频。
动机:现有的文本到视频生成方法往往需要大量的数据和训练,而且生成的视频可能缺乏语义连贯性。
方法:本文提出了一种新的自由绽放(Free-Bloom)流程,利用大型语言模型作为导演生成语义连贯的提示序列,预训练的潜在扩散模型作为动画师生成高保真帧。同时,为了确保时间、同一性和语义连贯性,本文还提出了一系列注释修改,包括联合噪声采样、步长感知注意力转移和双路径插值。
效果:自由绽放能够在没有任何视频数据和训练需求的情况下生成生动、高质量的视频,其生成的复杂场景具有语义有意义的帧序列,令人惊叹。此外,自由绽放与基于潜在扩散模型的扩展自然兼容。

Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of ``moving images'', we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.

Unlocking Feature Visualization for Deep Network with MAgnitude Constrained Optimization
Thomas FEL Thibaut Boissin Victor Boutin Agustin Martin Picard Paul Novello Julien Colin Drew Linsley Tom ROUSSEAU Remi Cadene Lore Goetschalckx Laurent Gardes Thomas Serre



研究问题:特征可视化在深度神经网络中的广泛应用受到限制,需要解决扩展性和图像生成的问题。
动机:Olah等人2017年的工作使特征可视化方法受到关注,但该方法在深度神经网络中的应用受限,且依赖于技巧来生成可解释的图像。
方法:本文提出了MACO方法,通过优化图像的相位频谱并保持其幅度恒定,确保生成的解释位于自然图像的空间中,从而解决了上述问题。
效果:实验结果表明,该方法在定性和定量上都取得了显著改进,为最先进的神经网络提供了高效且可解释的特征可视化。同时,该方法还具有空间重要性的属性,可以通过量化评估特征可视化。

Feature visualization has gained significant popularity as an explainability method, particularly after the influential work by Olah et al. in 2017. Despite its success, its widespread adoption has been limited due to issues in scaling to deeper neural networks and the reliance on tricks to generate interpretable images. Here, we describe MACO, a simple approach to address these shortcomings. It consists in optimizing solely an image's phase spectrum while keeping its magnitude constant to ensure that the generated explanations lie in the space of natural images. Our approach yields significantly better results -- both qualitatively and quantitatively -- unlocking efficient and interpretable feature visualizations for state-of-the-art neural networks. We also show that our approach exhibits an attribution mechanism allowing to augment feature visualizations with spatial importance. Furthermore, we enable quantitative evaluation of feature visualizations by introducing 3 metrics: transferability, plausibility, and alignment with natural images. We validate our method on various applications and we introduce a website featuring MACO visualizations for all classes of the ImageNet dataset, which will be made available upon acceptance. Overall, our study unlocks feature visualizations for the largest, state-of-the-art classification networks without resorting to any parametric prior image model, effectively advancing a field that has been stagnating since 2017 (Olah et al, 2017).

DiffUTE: Universal Text Editing Diffusion Model
Haoxing Chen Zhuoer Xu Zhangxuan Gu jun lan 行 郑 Yaohui Li Changhua Meng Huijia Zhu Weiqiang Wang



研究问题:现有的扩散模型在生成文本和文本风格时存在错误,如何解决这个问题?
动机:提出一种通用的自监督文本编辑扩散模型(DiffUTE),旨在在保持图像现实外观的同时替换或修改源图像中的单词。
方法:基于扩散模型构建模型,并修改网络结构以利用字形和位置信息绘制多语言字符。设计一个自监督学习框架,利用大量网络数据提高模型的表示能力。
效果:实验结果表明,该方法在野外图像上实现了高度逼真的可控编辑,性能令人印象深刻。

Diffusion model based language-guided image editing has achieved great success recently. However, existing state-of-the-art diffusion models struggle with rendering correct text and text style during generation. To tackle this problem, we propose a universal self-supervised text editing diffusion model (DiffUTE), which aims to replace or modify words in the source image with another one while maintaining its realistic appearance. Specifically, we build our model on a diffusion model and carefully modify the network structure to enable the model for drawing multilingual characters with the help of glyph and position information. Moreover, we design a self-supervised learning framework to leverage large amounts of web data to improve the representation ability of the model. Experimental results show that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity. Our code will be avaliable in \url{https://github.com/chenhaoxing/DiffUTE}.

A Hierarchical Training Paradigm for Antibody Structure-sequence Co-design
Fang Wu Stan Z. Li



研究问题:本文旨在提出一种抗体序列-结构联合设计的分层训练模式(HTP)。
动机:为了从几何结构和大量的抗体和非抗体序列数据库中挖掘进化信息,以确定配体结合位点和强度。
方法:通过精心设计的任务,将几何图神经网络与大规模蛋白质语言模型无缝有效地集成,形成包含四个级别的训练阶段的HTP。
效果:实验证明,HTP在联合设计问题和固定骨架设计上设置了新的最先进的性能,为深度生成架构的潜力释放提供了希望的道路。

Therapeutic antibodies are an essential and rapidly flourishing drug modality. The binding specificity between antibodies and antigens is decided by complementarity-determining regions (CDRs) at the tips of these Y-shaped proteins. In this paper, we propose a \textbf{h}ierarchical \textbf{t}raining \textbf{p}aradigm (HTP) for the antibody sequence-structure co-design. HTP consists of four levels of training stages, each corresponding to a specific protein modality within a particular protein domain. Through carefully crafted tasks in different stages, HTP seamlessly and effectively integrates geometric graph neural networks (GNNs) with large-scale protein language models to excavate evolutionary information from not only geometric structures but also vast antibody and non-antibody sequence databases, which determines ligand binding pose and strength. Empirical experiments show HTP sets the new state-of-the-art performance in the co-design problem as well as the fix-backbone design. Our research offers a hopeful path to unleash the potential of deep generative architectures and seeks to illuminate the way forward for the antibody sequence and structure co-design challenge.

AND: Adversarial Neural Degradation for Learning Blind Image Super-Resolution
Fangzhou Luo Xiaolin Wu Yanhui Guo



研究问题:训练中的假设退化模型与推理阶段的真正退化源不匹配时,学习用于图像超分辨率的深度神经网络容易失败。
动机:尝试模拟所有退化变体既笨重又不实用,因此提出一种新的对抗性神经退化(AND)模型,无需任何显式监督即可生成广泛的高度非线性复杂退化效应。
方法:在最小最大准则下,将AND模型与深度恢复神经网络一起训练。
效果:AND模型具有超越现有技术的独特优势,能更好地泛化到未见过退化变体,从而在实际图像上显著提高恢复性能。

Learnt deep neural networks for image super-resolution fail easily if the assumed degradation model in training mismatches that of the real degradation source at the inference stage. Instead of attempting to exhaust all degradation variants in simulation, which is unwieldy and impractical, we propose a novel adversarial neural degradation (AND) model that can, when trained in conjunction with a deep restoration neural network under a minmax criterion, generate a wide range of highly nonlinear complex degradation effects without any explicit supervision. The AND model has a unique advantage over the current state of the art in that it can generalize much better to unseen degradation variants and hence deliver significantly improved restoration performance on real-world images.

Semi-Implicit Denoising Diffusion Models (SIDDMs)
yanwu xu Mingming Gong Shaoan Xie Wei Wei Matthias Grundmann kayhan Batmanghelich Tingbo Hou



研究问题:尽管生成模型的普及,但在不影响样本多样性和质量的情况下实现快速推理采样仍然具有挑战性。
动机:现有的模型如去噪扩散概率模型(DDPM)可以提供高质量、多样化的样本,但受迭代步骤数量多的影响,速度较慢。去噪扩散生成对抗网络(DDGAN)试图通过整合GAN模型来解决这个问题,但在大型数据集上应用时遇到了可扩展性限制。
方法:我们提出了一种新的方法,通过匹配隐式和显式因素来解决上述问题。具体来说,我们的方法涉及使用隐式模型来匹配有噪声数据的边际分布和前向扩散的显式条件分布。这种结合使我们能够有效地匹配联合去噪分布。与DDPM类似,但我们与DDGAN不同,我们没有强制规定反向步骤的参数分布,这使我们能够在推理过程中进行大步长。与DDPM类似,但与DDGAN不同,我们利用了扩散过程的确切形式。
效果:我们的实验表明,我们提出的方法在生成性能上与基于扩散的模型相当,并且在采样步数较少的模型中获得了显著更好的结果。

Despite the proliferation of generative models, achieving fast sampling during inference without compromising sample diversity and quality remains challenging. Existing models such as Denoising Diffusion Probabilistic Models (DDPM) deliver high-quality, diverse samples but are slowed by an inherently high number of iterative steps. The Denoising Diffusion Generative Adversarial Networks (DDGAN) attempted to circumvent this limitation by integrating a GAN model for larger jumps in the diffusion process. However, DDGAN encountered scalability limitations when applied to large datasets. To address these limitations, we introduce a novel approach that tackles the problem by matching implicit and explicit factors. More specifically, our approach involves utilizing an implicit model to match the marginal distributions of noisy data and the explicit conditional distribution of the forward diffusion. This combination allows us to effectively match the joint denoising distributions. Unlike DDPM but similar to DDGAN, we do not enforce a parametric distribution for the reverse step, enabling us to take large steps during inference. Similar to the DDPM but unlike DDGAN, we take advantage of the exact form of the diffusion process. We demonstrate that our proposed method obtains comparable generative performance to diffusion-based models and vastly superior results to models with a small number of sampling steps.

CRoSS: Diffusion Model Makes Controllable, Robust and Secure Image Steganography
Jiwen Yu Xuanyu Zhang Youmin Xu Jian Zhang



研究问题:当前图像隐写技术主要关注基于覆盖的方法,这种方法存在泄露秘密图像的风险,并且研究问题:当前图像隐写技术主要关注基于覆盖的方法,这种方法存在泄露秘密图像的风险,并且对退化的容器图像的鲁棒性较差。
动机:受最近扩散模型发展启发,我们发现扩散模型的两个特性——无需训练即可实现两图像之间的转换,以及对噪声数据的鲁棒性——可用于提高图像隐写任务的安全性和自然鲁棒性。
方法:我们选择了稳定扩散作为扩散模型,这是一种条件扩散模型,并充分利用了开源社区的最新工具,如LoRAs和ControlNets,以提高容器图像的可控性和多样性。总的来说,我们提出了一种新的图像隐写框架,名为可控、鲁棒和安全的图像隐写(CRoSS),与基于覆盖的图像隐写方法相比,它在可控性、鲁棒性和安全性方面具有显著优势。这些优势是在无需额外训练的情况下获得的。
效果:在实验部分,我们进行了详细的实验,以证明我们提出的CRoSS框架在可控性、鲁棒性和安全性方面的优势。据我们所知,这是首次将扩散模型引入图像隐写领域的工作。

Current image steganography techniques are mainly focused on cover-based methods, which commonly have the risk of leaking secret images and poor robustness against degraded container images. Inspired by recent developments in diffusion models, we discovered that two properties of diffusion models, the ability to achieve translation between two images without training, and robustness to noisy data, can be used to improve security and natural robustness in image steganography tasks. For the choice of diffusion model, we selected Stable Diffusion, a type of conditional diffusion model, and fully utilized the latest tools from open-source communities, such as LoRAs and ControlNets, to improve the controllability and diversity of container images. In summary, we propose a novel image steganography framework, named Controllable, Robust and Secure Image Steganography (CRoSS), which has significant advantages in controllability, robustness, and security compared to cover-based image steganography methods. These benefits are obtained without additional training. To our knowledge, this is the first work to introduce diffusion models to the field of image steganography. In the experimental section, we conducted detailed experiments to demonstrate the advantages of our proposed CRoSS framework in controllability, robustness, and security.

Customizable Image Synthesis with Multiple Subjects
Zhiheng Liu Yifei Zhang Yujun Shen Kecheng Zheng Kai Zhu Ruili Feng Yu Liu Deli Zhao Jingren Zhou Yang Cao



研究问题:如何有效地表示特定主题,并适当地组合不同的主题,以实现可控的多主题图像合成。
动机:尽管现有的算法在单个主题定制方面取得了成功,但随着主题数量的增加,其训练成本高、成功率低。
方法:通过学习基础嵌入上的残差,将原始主题稳定地转移到给定各种文本条件的自定义主题。然后提出使用布局作为空间指导来安排主题的位置。
效果:实验结果表明,该方法在各种设置下都能显著优于最先进的替代方案,实现了多主题定制的图像合成。

Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Using cross-attention map as the intermediary, we could strengthen the signal of target subjects and weaken the signal of irrelevant subjects within a certain region, significantly alleviating the interference across subjects. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.

Boundary Guided Learning-Free Semantic Control with Diffusion Models
Ye Zhu Yu Wu Zhiwei Deng Olga Russakovsky Yan Yan



研究问题:如何有效地利用预训练的生成去噪扩散模型(DDMs)进行下游任务,如图像语义编辑,而无需学习额外的网络。
动机:现有的方法通常需要微调DDMs或学习辅助编辑网络,本文提出了一种无额外网络需求的BoundaryDiffusion方法。
方法:通过理论和实证分析高维潜在空间在马尔科夫链中的概率和几何行为,探索中间高维潜在空间的全面理解。然后提出一种自动搜索方法来进一步探索预训练DDMs的去噪轨迹的关键步骤。
效果:在多个DPMs架构和数据集上进行了广泛的实验,取得了优异的性能,证明了该方法在各种任务场景(图像语义编辑、基于文本的编辑、无条件语义控制)中的有效性。

Applying pre-trained generative denoising diffusion models (DDMs) for downstream tasks such as image semantic editing usually requires either fine-tuning DDMs or learning auxiliary editing networks in the existing literature. In this work, we present our BoundaryDiffusion method for efficient, effective and light-weight semantic control with frozen pre-trained DDMs, without learning any extra networks. As one of the first learning-free diffusion editing works, we start by seeking a more comprehensive understanding of the intermediate high-dimensional latent spaces by theoretically and empirically analyzing their probabilistic and geometric behaviors in the Markov chain. We then propose to further explore the critical step in the denoising trajectory that characterizes the convergence of a pre-trained DDM and introduce an automatic search method. Last but not least, in contrast to the conventional understanding that DDMs have relatively poor semantic behaviors (in generic latent spaces), we prove that the critical latent space we found already forms semantic subspace boundaries at the generic level in unconditional DDMs, which allows us to do controllable manipulation by guiding the denoising trajectory towards the targeted boundary via a single-step operation. We conduct extensive experiments on multiple DPMs architectures (DDPM, iDDPM) and datasets (CelebA, CelebA-HQ, LSUN-church, LSUN-bedroom, AFHQ-dog) with different resolutions (64, 256), achieving superior or state-of-the-art performance in various task scenarios (image semantic editing, text-based editing, unconditional semantic control) to demonstrate the effectiveness.

StyleGAN knows Normal, Depth, Albedo, and More
Anand Bhattad Daniel McKee Derek Hoiem David Forsyth



研究问题:如何利用StyleGAN生成内蕴图像。
动机:内蕴图像是具有深度、法线、反照率或阴影等场景属性的类似图像的映射,而现有的方法在处理这些任务时存在不足。
方法:通过将固定的偏移量${bf d_c}$加到StyleGAN的潜变量${\bf w}$上,可以容易地诱导StyleGAN生成内蕴图像。
效果:实验结果表明,使用StyleGAN生成的内蕴图像在定性和定量上都与使用最新的图像回归技术获得的内蕴图像相当,并且对重光照效应具有鲁棒性。

Intrinsic images, in the original sense, are image-like maps of scene properties like depth, normal, albedo, or shading. This paper demonstrates that StyleGAN can easily be induced to produce intrinsic images. The procedure is straightforward. We show that if StyleGAN produces $G({\bf w})$ from latent ${\bf w}$, then for each type of intrinsic image, there is a fixed offset ${\bf d}_c$ so that $G({\bf w}+{\bf d}_c)$ is that type of intrinsic image for $G({\bf w})$. Here ${\bf d}_c$ is {\em independent of ${\bf w}$}. The StyleGAN we used was pretrained by others, so this property is not some accident of our training regime. We show that there are image transformations StyleGAN will {\em not} produce in this fashion, so StyleGAN is not a generic image regression engine. It is conceptually exciting that an image generator should ``know'' and represent intrinsic images. There may also be practical advantages to using a generative model to produce intrinsic images. The intrinsic images obtained from StyleGAN compare well both qualitatively and quantitatively with those obtained by using SOTA image regression techniques; but StyleGAN's intrinsic images are robust to relighting effects, unlike SOTA methods.

TextDiffuser: Diffusion Models as Text Painters
Jingye Chen Yupan Huang Tengchao Lv Lei Cui Qifeng Chen Furu Wei



研究问题:扩散模型在生成准确连贯的文本方面存在困难。
动机:为了解决这个问题,我们提出了TextDiffuser,专注于生成与背景视觉上吸引人的文本一致的图像。
方法:TextDiffuser包括两个阶段:首先,一个Transformer模型从文本提示中提取关键词并生成布局;然后,扩散模型根据文本提示和生成的布局生成图像。
效果:通过实验和用户研究,我们证明了TextDiffuser能够灵活、可控地使用文本提示或与文本模板图像一起创建高质量的文本图像,并进行文本修复以重构不完整的图像。我们将公开代码、模型和数据集。

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we demonstrate that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. We will make the code, model and dataset publicly available.

PromptRestorer: A Prompting Image Restoration Method with Degradation Perception
Cong Wang Jinshan Pan Wei Wang Jiangxin Dong Mengzhu Wang Yakun Ju Junyang Chen



研究问题:如何利用原始退化特征有效地引导深度恢复模型,以提供准确的退化先验来促进更好的恢复。
动机:在网络学习过程中,不考虑退化的恢复模型会逐渐忘记退化,从而严重阻碍模型容量。
方法:提出一种提示图像恢复器(PromptRestorer),包含两个分支:恢复分支和提示分支。前者用于恢复图像,后者感知退化先验,用可靠的感知内容提示恢复分支指导恢复过程以实现更好的恢复。
效果:通过实验证明,我们的PromptRestorer在图像去雨、去模糊、去雾霾和去雪等4个图像恢复任务上取得了最先进的结果。

We show that raw degradation features can effectively guide deep restoration models, providing accurate degradation priors to facilitate better restoration. While networks that do not consider them for restoration forget gradually degradation during the learning process, model capacity is severely hindered. To address this, we propose a Prompt}ing image Restorer, termed as PromptRestorer. Specifically, PromptRestorer contains two branches: a restoration branch and a prompting branch. The former is used to restore images, while the latter perceives degradation priors to prompt the restoration branch with reliable perceived content to guide the restoration process for better recovery. To better perceive the degradation which is extracted by a pre-trained model from given degradation observations, we propose a prompting degradation perception modulator, which adequately considers the characters of the self-attention mechanism and pixel-wise modulation, to better perceive the degradation priors from global and local perspectives. To control the propagation of the perceived content for the restoration branch, we propose gated degradation perception propagation, enabling the restoration branch to adaptively learn more useful features for better recovery. Extensive experimental results show that our PromptRestorer achieves state-of-the-art results on 4 image restoration tasks, including image deraining, deblurring, dehazing, and desnowing.

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry
Yong-Hyun Park Mingi Kwon Jaewoong Choi Junghyo Jo Youngjung Uh



研究问题:尽管扩散模型(DMs)取得了成功,但我们对其潜在空间的理解仍然不足。
动机:为了理解潜在空间,我们从几何的角度对其进行了分析。
方法:我们通过利用与编码特征图相关的拉回度量来推导出潜在空间中的局部潜在基。
效果:我们发现的局部潜在基使得我们能够通过在特定时间步长沿着基向量移动潜在空间的$\mathbf{x}_t$来进行图像编辑。此外,我们还分析了DMs的几何结构如何随扩散时间步长而演变以及在不同文本条件下的差异。这证实了已知的从粗糙到精细生成的现象,并揭示了一些新的发现,如不同时间步长的$\mathbf{x}_t$之间的差异、数据集复杂性的影响以及文本提示的时间变化影响。据我们所知,这是第一篇通过$mathbf{x}$-空间遍历进行图像编辑的文章,无需任何额外训练,仅在特定时间步长$t$编辑一次,并对DMs的潜在结构进行了全面分析。

Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Our approach involves deriving the local latent basis within $\mathcal{X}$ by leveraging the pullback metric associated with their encoding feature maps. Remarkably, our discovered local latent basis enables image editing capabilities by moving $\mathbf{x}_t$, the latent space of DMs, along the basis vector at specific timesteps. We further analyze how the geometric structure of DMs evolves over diffusion timesteps and differs across different text conditions. This confirms the known phenomenon of coarse-to-fine generation, as well as reveals novel insights such as the discrepancy between $\mathbf{x}_t$ across timesteps, the effect of dataset complexity, and the time-varying influence of text prompts. To the best of our knowledge, this paper is the first to present image editing through $\mathbf{x}$-space traversal, editing only once at specific timestep $t$ without any additional training, and providing thorough analyses of the latent structure of DMs. The code to reproduce our experiments can be found at the [link](https://github.com/enkeejunior1/Diffusion-Pullback).

PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance
Peiqing Yang Shangchen Zhou Qingyi Tao Chen Change Loy



研究问题:如何利用预训练的扩散模型进行图像恢复。
动机:传统的任务特定训练方法在面对复杂的退化过程时往往无法精确建模,而现有的通过显式退化模型限制解空间的方法也常常力不从心。
方法:本文提出了一种新的视角“部分引导”,该方法比现有工作更适应现实世界的退化。我们没有具体定义退化过程,而是对高质量图像的结构、颜色统计等期望属性进行建模,并在反向扩散过程中应用这种引导。这些属性是现成的,并且对退化过程没有任何假设。当与扩散先验结合时,这种部分引导可以在各种恢复任务中产生吸引人的结果。此外,我们的方法是可扩展的,可以通过整合来自各自任务的引导来处理复合任务。
效果:实验结果表明,我们的方法不仅优于现有的基于扩散先验的方法,而且与任务特定的模型相比也具有竞争力。

Exploiting pre-trained diffusion models for restoration has recently become a favored alternative to the traditional task-specific training approach. Previous works have achieved noteworthy success by limiting the solution space using explicit degradation models. However, these methods often fall short when faced with complex degradations as they generally cannot be precisely modeled. In this paper, we introduce $\textit{partial guidance}$, a fresh perspective that is more adaptable to real-world degradations compared to existing works. Rather than specifically defining the degradation process, our approach models the desired properties, such as image structure and color statistics of high-quality images, and applies this guidance during the reverse diffusion process. These properties are readily available and make no assumptions about the degradation process. When combined with a diffusion prior, this partial guidance can deliver appealing results across a range of restoration tasks. Additionally, our method can be extended to handle composite tasks by consolidating multiple high-quality image properties, achieved by integrating the guidance from respective tasks. Experimental results demonstrate that our method not only outperforms existing diffusion-prior-based approaches but also competes favorably with task-specific models.

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
Yonglong Tian Lijie Fan Phillip Isola Huiwen Chang Dilip Krishnan



研究问题:探索使用由文本到图像模型生成的合成图像学习视觉表示的潜力。
动机:鉴于文本到图像模型在生成高质量图像方面的出色表现,这是一个很自然的问题。
方法:我们考虑了Stable Diffusion,这是领先的开源文本到图像模型之一。我们展示了(1)当生成模型配置适当时,在合成图像上训练自监督方法可以匹配或超越真实图像;
(2)通过将同一文本提示生成的多个图像视为彼此的正例,我们开发了一种多正对比学习方法,称为StableRep。

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is properly configured, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, \name~trained with 20M synthetic images (10M captions) achieves better accuracy than CLIP trained with 50M real images (50M captions).

Optimal Transport-Guided Conditional Score-Based Diffusion Model
Xiang Gu Liwei Yang Jian Sun Zongben Xu



研究问题:现有的条件生成模型需要配对数据作为条件,但在实际应用中可能无法提供足够的配对数据。
动机:为了解决部分配对或无配对数据集的应用问题,本文提出了一种新的基于最优传输的条件分数扩散模型(OTCS)。
方法:通过$L_2$-正则化的无监督或半监督最优传输,为未配对或部分配对的数据集建立耦合关系。然后,基于这种耦合关系,开发了针对未配对或部分配对设置的条件分数模型的训练目标。
效果:在未配对的超分辨率和半配对的图像到图像翻译等任务上进行的大量实验表明,提出的OTCS模型是有效的。从最优传输的角度看,OTCS提供了一种在大规模数据集上实现数据分布间传输的方法,这在最优传输中是一个挑战。理论上,我们证明了OTCS实现了最优传输中的数据运输,并给出了理论界限。

Conditional score-based diffusion model (SBDM) is for conditional generation of target data with paired data as condition, and has achieved great success in image translation. However, it requires the paired data as condition, and there would be insufficient paired data provided in real-world applications. To tackle the applications with partially paired or even unpaired dataset, we propose a novel Optimal Transport-guided Conditional Score-based diffusion model (OTCS) in this paper. We build the coupling relationship for the unpaired or partially paired dataset based on $L_2$-regularized unsupervised or semi-supervised optimal transport, respectively. Based on the coupling relationship, we develop the objective for training the conditional score-based model for unpaired or partially paired settings, which is based on a reformulation and generalization of the conditional SBDM for paired setting. With the estimated coupling relationship, we effectively train the conditional score-based model by designing a ``resampling-by-compatibility'' strategy to choose the sampled data with high compatibility as guidance. Extensive experiments on unpaired super-resolution and semi-paired image-to-image translation demonstrated the effectiveness of the proposed OTCS model. From the viewpoint of optimal transport, OTCS provides an approach to transport data across distributions, which is a challenge for OT on large-scale datasets. We theoretically prove that OTCS realizes the data transport in OT with a theoretical bound.

Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing
Kai Wang Fei Yang Shiqi Yang Muhammad Atif Butt Joost van de Weijer



研究问题:当前图像编辑技术在修改目标区域时,易对非目标区域(如背景或与目标对象有语义或视觉关系的干扰物)产生意外的修改。
动机:为了解决这一问题,我们提出了动态提示学习(DPL)方法,通过强制注意力映射关注文本提示中的正确的名词词组,实现对特定对象的精细图像编辑,同时防止对其他图像区域的不必要更改。
方法:基于公开的稳定扩散模型,我们通过更新文本输入中名词的动态标记,使用提出的泄漏修复损失来实现这一目标。
效果:在广泛的图像上进行评估后,我们的DPL方法在定量(CLIP分数、结构-距离)和定性(用户评估)上都取得了优异的结果,特别是在复杂的多对象场景中,改进了单词交换、提示精炼和注意力重加权等图像编辑结果。

Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose $\textit{Dynamic Prompt Learning}$ ($DPL$) to force cross-attention maps to focus on correct $\textit{noun}$ words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method $DPL$, based on the publicly available $\textit{Stable Diffusion}$, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.

Predicting a Protein's Stability under a Million Mutations
Jeffrey Ouyang-Zhang Daniel Jesus Diaz Adam Klivans Philipp Kraehenbuehl



研究问题:如何有效地预测蛋白质稳定性的改进突变?
动机:识别能提高热力学稳定性的稀缺突变是蛋白质工程的基础步骤,但现有方法计算成本高且效率低。
方法:开发了一种名为"Mutate Everything"的简单并行解码算法,该算法能在一次前向传递中预测所有单点和双点突变的效果,甚至能以最小的计算开销预测更高阶的突变。
效果:在Mega-Scale cDNA proteolysis数据集上训练后,"Mutate Everything"在S669、ProTherm和ProteinGym数据集上的单点和更高阶突变预测性能达到了最先进的水平。

Stabilizing proteins is a foundational step in protein engineering. However, the evolutionary pressure of all extant proteins makes identifying the scarce number of mutations that will improve thermodynamic stability challenging. Deep learning has recently emerged as a powerful tool for identifying promising mutations. Existing approaches, however, are computationally expensive, as the number of model inferences scales with the number of mutations queried. Our main contribution is a simple, parallel decoding algorithm. Mutate Everything is capable of predicting the effect of all single and double mutations in one forward pass. It is even versatile enough to predict higher-order mutations with minimal computational overhead. We build Mutate Everything on top of ESM2 and AlphaFold, neither of which were trained to predict thermodynamic stability. We trained on the Mega-Scale cDNA proteolysis dataset and achieved state-of-the-art performance on single and higher-order mutations on S669, ProTherm, and ProteinGym datasets. Our code is available at https://github.com/jozhang97/MutateEverything.

topic-10

Topic words :  learning,  data,  training,  model,  performance,  methods,  distribution,  datasets

Students Parrot Their Teachers: Membership Inference on Model Distillation
Matthew Jagielski Milad Nasr Katherine Lee Christopher A. Choquette-Choo Nicholas Carlini Florian Tramèr



研究问题:本文旨在通过设计成员推理攻击,系统地研究知识蒸馏对教师和学生训练集提供的隐私保护。
动机:现有的经验性隐私防御依赖于“学生”模型可以间接通过“教师”模型与训练数据交互来保护训练数据的隐私的直觉。
方法:设计成员推理攻击,对多个领域的知识蒸馏进行系统性研究。
效果:实验结果表明,仅凭蒸馏本身在许多领域只能提供有限的隐私保护。当学生和教师的数据集相似或攻击者可以污染教师的数据集时,我们的攻击最为成功。

Model distillation is frequently proposed as a technique to reduce the privacy leakage of machine learning. These empirical privacy defenses rely on the intuition that distilled ``student'' models protect the privacy of training data, as they only interact with this data indirectly through a ``teacher'' model. In this work, we design membership inference attacks to systematically study the privacy provided by knowledge distillation to both the teacher and student training sets. Our new attacks show that distillation alone provides only limited privacy across a number of domains. We explain the success of our attacks on distillation by showing that membership inference attacks on a private dataset can succeed even if the target model is never queried on any actual training points, but only on inputs whose predictions are highly influenced by training data. Finally, we show that our attacks are strongest when student and teacher sets are similar, or when the attacker can poison the teacher set.

Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition
Samuel Dooley Rhea Sanjay Sukthanker John P Dickerson Colin White Frank Hutter Micah Goldblum



研究问题:人脸识别系统在安全关键应用中广泛部署,但其在性别、种族等社会经济维度上存在偏见。
动机:传统的观念认为模型的偏见源于训练数据的偏见,但作者发现偏见实际上是神经网络架构本身所固有的。
方法:通过进行神经架构搜索和超参数搜索,输出一套在准确性和公平性上都优于所有其他高性能架构和现有偏差缓解方法的模型。
效果:这些模型在CelebA和VGGFace2这两个最常用的人脸识别数据集上表现出色,并可以推广到其他数据集和敏感属性。

Face recognition systems are widely deployed in safety-critical applications, including law enforcement, yet they exhibit bias across a range of socio-demographic dimensions, such as gender and race. Conventional wisdom dictates that model biases arise from biased training data. As a consequence, previous works on bias mitigation largely focused on pre-processing the training data, adding penalties to prevent bias from effecting the model during training, or post-processing predictions to debias them, yet these approaches have shown limited success on hard problems such as face recognition. In our work, we discover that biases are actually inherent to neural network architectures themselves. Following this reframing, we conduct the first neural architecture search for fairness, jointly with a search for hyperparameters. Our search outputs a suite of models which Pareto-dominate all other high-performance architectures and existing bias mitigation methods in terms of accuracy and fairness, often by large margins, on the two most widely used datasets for face identification, CelebA and VGGFace2. Furthermore, these models generalize to other datasets and sensitive attributes. We release our code, models and raw data files at https://github.com/dooleys/FR-NAS.

Can semi-supervised learning use all the data effectively? A lower bound perspective
Alexandru Tifrea Gizem Yüce Amartya Sanyal Fanny Yang



研究问题:现有的理论和实证工作已经证明,半监督学习算法可以利用未标记的数据来提高监督学习算法的标签样本复杂度。然而,现有的理论研究主要关注的是未标记的数据足以使用无监督学习单独学习良好决策边界的情况。这引发了一个问题:半监督学习算法能否同时改善无监督学习和监督学习?
动机:为了解决这个问题,我们为2-高斯混合模型推导了一个紧密的下界,该下界明确依赖于标记和未标记数据集的大小以及混合分布的信号噪声比。令人惊讶的是,我们的结果暗示,对于这些分布,没有半监督学习算法能改进无监督学习和监督学习算法的最小最大最优统计误差率。
方法:通过推导一个紧密的下界,我们分析了2-高斯混合模型的半监督学习算法的性能。
效果:虽然理论上无法证明半监督学习算法的性能增益,但在我们的现实世界实验中,半监督学习算法往往能超过无监督学习和监督学习算法的性能。总的来说,我们的工作表明,尽管有可能证明半监督学习算法的性能增益,但这需要仔细跟踪理论分析中的常数。

Prior theoretical and empirical works have established that semi-supervised learning algorithms can leverage the unlabeled data to improve over the labeled sample complexity of supervised learning (SL) algorithms. However, existing theoretical work focuses on regimes where the unlabeled data is sufficient to learn a good decision boundary using unsupervised learning (UL) alone. This begs the question: Can SSL algorithms simultaneously improve upon both UL and SL? To this end, we derive a tight lower bound for 2-Gaussian mixture models that explicitly depends on the labeled and the unlabeled dataset size as well as the signal-to-noise ratio of the mixture distribution. Surprisingly, our result implies that no SSL algorithm improves upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions. Nevertheless, in our real-world experiments, SSL algorithms can often outperform UL and SL algorithms. In summary, our work suggests that while it is possible to prove the performance gains of SSL algorithms, this would require careful tracking of constants in the theoretical analysis.

Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms
Dheeraj Baby Saurabh Garg Tzu-Ching Yen Sivaraman Balakrishnan Zachary Chase Lipton Yu-Xiang Wang



研究问题:本文关注有监督和无监督的在线标签漂移问题,其中类别边际$Q(y)$变化,但类别条件$Q(x|y)$保持不变。
动机:在无监督设置中,目标是将一个学习器适应到给定未标记的在线数据时不断变化的标签分布。在有监督设置中,我们必须同时学习分类器并适应仅使用标记的在线数据动态演化的类别边际。
方法:我们开发了新的算法,将适应问题简化为在线回归,并在没有任何先验知识的情况下保证最优动态遗憾。我们的解决方案基于启动跟踪漂移比例的*在线回归 oracles* 的估计。
效果:通过大量模拟和现实世界的在线标签漂移场景进行实验,证明了我们提出的方法具有优越的性能,通常在准确性上提高了1-3%,并且在样本和计算效率方面表现良好。代码已在 https://github.com/Anon-djiwh/OnlineLabelShift 公开。

This paper focuses on supervised and unsupervised online label shift, where the class marginals $Q(y)$ varies but the class-conditionals $Q(x|y)$ remain invariant. In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data. In the supervised setting, we must both learn a classifier and adapt to the dynamically evolving class marginals given only labeled online data. We develop novel algorithms that reduce the adaptation problem to online regression and guarantee optimal dynamic regret without any prior knowledge of the extent of drift in the label distribution. Our solution is based on bootstrapping the estimates of *online regression oracles* that track the drifting proportions. Experiments across numerous simulated and real-world online label shift scenarios demonstrate the superior performance of our proposed approaches, often achieving 1-3% improvement in accuracy while being sample and computationally efficient. Code is publicly available at https://github.com/Anon-djiwh/OnlineLabelShift

Combating Representation Learning Disparity with Geometric Harmonization
Zhihan Zhou Jiangchao Yao Feng Hong Ya Zhang Bo Han Yanfeng Wang



研究问题:现有的自监督学习方法在面对真实世界应用中的长尾分布时,难以捕捉到可转移和稳健的表示。
动机:现有的自监督学习方法追求样本级别的一致性,导致表示学习的差异性,即头部类别(样本数量大的类别)主导特征空间,而尾部类别(样本数量小的类别)被动地崩溃。
方法:提出一种新的几何协调(GH)方法,鼓励表示学习中类别级别的一致性,对少数类更友好,且在长尾分布下几乎不会伤害多数类。具体来说,GH测量自监督学习之上的嵌入空间的总体统计信息,然后推断出精细的实例级校准,以约束头部类别的空间扩展并避免尾部类别的被动崩溃。
效果:广泛的实验结果表明,该方法对分布偏斜具有高容忍度,可以有效地解决现有自监督学习方法在长尾分布问题上的挑战。

Self-supervised learning (SSL) as an effective paradigm of representation learning has achieved tremendous success on various curated datasets in diverse scenarios. Nevertheless, when facing the long-tailed distribution in real-world applications, it is still hard for existing methods to capture transferable and robust representation. The attribution is that the vanilla SSL methods that pursue the sample-level uniformity easily leads to representation learning disparity, where head classes with the huge sample number dominate the feature regime but tail classes with the small sample number passively collapse. To address this problem, we propose a novel Geometric Harmonization (GH) method to encourage the category-level uniformity in representation learning, which is more benign to the minority and almost does not hurt the majority under long-tailed distribution. Specially, GH measures the population statistics of the embedding space on top of self-supervised learning, and then infer an fine-grained instance-wise calibration to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing methods in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of \methodspace with high tolerance to the distribution skewness.

RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability
Chuning Zhu Max Simchowitz Siri Gadipudi Abhishek Gupta



研究问题:现有的视觉模型基础强化学习方法在处理图像观察时,由于没有消除冗余信息,容易受到无关变化的影响。
动机:为了提高视觉模型基础强化学习方法对无关变化的鲁棒性,使其能在动态环境中运行。
方法:提出了一种新的训练目标,鼓励表示具有最大的预测动态和奖励的能力,同时限制从观察中到潜在表示的信息流。此外,还提出了一种奖励自由的对齐程序,使测试时间的编码器可以进行适应。
效果:实验证明,这种方法显著增强了视觉模型基础强化学习方法对视觉干扰的鲁棒性,并能在动态环境中运行。同时,通过奖励自由的对齐程序,可以在不需要重新学习动态和策略的情况下快速适应大不同的环境。

Visual model-based RL methods typically encode image observations into low-dimensional representations in a manner that does not eliminate redundant information. This leaves them susceptible to spurious variations -- changes in task-irrelevant components such as background distractors or lighting conditions. In this paper, we propose a visual model-based RL method that learns a latent representation resilient to such spurious variations. Our training objective encourages the representation to be maximally predictive of dynamics and reward, while constraining the information flow from the observation to the latent representation. We demonstrate that this objective significantly bolsters the resilience of visual model-based RL methods to visual distractors, allowing them to operate in dynamic environments. We then show that while the learned encoder is able to operate in dynamic environments, it is not invariant under significant distribution shift. To address this, we propose a simple reward-free alignment procedure that enables test time adaptation of the encoder. This allows for quick adaptation to widely differing environments without having to relearn the dynamics and policy. Our effort is a step towards making model-based RL a practical and useful tool for dynamic, diverse domains and we show its effectiveness in simulation tasks with significant spurious variations.

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness
Vivek Ramanujan Thao Nguyen Sewoong Oh Ali Farhadi Ludwig Schmidt



研究问题:预训练分布的性质如何影响下游模型的鲁棒性?
动机:探索预训练策略对下游模型泛化性能的影响。
方法:通过改变预训练分布的标签空间、标签语义、图像多样性、数据域和数据量,来研究这些性质对下游模型鲁棒性的影响。
效果:发现数据量是影响下游有效鲁棒性的主要因素,而其他因素的影响有限。例如,减少ImageNet预训练类别的数量4倍,同时增加每个类别的图像数量4倍(即保持总数据量不变),不会影响下游模型的鲁棒性。

Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do properties of the pre-training distribution affect the robustness of a fine-tuned model? The properties we explore include the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution. We find that the primary factor influencing downstream effective robustness (Taori et al., 2020) is data quantity, while other factors have limited significance. For example, reducing the number of ImageNet pre-training classes by 4x while increasing the number of images per class by 4x (that is, keeping total data quantity fixed) does not impact the robustness of fine-tuned models. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources, primarily using the iWildCam-WILDS distribution shift as a test for robustness.

No Change, No Gain: Empowering Graph Neural Networks with Expected Model Change Maximization for Active Learning
Zixing Song Yifei Zhang Irwin King



研究问题:如何提高图神经网络(GNNs)在无标签数据上的预测性能。
动机:图神经网络的成功依赖于充足的标记数据,但在许多情况下,获取标记数据是困难的。
方法:提出了一种新的主动学习(AL)方法,将期望模型变化最大化(EMCM)原则扩展到GNNs,以改善未标记数据的预测性能。通过在半监督设置下对GNN生成的节点嵌入进行贝叶斯解释,我们有效地计算了闭型EMCM采集函数作为AL的选择标准,而无需重新训练。
效果:实验表明,与现有方法相比,该方法在准确性和效率方面都表现出了有效性。

Graph Neural Networks (GNNs) are crucial for machine learning applications with graph-structured data, but their success depends on sufficient labeled data. We present a novel active learning (AL) method for GNNs, extending the Expected Model Change Maximization (EMCM) principle to improve prediction performance on unlabeled data. By presenting a Bayesian interpretation for the node embeddings generated by GNNs under the semi-supervised setting, we efficiently compute the closed-form EMCM acquisition function as the selection criterion for AL without re-training. Our method establishes a direct connection with expected prediction error minimization, offering theoretical guarantees for AL performance. Experiments demonstrate our method's effectiveness compared to existing approaches, in terms of both accuracy and efficiency.

Zero-shot causal learning
Hamed Nilforoshan Michael Moor Yusuf H Roohani Yining Chen Anja Šurina Michihiro Yasunaga Sara Oblak Jure Leskovec



研究问题:如何预测不同干预措施对特定个体的因果影响?
动机:在个性化医疗、公共政策和在线营销等领域,预测特定干预措施对个体的影响具有重要意义。
方法:提出CaML(Causal Meta-Learning)框架,通过训练一个元模型来预测新干预措施的个性化效果。该框架将每个干预措施的效果预测任务构造为一个任务,并利用干预信息和个人特征进行预测。
效果:实验结果表明,CaML在大规模医疗索赔和细胞系扰动的真实世界数据集上表现出色,甚至优于直接在测试干预数据上训练的强大基线。

Predicting how different interventions will causally affect a specific individual is important in a variety of domains such as personalized medicine, public policy, and online marketing. There are a large number of methods to predict the effect of an existing intervention based on historical data from individuals who received it. However, in many settings it is important to predict the effects of novel interventions (e.g., a newly invented drug), which these methods do not address. Here, we consider zero-shot causal learning: predicting the personalized effects of a novel intervention. We propose CaML, a causal meta-learning framework which formulates the personalized prediction of each intervention's effect as a task. CaML trains a single meta-model across thousands of tasks, each constructed by sampling an intervention, its recipients, and its nonrecipients. By leveraging both intervention information (e.g., a drug's attributes) and individual features (e.g., a patient's history), CaML is able to predict the personalized effects of novel interventions that do not exist at the time of training. Experimental results on real world datasets in large-scale medical claims and cell-line perturbations demonstrate the effectiveness of our approach. Most strikingly, CaML's zero-shot predictions outperform even strong baselines trained directly on data from the test interventions.

Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign?
Erin George Michael Murray William Joseph Swartworth Deanna Needell



研究问题:本研究关注在有噪声的二分类数据上,使用梯度下降和铰链损失训练的两层ReLU网络中的良性过拟合现象。
动机:我们特别考虑了线性可分的数据,其中一小部分标签被错误地标记或翻转。我们希望理解清楚数据边界条件下的三种不同的训练结果:良性过拟合、过拟合和非过拟合。
方法:我们通过组合方法来证明这些结果,该方法涉及在训练的这两个阶段中,对干净与错误更新的数量进行限制。
效果:实验结果表明,良性过拟合情况下,测试数据被正确分类的概率很高;而过拟合情况下,测试数据被错误分类的概率被一个常数下界;非过拟合情况下,只有干净的点能达到零损失,并且测试数据被正确分类的概率也很高。此外,我们的分析还揭示了神经元在整个训练过程中的动态变化,以及训练的两个不同阶段。

We study benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted or flipped. We identify conditions on the margin of the clean data that give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates during these phases of training.

Maximization of Average Precision for Deep Learning with Adversarial Ranking Robustness
Gang Li Wei Tong Tianbao Yang



研究问题:优化平均精度(AP)的同时确保对抗性鲁棒性,这是一个尚未充分探索的领域。
动机:尽管有许多关于对抗性训练的研究,但它们主要关注于准确性的鲁棒性,即在对抗性扰动的例子上的平均准确率是否保持良好。然而,这种类型的对抗性鲁棒性对于许多应用来说是不够的,因为单个例子上的微小扰动可能会显著影响AP,而对预测系统的准确率影响不大。
方法:我们提出了一种新的方法,将AP替代损失与代表对抗性排名鲁棒性的正则化项相结合,以保持清洁数据和受扰动数据的排序一致性。然后,我们设计了一种有效的随机优化算法来优化得到的目标函数。
效果:通过对比当前领先的对抗性训练基线和其他稳健的AP最大化策略,我们的实证研究表明了所提出方法的有效性。特别是在CIFAR10和CIFAR100上,我们的方法在对抗PGD攻击的稳健AP方面比最先进的方法(TRADES)高出4%以上,同时在清洁数据上实现了7%的AP。

This paper seeks to address a gap in optimizing Average Precision (AP) while ensuring adversarial robustness, an area that has not been extensively explored to the best of our knowledge. AP maximization for deep learning has widespread applications, particularly when there is a significant imbalance between positive and negative examples. Although numerous studies have been conducted on adversarial training, they primarily focus on robustness concerning accuracy, ensuring that the average accuracy on adversarially perturbed examples is well maintained. However, this type of adversarial robustness is insufficient for many applications, as minor perturbations on a single example can significantly impact AP while not greatly influencing the accuracy of the prediction system. To tackle this issue, we introduce a novel formulation that combines an AP surrogate loss with a regularization term representing adversarial ranking robustness, which maintains the consistency between ranking of clean data and that of perturbed data. We then devise an efficient stochastic optimization algorithm to optimize the resulting objective. Our empirical studies, which compare our method to current leading adversarial training baselines and other robust AP maximization strategies, demonstrate the effectiveness of the proposed approach. Notably, our methods outperform a state-of-the-art method (TRADES) by more than 4\% in terms of robust AP against PGD attacks while achieving 7\% higher AP on clean data simultaneously on CIFAR10 and CIFAR100.

Mechanism Design for Collaborative Normal Mean Estimation
Yiding Chen Jerry Zhu Kirthevasan Kandasamy



研究问题:我们研究了合作性正态均值估计,其中m个策略性代理从正态分布$\mathcal{N}(mu, \sigma^2)$中收集独立同分布的样本,并付出一定的代价。他们都想估计均值$\mu$。通过彼此之间的数据共享,代理可以获得更好的估计结果,同时保持数据收集成本较小。
动机:为了促进这种合作,我们希望设计一种机制,鼓励代理收集足够的数据并进行真实共享,以便他们都比单独工作更好。在简单的机制中,例如简单地汇总和共享所有数据,单个代理可能会发现少收集和/或伪造数据是有利的,这可能导致不良的社会结果。
方法:我们设计了一种新颖的机制来克服这些挑战,主要通过两个关键技术:首先,当将其他代理的数据与一个代理共享时,该机制会按照该代理报告的数据与其他代理的差异程度对数据集进行一定程度的破坏;其次,我们为破坏后的数据集设计了最小最大优化估计器。我们的机制是纳什激励兼容和个体理性的,其社会惩罚(所有代理的估计误差和数据收集成本的总和)最多是全球最小值的两倍。当应用于高维(非高斯)分布且方差有限时,该机制保留了这三个属性,但结果略弱。
效果:在两种特殊情况下,我们限制了代理的策略空间,设计了实质上实现全球最小值的机制。

We study collaborative normal mean estimation, where $m$ strategic agents collect i.i.d samples from a normal distribution $\mathcal{N}(\mu, \sigma^2)$ at a cost. They all wish to estimate the mean $\mu$. By sharing data with each other, agents can obtain better estimates while keeping the cost of data collection small. To facilitate this collaboration, we wish to design mechanisms that encourage agents to collect a sufficient amount of data and share it truthfully, so that they are all better off than working alone. In naive mechanisms, such as simply pooling and sharing all the data, an individual agent might find it beneficial to under-collect and/or fabricate data, which can lead to poor social outcomes. We design a novel mechanism that overcomes these challenges via two key techniques: first, when sharing the others' data with an agent, the mechanism corrupts this dataset proportional to how much the data reported by the agent differs from the others; second, we design minimax optimal estimators for the corrupted dataset. Our mechanism, which is Nash incentive compatible and individually rational, achieves a social penalty (sum of all agents' estimation errors and data collection costs) that is at most a factor 2 of the global minimum. When applied to high dimensional (non-Gaussian) distributions with bounded variance, this mechanism retains these three properties, but with slightly weaker results. Finally, in two special cases where we restrict the strategy space of the agents, we design mechanisms that essentially achieve the global minimum.

Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts
Pritam Sarkar Ahmad Beirami Ali Etemad



研究问题:本文旨在全面研究六种流行的自我监督方法(v-SimCLR,v-MoCo,v-BYOL,v-SimSiam,v-DINO,v-MAE)在不同形式的自然分布偏移下的行为和动态。
动机:虽然视频自监督学习(VSSL)近年来取得了显著进展,但这些模型在面对不同形式的分布偏移时的具体行为和动态尚未明确。
方法:通过使用公共数据集和一系列评估协议,精心设计了一个包含17个分布内和分布外基准对的测试平台,以在预期的偏移下对不同的方法进行压力测试。
效果:研究发现了一系列有趣的VSSL方法的行为。例如,观察到视频模型在面对上下文偏移时普遍表现不佳,而v-MAE和有监督学习方法表现出更强的鲁棒性。此外,研究还发现v-MAE是一个强大的时间学习者,而对比方法v-SimCLR和v-MoCo在面对视点偏移时表现出强大的性能。在研究开放集识别的概念时,注意到如果在没有微调的情况下使用预训练的VSSL编码器,那么封闭集和开放集识别性能之间存在权衡。

Video self-supervised learning (VSSL) has made significant progress in recent years. However, the exact behavior and dynamics of these models under different forms of distribution shift are not yet known. In this paper, we comprehensively study the behavior of six popular self-supervised methods (v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various forms of natural distribution shift, i.e., (i) context shift, (ii) viewpoint shift, (iii) actor shift, (iv) source shift, (v) generalizability to unknown classes (zero-shot), and (vi) open-set recognition. To perform this extensive study, we carefully craft a test bed consisting of 17 in-distribution and out-of-distribution benchmark pairs using available public datasets and a series of evaluation protocols to stress-test the different methods under the intended shifts. Our study uncovers a series of intriguing findings and interesting behaviors of VSSL methods. For instance, we observe that while video models generally struggle with context shifts, v-MAE and supervised learning exhibit more robustness. Moreover, our study shows that v-MAE is a strong temporal learner, whereas contrastive methods, v-SimCLR and v-MoCo, exhibit strong performances against viewpoint shifts. When studying the notion of open-set recognition, we notice a trade-off between closed-set and open-set recognition performance if the pretrained VSSL encoders are used without finetuning. We hope that our work will contribute to the development of robust video representation learning frameworks for various real-world scenarios. The project page and code are available at: https://pritamqu.github.io/OOD-VSSL.

Skill-it! A data-driven skills framework for understanding and training language models
Mayee F Chen Nicholas Roberts Kush Bhatia Jue WANG Ce Zhang Frederic Sala Christopher Re



研究问题:在有限的标记预算下,如何最有效地选择训练数据以优化预训练大型语言模型(LMs)的下游任务性能。
动机:人类在获取相互依赖的技能时有一种刻意的顺序,同样,语言模型在学习一组技能时也遵循一种自然顺序。如果这种顺序存在,就可以用于改进对LMs的理解并实现数据高效的训练。
方法:开发了一个新框架,该框架基于一个简单的假设:语言模型在学习一组技能时也遵循一种自然顺序。我们使用合成和真实数据来证明这些有序的技能集的存在,并证明当我们根据其先决条件技能进行训练时,这些存在的有序技能集可以使更高级的技能通过更少的数据学习。
效果:在LEGO合成数据的持续预训练设置中,Skill-It比随机采样获得了37.5分的更高准确率。在Natural Instructions数据集的微调设置中,Skill-It将目标技能的验证损失降低了13.6%。在RedPajama数据集上应用我们的技能框架进行持续预训练,结果在1B个标记的情况下,LM评估装置的准确性高于在3B个标记的情况下均匀采样各个数据源的基线方法。

The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 37.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems
Benjamin Coleman Wang-Cheng Kang Matthew Fahrbach Ruoxi Wang Lichan Hong Ed H. Chi Derek Zhiyuan Cheng



研究问题:如何有效地学习高质量的特征嵌入,以提升网页级机器学习系统的性能。
动机:对于具有数百万到数十亿个标记的词汇表的数百种特征,将每个特征值表示为$d$-维嵌入引入了数千亿个参数,这对性能产生了瓶颈。
方法:提出了一种简单但非常有效的框架——特征多路复用,其中许多不同的分类特征使用一个单一的表示空间。
效果:理论和实证分析表明,多路复用的嵌入可以分解为每个构成特征的组件,使模型能够区分特征。在三个公共基准数据集上,多路复用表示提供了帕累托最优的空间-准确性权衡。此外,还提出了一种高度实用的方法——统一嵌入,它具有简化特征配置、强适应动态数据分布和兼容现代硬件三大优点。与五个网页级搜索、广告和推荐系统中的强大竞争性基线相比,统一嵌入在离线和在线指标上都取得了显著的改进。

Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a $d$-dimensional embedding, which introduces hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used for many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations give Pareto-optimal space-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.

Minimum-Risk Recalibration of Classifiers
Zeyu Sun Dogyoon Song Alfred Hero



研究问题:本文旨在解决概率分类器再校准的问题,以提升预测模型的可靠性和准确性。
动机:尽管已经开发了许多再校准算法,但仍然缺乏一个整合校准和锐度(对保持预测能力至关重要)的综合理论。
方法:在均方误差(MSE)分解框架中引入最小风险再校准的概念,为评估和再校准概率分类器提供了一个原则性的方法。
效果:通过这个框架,我们分析了均匀质量分箱(UMB)再校准方法,并建立了数量级为O(B/n + 1/B^2)的有限样本风险上界,其中B是箱子的数量,n是样本大小。我们还提出了一种两阶段方法来应对标签偏移的挑战,该方法使用来自目标领域的有限标记数据调整再校准函数。我们的结果显示,与从零开始再校准相比,转移已校准分类器所需的目标样本数量显著减少。

Recalibrating probabilistic classifiers is vital for enhancing the reliability and accuracy of predictive models. Despite the development of numerous recalibration algorithms, there is still a lack of a comprehensive theory that integrates calibration and sharpness (which is essential for maintaining predictive power). In this paper, we introduce the concept of minimum-risk recalibration within the framework of mean-squared-error (MSE) decomposition, offering a principled approach for evaluating and recalibrating probabilistic classifiers. Using this framework, we analyze the uniform-mass binning (UMB) recalibration method and establish a finite-sample risk upper bound of order $\tilde{O}(B/n + 1/B^2)$ where $B$ is the number of bins and $n$ is the sample size. By balancing calibration and sharpness, we further determine that the optimal number of bins for UMB scales with $n^{1/3}$, resulting in a risk bound of approximately $O(n^{-2/3})$. Additionally, we tackle the challenge of label shift by proposing a two-stage approach that adjusts the recalibration function using limited labeled data from the target domain. Our results show that transferring a calibrated classifier requires significantly fewer target samples compared to recalibrating from scratch. We validate our theoretical findings through numerical simulations, which confirm the tightness of the proposed bounds, the optimal number of bins, and the effectiveness of label shift adaptation.

Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases
Mazda Moayeri Wenxiao Wang Sahil Singla Soheil Feizi



研究问题:如何测量和减轻模型因依赖虚假线索而产生的偏见?
动机:目前的模型在训练过程中容易受到虚假线索的影响,导致对某些样本的预测结果存在偏见。
方法:提出一种简单有效的方法,通过可解释网络的深度神经网络特征来代理虚假度(虚假线索的程度),并据此对图像进行排序。然后根据排序结果,可以识别出少数子群体(即低虚假度的图像),并评估模型的偏见程度。最后,通过对低虚假度图像进行分类头微调,可以在不影响准确性的情况下有效地消除模型的偏见,从而公平地对待各种样本。
效果:在ImageNet数据集上进行了实验,发现了630个虚假的特征依赖关系,并对89个不同模型进行了偏见评估。结果显示,模型因依赖虚假特征而产生的偏见主要受训练数据的影响,而不是训练方式。

We present a simple but effective method to measure and mitigate model biases caused by reliance on spurious cues. Instead of requiring costly changes to one's data or model training, our method better utilizes the data one already has by sorting them. Specifically, we rank images within their classes based on spuriosity (the degree to which common spurious cues are present), proxied via deep neural features of an interpretable network. With spuriosity rankings, it is easy to identify minority subpopulations (i.e. low spuriosity images) and assess model bias as the gap in accuracy between high and low spuriosity images. One can even efficiently remove a model's bias at little cost to accuracy by finetuning its classification head on low spuriosity images, resulting in fairer treatment of samples regardless of spuriosity. We demonstrate our method on ImageNet, annotating $5000$ class-feature dependencies ($630$ of which we find to be spurious) and generating a dataset of $325k$ soft segmentations for these features along the way. Having computed spuriosity rankings via the identified spurious neural features, we assess biases for $89$ diverse models and find that class-wise biases are highly correlated across models. Our results suggest that model bias due to spurious feature reliance is influenced far more by what the model is trained on than how it is trained.

Scaling Open-Vocabulary Object Detection
Matthias Minderer Alexey A. Gritsenko Neil Houlsby



研究问题:如何利用预训练的视觉-语言模型进行开放词汇对象检测,并解决可用检测训练数据量有限的问题。
动机:虽然可以通过使用网络图像-文本对作为弱监督来扩展检测训练数据,但尚未在与图像级别预训练相当的规模上进行。
方法:我们采用自我训练的方法扩大检测数据,该方法使用现有的检测器为图像-文本对生成伪框注释。主要挑战是标签空间的选择、伪注释过滤和训练效率。我们提出了OWLv2模型和OWL-ST自我训练方案,以解决这些挑战。
效果:OWLv2在可比的训练规模(约1000万例)上超越了先前最先进的开放词汇检测器的性能。然而,通过OWL-ST,我们可以扩展到超过10亿个例子,进一步大幅提高性能:使用L/14架构,OWL-ST在LVIS罕见类别上的AP从31.2%提高到44.6%(相对提高了43%)。OWL-ST解锁了类似于图像分类和语言建模的开放世界定位的Web级训练。

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling. Code and checkpoints are available on GitHub.

Imitation Learning from Imperfection: Theoretical Justifications and Algorithms
Ziniu Li Tian Xu Zeyu Qin Yang Yu Zhi-Quan Luo



研究问题:模仿学习算法在面对专家数据有限的情况下,如何通过辅助数据进行改进。
动机:现有的离线模仿学习方法存在局限性,需要一种更有效的方法来处理来自次优策略的不完美数据集。
方法:提出了一种新的基于重要性采样的数据选择技术,用于从专家分布中选择数据。
效果:理论分析和实验证明,该方法可以消除简单应用行为克隆算法到组合的专家和辅助数据的缺陷,并在机器人控制、Atari视频游戏和图像分类等任务上优于现有方法。

Imitation learning (IL) algorithms excel in acquiring high-quality policies from expert data for sequential decision-making tasks. But, their effectiveness is hampered when faced with limited expert data. To tackle this challenge, a novel framework called (offline) IL with supplementary data has emerged, which enhances learning by incorporating an additional yet imperfect dataset obtained inexpensively from sub-optimal policies. Nonetheless, learning becomes challenging due to the potential inclusion of out-of-expert-distribution samples. In this work, we pioneer the mathematical formalization of this framework, uncovering its limitations. Our theoretical analysis reveals that a naive approach—applying the behavioral cloning (BC) algorithm concept to the combined set of expert and supplementary data—may fall short of vanilla BC, which solely relies on expert data. This deficiency arises due to the distribution shift between the two data sources. To address this issue, we propose a new importance-sampling-based technique for selecting data within the expert distribution. We prove that the proposed method theoretically eliminates the gap of the naive approach, highlighting its efficacy when handling imperfect data. Empirical studies demonstrate that our method outperforms previous state-of-the-art methods in tasks including robotics locomotion control, Atari video games, and image classification. Overall, our work underscores the potential of improving IL by leveraging diverse data sources through effective data selection.

Leveraging sparse and shared feature activations for disentangled representation learning
Marco Fumero Florian Wenzel Luca Zancato Alessandro Achille Emanuele Rodolà Stefano Soatto Bernhard Schölkopf Francesco Locatello



研究问题:如何从多样化的有监督任务中提取知识,学习一个通用的解耦表示。
动机:现有的高维数据变异潜在因素恢复方法主要关注简单的合成设置,缺乏对现实世界数据表示学习的积极影响。
方法:我们提出了一种利用从多样化的有监督任务中提取的知识来学习通用解耦表示的方法。假设每个有监督任务仅依赖于未知变异因素的一个子集,我们对有监督多任务模型的特征空间进行解耦,使特征在不同任务中稀疏激活,并适当共享信息。
效果:我们在六个现实世界分布偏移基准测试和不同的数据模态(图像、文本)上验证了我们的方法,展示了解耦表示如何转移到真实场景。

Recovering the latent factors of variation of high dimensional data has so far focused on simple synthetic settings. Mostly building on unsupervised and weakly-supervised objectives, prior work missed out on the positive implications for representation learning on real world data. In this work, we propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation. Assuming each supervised task only depends on an unknown subset of the factors of variation, we disentangle the feature space of a supervised multi-task model, with features activating sparsely across different tasks and information being shared as appropriate. Importantly, we never directly observe the factors of variations but establish that access to multiple tasks is sufficient for identifiability under sufficiency and minimality assumptions. We validate our approach on six real world distribution shift benchmarks, and different data modalities (images, text), demonstrating how disentangled representations can be transferred to real settings.

The Pursuit of Human Labeling: A New Perspective on Unsupervised Learning
Artyom Gadetsky Maria Brbic



研究问题:本文旨在提出一个模型无关的框架HUME,用于推断给定数据集的人类标签,而无需任何外部监督。
动机:许多人类标签定义的类别在表示数据集的任何表示空间中都是线性可分的,这一关键洞察引导了我们的方法。
方法:HUME利用这一洞察来指导对数据集所有可能标签的搜索,以发现潜在的人类标签。我们仅在预训练的表示上训练线性分类器,这些表示在训练期间保持不变,使我们的框架与任何大型预训练和自监督模型兼容。
效果:实验结果表明,HUME在STL-10数据集上大大优于基于自我监督表示的监督线性分类器,并在CIFAR-10数据集上实现了相当的性能。与现有的无监督基线相比,HUME在四个基准图像分类数据集上实现了最先进的性能,包括大规模的ImageNet-1000数据集。

We present HUME, a simple model-agnostic framework for inferring human labeling of a given dataset without any external supervision. The key insight behind our approach is that classes defined by many human labelings are linearly separable regardless of the representation space used to represent a dataset. HUME utilizes this insight to guide the search over all possible labelings of a dataset to discover an underlying human labeling. We show that the proposed optimization objective is strikingly well-correlated with the ground truth labeling of the dataset. In effect, we only train linear classifiers on top of pretrained representations that remain fixed during training, making our framework compatible with any large pretrained and self-supervised model. Despite its simplicity, HUME outperforms a supervised linear classifier on top of self-supervised representations on the STL-10 dataset by a large margin and achieves comparable performance on the CIFAR-10 dataset. Compared to the existing unsupervised baselines, HUME achieves state-of-the-art performance on four benchmark image classification datasets including the large-scale ImageNet-1000 dataset. Altogether, our work provides a fundamentally new view to tackle unsupervised learning by searching for consistent labelings between different representation spaces.

Rewiring Neurons in Non-Stationary Environments
Zhicheng Sun Yadong MU



研究问题:如何利用大脑的神经可塑性,在持续强化学习中进行网络重连,以适应非平稳环境。
动机:现有的网络重连方法主要依赖于剪枝或动态路由,可能会限制网络容量和可塑性。本文提出了一种新的重连方案,通过置换隐藏的神经元来实现。
方法:通过参数化使神经元置换成为端到端可学习的,可以重新排列所有可用的突触,探索更大的权重空间,从而促进适应性。同时,引入两种主要设计来指导持续强化学习中的重连过程:一是提出多模式重连策略,当遇到新环境时,多样化策略并鼓励探索;二是为确保历史任务的稳定性,设计网络缓存每次学习的连接方式,同时微妙地更新其权重,以便恢复适合任务的任何先前状态。此外,通过联合优化缓存的连接和权重,制定了一种对齐机制,以实现更好的可塑性-稳定性权衡。
效果:在18个持续强化学习场景中进行了全面评估,从移动到操作,展示了其在性能-效率权衡方面优于最先进的竞争对手。代码可在https://github.com/feifeiobama/RewireNeuron获取。

The human brain rewires itself for neuroplasticity in the presence of new tasks. We are inspired to harness this key process in continual reinforcement learning, prioritizing adaptation to non-stationary environments. In distinction to existing rewiring approaches that rely on pruning or dynamic routing, which may limit network capacity and plasticity, this work presents a novel rewiring scheme by permuting hidden neurons. Specifically, the neuron permutation is parameterized to be end-to-end learnable and can rearrange all available synapses to explore a large span of weight space, thereby promoting adaptivity. In addition, we introduce two main designs to steer the rewiring process in continual reinforcement learning: first, a multi-mode rewiring strategy is proposed which diversifies the policy and encourages exploration when encountering new environments. Secondly, to ensure stability on history tasks, the network is devised to cache each learned wiring while subtly updating its weights, allowing for retrospective recovery of any previous state appropriate for the task. Meanwhile, an alignment mechanism is curated to achieve better plasticity-stability tradeoff by jointly optimizing cached wirings and weights. Our proposed method is comprehensively evaluated on 18 continual reinforcement learning scenarios ranging from locomotion to manipulation, demonstrating its advantages over state-of-the-art competitors in performance-efficiency tradeoffs. Code is available at https://github.com/feifeiobama/RewireNeuron.

Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels
Zebin You Yong Zhong Fan Bao Jiacheng Sun Chongxuan Li Jun Zhu



研究问题:如何进一步推进半监督生成和分类任务?
动机:现有的半监督学习方法在少量标签的情况下性能有限,需要更有效的训练策略。
方法:提出一种名为“双伪训练”(DPT)的策略,包括三个阶段:使用部分标记数据训练分类器预测伪标签;使用这些伪标签训练条件生成模型生成伪图像;以及用真实和伪图像混合重新训练分类器。
效果:实验证明,DPT在各种设置下都能实现最先进的半监督生成和分类性能。特别是在每个类别只有一两个标签的情况下,DPT在ImageNet $256\times256$上达到了3.08或2.52的Fréchet Inception Distance (FID)分数。此外,DPT在ImageNet分类任务上大幅超过了竞争性的半监督基线,分别在每个类别只有一、两、五个标签的情况下实现了59.0(+2.8)、69.5(+3.0)和74.4(+2.0)的Top-1准确率。

In an effort to further advance semi-supervised generative and classification tasks, we propose a simple yet effective training strategy called *dual pseudo training* (DPT), built upon strong semi-supervised learners and diffusion models. DPT operates in three stages: training a classifier on partially labeled data to predict pseudo-labels; training a conditional generative model using these pseudo-labels to generate pseudo images; and retraining the classifier with a mix of real and pseudo images. Empirically, DPT consistently achieves SOTA performance of semi-supervised generation and classification across various settings. In particular, with one or two labels per class, DPT achieves a Fréchet Inception Distance (FID) score of 3.08 or 2.52 on ImageNet $256\times256$. Besides, DPT outperforms competitive semi-supervised baselines substantially on ImageNet classification tasks, *achieving top-1 accuracies of 59.0 (+2.8), 69.5 (+3.0), and 74.4 (+2.0)* with one, two, or five labels per class, respectively. Notably, our results demonstrate that diffusion can generate realistic images with only a few labels (e.g., $<0.1$%) and generative augmentation remains viable for semi-supervised classification. Our code is available at *https://github.com/ML-GSAI/DPT*.

Rank-N-Contrast: Learning Continuous Representations for Regression
Kaiwen Zha Peng Cao Jeany Son Yuzhe Yang Dina Katabi



研究问题:现有的深度学习回归模型学习方式通常端到端,没有明确强调回归感知表示,导致研究问题:现有的深度学习回归模型学习方式通常端到端,没有明确强调回归感知表示,导致学到的表示出现碎片化,无法捕捉样本顺序的连续性,从而在广泛的回归任务中产生次优结果。
动机:为了填补这一空白,我们提出了Rank-N-Contrast(RNC)框架,该框架通过基于目标空间中样本的排名进行对比,为回归学习连续表示。
方法:RNC框架通过对比样本在目标空间中的排名来学习连续表示。
效果:理论和实验证明,RNC能够保证学到的表示顺序与目标顺序一致,不仅性能更好,而且鲁棒性、效率和泛化能力都有显著提高。在五个现实世界的回归数据集上进行的大量实验,包括计算机视觉、人机交互和医疗领域,验证了RNC达到了最先进的性能,突出了其更好的数据效率、对虚假目标和数据损坏的鲁棒性以及分布偏移的泛化能力等吸引人的特性。

Deep regression models typically learn in an end-to-end fashion without explicitly emphasizing a regression-aware representation. Consequently, the learned representations exhibit fragmentation and fail to capture the continuous nature of sample orders, inducing suboptimal results across a wide range of regression tasks. To fill the gap, we propose Rank-N-Contrast (RNC), a framework that learns continuous representations for regression by contrasting samples against each other based on their rankings in the target space. We demonstrate, theoretically and empirically, that RNC guarantees the desired order of learned representations in accordance with the target orders, enjoying not only better performance but also significantly improved robustness, efficiency, and generalization. Extensive experiments using five real-world regression datasets that span computer vision, human-computer interaction, and healthcare verify that RNC achieves state-of-the-art performance, highlighting its intriguing properties including better data efficiency, robustness to spurious targets and data corruptions, and generalization to distribution shifts.

Promises and Pitfalls of Threshold-based Auto-labeling
Harit Vishwakarma Heguang Lin Frederic Sala Ramya Korlakai Vinayak



研究问题:如何减少监督机器学习工作流程中对大规模高质量标注数据集的依赖。
动机:阈值基自动标注(TBAL)可以减少对人工标注的依赖,但需要大量的人类标注验证数据来保证机器标注数据的质量。
方法:通过分析TBAL系统并推导出人类标注验证数据所需的样本复杂度界限,以理解何时可以依赖这种自动标注系统获得的数据。
效果:实验结果发现,看似糟糕的模型可以自动准确地对大量未标注数据进行合理划分和标注,同时揭示了TBAL系统的潜力和潜在缺陷。

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.

MGDD: A Meta Generator for Fast Dataset Distillation
Songhua Liu Xinchao Wang



研究问题:现有的数据集蒸馏(DD)技术通常依赖于迭代策略来合成压缩的数据集,但研究问题:现有的数据集蒸馏(DD)技术通常依赖于迭代策略来合成压缩的数据集,但这种方法的时间效率仍然无法满足需求,且在需要不同大小的合成数据集时,必须重复进行迭代训练,这既繁琐又缺乏灵活性。
动机:为了解决现有数据集蒸馏方法的时间效率低下和缺乏灵活性的问题,本文提出了一种生成式的数据集蒸馏方法。
方法:该方法通过一个生成器网络根据数据集蒸馏的初始化条件生成合成样本,同时通过在特征空间中解决最小二乘问题来获取合成标签。我们还提出了一个元学习算法,用于高效地找到满意的生成器。
效果:实验表明,仅用有限的步骤调整后的生成器的性能与最先进的数据集蒸馏方法相当,并且可以实现$22\times$的加速。

Existing dataset distillation (DD) techniques typically rely on iterative strategies to synthesize condensed datasets, where datasets before and after distillation are forward and backward through neural networks a massive number of times. Despite the promising results achieved, the time efficiency of prior approaches is still far from satisfactory. Moreover, when different sizes of synthetic datasets are required, they have to repeat the iterative training procedures, which is highly cumbersome and lacks flexibility. In this paper, different from the time-consuming forward-backward passes, we introduce a generative fashion for dataset distillation with significantly improved efficiency. Specifically, synthetic samples are produced by a generator network conditioned on the initialization of DD, while synthetic labels are obtained by solving a least-squares problem in a feature space. Our theoretical analysis reveals that the errors of synthetic datasets solved in the original space and then processed by any conditional generators are upper-bounded. To find a satisfactory generator efficiently, we propose a meta-learning algorithm, where a meta generator is trained on a large dataset so that only a few steps are required to adapt to a target dataset. The meta generator is termed as MGDD in our approach. Once adapted, it can handle arbitrary sizes of synthetic datasets, even for those unseen during adaptation. Experiments demonstrate that the generator adapted with only a limited number of steps performs on par with those state-of-the-art DD methods and yields $22\times$ acceleration.

Proximity-Informed Calibration for Deep Neural Networks
Miao Xiong Ailin Deng Pang Wei Koh Jiaying Wu Shen Li Jianqing Xu Bryan Hooi



研究问题:现有的校准算法往往忽视了近邻偏差的问题,即模型在低近邻数据(即分布在数据稀疏区域的数据)上比高近邻样本更自信,导致不同近邻样本的误校准不一致。
动机:我们对504个预训练ImageNet模型进行了检查,发现近邻偏差存在于各种模型架构和尺寸中,且Transformer模型比CNN模型更容易受到近邻偏差的影响。
方法:我们提出了ProCal,这是一个即插即用的算法,具有基于近邻调整样本信心的理论保证。我们还引入了基于近邻的期望校准误差(PIECE),以进一步量化校准算法在减轻近邻偏差方面的有效性。
效果:实验表明,ProCal在平衡、长尾和分布偏移设置下,能有效解决近邻偏差并提高校准效果。我们相信关于近邻偏差的发现将指导开发更公平、更准确的模型,为追求可信AI做出贡献。

Confidence calibration is central to providing accurate and interpretable uncertainty estimates, especially under safety-critical scenarios. However, we find that existing calibration algorithms often overlook the issue of proximity bias, a phenomenon where models tend to be more overconfident in low proximity data (i.e., data lying in the sparse region of the data distribution) compared to high proximity samples, and thus suffer from inconsistent miscalibration across different proximity samples. We examine the problem over $504$ pretrained ImageNet models and observe that: 1) Proximity bias exists across a wide variety of model architectures and sizes; 2) Transformer-based models are relatively more susceptible to proximity bias than CNN-based models; 3) Proximity bias persists even after performing popular calibration algorithms like temperature scaling; 4) Models tend to overfit more heavily on low proximity samples than on high proximity samples. Motivated by the empirical findings, we propose ProCal, a plug-and-play algorithm with a theoretical guarantee to adjust sample confidence based on proximity. To further quantify the effectiveness of calibration algorithms in mitigating proximity bias, we introduce proximity-informed expected calibration error (PIECE) with theoretical analysis. We show that ProCal is effective in addressing proximity bias and improving calibration on balanced, long-tail, and distribution-shift settings under four metrics over various model architectures. We believe our findings on proximity bias will guide the development of fairer and better-calibrated} models, contributing to the broader pursuit of trustworthy AI.

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations
Hammaad Adam Fan Yin Mary Hu Neil Tenenholtz Lorin Crawford Lester Mackey Allison Koenecke



研究问题:随机实验由于治疗产生意外的有害效果,往往需要提前停止。现有的方法通常应用于整体数据,并未考虑到处理效应的异质性。
动机:当前的方法在治疗伤害到少数群体参与者时,往往无法及时停止实验。
方法:使用因果机器学习开发CLASH,这是第一个广泛适用的异质性早期停止方法。
效果:通过模拟和真实数据展示了CLASH的性能,表明它可以有效地为临床试验和A/B测试提供早期停止。

Randomized experiments often need to be stopped prematurely due to the treatment having an unintended harmful effect. Existing methods that determine when to stop an experiment early are typically applied to the data in aggregate and do not account for treatment effect heterogeneity. In this paper, we study the early stopping of experiments for harm on heterogeneous populations. We first establish that current methods often fail to stop experiments when the treatment harms a minority group of participants. We then use causal machine learning to develop CLASH, the first broadly-applicable method for heterogeneous early stopping. We demonstrate CLASH's performance on simulated and real data and show that it yields effective early stopping for both clinical trials and A/B tests.

Conditional Mutual Information for Disentangled Representations in Reinforcement Learning
Mhairi Dunion Trevor McInroe Kevin Sebastian Luck Josiah P. Hanna Stefano V Albrecht



研究问题:强化学习环境中,由于训练数据量大或特征覆盖范围有限,可能会产生特征之间的误导性关联。这可能导致RL代理将这些误导性的关联编码到其潜在表示中,从而在环境内或在现实世界中部署时无法进行泛化。
动机:现有的解耦技术需要独立的特征才能最小化特征之间的互信息,因此它们不能解耦相关的特征。我们提出了一个辅助任务,通过最小化表示中特征之间的条件互信息,让RL算法学习具有相关特征的高维观测的解耦表示。
方法:我们为RL算法设计了一个辅助任务,该任务通过最小化表示中特征之间的条件互信息来学习具有相关特征的高维观测的解耦表示。
效果:实验表明,我们的方法在相关性偏移下提高了泛化能力,并且在存在相关特征的情况下改善了RL算法的训练性能。

Reinforcement Learning (RL) environments can produce training data with spurious correlations between features due to the amount of training data or its limited feature coverage. This can lead to RL agents encoding these misleading correlations in their latent representation, preventing the agent from generalising if the correlation changes within the environment or when deployed in the real world. Disentangled representations can improve robustness, but existing disentanglement techniques that minimise mutual information between features require independent features, thus they cannot disentangle correlated features. We propose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the conditional mutual information between features in the representation. We demonstrate experimentally, using continuous control tasks, that our approach improves generalisation under correlation shifts, as well as improving the training performance of RL algorithms in the presence of correlated features.

Subspace Identification for Multi-Source Domain Adaptation
Zijian Li Ruichu Cai Guangyi Chen Boyang Sun Zhifeng Hao Kun Zhang



研究问题:本文旨在解决多源领域适应(MSDA)方法在实际应用中需要满足严格假设的问题。
动机:现有的MSDA方法需要满足一些严格的条件,如足够数量的领域、潜在变量的单调转换和标签分布的不变性,这些条件在实际应用中很难满足。
方法:本文提出了一种子空间识别理论,该理论在对领域数量和转换性质的限制较少的情况下,保证了领域不变和特定领域的变量的解耦,从而通过最小化领域转移对不变变量的影响来促进领域适应。基于这个理论,开发了一个利用变分推理的子空间识别保证(SIG)模型。此外,SIG模型还结合了类别感知的条件对齐,以适应标签分布随领域变化的目标转移。
效果:实验结果表明,我们的SIG模型在各种基准数据集上优于现有的MSDA技术,突出了其在实际应用中的有效性。

Multi-source domain adaptation (MSDA) methods aim to transfer knowledge from multiple labeled source domains to an unlabeled target domain. Although current methods achieve target joint distribution identifiability by enforcing minimal changes across domains, they often necessitate stringent conditions, such as an adequate number of domains, monotonic transformation of latent variables, and invariant label distributions. These requirements are challenging to satisfy in real-world applications. To mitigate the need for these strict assumptions, we propose a subspace identification theory that guarantees the disentanglement of domain-invariant and domain-specific variables under less restrictive constraints regarding domain numbers and transformation properties and thereby facilitating domain adaptation by minimizing the impact of domain shifts on invariant variables. Based on this theory, we develop a Subspace Identification Guarantee (SIG) model that leverages variational inference. Furthermore, the SIG model incorporates class-aware conditional alignment to accommodate target shifts where label distributions change with the domain. Experimental results demonstrate that our SIG model outperforms existing MSDA techniques on various benchmark datasets, highlighting its effectiveness in real-world applications.

Alleviating the Semantic Gap for Generalized fMRI-to-Image Reconstruction
Tao Fang Qian Zheng Gang Pan



研究问题:现有的fMRI-to-image重建方法在训练和测试数据间存在语义鸿沟,导致重建结果不稳定且不确定。
动机:解决fMRI-to-image重建中存在的语义鸿沟问题。
方法:利用预训练的CLIP模型将训练数据映射到紧凑的特征表示,扩展稀疏的训练数据语义为密集的,从而缓解已知概念附近的实例(即训练超类内)的语义鸿沟。同时,借鉴fMRI数据中的稳健低层表示,以结构信息作为通用提示来指导图像重建。通过概率密度估计量化语义不确定性,并在扩散过程中自适应地整合扩展语义和结构信息(GESS)。
效果:实验结果表明,提出的GESS模型优于最先进的方法,且提出了一种广义的场景分割策略来评估GESS在缩小语义鸿沟方面的优势。

Although existing fMRI-to-image reconstruction methods could predict high-quality images, they do not explicitly consider the semantic gap between training and testing data, resulting in reconstruction with unstable and uncertain semantics. This paper addresses the problem of generalized fMRI-to-image reconstruction by explicitly alleviates the semantic gap. Specifically, we leverage the pre-trained CLIP model to map the training data to a compact feature representation, which essentially extends the sparse semantics of training data to dense ones, thus alleviating the semantic gap of the instances nearby known concepts (i.e., inside the training super-classes). Inspired by the robust low-level representation in fMRI data, which could help alleviate the semantic gap for instances that far from the known concepts (i.e., outside the training super-classes), we leverage structural information as a general cue to guide image reconstruction. Further, we quantify the semantic uncertainty based on probability density estimation and achieve Generalized fMRI-to-image reconstruction by adaptively integrating Expanded Semantics and Structural information (GESS) within a diffusion process. Experimental results demonstrate that the proposed GESS model outperforms state-of-the-art methods, and we propose a generalized scenario split strategy to evaluate the advantage of GESS in closing the semantic gap.

Episodic Multi-Task Learning with Heterogeneous Neural Processes
Jiayi Shen Xiantong Zhen Cheems Wang Marcel Worring



研究问题:本文旨在解决多任务学习中数据不足的问题,特别是在情境训练设置中。
动机:现有的元学习方法往往未能充分利用单一情境中的异构信息,而多任务学习模型则忽视了早期情境经验的再利用。
方法:我们开发了异构神经过程(HNPs)来解决这一问题,该方法在分层贝叶斯框架下,有效地利用先前经验作为元知识,并捕捉到异构任务之间的相关性,以缓解数据不足的问题。
效果:实验结果表明,HNPs在处理新的异构任务上表现出优于典型基线的性能,消融研究验证了设计的推理模块的有效性。

This paper focuses on the data-insufficiency problem in multi-task learning within an episodic training setup. Specifically, we explore the potential of heterogeneous information across tasks and meta-knowledge among episodes to effectively tackle each task with limited data. Existing meta-learning methods often fail to take advantage of crucial heterogeneous information in a single episode, while multi-task learning models neglect reusing experience from earlier episodes. To address the problem of insufficient data, we develop Heterogeneous Neural Processes (HNPs) for the episodic multi-task setup. Within the framework of hierarchical Bayes, HNPs effectively capitalize on prior experiences as meta-knowledge and capture task-relatedness among heterogeneous tasks, mitigating data-insufficiency. Meanwhile, transformer-structured inference modules are designed to enable efficient inferences toward meta-knowledge and task-relatedness. In this way, HNPs can learn more powerful functional priors for adapting to novel heterogeneous tasks in each meta-test episode. Experimental results show the superior performance of the proposed HNPs over typical baselines, and ablation studies verify the effectiveness of the designed inference modules.

Generalizing Importance Weighting to A Universal Solver for Distribution Shift Problems
Tongtong Fang Nan Lu Gang Niu Masashi Sugiyama



研究问题:本文旨在解决现有方法在处理训练和测试分布支持变化(Distribution shift,DS研究问题:本文旨在解决现有方法在处理训练和测试分布支持变化(Distribution shift,DS)时的问题,特别是在测试支持更广或部分重叠的情况下。
动机:现有的方法主要针对训练和测试分布完全匹配或训练支持更广的情况,但在测试支持更广或部分重叠的情况下表现不佳。
方法:本文提出了一种通用的重要性加权(Generalized Importance Weighting,GIW)方法,该方法将测试支持分为训练内(in-training, IT)和支持外(out-of-training, OOT)两部分,并分解期望风险为IT部分的加权分类项和OOT部分的标准分类项,以确保GIW的风险一致性。
效果:实验表明,GIW是一种通用的DS问题解决方法,在测试支持更广或部分重叠的情况下,其性能超过了现有的重要性加权方法。

Distribution shift (DS) may have two levels: the distribution itself changes, and the support (i.e., the set where the probability density is non-zero) also changes. When considering the support change between the training and test distributions, there can be four cases: (i) they exactly match; (ii) the training support is wider (and thus covers the test support); (iii) the test support is wider; (iv) they partially overlap. Existing methods are good at cases (i) and (ii), while cases (iii) and (iv) are more common nowadays but still under-explored. In this paper, we generalize importance weighting (IW), a golden solver for cases (i) and (ii), to a universal solver for all cases. Specifically, we first investigate why IW might fail in cases (iii) and (iv); based on the findings, we propose generalized IW (GIW) that could handle cases (iii) and (iv) and would reduce to IW in cases (i) and (ii). In GIW, the test support is split into an in-training (IT) part and an out-of-training (OOT) part, and the expected risk is decomposed into a weighted classification term over the IT part and a standard classification term over the OOT part, which guarantees the risk consistency of GIW. Then, the implementation of GIW consists of three components: (a) the split of validation data is carried out by the one-class support vector machine, (b) the first term of the empirical risk can be handled by any IW algorithm given training data and IT validation data, and (c) the second term just involves OOT validation data. Experiments demonstrate that GIW is a universal solver for DS problems, outperforming IW methods in cases (iii) and (iv).

Invariant Learning via Probability of Sufficient and Necessary Causes
Mengyue Yang Yonggang Zhang Zhen Fang Yali Du Furui Liu Jean-Francois Ton Jianhong Wang Jun Wang



研究问题:如何实现模型在未知测试分布下的泛化能力,特别是在处理因果关系时,现有方法主要关注原因的不变性,而忽视了充分必要条件的特性。
动机:为了解决这一问题,我们提出了一种基于充分必要条件概率(PNS)的方法,以更好地捕捉充分和必要原因的信息。
方法:我们采用了经典的概率充分必要条件(PNS)概念,并将其与OOD泛化相关联,提出了PNS风险的概念,并设计了一种算法来学习具有高PNS值的表示。
效果:实验证明,该方法在合成和真实世界的基准测试上都表现出了良好的效果。

Out-of-distribution (OOD) generalization is indispensable for learning models in the wild, where testing distribution typically unknown and different from the training. Recent methods derived from causality have shown great potential in achieving OOD generalization. However, existing methods mainly focus on the invariance property of causes, while largely overlooking the property of sufficiency and necessity conditions. Namely, a necessary but insufficient cause (feature) is invariant to distribution shift, yet it may not have required accuracy. By contrast, a sufficient yet unnecessary cause (feature) tends to fit specific data well but may have a risk of adapting to a new domain. To capture the information of sufficient and necessary causes, we employ a classical concept, the probability of sufficiency and necessary causes (PNS), which indicates the probability of whether one is the necessary and sufficient cause. To associate PNS with OOD generalization, we propose PNS risk and formulate an algorithm to learn representation with a high PNS value. We theoretically analyze and prove the generalizability of the PNS risk. Experiments on both synthetic and real-world benchmarks demonstrate the effectiveness of the proposed method. The detailed implementation can be found at the GitHub repository: https://github.com/ymy4323460/CaSN.

Learning List-Level Domain-Invariant Representations for Ranking
Ruicheng Xian Honglei Zhuang Zhen Qin Hamed Zamani Jing Lu Ji Ma Kai Hui Han Zhao Xuanhui Wang Michael Bendersky



研究问题:如何将知识从丰富的源领域转移到资源较少的目标领域,特别是在排名问题上的应用。
动机:尽管现有的方法在分类和回归问题上已经得到了广泛的应用,但在排名问题上的应用却很少,且缺乏理论支持。
方法:提出了列表级别的对齐——在学习域不变表示的更高级别上进行列表级别的对齐。
效果:该方法不仅实现了第一个关于排名问题的域适应泛化边界,从而为提出的方法提供了理论支持,而且在包括段落重排在内的排名任务上的无监督域适应转移性能上也取得了更好的效果。

Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment—learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking.

Semi-Supervised Domain Generalization with Known and Unknown Classes
Lei Zhang Ji-Fu Li Wei Wang



研究问题:如何训练一个能对未见过的目标领域进行泛化的模型,当只有少量标签可用时?
动机:现有的半监督领域泛化方法假设未标记的训练和测试样本都是已知类别,但实际情况可能是已知类别与未知类别在未标记的训练和测试数据中混合。
方法:提出类特定自适应探索和利用(CWAEE)方法,通过使用一对多分类器和类特定自适应阈值来检测已知和未知类别,并通过基于傅里叶变换的增强样本一致性正则化来改进未见过领域的泛化。
效果:在真实世界数据集上进行的实验验证了我们的方法的有效性和优越性。

Semi-Supervised Domain Generalization (SSDG) aims to learn a model that is generalizable to an unseen target domain with only a few labels, and most existing SSDG methods assume that unlabeled training and testing samples are all known classes. However, a more realistic scenario is that known classes may be mixed with some unknown classes in unlabeled training and testing data. To deal with such a scenario, we propose the Class-Wise Adaptive Exploration and Exploitation (CWAEE) method. In particular, we explore unlabeled training data by using one-vs-rest classifiers and class-wise adaptive thresholds to detect known and unknown classes, and exploit them by adopting consistency regularization on augmented samples based on Fourier Transformation to improve the unseen domain generalization. The experiments conducted on real-world datasets verify the effectiveness and superiority of our method.

ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets
Damien Teney LIN Yong Seong Joon Oh Ehsan Abbasnejad



研究问题:本文旨在比较计算机视觉和自然语言处理中模型的内在分布(ID)和外在分布(OOD)性能,并探讨其相关性。
动机:过去的研究显示,模型的ID性能和OOD性能之间通常存在正相关关系,但在某些情况下,这种关系可能会反转。这种现象对于确定ID性能是否可以作为OOD泛化能力的代理至关重要。
方法:通过对多个真实世界数据集的分析,本文揭示了ID和OOD性能之间的反向关联确实存在,并且不仅仅是在人为的最坏情况下出现。同时,本文从理论上解释了这些情况的产生原因,以及为什么过去的研究由于采用了偏颇的模型选择方法而未能发现它们。
效果:本文的观察结果对当前的文献提出了一些与之相反的建议:高OOD性能有时需要牺牲ID性能;仅关注ID性能可能无法达到最优的OOD性能,甚至可能导致OOD性能逐渐下降;在这些情况下,使用ID性能进行模型选择的OOD泛化研究将必然错过表现最佳的模型,从而使这些研究对一系列现象视而不见。

Several studies have compared the in-distribution (ID) and out-of-distribution (OOD) performance of models in computer vision and NLP. They report a frequent positive correlation and some surprisingly never even observe an inverse correlation indicative of a necessary trade-off. The possibility of inverse patterns is important to determine whether ID performance can serve as a proxy for OOD generalization capabilities. This paper shows that inverse correlations between ID and OOD performance do happen with multiple real-world datasets, not only in artificial worst-case settings. We explain theoretically how these cases arise and how past studies missed them because of improper methodologies that examined a biased selection of models. Our observations lead to recommendations that contradict those found in much of the current literature. - High OOD performance sometimes requires trading off ID performance. - Focusing on ID performance alone may not lead to optimal OOD performance. It may produce diminishing (eventually negative) returns in OOD performance. - In these cases, studies on OOD generalization that use ID performance for model selection (a common recommended practice) will necessarily miss the best-performing models, making these studies blind to a whole range of phenomena.

A Deep Instance Generative Framework for MILP Solvers Under Limited Data Availability
Zijie Geng Xijun Li Jie Wang Xiao Li Yongdong Zhang Feng Wu



研究问题:现有的混合整数线性规划(MILP)实例生成技术往往依赖专家设计的公式,或难以捕捉真实世界实例的丰富特征。
动机:由于真实世界实例的有限可用性,这导致次优决策和偏见的评价,因此需要一种合成的MILP实例生成技术。
方法:我们提出了G2MILP,这是第一个用于MILP实例的深度生成框架。具体来说,G2MILP将MILP实例表示为二部图,并应用了掩码变分自编码器来迭代地破坏和替换原始图的部分以生成新的图。
效果:实验表明,我们的方法可以生成在结构和计算难度上都与真实世界数据集相似的新实例,同时无需预先设计专家公式。这些生成的实例可以在数据有限的情况下提升MILP求解器的性能。

In the past few years, there has been an explosive surge in the use of machine learning (ML) techniques to address combinatorial optimization (CO) problems, especially mixed-integer linear programs (MILPs). Despite the achievements, the limited availability of real-world instances often leads to sub-optimal decisions and biased solver assessments, which motivates a suite of synthetic MILP instance generation techniques. However, existing methods either rely heavily on expert-designed formulations or struggle to capture the rich features of real-world instances. To tackle this problem, we propose G2MILP, *the first* deep generative framework for MILP instances. Specifically, G2MILP represents MILP instances as bipartite graphs, and applies a masked variational autoencoder to iteratively corrupt and replace parts of the original graphs to generate new ones. The appealing feature of G2MILP is that it can learn to generate novel and realistic MILP instances without prior expert-designed formulations, while preserving the structures and computational hardness of real-world datasets, simultaneously. Thus the generated instances can facilitate downstream tasks for enhancing MILP solvers under limited data availability. We design a suite of benchmarks to evaluate the quality of the generated MILP instances. Experiments demonstrate that our method can produce instances that closely resemble real-world datasets in terms of both structures and computational hardness. The deliverables are released at [https://miralab-ustc.github.io/L2O-G2MILP](https://miralab-ustc.github.io/L2O-G2MILP).

A Graph-Theoretic Framework for Understanding Open-World Semi-Supervised Learning
Yiyou Sun Zhenmei Shi Yixuan Li



研究问题:本文旨在填补开放世界半监督学习的理论空白,通过利用已知类别的标记集的先验知识来推断未标记数据中的已知和未知类别。
动机:尽管开放世界半监督学习的重要性,但这个问题缺乏理论基础。
方法:本文通过为开放世界环境量身定制的图论框架形式化了这个问题,其中聚类可以通过图分解进行理论表征。基于我们的图论框架,我们应用了名为Spectral Open-world Representation Learning (SORL)的算法,并证明最小化我们的损失等价于在图上执行谱分解。
效果:实验结果表明,SORL可以在常见的基准数据集上匹配或超越几个强大的基线,这对于实际使用具有吸引力,同时享受理论保证。

Open-world semi-supervised learning aims at inferring both known and novel classes in unlabeled data, by harnessing prior knowledge from a labeled set with known classes. Despite its importance, there is a lack of theoretical foundations for this problem. This paper bridges the gap by formalizing a graph-theoretic framework tailored for the open-world setting, where the clustering can be theoretically characterized by graph factorization. Our graph-theoretic framework illuminates practical algorithms and provides guarantees. In particular, based on our graph formulation, we apply the algorithm called Spectral Open-world Representation Learning (SORL), and show that minimizing our loss is equivalent to performing spectral decomposition on the graph. Such equivalence allows us to derive a provable error bound on the clustering performance for both known and novel classes, and analyze rigorously when labeled data helps. Empirically, SORL can match or outperform several strong baselines on common benchmark datasets, which is appealing for practical usage while enjoying theoretical guarantees.

Adversarial Counterfactual Environment Model Learning
Xiong-Hui Chen Yang Yu Zhengmao Zhu ZhiHua Yu Chen Zhenjun Chenghe Wang Yinan Wu Rong-Jun Qin Hongqiu Wu Ruijin Ding Huang Fangsheng



研究问题:如何准确建立环境动态模型,以支持各种下游任务,如反事实预测、离线强化学习和离策略评估。
动机:当前环境动态模型主要通过历史转换数据的逐步拟合进行学习,这种方法在序列决策环境中可能会由于行为策略的选择偏差而导致预测失败。
方法:提出了一种新的模型学习方法——对抗性加权经验风险最小化(AWRM)。该方法引入了一个对抗性策略,该策略利用模型生成一个削弱模型预测精度的数据分布,然后模型在这个对抗性数据分布下进行学习。
效果:实验证明,GALILEO可以准确预测反事实行动并改善各种下游任务,包括离线策略评估和改进以及在线决策制定。

An accurate environment dynamics model is crucial for various downstream tasks, such as counterfactual prediction, off-policy evaluation, and offline reinforcement learning. Currently, these models were learned through empirical risk minimization (ERM) by step-wise fitting of historical transition data. However, we first show that, particularly in the sequential decision-making setting, this approach may catastrophically fail to predict counterfactual action effects due to the selection bias of behavior policies during data collection. To tackle this problem, we introduce a novel model-learning objective called adversarial weighted empirical risk minimization (AWRM). AWRM incorporates an adversarial policy that exploits the model to generate a data distribution that weakens the model's prediction accuracy, and subsequently, the model is learned under this adversarial data distribution. We implement a practical algorithm, GALILEO, for AWRM and evaluate it on two synthetic tasks, three continuous-control tasks, and \textit{a real-world application}. The experiments demonstrate that GALILEO can accurately predict counterfactual actions and improve various downstream tasks, including offline policy evaluation and improvement, as well as online decision-making.

Online Constrained Meta-Learning: Provable Guarantees for Generalization
Siyuan Xu Minghui Zhu



研究问题:本文旨在提出一种在线约束的元学习框架,该框架可以从连续的学习任务中持续学习元知识,且学习任务受到硬性约束。
动机:大多数现有的元学习方法只能从无约束的任务中学习,缺乏对新任务学习的加速和提升能力。
方法:通过考虑在线学习的动态遗憾以及特定任务模型的泛化能力,提出了一个在线约束的元学习框架,并给出了其最优性差距和约束违反的上限。同时,还提供了一个实用的算法。
效果:实验结果表明,该框架在元模仿学习和少样本图像分类等任务上表现出优越的效果。

Meta-learning has attracted attention due to its strong ability to learn experiences from known tasks, which can speed up and enhance the learning process for new tasks. However, most existing meta-learning approaches only can learn from tasks without any constraint. This paper proposes an online constrained meta-learning framework, which continuously learns meta-knowledge from sequential learning tasks, and the learning tasks are subject to hard constraints. Beyond existing meta-learning analyses, we provide the upper bounds of optimality gaps and constraint violations produced by the proposed framework, which considers the dynamic regret of online learning, as well as the generalization ability of the task-specific models. Moreover, we provide a practical algorithm for the framework, and validate its superior effectiveness through experiments conducted on meta-imitation learning and few-shot image classification.

Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality
Liyuan Wang Jingyi Xie Xingxing Zhang Mingyi Huang Hang Su Jun Zhu



研究问题:当前的策略在自我监督预训练下的表现不足,难以将任务特定知识整合到指示表示中。
动机:解决预训练中任务特定知识的整合问题,提高持续学习的性能。
方法:提出分层分解(HiDe-)Prompt方法,通过任务特定的提示和未指示和指示表示的统计数据的联合优化,明确优化分层组件。
效果:实验证明HiDe-Prompt的优越性能和对预训练范式的鲁棒性,在持续学习任务上取得了显著的提升。

Prompt-based continual learning is an emerging direction in leveraging pre-trained knowledge for downstream continual learning, and has almost reached the performance pinnacle under supervised pre-training. However, our empirical research reveals that the current strategies fall short of their full potential under the more realistic self-supervised pre-training, which is essential for handling vast quantities of unlabeled data in practice. This is largely due to the difficulty of task-specific knowledge being incorporated into instructed representations via prompt parameters and predicted by uninstructed representations at test time. To overcome the exposed sub-optimality, we conduct a theoretical analysis of the continual learning objective in the context of pre-training, and decompose it into hierarchical components: within-task prediction, task-identity inference, and task-adaptive prediction. Following these empirical and theoretical insights, we propose Hierarchical Decomposition (HiDe-)Prompt, an innovative approach that explicitly optimizes the hierarchical components with an ensemble of task-specific prompts and statistics of both uninstructed and instructed representations, further with the coordination of a contrastive regularization strategy. Our extensive experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning (e.g., up to 15.01% and 9.61% lead on Split CIFAR-100 and Split ImageNet-R, respectively).

Learning Generalizable Agents via Saliency-guided Features Decorrelation
Sili Huang Yanchao Sun Jifeng Hu Siyuan Guo Hechang Chen Yi Chang Lichao Sun Bo Yang



研究问题:在视觉强化学习中,由于状态空间中的特征之间存在内在关联,导致代理研究问题:在视觉强化学习中,由于状态空间中的特征之间存在内在关联,导致代理难以理解特征变化对决策的影响,从而无法很好地泛化到训练过程中未观察到的环境变化。
动机:为了解决这一问题,我们提出了一种名为“显著性引导的特征去相关”(SGFD)的方法,通过样本重权来消除特征之间的相关性。
方法:SGFD主要由随机傅里叶函数(RFF)和显著性图两个核心技术组成。RFF用于估计高维图像中的复杂非线性关联,而显著性图则用于识别变化的特征。在显著性图的指导下,SGFD通过样本重权来最小化与变化特征相关的估计关联,从而实现视觉RL任务中的去相关。
效果:实验结果表明,SGFD可以在广泛的测试环境中进行良好的泛化,并在处理任务无关变化和任务相关变化方面显著优于现有方法。

In visual-based Reinforcement Learning (RL), agents often struggle to generalize well to environmental variations in the state space that were not observed during training. The variations can arise in both task-irrelevant features, such as background noise, and task-relevant features, such as robot configurations, that are related to the optimal decisions. To achieve generalization in both situations, agents are required to accurately understand the impact of changed features on the decisions, i.e., establishing the true associations between changed features and decisions in the policy model. However, due to the inherent correlations among features in the state space, the associations between features and decisions become entangled, making it difficult for the policy to distinguish them. To this end, we propose Saliency-Guided Features Decorrelation (SGFD) to eliminate these correlations through sample reweighting. Concretely, SGFD consists of two core techniques: Random Fourier Functions (RFF) and the saliency map. RFF is utilized to estimate the complex non-linear correlations in high-dimensional images, while the saliency map is designed to identify the changed features. Under the guidance of the saliency map, SGFD employs sample reweighting to minimize the estimated correlations related to changed features, thereby achieving decorrelation in visual RL tasks. Our experimental results demonstrate that SGFD can generalize well on a wide range of test environments and significantly outperforms state-of-the-art methods in handling both task-irrelevant variations and task-relevant variations.

A Holistic Approach to Unifying Automatic Concept Extraction and Concept Importance Estimation
Thomas FEL Victor Boutin Louis Béthune Remi Cadene Mazda Moayeri Léo Andéol Mathieu Chalvidal Thomas Serre



研究问题:本文旨在通过提出一个统一的理论框架,将基于概念的方法解释性方法中的两个研究问题:本文旨在通过提出一个统一的理论框架,将基于概念的方法解释性方法中的两个关键步骤——概念提取和重要性估计进行统一,以更好地理解和解释深度神经网络的决策过程。
动机:近年来,基于概念的方法已成为最有前景的解释性方法之一,可以帮助我们解读深度神经网络(ANNs)的决策。这些方法试图在复杂的ANN激活模式中发现可理解的视觉“概念”。
方法:本文提出了一个统一的理论框架,将概念提取问题重新定义为字典学习的一个特例,并将概念重要性估计形式化为一种更一般的归因方法。
效果:该框架具有多个优点,包括提出新的评估指标来比较不同的概念提取方法,利用现代归因方法和评估指标来扩展和系统地评估最先进的基于概念的方法和重要性估计技术,以及获得关于这些方法的最优性的理论上的保证。此外,作者还开发了一个名为Lens的网站,为ImageNet数据集的所有类别提供了完整的可视化集合。

In recent years, concept-based approaches have emerged as some of the most promising explainability methods to help us interpret the decisions of Artificial Neural Networks (ANNs). These methods seek to discover intelligible visual ``concepts'' buried within the complex patterns of ANN activations in two key steps: (1) concept extraction followed by (2) importance estimation. While these two steps are shared across methods, they all differ in their specific implementations. Here, we introduce a unifying theoretical framework that recast the first step -- concept extraction problem -- as a special case of **dictionary learning**, and we formalize the second step -- concept importance estimation -- as a more general form of **attribution method**. This framework offers several advantages as it allows us: (i) to propose new evaluation metrics for comparing different concept extraction approaches; (ii) to leverage modern attribution methods and evaluation metrics to extend and systematically evaluate state-of-the-art concept-based approaches and importance estimation techniques; (iii) to derive theoretical guarantees regarding the optimality of such methods. We further leverage our framework to try to tackle a crucial question in explainability: how to *efficiently* identify clusters of data points that are classified based on a similar shared strategy. To illustrate these findings and to highlight the main strategies of a model, we introduce a visual representation called the strategic cluster graph. Finally, we present Lens, a dedicated website that offers a complete compilation of these visualizations for all classes of the ImageNet dataset.

Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection
Xilie Xu Jingfeng Zhang Feng Liu Masashi Sugiyama Mohan Kankanhalli



研究问题:对抗性对比学习(ACL)需要大量运行时间来生成所有训练数据的对抗变体,限制了其在大型数据集上的可扩展性。
动机:为了提高ACL的效率,本文提出了一种鲁棒性感知的核心集选择(RCS)方法。
方法:RCS通过寻找一个最小化表示性发散的有信息量的子集,无需标签信息,将ACL转化为一个子模最大化的替代问题,其贪婪搜索是原问题的最优解决方案。
效果:实验结果表明,RCS可以大幅提高ACL的效率,同时不会显著影响鲁棒性的转移性。在大规模ImageNet-1K数据集上,我们是首个有效利用RCS进行ACL以获取有效鲁棒表示的团队。

Adversarial contrastive learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks and also generalizes to a wide range of downstream tasks. However, ACL needs tremendous running time to generate the adversarial variants of all training data, which limits its scalability to large datasets. To speed up ACL, this paper proposes a robustness-aware coreset selection (RCS) method. RCS does not require label information and searches for an informative subset that minimizes a representational divergence, which is the distance of the representation between natural data and their virtual adversarial variants. The vanilla solution of RCS via traversing all possible subsets is computationally prohibitive. Therefore, we theoretically transform RCS into a surrogate problem of submodular maximization, of which the greedy search is an efficient solution with an optimality guarantee for the original problem. Empirically, our comprehensive results corroborate that RCS can speed up ACL by a large margin without significantly hurting the robustness transferability. Notably, to the best of our knowledge, we are the first to conduct ACL efficiently on the large-scale ImageNet-1K dataset to obtain an effective robust representation via RCS. Our source code is at https://github.com/GodXuxilie/Efficient_ACL_via_RCS.

Beyond Myopia: Learning from Positive and Unlabeled Data through Holistic Predictive Trends
Wang Xinrui wan wenhai Chuanxing Geng Shao-Yuan Li Songcan Chen



研究问题:如何在无负标签的情况下,从正例和未标记的数据中学习二元分类器。
动机:在许多实际应用中,验证负例的困难使得从正例和未标记的数据中学习二元分类器(PUL)至关重要。尽管最近的PUL方法在实证性能上令人印象深刻,但由于缺乏负标签,累积错误和增加的估计偏差等问题仍然存在。
方法:本文揭示了PUL中的一个有趣但长期被忽视的观察结果:在每个训练迭代中重新采样正例数据以确保正例和未标记示例之间的平衡分布,可以产生强大的早期阶段性能。此外,正类和负类的预测趋势显示出明显不同的模式。我们创新地采用全局方法,将每个示例的分数解释为一个时间过程点过程(TPP),并将PUL的核心问题重新表述为识别这些分数的趋势。然后,我们提出了一种受TPP启发的新趋势检测方法,并证明其在预测变化方面的无偏性。
效果:广泛的实验验证了我们的方法的优越性,特别是在高度不平衡的真实世界设置中,其中关键指标的改进达到了11.3%。

Learning binary classifiers from positive and unlabeled data (PUL) is vital in many real-world applications, especially when verifying negative examples is difficult. Despite the impressive empirical performance of recent PUL methods, challenges like accumulated errors and increased estimation bias persist due to the absence of negative labels. In this paper, we unveil an intriguing yet long-overlooked observation in PUL: \textit{resampling the positive data in each training iteration to ensure a balanced distribution between positive and unlabeled examples results in strong early-stage performance. Furthermore, predictive trends for positive and negative classes display distinctly different patterns.} Specifically, the scores (output probability) of unlabeled negative examples consistently decrease, while those of unlabeled positive examples show largely chaotic trends. Instead of focusing on classification within individual time frames, we innovatively adopt a holistic approach, interpreting the scores of each example as a temporal point process (TPP). This reformulates the core problem of PUL as recognizing trends in these scores. We then propose a novel TPP-inspired measure for trend detection and prove its asymptotic unbiasedness in predicting changes. Notably, our method accomplishes PUL without requiring additional parameter tuning or prior assumptions, offering an alternative perspective for tackling this problem. Extensive experiments verify the superiority of our method, particularly in a highly imbalanced real-world setting, where it achieves improvements of up to $11.3\%$ in key metrics.

CODA: Generalizing to Open and Unseen Domains with Compaction and Disambiguation
Chaoqi Chen Luyao Tang Yue Huang Xiaoguang Han Yizhou Yu



研究问题:现有的机器学习系统在测试分布偏离训练分布时,其泛化能力会显著下降。
动机:尽管领域泛化(DG)方法被用于使机器学习模型能够推广到未见过的数据域,但大多数DG方法都假设训练和测试数据具有相同的标签空间,忽视了许多实际应用中可能存在的未见过类别的问题。
方法:本文提出了一种名为“压缩与消歧”(CODA)的两阶段框架,用于学习紧凑表示并适应野外未知类别。CODA通过引入虚拟未知类别来优化新的训练目标,将未知类别插入潜在空间,从而压缩源已知类别的嵌入空间。然后,通过测试时间的训练目标来消除已知和未知类别之间的决策边界,以缓解适应性差距和灾难性遗忘的挑战。
效果:实验表明,CODA可以在标准DG数据集上显著优于先前的最佳方法,并在已知和未知类别之间统一分类精度。

The generalization capability of machine learning systems degenerates notably when the test distribution drifts from the training distribution. Recently, Domain Generalization (DG) has been gaining momentum in enabling machine learning models to generalize to unseen domains. However, most DG methods assume that training and test data share an identical label space, ignoring the potential unseen categories in many real-world applications. In this paper, we delve into a more general but difficult problem termed Open Test-Time DG (OTDG), where both domain shift and open class may occur on the unseen test data. We propose Compaction and Disambiguation (CODA), a novel two-stage framework for learning compact representations and adapting to open classes in the wild. To meaningfully regularize the model's decision boundary, CODA introduces virtual unknown classes and optimizes a new training objective to insert unknowns into the latent space by compacting the embedding space of source known classes. To adapt target samples to the source model, we then disambiguate the decision boundaries between known and unknown classes with a test-time training objective, mitigating the adaptivity gap and catastrophic forgetting challenges. Experiments reveal that CODA can significantly outperform the previous best method on standard DG datasets and harmonize the classification accuracy between known and unknown classes.

A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning
Zitai Wang Qianqian Xu Zhiyong Yang Yuan He Xiaochun Cao Qingming Huang



研究问题:真实世界的数据通常是不平衡的,即只有少数类别有大量的样本,而许多类别只有少数样本。这导致朴素的ERM学习过程偏向于多数类,难以泛化到少数类。
动机:为了解决这个问题,我们提出了一种名为数据依赖收缩的新技术,以捕捉这些修改后的损失如何处理不同的类别。
方法:我们建立了一个精细的不平衡学习的广义界限,并基于理论洞察开发了一个原则性的学习算法。
效果:实验结果不仅验证了理论结果,而且展示了所提出方法的有效性。

Real-world datasets are typically imbalanced in the sense that only a few classes have numerous samples, while many classes are associated with only a few samples. As a result, a naive ERM learning process will be biased towards the majority classes, making it difficult to generalize to the minority classes. To address this issue, one simple but effective approach is to modify the loss function to emphasize the learning on minority classes, such as re-weighting the losses or adjusting the logits via class-dependent terms. However, existing generalization analysis of such losses is still coarse-grained and fragmented, failing to explain some empirical results. To bridge this gap between theory and practice, we propose a novel technique named data-dependent contraction to capture how these modified losses handle different classes. On top of this technique, a fine-grained generalization bound is established for imbalanced learning, which helps reveal the mystery of re-weighting and logit-adjustment in a unified manner. Furthermore, a principled learning algorithm is developed based on the theoretical insights. Finally, the empirical results on benchmark datasets not only validate the theoretical results but also demonstrate the effectiveness of the proposed method.

Don’t blame Dataset Shift! Shortcut Learning due to Gradients and Cross Entropy
Aahlad Manas Puli Lily H Zhang Yoav Wald Rajesh Ranganath



研究问题:本文探讨了现有预训练语言模型在知识图谱利用上的不足,以及如何通过结合大规模文本语料库和知识图谱来训练一种增强的语言表示模型。
动机:现有的预训练语言模型往往忽视了知识图谱中的有信息量的实体,而这些实体可以通过外部知识来增强语言表示。
方法:本文提出了一种ERNIE模型,该模型采用大规模文本语料库和知识图谱进行联合训练,以充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。同时,作者还提出了一种新的损失函数——MARG-CTRL,可以更好地控制间隔,减少在感知任务中的捷径学习现象。

Common explanations for shortcut learning assume that the shortcut improves prediction only under the training distribution. Thus, models trained in the typical way by minimizing log-loss using gradient descent, which we call default-ERM, should utilize the shortcut. However, even when the stable feature determines the label in the training distribution and the shortcut does not provide any additional information, like in perception tasks, default-ERM exhibits shortcut learning. Why are such solutions preferred when the loss can be driven to zero when using the stable feature alone? By studying a linear perception task, we show that default-ERM’s preference for maximizing the margin, even without overparameterization, leads to models that depend more on the shortcut than the stable feature. This insight suggests that default-ERM’s implicit inductive bias towards max-margin may be unsuitable for perception tasks. Instead, we consider inductive biases toward uniform margins. We show that uniform margins guarantee sole dependence on the perfect stable feature in the linear perception task and suggest alternative loss functions, termed margin control (MARG-CTRL), that encourage uniform-margin solutions. MARG-CTRL techniques mitigate shortcut learning on a variety of vision and language tasks, showing that changing inductive biases can remove the need for complicated shortcut-mitigating methods in perception tasks.

Two-Stage Learning to Defer with Multiple Experts
Anqi Mao Christopher Mohri Mehryar Mohri Yutao Zhong



研究问题:本研究针对多专家决策中的学习延迟问题进行研究,这是许多实际应用中的关键问题。
动机:在多专家决策中,存在一个学习延迟的问题,即如何将最合适的专家分配给每个输入。为此,我们设计了一种新的损失函数族来解决这个问题。
方法:我们首先通过训练一个预测器(例如使用交叉熵等常见损失函数)来获得预测结果,然后在第二阶段学习一个延迟函数,以将最合适的专家分配给每个输入。我们还设计了一种新的损失函数族,并证明了它们具有$H$-一致性边界,这意味着它们是贝叶斯一致的。
效果:虽然本研究的主要重点是理论分析,但我们还在CIFAR-10和SVHN数据集上进行了一些实验,并取得了良好的效果。

We study a two-stage scenario for learning to defer with multiple experts, which is crucial in practice for many applications. In this scenario, a predictor is derived in a first stage by training with a common loss function such as cross-entropy. In the second stage, a deferral function is learned to assign the most suitable expert to each input. We design a new family of surrogate loss functions for this scenario both in the score-based and the predictor-rejector settings and prove that they are supported by $H$-consistency bounds, which implies their Bayes-consistency. Moreover, we show that, for a constant cost function, our two-stage surrogate losses are realizable $H$-consistent. While the main focus of this work is a theoretical analysis, we also report the results of several experiments on CIFAR-10 and SVHN datasets.

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning
Ildus Sadrtdinov Dmitrii Pozdeev Dmitry P. Vetrov Ekaterina Lobacheva



研究问题:如何提高神经网络的性能和鲁棒性,特别是在预训练成本高的情况下?
动机:由于预训练成本高昂,实践中常常使用从一个预训练检查点微调的模型集合。然而,这些模型最终会陷入同一损失景观的“预训练盆地”,导致多样性有限。
方法:我们提出了一种改进的“StarSSE”方法,通过对现有探索方法的分析,对转移学习设置下的快照集合(SSE)进行更有效的修改,以更好地探索预训练盆地。
效果:实验结果表明,这种方法可以产生更强的模型集合和更均匀的模型汤,从而提高了模型的性能和鲁棒性。

Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.

Red Teaming Deep Neural Networks with Feature Synthesis Tools
Stephen Casper Tong Bu Yuxiao Li Jiawei Li Kevin Zhang Kaivalya Hariharan Dylan Hadfield-Menell



研究问题:本文旨在探讨解释性AI工具在模型调试中的有效性,特别是在处理未知错误(OOD)情况时。
动机:尽管解释性AI工具的研究受到关注,但能发现模型中先前未知错误的案例相对较少。作者认为,这在一定程度上是由于许多解释性方法的一个共同特点:它们通过使用特定的数据集来分析模型行为。
方法:为了解决这个问题,越来越多的研究开始使用不依赖于数据集的特征合成方法来解释模型。在本文中,我们通过在模型中植入人类可解释的"木马"(trojans),然后评估这些工具是否能帮助人类发现它们,以此来衡量解释性工具的有用性。
效果:我们的实验表明,即使在有直接访问带有木马触发器的数据的理想条件下,最先进的16种特征归因/显著性工具也往往无法识别出这些"木马"。我们还对7种特征合成方法进行了评估,并介绍了2种新的最佳表现方法的变体。

Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by using a particular dataset. This only allows for the study of the model in the context of features that the user can sample in advance. To address this, a growing body of research involves interpreting models using feature synthesis methods that do not depend on a dataset. In this paper, we benchmark the usefulness of interpretability tools for model debugging. Our key insight is that we can implant human-interpretable trojans into models and then evaluate these tools based on whether they can help humans discover them. This is analogous to finding OOD bugs, except the ground truth is known, allowing us to know when a user's interpretation is correct. We make four contributions. (1) We propose trojan discovery as an evaluation task for interpretability tools and introduce a benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution/saliency tools. Even under ideal conditions, given direct access to data with the trojan trigger, these methods still often fail to identify bugs. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new variants of the best-performing method from the previous evaluation.

Understanding the detrimental class-level effects of data augmentation
Polina Kirichenko Mark Ibrahim Randall Balestriero Diane Bouchacourt Shanmukha Ramakrishna Vedantam Hamed Firooz Andrew Gordon Wilson



研究问题:数据增强在图像分类任务中对模型性能的影响,以及其对不同类别精度的依赖性。
动机:尽管数据增强可以提高平均准确率,但其对某些类别精度的负面影响也明显,且对此的理解有限。
方法:通过使用ImageNet上的高质量多标签注释,系统地分类受影响的类别,并发现大多数类别本质上是模糊的、共同出现的或涉及细粒度的区别,而数据增强则控制了模型对这些密切相关类别的偏见。
效果:通过分析类别混淆,我们提出了一种简单的类别条件增强策略,可以改善受负面影响的类别的性能。

Data augmentation (DA) encodes invariance and provides implicit regularization critical to a model's performance in image classification tasks. However, while DA improves average accuracy, recent studies have shown that its impact can be highly class dependent: achieving optimal average accuracy comes at the cost of significantly hurting individual class accuracy by as much as 20% on ImageNet. There has been little progress in resolving class-level accuracy drops due to a limited understanding of these effects. In this work, we present a framework for understanding how DA interacts with class-level learning dynamics. Using higher-quality multi-label annotations on ImageNet, we systematically categorize the affected classes and find that the majority are inherently ambiguous, co-occur, or involve fine-grained distinctions, while DA controls the model's bias towards one of the closely related classes. While many of the previously reported performance drops are explained by multi-label annotations, we identify other sources of accuracy degradations by analyzing class confusions. We show that simple class-conditional augmentation strategies informed by our framework improve performance on the negatively affected classes.

Compositional Generalization from First Principles
Thaddäus Wiedemer Prasanna Mayilvahanan Matthias Bethge Wieland Brendel



研究问题:如何实现组合泛化,即利用世界的组合性质来加速学习和促进泛化,这是人类感知的标志。
动机:在机器学习中,即使是具有明确组合先验的模型,实现组合泛化也被证明是一个难以达成的目标。为了更好地处理组合泛化,我们从底层开始研究。
方法:受到可识别表示学习(identifiable representation learning)的启发,我们将组合性视为数据生成过程的属性,而非数据本身。这种重新定义使我们能够推导出仅对训练分布的支持和模型架构的温和条件,这些条件足以实现组合泛化。
效果:我们进一步展示了我们的理论框架如何应用于现实世界的场景,并通过实证验证了我们的发现。我们的研究结果为组合泛化提供了一个有原则的理论研究平台。

Leveraging the compositional nature of our world to expedite learning and facilitate generalization is a hallmark of human perception. In machine learning, on the other hand, achieving compositional generalization has proven to be an elusive goal, even for models with explicit compositional priors. To get a better handle on compositional generalization, we here approach it from the bottom up: Inspired by identifiable representation learning, we investigate compositionality as a property of the data-generating process rather than the data itself. This reformulation enables us to derive mild conditions on only the support of the training distribution and the model architecture, which are sufficient for compositional generalization. We further demonstrate how our theoretical framework applies to real-world scenarios and validate our findings empirically. Our results set the stage for a principled theoretical study of compositional generalization.

Feature Learning for Interpretable, Performant Decision Trees
Jack Henry Good Torin Kovach Kyle Miller Artur Dubrawski



研究问题:现有的决策树模型在实际应用中存在深度过深、易过拟合和对输入敏感等问题,需要大量的专家特征工程。
动机:提出一种交替稀疏特征学习和可微分决策树构建的系统,以生成小型、可解释且性能良好的决策树。
方法:通过交替稀疏特征学习和可微分决策树构建,来优化决策树模型。
效果:与常规的基于树的模型进行基准测试,证明了该模型及其预测的解释性,并在多个任务上表现出良好的性能。

Decision trees are regarded for high interpretability arising from their hierarchical partitioning structure built on simple decision rules. However, in practice, this is not realized because axis-aligned partitioning of realistic data results in deep trees, and because ensemble methods are used to mitigate overfitting. Even then, model complexity and performance remain sensitive to transformation of the input, and extensive expert crafting of features from the raw data is common. We propose the first system to alternate sparse feature learning with differentiable decision tree construction to produce small, interpretable trees with good performance. We benchmark against conventional tree-based models and demonstrate several notions of interpretation of a model and its predictions.

Prioritizing Samples in Reinforcement Learning with Reducible Loss
Shiva Kanth Sujit Somjit Nath Pedro Braga Samira Ebrahimi Kahou



研究问题:如何有效地利用经验回放缓冲区中的样本进行强化学习。
动机:并非所有样本都具有相同的显著性,简单地将等同的重要性分配给每个样本是一种简单策略。
方法:提出一种基于样本可学性的优先级采样方法,定义样本的可学性为训练损失随时间稳定减少的程度。开发了一种优先处理具有高可学性的样本,同时降低难以学习的样本(通常由噪声或随机性引起)的优先级的算法。
效果:在多个领域中,该方法比随机采样和仅根据训练损失(即优先经验回放中使用的时间差损失)进行优先排序的方法更稳健。

Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. Not all samples carry the same amount of significance and simply assigning equal importance to each of the samples is a naïve strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that across multiple domains our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in prioritized experience replay.

Group Robust Classification Without Any Group Information
Christos Tsirigotis Joao Monteiro Pau Rodriguez David Vazquez Aaron Courville



研究问题:现有的问题主要在于经验风险最小化(ERM)对训练数据中的虚假相关性非常敏感,这在高风险应用中部署系统时构成了重大风险。
动机:尽管现有的文献关注于最大化组平衡或最差组的准确性,但由于需要昂贵的偏见标注,这些数量的估计受到了阻碍。本研究认为当前的无偏监督方法在群体鲁棒性上仍然依赖于群体信息来实现最佳性能。
方法:我们提出了一种全新的无偏监督方式来训练和验证去偏模型。通过使用预训练的自监督模型来可靠地提取偏见信息,我们可以将日志调整训练损失与我们的验证标准相结合。
效果:我们在合成任务和真实世界任务上的实证分析表明,我们的方法克服了所识别的挑战,并始终提高了鲁棒准确性,其性能与依赖偏见标签进行验证的最先进的方法竞争甚至超越。

Empirical risk minimization (ERM) is sensitive to spurious correlations present in training data, which poses a significant risk when deploying systems trained under this paradigm in high-stake applications. While the existing literature focuses on maximizing group-balanced or worst-group accuracy, estimating these quantities is hindered by costly bias annotations. This study contends that current bias-unsupervised approaches to group robustness continue to rely on group information to achieve optimal performance. Firstly, these methods implicitly assume that all group combinations are represented during training. To illustrate this, we introduce a systematic generalization task on the MPI3D dataset and discover that current algorithms fail to improve the ERM baseline when combinations of observed attribute values are missing. Secondly, bias labels are still crucial for effective model selection, restricting the practicality of these methods in real-world scenarios. To address these limitations, we propose a revised methodology for training and validating debiased models in an entirely bias-unsupervised manner. We achieve this by employing pretrained self-supervised models to reliably extract bias information, which enables the integration of a logit adjustment training loss with our validation criterion. Our empirical analysis on synthetic and real-world tasks provides evidence that our approach overcomes the identified challenges and consistently enhances robust accuracy, attaining performance which is competitive with or outperforms that of state-of-the-art methods, which, conversely, rely on bias labels for validation.

Adversarial Learning for Feature Shift Detection and Correction
Míriam Barrabés Daniel Mas Montserrat Margarita Geleta Xavier Giró-i-Nieto Alexander G Ioannidis



研究问题:如何在现实世界的应用中检测并修正特征偏移。
动机:在许多数据集(包括多传感器数据、表格和结构化数据)中,特征偏移可能出现,如某些传感器故障或数据处理流程错误等。
方法:利用对抗学习的原理,通过训练多个判别器来区分两种分布的信息,以检测和修复损坏的特征,从而消除数据集之间的分布偏移。
效果:结合简单的迭代启发式算法,主流的监督分类器(如随机森林或梯度提升树)能够定位并修正特征偏移,其性能超过了当前基于统计和神经网络的技术。

Data shift is a phenomenon present in many real-world applications, and while there are multiple methods attempting to detect shifts, the task of localizing and correcting the features originating such shifts has not been studied in depth. Feature shifts can occur in many datasets, including in multi-sensor data, where some sensors are malfunctioning, or in tabular and structured data, including biomedical, financial, and survey data, where faulty standardization and data processing pipelines can lead to erroneous features. In this work, we explore using the principles of adversarial learning, where the information from several discriminators trained to distinguish between two distributions is used to both detect the corrupted features and fix them in order to remove the distribution shift between datasets. We show that mainstream supervised classifiers, such as random forest or gradient boosting trees, combined with simple iterative heuristics, can localize and correct feature shifts, outperforming current statistical and neural network-based techniques. The code is available at https://github.com/AI-sandbox/DataFix.

Automated Classification of Model Errors on ImageNet
Momchil Peychev Mark Niklas Mueller Marc Fischer Martin Vechev



研究问题:ImageNet数据集的标签噪声和模糊性使得仅使用top-1准确率无法充分衡量模型性能。
动机:为了解决这个问题,研究人员提出了新的标签集和评估协议,但这种方法耗时且需要专家参与。
方法:我们提出了第一个自动错误分类框架,用于研究模型选择如何影响错误分布。
效果:我们发现,无论模型架构、规模和预训练语料库如何,top-1准确率都是所有错误类型比例的强预测指标。此外,我们还发现严重错误的部分随着top-1准确率的提高而显著下降,表明尽管它低估了模型的真实性能,但仍是一个重要的性能度量标准。

While the ImageNet dataset has been driving computer vision research over the past decade, significant label noise and ambiguity have made top-1 accuracy an insufficient measure of further progress. To address this, new label-sets and evaluation protocols have been proposed for ImageNet showing that state-of-the-art models already achieve over 95% accuracy and shifting the focus on investigating why the remaining errors persist. Recent work in this direction employed a panel of experts to manually categorize all remaining classification errors for two selected models. However, this process is time-consuming, prone to inconsistencies, and requires trained experts, making it unsuitable for regular model evaluation thus limiting its utility. To overcome these limitations, we propose the first automated error classification framework, a valuable tool to study how modeling choices affect error distributions. We use our framework to comprehensively evaluate the error distribution of over 900 models. Perhaps surprisingly, we find that across model architectures, scales, and pre-training corpora, top-1 accuracy is a strong predictor for the *portion* of all error types. In particular, we observe that the portion of severe errors drops significantly with top-1 accuracy indicating that, while it underreports a model's true performance, it remains a valuable performance metric. We release all our code at https://github.com/eth-sri/automated-error-analysis.

Towards robust and generalizable representations of extracellular data using contrastive learning
Ankit Vishnubhotla Charlotte Loh Akash Srivastava Liam Paninski Cole Lincoln Hurwitz



研究问题:如何利用对比学习提取神经活动的强大和有意义的表示,并将其应用于关键的主要数据任务,如尖峰排序或细胞类型分类。
动机:尽管对比学习已被广泛应用于神经元群体数据,但在如何将其适应于关键的主要数据任务方面,如尖峰排序或细胞类型分类,还鲜有探索。
方法:提出了一种新颖的对比学习框架CEED(细胞外数据的对比嵌入),用于高密度细胞外记录。通过精心设计网络架构和数据增强,可以普遍提取出优于当前专门方法的表示。
效果:在多个高密度细胞外记录上验证了该方法,所有运行CEED的代码都可以在https://github.com/ankitvishnu23/CEED找到。

Contrastive learning is quickly becoming an essential tool in neuroscience for extracting robust and meaningful representations of neural activity. Despite numerous applications to neuronal population data, there has been little exploration of how these methods can be adapted to key primary data analysis tasks such as spike sorting or cell-type classification. In this work, we propose a novel contrastive learning framework, CEED (Contrastive Embeddings for Extracellular Data), for high-density extracellular recordings. We demonstrate that through careful design of the network architecture and data augmentations, it is possible to generically extract representations that far outperform current specialized approaches. We validate our method across multiple high-density extracellular recordings. All code used to run CEED can be found at https://github.com/ankitvishnu23/CEED.

Spuriosity Didn’t Kill the Classifier: Using Invariant Predictions to Harness Spurious Features
Cian Eastwood Shashank Singh Andrei Liviu Nicolicioiu Marin Vlastelica Julius von Kügelgen Bernhard Schölkopf



研究问题:如何正确使用不稳定特征来提高模型在测试领域的性能,而不依赖于测试领域的标签。
动机:尽管不稳定的特征可能会改变其与标签的关系,但它们往往携带着可以提升性能的补充信息。
方法:提出稳定特征增强(SFB)算法,该算法通过学习一个预测器来区分稳定和条件独立的不稳定特征,并利用稳定特征的预测结果来调整测试领域中不稳定特征的预测结果。
效果:理论证明SFB可以在没有测试领域标签的情况下学习到渐近最优的预测器。在真实和合成数据上的实验表明,SFB具有很好的效果。

To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.

SaVeNet: A Scalable Vector Network for Enhanced Molecular Representation Learning
Sarp Aykent Tian Xia



研究问题:如何有效地捕捉分子在空间维度上的复杂几何特征,以应对模型化高效的几何表示和学习3D结构模型的内在相关性的挑战。
动机:尽管几何深度学习在各种分子表示学习任务上取得了显著的突破,但由于模型化高效的几何表示和学习3D结构模型的内在相关性存在重大困难,因此对跨越空间维度的有效捕捉复杂几何特征的研究仍然不足。
方法:我们引入了一个高效且有效的框架——可扩展向量网络(SaVeNet),该框架设计用于适应各种几何需求,而无需依赖昂贵的嵌入。此外,所提出的框架可以有效地处理引入的方向噪声。
效果:通过理论分析和实验验证,我们的模型在效率和表现力上都优于现有的方法。在合成和真实世界的数据集上的实验结果证明了我们模型的表现力,它在分子表示学习的各种任务上都达到了最先进的性能。

Geometric representation learning of molecules is challenging yet essential for applications in multiple domains. Despite the impressive breakthroughs made by geometric deep learning in various molecular representation learning tasks, effectively capturing complicated geometric features across spatial dimensions is still underexplored due to the significant difficulties in modeling efficient geometric representations and learning the inherent correlation in 3D structural modeling. These include computational inefficiency, underutilization of vectorial embeddings, and limited generalizability to integrate various geometric properties. To address the raised concerns, we introduce an efficient and effective framework, Scalable Vector Network (SaVeNet), designed to accommodate a range of geometric requirements without depending on costly embeddings. In addition, the proposed framework scales effectively with introduced direction noise. Theoretically, we analyze the desired properties (i.e., invariance and equivariant) and framework efficiency of the SaVeNet. Empirically, we conduct a comprehensive series of experiments to evaluate the efficiency and expressiveness of the proposed model. Our efficiency-focused experiments underscore the model's empirical superiority over existing methods. Experimental results on synthetic and real-world datasets demonstrate the expressiveness of our model, which achieves state-of-the-art performance across various tasks within molecular representation learning.

End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes
Alexandre Max Maraval Matthieu Zimmer Antoine Grosnit Haitham Bou Ammar



研究问题:如何通过联合训练替代模型和采集函数,提高贝叶斯优化的样本效率。
动机:现有的方法可以独立地元学习替代模型或采集函数,但将两者联合训练仍然是一个开放的挑战。
方法:提出了第一个端到端的可微元贝叶斯优化框架,该框架通过变压器架构对获取函数进行学习,并使用强化学习解决缺乏标记采集数据的问题。
效果:在标准的超参数优化任务上,该方法实现了最先进的遗憾结果,并在混合整数规划调整、抗体设计和电子设计自动化的逻辑合成等实际问题中超越了其他方法。

Meta-Bayesian optimisation (meta-BO) aims to improve the sample efficiency of Bayesian optimisation by leveraging data from related tasks. While previous methods successfully meta-learn either a surrogate model or an acquisition function independently, joint training of both components remains an open challenge. This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data. Early on, we notice that training transformer-based neural processes from scratch with RL is challenging due to insufficient supervision, especially when rewards are sparse. We formalise this claim with a combinatorial analysis showing that the widely used notion of regret as a reward signal exhibits a logarithmic sparsity pattern in trajectory lengths. To tackle this problem, we augment the RL objective with an auxiliary task that guides part of the architecture to learn a valid probabilistic model as an inductive bias. We demonstrate that our method achieves state-of-the-art regret results against various baselines in experiments on standard hyperparameter optimisation tasks and also outperforms others in the real-world problems of mixed-integer programming tuning, antibody design, and logic synthesis for electronic design automation.

Improvements on Uncertainty Quantification for Node Classification via Distance Based Regularization
Russell Alan Hart Linlin Yu Yifei Lou Feng Chen



研究问题:本文旨在解决深度神经网络的不确定性量化问题,特别是在节点级分类中的互依赖性。
动机:当前深度学习模型的预测结果往往不可靠,而不确定性量化对于分布外(OOD)检测和误分类检测等应用至关重要。
方法:本文从优化不确定性交叉熵(UCE)损失函数的图后验网络(GPNs)出发,针对广泛使用的UCE损失的理论局限性,提出了一种基于距离的正则化方法,鼓励分布在外的节点在潜在空间中保持聚类。
效果:通过在八个标准数据集上进行大量实验,证明所提出的正则化方法在OOD检测和误分类检测上都优于现有技术。

Deep neural networks have achieved significant success in the last decades, but they are not well-calibrated and often produce unreliable predictions. A large number of literature relies on uncertainty quantification to evaluate the reliability of a learning model, which is particularly important for applications of out-of-distribution (OOD) detection and misclassification detection. We are interested in uncertainty quantification for interdependent node-level classification. We start our analysis based on graph posterior networks (GPNs) that optimize the uncertainty cross-entropy (UCE)-based loss function. We describe the theoretical limitations of the widely-used UCE loss. To alleviate the identified drawbacks, we propose a distance-based regularization that encourages clustered OOD nodes to remain clustered in the latent space. We conduct extensive comparison experiments on eight standard datasets and demonstrate that the proposed regularization outperforms the state-of-the-art in both OOD detection and misclassification detection.

UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
Xin Li Sima Behpour Thang Doan Wenbin He Liang Gou Liu Ren



研究问题:如何优化未标记数据集的实例选择,以在有限的标注预算下提高未定义的下游任务的性能。
动机:目前的数据预选方法主要依赖于从基础模型(如CLIP和BLIP-2)提取的视觉特征,但忽视了文本特征的强大作用。
方法:提出了一种简单而有效的无监督提示学习方法UP-DP,该方法通过训练文本提示来提取改进后的代表特征,确保覆盖整个数据集的多样化集群结构。
效果:在七个不同设置的基准数据集上进行了广泛的比较,性能提高了高达20%。此外,从一个数据集中学习到的提示具有显著的泛化性,可以直接应用于增强其他数据集的BLIP-2特征提取。

In this study, we investigate the task of data pre-selection, which aims to select instances for labeling from an unlabeled dataset through a single pass, thereby optimizing performance for undefined downstream tasks with a limited annotation budget. Previous approaches to data pre-selection relied solely on visual features extracted from foundation models, such as CLIP and BLIP-2, but largely ignored the powerfulness of text features. In this work, we argue that, with proper design, the joint feature space of both vision and text can yield a better representation for data pre-selection. To this end, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection. Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation, ensuring a diverse cluster structure that covers the entire dataset. We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20\%. Interestingly, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets. To the best of our knowledge, UP-DP is the first work to incorporate unsupervised prompt learning in a vision-language model for data pre-selection.

Beyond Invariance: Test-Time Label-Shift Adaptation for Addressing "Spurious" Correlations
Qingyao Sun Kevin Patrick Murphy Sayna Ebrahimi Alexander D'Amour



研究问题:测试时间数据分布的变化可能对预测模型的性能产生有害影响。
动机:存在能解释这种分布变化的额外元数据标签,我们考虑了这种情况,并假设类标签和“干扰”因素之间的依赖关系可能会因域而异。
方法:我们提出了一种测试时标签转移校正方法,该方法使用EM应用于目标领域分布的未标记样本来适应联合分布的变化。
效果:我们在多个标准图像和文本数据集以及CheXpert胸部X射线数据集上评估了这种方法,结果显示,它比那些针对分布变化不变性的方法以及基线经验风险最小化方法有更优的表现。

Changes in the data distribution at test time can have deleterious effects on the performance of predictive models $p(y|x)$. We consider situations where there are additional meta-data labels (such as group labels), denoted by $z$, that can account for such changes in the distribution. In particular, we assume that the prior distribution $p(y,z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, either due to a change in the correlation between these terms, or a change in one of their marginals. However, we assume that the generative model for features $p(x|y,z)$ is invariant across domains. We note that this corresponds to an expanded version of the widely used "label shift" assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we propose a test-time label shift correction that adapts to changes in the joint distribution $p(y, z)$ using EM applied to unlabeled samples from the target domain distribution, $p_t(x)$. Importantly, we are able to avoid fitting a generative model $p(x|y,z)$, and merely need to reweight the outputs of a discriminative model $p_s(y,z|x)$ trained on the source distribution. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on several standard image and text datasets, as well as the CheXpert chest X-ray dataset, and show that it improves performance over methods that target invariance to changes in the distribution, as well as baseline empirical risk minimization methods. Code for reproducing experiments is available at https://github.com/nalzok/test-time-label-shift.

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
Cristina Menghini Andrew Delworth Stephen Bach



研究问题:测试时间数据分布的变化可能对预测模型的性能产生有害影响。
动机:存在能解释这种分布变化的额外元数据标签,我们考虑了这种情况,并假设类标签和“干扰”因素之间的依赖关系可能会因域而异。
方法:我们提出了一种测试时标签转移校正方法,该方法使用EM应用于目标领域分布的未标记样本来适应联合分布的变化。
效果:我们在多个标准图像和文本数据集以及CheXpert胸部X射线数据集上评估了这种方法,结果显示,它比那些针对分布变化不变性的方法以及基线经验风险最小化方法有更优的表现。

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e., heuristic labels for unlabeled data, to enhance CLIP via prompt tuning. Conventional pseudolabeling trains a model on labeled data and then generates labels for unlabeled data. VLMs' zero-shot capabilities enable a ``second generation'' of pseudolabeling approaches that do not require task-specific training on labeled data. By using zero-shot pseudolabels as a source of supervision, we observe that learning paradigms such as semi-supervised, transductive zero-shot, and unsupervised learning can all be seen as optimizing the same loss function. This unified view enables the development of versatile training strategies that are applicable across learning paradigms. We investigate them on image classification tasks where CLIP exhibits limitations, by varying prompt modalities, e.g., textual or visual prompts, and learning paradigms. We find that (1) unexplored prompt tuning strategies that iteratively refine pseudolabels consistently improve CLIP accuracy, by 19.5 points in semi-supervised learning, by 28.4 points in transductive zero-shot learning, and by 15.2 points in unsupervised learning, and (2) unlike conventional semi-supervised pseudolabeling, which exacerbates model biases toward classes with higher-quality pseudolabels, prompt tuning leads to a more equitable distribution of per-class accuracy. The code to reproduce the experiments is at https://github.com/BatsResearch/menghini-neurips23-code.

Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization
Jameel Hassan Abdul Samadh Hanan Gani Noor Hazim Hussein Muhammad Uzair Khattak Muzammal Naseer Fahad Khan Salman Khan



研究问题:如何通过调整文本提示来适应未见过领域的视觉语言模型,以解决分布偏移的问题。
动机:现有的测试时间提示调优方法忽视了分布偏移这一导致性能下降的关键原因。
方法:通过最小化特征分布偏移,使用单个测试样本在测试时调整多模态提示,以弥合测试领域的鸿沟。
效果:在领域泛化基准测试中,该方法比现有的提示学习方法提高了零射一准确率,比基线MaPLe提高了3.08%。在10个数据集的跨数据集泛化中,与现有的最先进技术相比,该方法在所有数据集上都有所提高。

The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks. Previous works have shown test-time prompt tuning using entropy minimization to adapt text prompts for unseen domains. While effective, this overlooks the key cause for performance degradation to unseen domains -- distribution shift. In this work, we explicitly handle this problem by aligning the out-of-distribution (OOD) test sample statistics to those of the source data using prompt tuning. We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Evaluating against the domain generalization benchmark, our method improves zero-shot top-1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe. In cross-dataset generalization with unseen categories across 10 datasets, our method improves consistently across all datasets compared to the existing state-of-the-art. Our source code and models are available at [https://jameelhassan.github.io/promptalign](https://jameelhassan.github.io/promptalign)

On Separate Normalization in Self-supervised Transformers
Xiaohui Chen Yinkai Wang Yuanqi Du Soha Hassoun Liping Liu



研究问题:现有的预训练语言模型如何更好地利用结构化知识,提升语言理解能力?
动机:目前的预训练语言模型如BERT等,虽然能捕获丰富的语义模式,但很少考虑结合知识图谱进行训练。
方法:本文提出一种增强的语言表示模型ERNIE,通过大规模文本语料库和知识图谱联合训练,充分利用词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上表现显著提升,同时在其他常见NLP任务上与BERT相媲美。

Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing the conventional normalization layer with the two separate layers, we observe an average 2.7% performance improvement over the image, natural language, and graph domains.

Energy-based learning algorithms for analog computing: a comparative study
Benjamin Scellier Maxence Ernoult Jack Kendall Suhas Kumar



研究问题:本研究旨在比较七种基于能量的学习算法,包括对比学习、平衡传播和耦合学习等,以确定它们在实践中的可扩展性和选择最佳算法。
动机:尽管这些算法在模拟数字硬件中具有兼容性,但它们从未在相同的模型和数据集上进行过直接比较,使得评估其可扩展性并在实际中选择最佳算法变得困难。
方法:使用这七种学习算法,我们在五个视觉任务(MNIST、F-MNIST、SVHN、CIFAR-10和CIFAR-100)上训练深度卷积Hopfield网络(DCHNs)。我们发现,虽然所有算法在MNIST上的表现相当,但随着任务难度的增加,性能差异显著。
效果:我们的主要发现表明,负向扰动优于正向扰动,并强调了使用两个相反符号的扰动的中心化版本的EP(平衡传播)作为表现最佳的算法。此外,我们还通过理论论证支持了这些发现。在性能和速度方面,我们的DCHN在所有五个数据集上都取得了新的最先进的结果。特别是,由于使用了基于异步更新和降低精度(16位)的新型能量最小化算法,我们的DCHN模拟比Laborieux等人(2021)快13.5倍。

Energy-based learning algorithms have recently gained a surge of interest due to their compatibility with analog (post-digital) hardware. Existing algorithms include contrastive learning (CL), equilibrium propagation (EP) and coupled learning (CpL), all consisting in contrasting two states, and differing in the type of perturbation used to obtain the second state from the first one. However, these algorithms have never been explicitly compared on equal footing with same models and datasets, making it difficult to assess their scalability and decide which one to select in practice. In this work, we carry out a comparison of seven learning algorithms, namely CL and different variants of EP and CpL depending on the signs of the perturbations. Specifically, using these learning algorithms, we train deep convolutional Hopfield networks (DCHNs) on five vision tasks (MNIST, F-MNIST, SVHN, CIFAR-10 and CIFAR-100). We find that, while all algorithms yield comparable performance on MNIST, important differences in performance arise as the difficulty of the task increases. Our key findings reveal that negative perturbations are better than positive ones, and highlight the centered variant of EP (which uses two perturbations of opposite sign) as the best-performing algorithm. We also endorse these findings with theoretical arguments. Additionally, we establish new SOTA results with DCHNs on all five datasets, both in performance and speed. In particular, our DCHN simulations are 13.5 times faster with respect to Laborieux et al. (2021), which we achieve thanks to the use of a novel energy minimisation algorithm based on asynchronous updates, combined with reduced precision (16 bits).

Conformal Prediction Sets for Ordinal Classification
PRASENJIT DEY Srujana Merugu Sivaramakrishnan R Kaveri



研究问题:如何利用现有的预测方法,对有序分类任务进行优化,以生成具有保证覆盖范围和最小基数的连续集合。
动机:在实际应用中,对于有序分类任务,通常希望获得一个包含真实类别可能性很高的小集合。现有的预测方法虽然可以解决非有序标签的分类问题,但产生的预测集合往往是不连续的,不适合有序分类。
方法:提出一种框架,将现有的预测方法适应于生成连续集合,该框架采用一种新的非参数方法来建模单峰分布。
效果:在合成数据集和真实世界数据集上的实验结果表明,该方法在准确率@K上比最先进的基线高出4%,在预测集大小上高出8%。

Ordinal classification (OC), i.e., labeling instances along classes with a natural ordering, is common in multiple applications such as size or budget based recommendations and disease severity labeling. Often in practical scenarios, it is desirable to obtain a small set of likely classes with a guaranteed high chance of including the true class. Recent works on conformal prediction (CP) address this problem for the classification setting with non-ordered labels but the resulting prediction sets (PS) are often non-contiguous and unsuitable for ordinal classification. In this work, we propose a framework to adapt existing CP methods to generate contiguous sets with guaranteed coverage and minimal cardinality. Our framework employs a novel non-parametric approach for modeling unimodal distributions. Empirical results on both synthetic and real-world datasets demonstrate our method outperforms SOTA baselines by 4% on Accuracy@K and 8% on PS size.

Graph of Circuits with GNN for Exploring the Optimal Design Space
Aditya Hemant Shahane Saripilli Venkata Swapna Manjiri Ankesh Jain Sandeep Kumar



研究问题:模拟电路的设计自动化在设计空间大、电路规格间复杂依赖关系和资源密集型模拟等方面面临重大挑战。
动机:为了解决这些挑战,本文提出了一种创新的框架——电路图探索器(GCX)。
方法:利用图结构学习和图神经网络,GCX能够在半监督学习框架中创建高效的最优设计空间探索的替代模型,从而减少对大型标记数据集的需求。该方法包括三个关键阶段:首先,学习电路的几何表示并丰富其技术信息以创建全面的特征向量;其次,将基于特征的图学习和少次或零次学习相结合,增强对未见过电路的预测泛化能力;最后,引入两种算法——EASCO和ASTROG,与GCX集成优化可用样本以产生满足设计师标准的最优电路配置。
效果:通过使用180nm CMOS技术中的衍生参数对各种电路进行模拟性能评估,证明了所提出方法的有效性。此外,该方法的通用性已扩展到更高阶拓扑和不同的工艺节点,如65nm和45nm CMOS工艺节点。

The design automation of analog circuits poses significant challenges in terms of the large design space, complex interdependencies between circuit specifications, and resource-intensive simulations. To address these challenges, this paper presents an innovative framework called the Graph of Circuits Explorer (GCX). Leveraging graph structure learning along with graph neural networks, GCX enables the creation of a surrogate model that facilitates efficient exploration of the optimal design space within a semi-supervised learning framework which reduces the need for large labelled datasets. The proposed approach comprises three key stages. First, we learn the geometric representation of circuits and enrich it with technology information to create a comprehensive feature vector. Subsequently, integrating feature-based graph learning with few-shot and zero-shot learning enhances the generalizability in predictions for unseen circuits. Finally, we introduce two algorithms namely, EASCO and ASTROG which upon integration with GCX optimize the available samples to yield the optimal circuit configuration meeting the designer's criteria. The effectiveness of the proposed approach is demonstrated through simulated performance evaluation of various circuits, using derived parameters in 180nm CMOS technology. Furthermore, the generalizability of the approach is extended to higher-order topologies and different technology nodes such as 65nm and 45nm CMOS process nodes.

Resilient Constrained Learning
Ignacio Hounie Alejandro Ribeiro Luiz F. O. Chamon



研究问题:在部署机器学习解决方案时,除了准确性外,还需要考虑公平性、鲁棒性和安全性等多重要求。
动机:这些要求的制定受到妥协和对数据有限知识的限制,其对性能的影响通常只能通过实际解决学习问题来评估。
方法:本文提出了一种适应性调整需求的约束学习方法,通过权衡放松对任务影响的性能增益与用户定义的放松成本来实现。
效果:该方法在涉及多个潜在不变性的图像分类任务和分布偏移下的联邦学习中展示了优势。

When deploying machine learning solutions, they must satisfy multiple requirements beyond accuracy, such as fairness, robustness, or safety. These requirements are imposed during training either implicitly, using penalties, or explicitly, using constrained optimization methods based on Lagrangian duality. Either way, specifying requirements is hindered by the presence of compromises and limited prior knowledge about the data. Furthermore, their impact on performance can often only be evaluated by actually solving the learning problem. This paper presents a constrained learning approach that adapts the requirements while simultaneously solving the learning task. To do so, it relaxes the learning constraints in a way that contemplates how much they affect the task at hand by balancing the performance gains obtained from the relaxation against a user-defined cost of that relaxation. We call this approach resilient constrained learning after the term used to describe ecological systems that adapt to disruptions by modifying their operation. We show conditions under which this balance can be achieved and introduce a practical algorithm to compute it, for which we derive approximation and generalization guarantees. We showcase the advantages of this resilient learning method in image classification tasks involving multiple potential invariances and in federated learning under distribution shift.

ExPT: Synthetic Pretraining for Few-Shot Experimental Design
Tung Nguyen Sudhanshu Agrawal Aditya Grover



研究问题:本文旨在解决实验设计中样本效率低下的问题,特别是在只有少量标注数据可用的少数镜头实验设计场景。
动机:现有的方法要么依赖于主动数据收集,要么依赖于大量过去的实验标记数据集,这在许多现实场景中是不现实的。
方法:本文将此问题视为条件生成任务,模型根据少数标注示例和期望的输出生成最优输入设计。为此,引入了实验预训练变压器(ExPT),一种用于少数镜头实验设计的基础设施模型,该模型采用合成预训练与上下文学习的创新组合。
效果:在具有挑战性的领域中评估ExPT,并证明其相比现有方法具有优越的通用性和性能。

Experimental design is a fundamental problem in many science and engineering fields. In this problem, sample efficiency is crucial due to the time, money, and safety costs of real-world design evaluations. Existing approaches either rely on active data collection or access to large, labeled datasets of past experiments, making them impractical in many real-world scenarios. In this work, we address the more challenging yet realistic setting of few-shot experimental design, where only a few labeled data points of input designs and their corresponding values are available. We approach this problem as a conditional generation task, where a model conditions on a few labeled examples and the desired output to generate an optimal input design. To this end, we introduce Experiment Pretrained Transformers (ExPT), a foundation model for few-shot experimental design that employs a novel combination of synthetic pretraining with in-context learning. In ExPT, we only assume knowledge of a finite collection of unlabelled data points from the input domain and pretrain a transformer neural network to optimize diverse synthetic functions defined over this domain. Unsupervised pretraining allows ExPT to adapt to any design task at test time in an in-context fashion by conditioning on a few labeled data points from the target task and generating the candidate optima. We evaluate ExPT on few-shot experimental design in challenging domains and demonstrate its superior generality and performance compared to existing methods. The source code is available at https://github.com/tung-nd/ExPT.git.

Ess-InfoGAIL: Semi-supervised Imitation Learning from Imbalanced Demonstrations
Huiqiao Fu Kaiqiang Tang Yuanyang Lu Yiming Qi Guizhou Deng Flood Sung Chunlin Chen



研究问题:本研究旨在解决模仿学习中的现实挑战,如多模态、数据不平衡和昂贵的标签过程。
动机:现有的模仿学习方法在处理现实世界的示范时面临诸多挑战,如多模态、数据不平衡和昂贵的标签过程。
方法:我们提出了一种新的半监督模仿学习架构,该架构使用有限的标记数据从不平衡的示范中学习分离的行为表示。具体来说,我们的方法包括三个关键组件:首先,我们将半监督生成对抗网络的概念适应到模仿学习环境中;其次,我们采用可学习的潜变量分布来对齐生成的数据和专家数据分布;最后,我们利用正则化的信息最大化方法和近似标签先验来进一步提高半监督学习的性能。
效果:实验结果表明,与基线方法相比,我们的方法在从不平衡的示范中学习多模态行为方面更为有效。

Imitation learning aims to reproduce expert behaviors without relying on an explicit reward signal. However, real-world demonstrations often present challenges, such as multi-modal, data imbalance, and expensive labeling processes. In this work, we propose a novel semi-supervised imitation learning architecture that learns disentangled behavior representations from imbalanced demonstrations using limited labeled data. Specifically, our method consists of three key components. First, we adapt the concept of semi-supervised generative adversarial networks to the imitation learning context. Second, we employ a learnable latent distribution to align the generated and expert data distributions. Finally, we utilize a regularized information maximization approach in conjunction with an approximate label prior to further improve the semi-supervised learning performance. Experimental results demonstrate the efficiency of our method in learning multi-modal behaviors from imbalanced demonstrations compared to baseline methods.

Ensemble-based Deep Reinforcement Learning for Vehicle Routing Problems under Distribution Shift
Yuan Jiang Zhiguang Cao Yaoxin Wu Wen Song Jie Zhang



研究问题:现有的车辆路线问题(VRPs)的神经方法在处理分布偏移时表现不佳。
动机:为了解决这个问题,我们提出了一种基于深度学习的强化学习方法,该方法通过学习一组多样化的子策略来应对各种实例分布。
方法:我们利用随机初始化引导子策略之间的多样性,并通过训练期间利用正则化项来进一步增加子策略之间的差异性。
效果:实验结果表明,我们的方法能够在各种分布的随机生成实例上超越最先进的神经网络基线,并在TSPLib和CVRPLib的基准实例上表现出良好的泛化能力。

While performing favourably on the independent and identically distributed (i.i.d.) instances, most of the existing neural methods for vehicle routing problems (VRPs) struggle to generalize in the presence of a distribution shift. To tackle this issue, we propose an ensemble-based deep reinforcement learning method for VRPs, which learns a group of diverse sub-policies to cope with various instance distributions. In particular, to prevent convergence of the parameters to the same one, we enforce diversity across sub-policies by leveraging Bootstrap with random initialization. Moreover, we also explicitly pursue inequality between sub-policies by exploiting regularization terms during training to further enhance diversity. Experimental results show that our method is able to outperform the state-of-the-art neural baselines on randomly generated instances of various distributions, and also generalizes favourably on the benchmark instances from TSPLib and CVRPLib, which confirmed the effectiveness of the whole method and the respective designs.

Generalized test utilities for long-tail performance in extreme multi-label classification
Erik Schultheis Marek Wydmuch Wojciech Kotlowski Rohit Babbar Krzysztof Dembczynski



研究问题:多标签分类中,大部分标签只有少数正例,如何准确预测这些“尾部”标签?
动机:现有的评估指标无法准确衡量尾部标签的预测效果,需要一种新的度量方法。
方法:提出一种基于期望测试效用(ETU)框架的新度量方法,并推导出最优预测规则和计算效率高的近似算法。
效果:该算法在极端多标签分类问题上表现良好,能有效提升尾部标签的预测性能。

Extreme multi-label classification (XMLC) is a task of selecting a small subset of relevant labels from a very large set of possible labels. As such, it is characterized by long-tail labels, i.e., most labels have very few positive instances. With standard performance measures such as precision@k, a classifier can ignore tail labels and still report good performance. However, it is often argued that correct predictions in the tail are more "interesting" or "rewarding," but the community has not yet settled on a metric capturing this intuitive concept. The existing propensity-scored metrics fall short on this goal by confounding the problems of long-tail and missing labels. In this paper, we analyze generalized metrics budgeted "at k" as an alternative solution. To tackle the challenging problem of optimizing these metrics, we formulate it in the \emph{expected test utility} (ETU) framework, which aims at optimizing the expected performance on a given test set. We derive optimal prediction rules and construct their computationally efficient approximations with provable regret guarantees and being robust against model misspecification. Our algorithm, based on block coordinate descent, scales effortlessly to XMLC problems and obtains promising results in terms of long-tail performance.

Accessing Higher Dimensions for Unsupervised Word Translation
Sida Wang



研究问题:现有的无监督词翻译方法都依赖于低维词向量预训练,但这种方法是否必要尚无定论。
动机:本文旨在挑战这一假设,通过开发一种能利用高维信号的方法来测试其有效性。
方法:本文提出的方法不再受限于低维限制,而是充分利用高维信号和更好的去噪方法。
效果:实验结果表明,该方法在英语到芬兰语、匈牙利语和中文的翻译任务上表现优异,且所需资源较少,仅需不到80MB的内存和几分钟的CPU时间即可达到超过50%的准确率。即使在领域不匹配的情况下,该方法也能在英语新闻抓取到中文维基百科和英语欧洲议会到西班牙语维基百科等任务上完全无监督地工作。这些结果挑战了对低维向量的必要性和优越性的普遍假设,表明高维信号可以被利用而不是被丢弃。

The striking ability of unsupervised word translation has been demonstrated recently with the help of low-dimensional word vectors / pretraining, which is used by all successful methods and assumed to be necessary. We test and challenge this assumption by developing a method that can also make use of high dimensional signal. Freed from the limits of low dimensions, we show that relying on low-dimensional vectors and their incidental properties miss out on better denoising methods and signals in high dimensions, thus stunting the potential of the data. Our results show that unsupervised translation can be achieved more easily and robustly than previously thought -- less than 80MB and minutes of CPU time is required to achieve over 50\% accuracy for English to Finnish, Hungarian, and Chinese translations when trained in the same domain; even under domain mismatch, the method still works fully unsupervised on English NewsCrawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others. These results challenge prevailing assumptions on the necessity and superiority of low-dimensional vectors and show that the higher dimension signal can be used rather than thrown away.

Why Does Sharpness-Aware Minimization Generalize Better Than SGD?
Zixiang Chen Junkai Zhang Yiwen Kou Xiangning Chen Cho-Jui Hsieh Quanquan Gu



研究问题:本文旨在解决深度学习模型过拟合的问题,特别是在非线性神经网络和分类任务中。
动机:过拟合是训练大型神经网络时的一个重大挑战,模型会记住训练数据而无法泛化到测试数据。Sharpness-Aware Minimization (SAM) 是一种有前景的训练方法,即使在标签噪声存在的情况下也能提高神经网络的泛化能力。然而,对于 SAM 在非线性神经网络和分类任务中如何工作的深入理解仍然缺乏。
方法:本文通过展示 SAM 为何比随机梯度下降(SGD)在某些数据模型和两层卷积 ReLU 网络中更好地泛化来填补这一空白。我们研究的问题的损失景观是非平滑的,因此基于海森矩阵信息的当前 SAM 成功解释是不充分的。我们的结果解释了 SAM 的好处,特别是其防止早期阶段噪声学习的能力,从而促进更有效的特征学习。
效果:我们在合成数据和真实数据上的实验结果都证实了我们的理论。

The challenge of overfitting, in which the model memorizes the training data and fails to generalize to test data, has become increasingly significant in the training of large neural networks. To tackle this challenge, Sharpness-Aware Minimization (SAM) has emerged as a promising training method, which can improve the generalization of neural networks even in the presence of label noise. However, a deep understanding of how SAM works, especially in the setting of nonlinear neural networks and classification tasks, remains largely missing. This paper fills this gap by demonstrating why SAM generalizes better than Stochastic Gradient Descent (SGD) for a certain data model and two-layer convolutional ReLU networks. The loss landscape of our studied problem is nonsmooth, thus current explanations for the success of SAM based on the Hessian information are insufficient. Our result explains the benefits of SAM, particularly its ability to prevent noise learning in the early stages, thereby facilitating more effective learning of features. Experiments on both synthetic and real data corroborate our theory.

Graph-Structured Gaussian Processes for Transferable Graph Learning
Jun Wu Lisa Ainsworth Andrew Leakey Haixun Wang Jingrui He



研究问题:如何将知识从源图迁移到相关的目标图,解决源图和目标图之间的分布偏移问题。
动机:现有的可迁移图学习面临源图和目标图之间由于节点属性和复杂图结构引起的分布偏移的挑战。
方法:提出一种通用的基于图结构的高斯过程框架(GraphGP),无论在同质或异质假设下,都能自适应地在图中迁移知识。
效果:通过在几个可迁移的图学习基准上进行广泛的实验,证明GraphGP优于最先进的高斯过程基线。

Transferable graph learning involves knowledge transferability from a source graph to a relevant target graph. The major challenge of transferable graph learning is the distribution shift between source and target graphs induced by individual node attributes and complex graph structures. To solve this problem, in this paper, we propose a generic graph-structured Gaussian process framework (GraphGP) for adaptively transferring knowledge across graphs with either homophily or heterophily assumptions. Specifically, GraphGP is derived from a novel graph structure-aware neural network in the limit on the layer width. The generalization analysis of GraphGP explicitly investigates the connection between knowledge transferability and graph domain similarity. Extensive experiments on several transferable graph learning benchmarks demonstrate the efficacy of GraphGP over state-of-the-art Gaussian process baselines.

Progressive Ensemble Distillation: Building Ensembles for Efficient Inference
Don Dennis Abhishek Shetty Anish Sevekari Kazuhito Koishida Virginia Smith



研究问题:如何将大型预训练教师模型分解为一组较小的、低推理成本的学生模型。
动机:为了在保证准确性的同时降低推理成本,提高模型的推理效率。
方法:使用B-DISTIL方法,通过提升程序和基于聚合规则的函数组合来构建具有相似性能且比使用更小的学生模型更为表达丰富的模型。
效果:通过在各种图像、语音和传感器数据集上分解预训练模型,证明了B-DISTIL方法的有效性,并从收敛性和泛化性方面提供了强有力的理论保证。

Knowledge distillation is commonly used to compress an ensemble of models into a single model. In this work we study the problem of progressive ensemble distillation: Given a large, pretrained teacher model , we seek to decompose the model into an ensemble of smaller, low-inference cost student models . The resulting ensemble allows for flexibly tuning accuracy vs. inference cost, which can be useful for a multitude of applications in efficient inference. Our method, B-DISTIL, uses a boosting procedure that allows function composition based aggregation rules to construct expressive ensembles with similar performance as using much smaller student models. We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across a variety of image, speech, and sensor datasets. Our method comes with strong theoretical guarantees in terms of convergence as well as generalization.

SPA: A Graph Spectral Alignment Perspective for Domain Adaptation
Zhiqing Xiao Haobo Wang Ying Jin Lei Feng Gang Chen Fei Huang Junbo Zhao



研究问题:如何将预训练的语言模型与知识图谱相结合,以增强语言表示。
动机:目前的预训练语言模型缺乏对结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱联合训练ERNIE模型,同时捕捉词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to extend the in-domain model to the distinctive target domains where the data distributions differ. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. In this work, we introduce a novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The core of our method is briefly condensed as follows: (i)-by casting the DA problem to graph primitives, SPA composes a coarse graph alignment mechanism with a novel spectral regularizer towards aligning the domain graphs in eigenspaces; (ii)-we further develop a fine-grained message propagation module --- upon a novel neighbor-aware self-training mechanism --- in order for enhanced discriminability in the target domain. On standardized benchmarks, the extensive experiments of SPA demonstrate that its performance has surpassed the existing cutting-edge DA methods. Coupled with dense model analysis, we conclude that our approach indeed possesses superior efficacy, robustness, discriminability, and transferability. Code and data are available at: https://github.com/CrownX/SPA.

Finding Order in Chaos: A Novel Data Augmentation Method for Time Series in Contrastive Learning
Berken Utku Demirel Christian Holz



研究问题:对比学习的成功依赖于数据增强,但时间序列数据增强由于其复杂的生成机制(如心血管系统的复杂机制)而是一个挑战性的问题。
动机:尽管在视觉等领域通过预定义的技术很好地控制了数据增强的程度,但目前还没有被广泛认可和通用的时间序列数据增强方法可以应用于不同的任务。
方法:本文提出了一种新的用于时间序列任务的数据增强方法,该方法旨在将同类样本连接在一起,从而在潜在空间中找到顺序。这种方法建立在已知的数据增强技术mixup之上,并引入了一种考虑时间序列数据的非平稳特性的新方法。
效果:我们在三个时间序列任务上评估了我们提出的方法,包括心率估计、人体活动识别和心血管疾病检测。与最先进的方法进行的大量实验表明,我们的方法在三个任务上优于现有的最佳数据生成方法和已知的数据增强技术,反映了所提出方法的有效性。

The success of contrastive learning is well known to be dependent on data augmentation. Although the degree of data augmentations has been well controlled by utilizing pre-defined techniques in some domains like vision, time-series data augmentation is less explored and remains a challenging problem due to the complexity of the data generation mechanism, such as the intricate mechanism involved in the cardiovascular system. Moreover, there is no widely recognized and general time-series augmentation method that can be applied across different tasks. In this paper, we propose a novel data augmentation method for time-series tasks that aims to connect intra-class samples together, and thereby find order in the latent space. Our method builds upon the well-known data augmentation technique of mixup by incorporating a novel approach that accounts for the non-stationary nature of time-series data. Also, by controlling the degree of chaos created by data augmentation, our method leads to improved feature representations and performance on downstream tasks. We evaluate our proposed method on three time-series tasks, including heart rate estimation, human activity recognition, and cardiovascular disease detection. Extensive experiments against the state-of-the-art methods show that the proposed method outperforms prior works on optimal data generation and known data augmentation techniques in three tasks, reflecting the effectiveness of the presented method. The source code is available at double-blind policy.

ProteinNPT: Improving protein property prediction and design with non-parametric transformers
Pascal Notin Ruben Weitzman Debora Susan Marks Yarin Gal



研究问题:本文旨在解决蛋白质设计中的挑战,包括设计空间的广泛性、功能区域的稀疏性和可用标签的稀缺性。
动机:蛋白质设计具有巨大的优化自然序列的潜力,广泛应用于药物发现、材料设计和可持续性。然而,计算方法在处理蛋白质工程时面临重大挑战。
方法:本文提出了一种非参数化的变压器变体ProteinNPT,专门用于蛋白质序列,特别适合于标签稀缺和多任务优化设置。
效果:实验结果表明,ProteinNPT在所有蛋白质性质预测任务上都优于所有现有的最佳基线,并在几个虚拟贝叶斯优化实验中展示了其迭代蛋白质设计的价值。

Protein design holds immense potential for optimizing naturally occurring sequences, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, including an expansive design space, sparse functional regions, and scarcity of available labels. Furthermore, real-life design scenarios often necessitate the simultaneous optimization of multiple properties, exacerbating label sparsity issues. In this paper, we present ProteinNPT, a non-parametric transformer variant tailored for protein sequences and particularly suited to label-scarce and multi-task optimization settings. We first expand the ProteinGym benchmark to evaluate models in supervised settings and develop several cross-validation schemes for robust assessment. Subsequently, we reimplement existing top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design in several in silico Bayesian optimization experiments.

Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy
Dongmin Park Seola Choi Doyoung Kim Hwanjun Song Jae-Gil Lee



研究问题:如何通过数据修剪减少大规模训练集的大小,同时保持模型的准确性和泛化能力。
动机:现代深度学习需要处理大规模的数据集,这会导致巨大的计算成本。尽管已经开发了许多鲁棒的学习方法来处理带有标注噪声的数据,但针对噪声鲁棒学习场景的数据修剪问题尚未得到充分关注。
方法:本文提出了一种新的数据修剪算法Prune4Rel,该算法通过最大化所有训练样本的邻居置信度来找到最优的子集,从而提高重标定的准确性和模型的泛化性能。
效果:在四个真实和五个合成的噪声数据集上的大量实验表明,Prune4Rel比使用重标定模型的基线提高了9.1%,比使用标准模型的基线提高了21.6%。

Data pruning, which aims to downsize a large training set into a small informative subset, is crucial for reducing the enormous computational costs of modern deep learning. Though large-scale data collections invariably contain annotation noise and numerous robust learning methods have been developed, data pruning for the noise-robust learning scenario has received little attention. With state-of-the-art Re-labeling methods that self-correct erroneous labels while training, it is challenging to identify which subset induces the most accurate re-labeling of erroneous labels in the entire training set. In this paper, we formalize the problem of data pruning with re-labeling. We first show that the likelihood of a training example being correctly re-labeled is proportional to the prediction confidence of its neighborhood in the subset. Therefore, we propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples, thereby maximizing the re-labeling accuracy and generalization performance. Extensive experiments on four real and one synthetic noisy datasets show that Prune4Rel outperforms the baselines with Re-labeling models by up to 9.1% as well as those with a standard model by up to 21.6%.

Reverse Engineering Self-Supervised Learning
Ido Ben-Shaul Ravid Shwartz-Ziv Tomer Galanti Shai Dekel Yann LeCun



研究问题:本文旨在通过“反向工程”方法,深入分析自监督学习(SSL)的习得内部表示,包括不同的模型、架构和超参数。
动机:理解自监督学习的习得表示和底层机制常常具有挑战性。
方法:通过对多种模型、架构和超参数进行深入研究,揭示出自监督训练中一个有趣的过程:即语义标签基础聚类的内在促进作用,这一过程令人惊讶地由自监督目标的正则化组件驱动。
效果:实验结果表明,这种聚类不仅增强了下游分类,还压缩了信息。此外,我们发现自监督训练的表示与不同层次的语义类别的对齐更为显著,且网络越深,这种对齐越强。这种“反向工程”方法为我们提供了深入了解自监督学习内部机制及其在不同类别集上性能影响的重要见解。

Understanding the learned representation and underlying mechanisms of Self-Supervised Learning (SSL) often poses a challenge. In this paper, we ‘reverse engineer’ SSL, conducting an in-depth empirical analysis of its learned internal representations, encompassing diverse models, architectures, and hyperparameters. Our study reveals an intriguing process within the SSL training: an inherent facilitation of semantic label-based clustering, which is surprisingly driven by the regularization component of the SSL objective. This clustering not only enhances downstream classification, but also compresses the information. We further illustrate that the alignment of the SSL-trained representation is more pronounced with semantic classes rather than random functions. Remarkably, the learned representations align with semantic classes across various hierarchical levels, with this alignment intensifying when going deeper into the network. This ‘reverse engineering’ approach provides valuable insights into the inner mechanism of SSL and their influences on the performance across different class sets.

Accelerating Molecular Graph Neural Networks via Knowledge Distillation
Filip Ekström Kelvinius Dimitar Georgiev Artur Toshev Johannes Gasteiger



研究问题:如何利用知识蒸馏(KD)加速分子图神经网络(GNNs)并提高其预测精度。
动机:尽管最新的图神经网络在分子属性预测和分子模拟上取得了显著进步,但复杂的模型结构和大规模应用需求使得其在实际应用中面临性能瓶颈。
方法:通过设计特定的知识蒸馏策略,对方向性和等变图神经网络的隐藏表示进行蒸馏,并在能量和力预测任务上评估其性能。
效果:实验结果表明,该方法可以显著提高学生模型的预测精度,同时保持了轻量级模型的推理吞吐量,对于能量和力预测,教师模型与学生模型之间的预测精度差距分别达到了96.7%和62.5%。

Recent advances in graph neural networks (GNNs) have enabled more comprehensive modeling of molecules and molecular systems, thereby enhancing the precision of molecular property prediction and molecular simulations. Nonetheless, as the field has been progressing to bigger and more complex architectures, state-of-the-art GNNs have become largely prohibitive for many large-scale applications. In this paper, we explore the utility of knowledge distillation (KD) for accelerating molecular GNNs. To this end, we devise KD strategies that facilitate the distillation of hidden representations in directional and equivariant GNNs, and evaluate their performance on the regression task of energy and force prediction. We validate our protocols across different teacher-student configurations and datasets, and demonstrate that they can consistently boost the predictive accuracy of student models without any modifications to their architecture. Moreover, we conduct comprehensive optimization of various components of our framework, and investigate the potential of data augmentation to further enhance performance. All in all, we manage to close the gap in predictive accuracy between teacher and student models by as much as 96.7\% and 62.5\% for energy and force prediction respectively, while fully preserving the inference throughput of the more lightweight models.

Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder
Huiwon Jang Jihoon Tack Daewon Choi Jongheon Jeong Jinwoo Shin



研究问题:尽管自我监督学习在各种模态中具有实际重要性,但最近的进展主要集中在几个精心策划的领域,如视觉和语言,通常依赖于其特定领域的知识。
动机:例如,掩码自动编码器(MAE)已成为这些领域中流行的架构之一,但在其他模态中的潜力尚未得到充分探索。
方法:本文将MAE开发为一个统一的、模态无关的自我监督学习框架。我们主张元学习是解释MAE作为模态无关学习者的关键,并提出从提高其在多样化模态中自我监督学习的动机出发,对MAE进行增强,结果被称为MetaMAE。
效果:我们的实验表明,MetaMAE在模态无关的自我监督学习基准(称为DABS)上表现出优越性,显著优于先前的基线。

Despite its practical importance across a wide range of modalities, recent advances in self-supervised learning (SSL) have been primarily focused on a few well-curated domains, e.g., vision and language, often relying on their domain-specific knowledge. For example, Masked Auto-Encoder (MAE) has become one of the popular architectures in these domains, but less has explored its potential in other modalities. In this paper, we develop MAE as a unified, modality-agnostic SSL framework. In turn, we argue meta-learning as a key to interpreting MAE as a modality-agnostic learner, and propose enhancements to MAE from the motivation to jointly improve its SSL across diverse modalities, coined MetaMAE as a result. Our key idea is to view the mask reconstruction of MAE as a meta-learning task: masked tokens are predicted by adapting the Transformer meta-learner through the amortization of unmasked tokens. Based on this novel interpretation, we propose to integrate two advanced meta-learning techniques. First, we adapt the amortized latent of the Transformer encoder using gradient-based meta-learning to enhance the reconstruction. Then, we maximize the alignment between amortized and adapted latents through task contrastive learning which guides the Transformer encoder to better encode the task-specific knowledge. Our experiment demonstrates the superiority of MetaMAE in the modality-agnostic SSL benchmark (called DABS), significantly outperforming prior baselines.

Frequency Domain-Based Dataset Distillation
DongHyeok Shin Seungjae Shin Il-chul Moon



研究问题:本文提出了一种新的数据集蒸馏参数化方法FreD,该方法利用频率域从大型原始数据集中蒸馏出小型的合成数据集。
动机:与传统的空间域方法不同,FreD采用基于频率的转换来优化每个数据实例的频率表示。通过利用空间域信息在特定频率组件上的集中,FreD智能地选择一部分频率维度进行优化,从而显著减少了合成一个实例所需的预算。
方法:FreD通过基于解释方差的频段选择,展示了其在有限预算内有效运作的能力,同时与现有的参数化方法相比,更好地保留了原始数据集的信息。
效果:实验结果表明,FreD在各种基准数据集的评价场景中,始终优于现有的蒸馏方法。

This paper presents FreD, a novel parameterization method for dataset distillation, which utilizes the frequency domain to distill a small-sized synthetic dataset from a large-sized original dataset. Unlike conventional approaches that focus on the spatial domain, FreD employs frequency-based transforms to optimize the frequency representations of each data instance. By leveraging the concentration of spatial domain information on specific frequency components, FreD intelligently selects a subset of frequency dimensions for optimization, leading to a significant reduction in the required budget for synthesizing an instance. Through the selection of frequency dimensions based on the explained variance, FreD demonstrates both theoretical and empirical evidence of its ability to operate efficiently within a limited budget, while better preserving the information of the original dataset compared to conventional parameterization methods. Furthermore, Based on the orthogonal compatibility of FreD with existing methods, we confirm that FreD consistently improves the performances of existing distillation methods over the evaluation scenarios with different benchmark datasets. We release the code at https://github.com/sdh0818/FreD.

Simplifying Neural Network Training Under Class Imbalance
Ravid Shwartz-Ziv Micah Goldblum Yucen Lily Li C. Bayan Bruss Andrew Gordon Wilson



研究问题:如何改善深度学习模型在面对高度不平衡的真实世界数据集时的性能。
动机:真实世界的数据集往往存在严重的类别不平衡问题,这会对深度学习模型的性能产生负面影响。大多数关于训练神经网络应对类别不平衡的研究都集中在专门的损失函数和采样技术。
方法:我们证明了只需调整标准深度学习流程中现有的组件,如批量大小、数据增强、架构大小、预训练、优化器和标签平滑,就可以在无需任何专门的损失函数或采样器的情况下实现最先进的性能。
效果:我们的实验结果表明,这种方法可以显著提高模型在处理类别不平衡数据时的性能,同时我们也提供了训练时应对类别不平衡的关键建议和注意事项。

Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models. The majority of research on training neural networks under class imbalance has focused on specialized loss functions and sampling techniques. Notably, we demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, architecture size, pre-training, optimizer, and label smoothing, can achieve state-of-the-art performance without any specialized loss functions or samplers. We also provide key prescriptions and considerations for training under class imbalance, and an understanding of why imbalance methods succeed or fail.

A Partially-Supervised Reinforcement Learning Framework for Visual Active Search
Anindya Sarkar Nathan Jacobs Yevgeniy Vorobeychik



研究问题:如何有效地利用视觉线索在大型地理空间区域中进行探索,以识别感兴趣的区域。
动机:现有的视觉主动搜索(VAS)模型,如深度强化学习(DRL)和传统主动搜索,虽然在某些领域表现优秀,但无法充分利用训练或实际搜索过程中获得的监督信息,限制了其在与训练分布差异较大的搜索任务中的应用。
方法:本文提出了一种结合DRL和传统主动搜索的方法,将搜索策略分解为预测模块和搜索模块。预测模块根据任务嵌入和搜索历史生成感兴趣区域的地理空间分布;搜索模块则将预测和搜索历史作为输入输出搜索分布。此外,还开发了一种新颖的元学习方法,用于联合学习得到的有效利用训练和决策时间获得的监督信息的合并策略。
效果:实验结果表明,所提出的表示和元学习框架在几个问题域上的视觉主动搜索性能显著优于现有技术。

Visual active search (VAS) has been proposed as a modeling framework in which visual cues are used to guide exploration, with the goal of identifying regions of interest in a large geospatial area. Its potential applications include identifying hot spots of rare wildlife poaching activity, search-and-rescue scenarios, identifying illegal trafficking of weapons, drugs, or people, and many others. State of the art approaches to VAS include applications of deep reinforcement learning (DRL), which yield end-to-end search policies, and traditional active search, which combines predictions with custom algorithmic approaches. While the DRL framework has been shown to greatly outperform traditional active search in such domains, its end-to-end nature does not make full use of supervised information attained either during training, or during actual search, a significant limitation if search tasks differ significantly from those in the training distribution. We propose an approach that combines the strength of both DRL and conventional active search approaches by decomposing the search policy into a prediction module, which produces a geospatial distribution of regions of interest based on task embedding and search history, and a search module, which takes the predictions and search history as input and outputs the search distribution. In addition, we develop a novel meta-learning approach for jointly learning the resulting combined policy that can make effective use of supervised information obtained both at training and decision time. Our extensive experiments demonstrate that the proposed representation and meta-learning frameworks significantly outperform state of the art in visual active search on several problem domains.

Causal Effect Regularization: Automated Detection and Removal of Spurious Correlations
Abhinav Kumar Amit Deshpande Amit Sharma



研究问题:在许多分类数据集中,任务标签与一些输入属性存在误导性关联。当部署时属性的相关性发生变化时,基于此类数据集训练的分类器往往会依赖于这些属性进行预测,从而无法泛化。
动机:在真实世界的数据中,关于误导性属性的信息通常是不可用的。因此,我们提出了一种方法来自动识别误导性属性,通过估计它们对标签的因果效应,然后使用正则化目标来减轻分类器对它们的依赖。
方法:我们的方法通过估计属性对标签的因果效应,并使用正则化目标来降低分类器对这些属性的依赖性,从而自动识别出误导性的属性。
效果:与最近的一种识别误导性属性的方法相比,我们的方法在从学习到的模型中移除属性方面更准确,特别是在误导性相关性较高的情况下。此外,即使在因果效应的估计存在噪声的情况下,我们的方法也能减少对误导性属性的依赖。

In many classification datasets, the task labels are spuriously correlated with some input attributes. Classifiers trained on such datasets often rely on these attributes for prediction, especially when the spurious correlation is high, and thus fail to generalize whenever there is a shift in the attributes' correlation at deployment. If we assume that the spurious attributes are known a priori, several methods have been proposed to learn a classifier that is invariant to the specified attributes. However, in real-world data, information about spurious attributes is typically unavailable. Therefore, we propose a method to automatically identify spurious attributes by estimating their causal effect on the label and then use a regularization objective to mitigate the classifier's reliance on them. Compared to a recent method for identifying spurious attributes, we find that our method is more accurate in removing the attribute from the learned model, especially when spurious correlation is high. Specifically, across synthetic, semi-synthetic, and real-world datasets, our method shows significant improvement in a metric used to quantify the dependence of a classifier on spurious attributes ($\Delta$Prob), while obtaining better or similar accuracy. In addition, our method mitigates the reliance on spurious attributes even under noisy estimation of causal effects. To explain the empirical robustness of our method, we create a simple linear classification task with two sets of attributes: causal and spurious. We prove that our method only requires that the ranking of estimated causal effects is correct across attributes to select the correct classifier.

Active Negative Loss Functions for Learning with Noisy Labels
Xichen Ye Xiaoqiang Li Songmin Dai Tong Liu Yan Sun Weiqin Tong



研究问题:如何训练深度神经网络在存在噪声标签的情况下?
动机:现有的鲁棒损失函数使用平均绝对误差(MAE)作为其必要组成部分,但这种方法对每个样本都一视同仁,减慢了收敛速度,使训练变得困难。
方法:提出一种新的理论鲁棒无源损失函数——*标准化负损失函数*(NNLFs),这种函数更关注记忆的干净样本。通过将APL中的MAE替换为提出的NNLFs,改进了APL并提出了一个新的框架——*积极负面损失*(ANL)。
效果:实验结果表明,我们通过ANL框架创建的新的损失函数集可以超越最先进的方法。

Robust loss functions are essential for training deep neural networks in the presence of noisy labels. Some robust loss functions use Mean Absolute Error (MAE) as its necessary component. For example, the recently proposed Active Passive Loss (APL) uses MAE as its passive loss function. However, MAE treats every sample equally, slows down the convergence and can make training difficult. In this work, we propose a new class of theoretically robust passive loss functions different from MAE, namely *Normalized Negative Loss Functions* (NNLFs), which focus more on memorized clean samples. By replacing the MAE in APL with our proposed NNLFs, we improve APL and propose a new framework called *Active Negative Loss* (ANL). Experimental results on benchmark and real-world datasets demonstrate that the new set of loss functions created by our ANL framework can outperform state-of-the-art methods. The code is available at https://github.com/Virusdoll/Active-Negative-Loss.

GAN You See Me? Enhanced Data Reconstruction Attacks against Split Inference
Ziang Li Mengda Yang Yaxin Liu Juan Wang Hongxin Hu Wenzhe Yi Xiaoyang Xu



研究问题:本文旨在解决深度学习中的计算限制和数据隐私问题,特别是在边缘设备上的问题。
动机:尽管分片推理(SI)是一种新兴的深度学习模式,可以解决边缘设备的计算限制并保护数据隐私,但它容易受到数据重建攻击(DRA)。现有的攻击方法存在各种局限性,如优化基的DRA无法有效利用公共数据,而基于学习的DRA严重依赖辅助数据的量和分布相似性。
方法:为了克服这些挑战,我们提出了一种基于生成对抗网络(GAN)的潜在空间搜索攻击(GLASS)。该方法利用先进的StyleGAN技术从公共数据中获取丰富的先验知识。此外,我们还引入了GLASS++来增强重建的稳定性。
效果:我们的方法是第一个针对SI的基于GAN的DRA,通过在不同的分片点和对手设置下进行广泛的评估,证明了其最先进的性能。此外,我们还详细检查了七种防御机制,强调了我们的方法即使在这些防御措施存在的情况下也能揭示私人信息的能力。

Split Inference (SI) is an emerging deep learning paradigm that addresses computational constraints on edge devices and preserves data privacy through collaborative edge-cloud approaches. However, SI is vulnerable to Data Reconstruction Attacks (DRA), which aim to reconstruct users' private prediction instances. Existing attack methods suffer from various limitations. Optimization-based DRAs do not leverage public data effectively, while Learning-based DRAs depend heavily on auxiliary data quantity and distribution similarity. Consequently, these approaches yield unsatisfactory attack results and are sensitive to defense mechanisms. To overcome these challenges, we propose a GAN-based LAtent Space Search attack (GLASS) that harnesses abundant prior knowledge from public data using advanced StyleGAN technologies. Additionally, we introduce GLASS++ to enhance reconstruction stability. Our approach represents the first GAN-based DRA against SI, and extensive evaluation across different split points and adversary setups demonstrates its state-of-the-art performance. Moreover, we thoroughly examine seven defense mechanisms, highlighting our method's capability to reveal private information even in the presence of these defenses.

Secure Out-of-Distribution Task Generalization with Energy-Based Models
Shengzhuang Chen Long-Kai Huang Jonathan Richard Schwarz Yilun Du Ying Wei



研究问题:元学习在野外分布外(OOD)任务上的效果并不稳定,如何保证元学习到的先验知识对OOD任务的泛化能力,特别是在安全关键应用中。
动机:现有的贝叶斯元学习方法在检测OOD任务和适应先验知识方面的可靠性受到特征分布偏移覆盖不完整和元学习到的先验知识表达能力不足的限制。
方法:构建一个支持检测和适应OOD任务的单一连贯框架,同时兼容现有的元学习基础。提出的基于能量的元学习(EBML)框架通过两个具有表达能力的神经网络能量函数的组合来描述任何任意的元训练任务分布。
效果:实验结果表明,该方法在四个回归和分类数据集上都表现出了有效性。

The success of meta-learning on out-of-distribution (OOD) tasks in the wild has proved to be hit-and-miss. To safeguard the generalization capability of the meta-learned prior knowledge to OOD tasks, in particularly safety-critical applications, necessitates detection of an OOD task followed by adaptation of the task towards the prior. Nonetheless, the reliability of estimated uncertainty on OOD tasks by existing Bayesian meta-learning methods is restricted by incomplete coverage of the feature distribution shift and insufficient expressiveness of the meta-learned prior. Besides, they struggle to adapt an OOD task, running parallel to the line of cross-domain task adaptation solutions which are vulnerable to overfitting. To this end, we build a single coherent framework that supports both detection and adaptation of OOD tasks, while remaining compatible with off-the-shelf meta-learning backbones. The proposed Energy-Based Meta-Learning (EBML) framework learns to characterize any arbitrary meta-training task distribution with the composition of two expressive neural-network-based energy functions. We deploy the sum of the two energy functions, being proportional to the joint distribution of a task, as a reliable score for detecting OOD tasks; during meta-testing, we adapt the OOD task to in-distribution tasks by energy minimization. Experiments on four regression and classification datasets demonstrate the effectiveness of our proposal.

PERFOGRAPH: A Numerical Aware Program Graph Representation for Performance Optimization and Program Analysis
Ali TehraniJamsaz Quazi Ishtiaque Mahmud Le Chen Nesreen K. Ahmed Ali Jannesari



研究问题:如何有效地表示编程语言,以使机器学习方法能够更好地理解和推理程序。
动机:当前的语言表示方法由于缺乏数值意识、聚合数据结构信息以及变量表示方式不当等问题,限制了其性能和应用范围。
方法:提出了一种新的基于图的编程语言表示方法PERFOGRAPH,通过引入新的节点和边来捕捉数值信息和聚合数据结构,同时提出一种适应性嵌入方法来引入数值意识。
效果:实验结果表明,PERFOGRAPH在各种应用中表现出色,包括程序分析、性能优化和并行性发现等,并在著名的设备映射挑战中将错误率降低了7.4%(AMD数据集)和10%(NVIDIA数据集),创造了新的最先进的结果。

The remarkable growth and significant success of machine learning have expanded its applications into programming languages and program analysis. However, a key challenge in adopting the latest machine learning methods is the representation of programming languages which has a direct impact on the ability of machine learning methods to reason about programs. The absence of numerical awareness, aggregate data structure information, and improper way of presenting variables in previous representation works have limited their performances. To overcome the limitations and challenges of current program representations, we propose a novel graph-based program representation called PERFOGRAPH. PERFOGRAPH can capture numerical information and the aggregate data structure by introducing new nodes and edges. Furthermore, we propose an adapted embedding method to incorporate numerical awareness. These enhancements make PERFOGRAPH a highly flexible and scalable representation that can effectively capture programs' intricate dependencies and semantics. Consequently, it serves as a powerful tool for various applications such as program analysis, performance optimization, and parallelism discovery. Our experimental results demonstrate that PERFOGRAPH outperforms existing representations and sets new state-of-the-art results by reducing the error rate by 7.4% (AMD dataset) and 10% (NVIDIA dataset) in the well-known Device Mapping challenge. It also sets new state-of-the-art results in various performance optimization tasks like Parallelism Discovery and Numa and Prefetchers Configuration prediction.

TriRE: A Multi-Mechanism Learning Paradigm for Continual Knowledge Retention and Promotion
Preetha Vijayan Prashant Shivaram Bhat Bahram Zonooz Elahe Arani



研究问题:本文旨在解决深度学习网络在连续学习中由于先前学习任务的灾难性遗忘(CF)而面临的挑战。
动机:尽管已有的技术如权重正则化、经验复述和参数隔离等在一定程度上缓解了遗忘问题,但这些方法大多相互独立,存在一些不足,同时也错过了竞争策略的优势。
方法:受大脑如何同时利用神经发生、主动遗忘、神经调制、可塑性、经验复述和上下文依赖门控等多种神经生理过程来学习、适应和跨任务转移知识的启发,我们提出了TriRE,这是一种新的连续学习范式,包括保留每个任务最突出的神经元,修订和巩固当前和过去任务提取的知识,并通过倒带和重新学习来主动推动后续任务的非活跃神经元。
效果:在各种连续学习设置中,TriRE显著减少了任务干扰,并超越了单独考虑的不同连续学习方法。

Continual learning (CL) has remained a persistent challenge for deep neural networks due to catastrophic forgetting (CF) of previously learned tasks. Several techniques such as weight regularization, experience rehearsal, and parameter isolation have been proposed to alleviate CF. Despite their relative success, these research directions have predominantly remained orthogonal and suffer from several shortcomings, while missing out on the advantages of competing strategies. On the contrary, the brain continually learns, accommodates, and transfers knowledge across tasks by simultaneously leveraging several neurophysiological processes, including neurogenesis, active forgetting, neuromodulation, metaplasticity, experience rehearsal, and context-dependent gating, rarely resulting in CF. Inspired by how the brain exploits multiple mechanisms concurrently, we propose TriRE, a novel CL paradigm that encompasses retaining the most prominent neurons for each task, revising and solidifying the extracted knowledge of current and past tasks, and actively promoting less active neurons for subsequent tasks through rewinding and relearning. Across CL settings, TriRE significantly reduces task interference and surpasses different CL approaches considered in isolation.

Scalarization for Multi-Task and Multi-Domain Learning at Scale
Amelie Royer Tijmen Blankevoort Babak Ehteshami Bejnordi



研究问题:如何优化多领域和多任务学习的网络,特别是在不同任务或领域之间存在差异的情况下?
动机:训练一个单一的模型在多个输入领域和/或输出任务上,可以将来自多个来源的信息压缩到一个统一的主干中,从而提高模型效率。同时,它也可以实现跨任务/领域的知识转移,提高准确性和数据效率的训练。
方法:我们首先设计了一个大规模的统一分析,以更好地理解各种任务/领域组合和模型大小的标量化动态。然后,我们提出利用基于种群的训练来有效地搜索大量任务或领域的最优标量化权重。
效果:实验结果表明,这种方法在处理大量任务或领域时,可以有效地找到最优的标量化权重,提高了模型的效率和准确性。

Training a single model on multiple input domains and/or output tasks allows for compressing information from multiple sources into a unified backbone hence improves model efficiency. It also enables potential positive knowledge transfer across tasks/domains, leading to improved accuracy and data-efficient training. However, optimizing such networks is a challenge, in particular due to discrepancies between the different tasks or domains: Despite several hypotheses and solutions proposed over the years, recent work has shown that uniform scalarization training, i.e., simply minimizing the average of the task losses, yields on-par performance with more costly SotA optimization methods. This raises the issue of how well we understand the training dynamics of multi-task and multi-domain networks. In this work, we first devise a large-scale unified analysis of multi-domain and multi-task learning to better understand the dynamics of scalarization across varied task/domain combinations and model sizes. Following these insights, we then propose to leverage population-based training to efficiently search for the optimal scalarization weights when dealing with a large number of tasks or domains.

Revisiting Visual Model Robustness: A Frequency Long-Tailed Distribution View
Zhiyu Lin Yifei Gao Yunfan Yang Jitao Sang



研究问题:视觉模型的鲁棒性缺乏的原因是什么?
动机:目前的理论认为,视觉模型对人眼无法察觉的高频率成分(HFC)的利用是其鲁棒性不足的原因。
方法:本文从频率长尾的角度重新定义了HFC,并重新审视了HFC与模型鲁棒性的关系。在频率长尾的情况下,通过实验发现标准训练的模型对HFC非常敏感,原因是模型在HFC上的信息量有限。基于这些发现,提出了平衡频谱采样(BaSS)策略,以有效对抗长尾效应并增强模型对HFC的学习。
效果:实验结果表明,该方法在与现有防御方法结合时,实现了显著的鲁棒性-准确性权衡改善,同时表明鼓励HFC学习可以提高模型性能。

A widely discussed hypothesis regarding the cause of visual models' lack of robustness is that they can exploit human-imperceptible high-frequency components (HFC) in images, which in turn leads to model vulnerabilities, such as the adversarial examples. However, (1) inconsistent findings regarding the validation of this hypothesis reflect in a limited understanding of HFC, and (2) solutions inspired by the hypothesis tend to involve a robustness-accuracy trade-off and leaning towards suppressing the model's learning on HFC. In this paper, inspired by the long-tailed characteristic observed in frequency spectrum, we first formally define the HFC from long-tailed perspective and then revisit the relationship between HFC and model robustness. In the frequency long-tailed scenario, experimental results on common datasets and various network structures consistently indicate that models in standard training exhibit high sensitivity to HFC. We investigate the reason of the sensitivity, which reflects in model's under-fitting behavior on HFC. Furthermore, the cause of the model's under-fitting behavior is attributed to the limited information content in HFC. Based on these findings, we propose a Balance Spectrum Sampling (BaSS) strategy, which effectively counteracts the long-tailed effect and enhances the model's learning on HFC. Extensive experimental results demonstrate that our method achieves a substantially better robustness-accuracy trade-off when combined with existing defense methods, while also indicating the potential of encouraging HFC learning in improving model performance.

Implicit variance regularization in non-contrastive SSL
Manu Srinath Halvagal Axel Laborieux Friedemann Zenke



研究问题:非对比性自监督学习方法如BYOL和SimSiam如何通过不使用负样本的非对称预测器网络避免表示崩溃,以及预测器网络如何促进稳定的学习。
动机:尽管先前的理论分析假设了欧几里得损失,但大多数实际应用依赖于余弦相似性。为了进一步理解非对比性SSL,本研究对闭合形式线性预测器网络的欧几里得和余弦相似性在学习动态中进行了深入的理论分析。
方法:我们分析了在封闭形式的线性预测器网络的特征空间中,学习动态与欧几里得和余弦相似性的关联。我们发现,尽管通过不同的动态机制,两者都通过隐含的方差正则化避免了崩溃。此外,我们发现特征值可以作为有效的学习率乘数,并提出了一类各向同性损失函数(IsoLoss),可以在各个特征模态上均衡收敛速度。
效果:实验结果表明,IsoLoss加快了初始学习动态并提高了鲁棒性,从而允许我们不再需要通常用于非对比方法的EMA目标网络。我们的分析揭示了非对比性SSL的方差正则化机制,并为塑造预测器频谱的学习动态奠定了理论基础。

Non-contrastive SSL methods like BYOL and SimSiam rely on asymmetric predictor networks to avoid representational collapse without negative samples. Yet, how predictor networks facilitate stable learning is not fully understood. While previous theoretical analyses assumed Euclidean losses, most practical implementations rely on cosine similarity. To gain further theoretical insight into non-contrastive SSL, we analytically study learning dynamics in conjunction with Euclidean and cosine similarity in the eigenspace of closed-form linear predictor networks. We show that both avoid collapse through implicit variance regularization albeit through different dynamical mechanisms. Moreover, we find that the eigenvalues act as effective learning rate multipliers and propose a family of isotropic loss functions (IsoLoss) that equalize convergence rates across eigenmodes. Empirically, IsoLoss speeds up the initial learning dynamics and increases robustness, thereby allowing us to dispense with the EMA target network typically used with non-contrastive methods. Our analysis sheds light on the variance regularization mechanisms of non-contrastive SSL and lays the theoretical grounds for crafting novel loss functions that shape the learning dynamics of the predictor's spectrum.

Auxiliary Losses for Learning Generalizable Concept-based Models
Ivaxi Sheth Samira Ebrahimi Kahou



研究问题:如何提高神经网络模型的透明度,同时避免学习到无关的概念表示。
动机:现有的概念瓶颈模型(CBMs)虽然提高了模型的透明度,但常常学习到无关的概念表示,损害了模型性能。
方法:提出合作概念瓶颈模型(coop-CBM),通过引入概念正交损失(COL)来鼓励概念表示之间的分离和减小概念内距离。
效果:在各种分布偏移设置下,coop-CBM模型在所有数据集上都实现了更高的准确率,甚至超过了具有最高概念准确率的黑盒模型。

The increasing use of neural networks in various applications has lead to increasing apprehensions, underscoring the necessity to understand their operations beyond mere final predictions. As a solution to enhance model transparency, Concept Bottleneck Models (CBMs) have gained popularity since their introduction. CBMs essentially limit the latent space of a model to human-understandable high-level concepts. While beneficial, CBMs have been reported to often learn irrelevant concept representations that consecutively damage model performance. To overcome the performance trade-off, we propose a cooperative-Concept Bottleneck Model (coop-CBM). The concept representation of our model is particularly meaningful when fine-grained concept labels are absent. Furthermore, we introduce the concept orthogonal loss (COL) to encourage the separation between the concept representations and to reduce the intra-concept distance. This paper presents extensive experiments on real-world datasets for image classification tasks, namely CUB, AwA2, CelebA and TIL. We also study the performance of coop-CBM models under various distributional shift settings. We show that our proposed method achieves higher accuracy in all distributional shift settings even compared to the black-box models with the highest concept accuracy.

Prompt-augmented Temporal Point Process for Streaming Event Sequence
Siqiao Xue Yan Wang Zhixuan Chu Xiaoming Shi Caigao JIANG Hongyan Hao Gangwei Jiang Xiaoyun Feng James Y. Zhang JUN ZHOU



研究问题:如何在隐私和内存限制下,对连续时间事件序列进行持续监控以学习流事件序列。
动机:在现实世界的应用中,事件数据通常以流媒体的形式出现,其模式分布可能会随时间推移而变化。
方法:采用持续学习(CL)的方法,通过将基础TPP与连续时间检索提示库相结合,提出了一个简单的但有效的框架PromptTPP。
效果:在两个真实的用户行为数据集上,PromptTPP始终设定了最先进的性能。

Neural Temporal Point Processes (TPPs) are the prevalent paradigm for modeling continuous-time event sequences, such as user activities on the web and financial transactions. In real world applications, the event data typically comes in a streaming manner, where the distribution of the patterns may shift over time. Under the privacy and memory constraints commonly seen in real scenarios, how to continuously monitor a TPP to learn the streaming event sequence is an important yet under-investigated problem. In this work, we approach this problem by adopting Continual Learning (CL), which aims to enable a model to continuously learn a sequence of tasks without catastrophic forgetting. While CL for event sequence is less well studied, we present a simple yet effective framework, PromptTPP, by integrating the base TPP with a continuous-time retrieval prompt pool. In our proposed framework, prompts are small learnable parameters, maintained in a memory space and jointly optimized with the base TPP so that the model is properly instructed to learn event streams arriving sequentially without buffering past examples or task-specific attributes. We formalize a novel and realistic experimental setup for modeling event streams, where PromptTPP consistently sets state-of-the-art performance across two real user behavior datasets.

On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models
Yufan Cai Yun Lin Chenyan Liu Jinglian Wu Yifan Zhang Yiming Liu Yeyun Gong Jin Song Dong



研究问题:如何提高代码注释生成器的性能,特别是在特定项目代码和训练语料库不匹配的情况下。
动机:现有的深度学习模型在处理特定项目的代码时,可能会因为其他项目的代码样本产生矛盾和不一致,从而影响性能。
方法:提出一种名为Adacom的新方法,通过实时模型适应来改进注释生成器的性能。该方法可以检测模型在目标代码上可能的性能下降,并检索出有矛盾的训练样本进行重新训练,以强化有益的样本并消除有害的样本。
效果:在7个注释生成器和4个公共数据集上的大量实验表明,该方法可以显著提高注释生成的性能(BLEU4得分平均提高了14.9%,METEOR提高了12.2%,ROUGE-L提高了7.4%),并且对单个代码样本的适应是成本效益高的,可以作为实时解决方案接受,同时也可以很好地适应分布外代码样本。

Deep learning models are emerging to summarize source code to comment, facilitating tasks of code documentation and program comprehension. Scaled-up large language models trained on large open corpus have achieved good performance in such tasks. However, in practice, the subject code in one certain project can be specific, which may not align with the overall training corpus. Some code samples from other projects may be contradictory and introduce inconsistencies when the models try to fit all the samples. In this work, we introduce a novel approach, Adacom, to improve the performance of comment generators by on-the-fly model adaptation. This research is motivated by the observation that deep comment generators often need to strike a balance as they need to fit all the training samples. Specifically, for one certain target code $c$, some training samples $S_p$ could have made more contributions while other samples $S_o$ could have counter effects. However, the traditional fine-tuned models need to fit both $S_p$ and $S_o$ from a global perspective, leading to compromised performance for one certain target code $c$. In this context, we design Adacom to (1) detect whether the model might have a compromised performance on a target code $c$ and (2) retrieve a few helpful training samples $S_p$ that have contradictory samples in the training dataset and, (3) adapt the model on the fly by re-training the $S_p$ to strengthen the helpful samples and unlearn the harmful samples. Our extensive experiments on 7 comment generators and 4 public datasets show that (1) can significantly boost the performance of comment generation (BLEU4 score by on average 14.9\%, METEOR by 12.2\%, and ROUGE-L by 7.4\%), (2) the adaptation on one code sample is cost-effective and acceptable as an on-the-fly solution, and (3) can adapt well on out-of-distribution code samples.

Feature Likelihood Score: Evaluating the Generalization of Generative Models Using Samples
Marco Jiralerspong Joey Bose Ian Gemp Chongli Qin Yoram Bachrach Gauthier Gidel



研究问题:当前深度学习模型的评估方法存在不足,如标准似然度指标不适用于高维复杂数据,样本基础指标对过拟合不敏感等。
动机:为了解决这些问题,提出了一种新的特征似然得分(FLS)指标,用于全面评估生成样本的新颖性、逼真度和多样性。
方法:FLS是一种基于密度估计的参数化样本评分方法,通过三分类评估来提供全面的评估。
效果:实验结果表明,FLS能够准确识别出现过拟合问题的特定情况,并在各种图像数据集和模型类别上表现出与先前指标(如FID)相匹配的能力,同时提供了更全面的生成模型评估。

The past few years have seen impressive progress in the development of deep generative models capable of producing high-dimensional, complex, and photo-realistic data. However, current methods for evaluating such models remain incomplete: standard likelihood-based metrics do not always apply and rarely correlate with perceptual fidelity, while sample-based metrics, such as FID, are insensitive to overfitting, i.e., inability to generalize beyond the training set. To address these limitations, we propose a new metric called the Feature Likelihood Score (FLS), a parametric sample-based score that uses density estimation to provide a comprehensive trichotomic evaluation accounting for novelty (i.e., different from the training samples), fidelity, and diversity of generated samples. We empirically demonstrate the ability of FLS to identify specific overfitting problem cases, where previously proposed metrics fail. We also extensively evaluate FLS on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models.

A Bayesian Approach To Analysing Training Data Attribution In Deep Learning
Elisa Nguyen Minjoon Seo Seong Joon Oh



研究问题:如何准确找出对模型预测有影响的训练数据?
动机:训练数据归属(TDA)技术在理论上有用,但在实践中难以应用到深度模型上,因为其对模型初始化的敏感性。
方法:从贝叶斯的角度看待TDA任务,将学习到的模型视为贝叶斯后验,并将TDA估计视为随机变量。
效果:发现单个训练样本的影响通常被模型初始化和SGD批量组成的噪声所掩盖。因此,只有在训练-测试数据对不受其他噪声因素影响的情况下,才能可靠地使用TDA来解释深度模型的预测。

Training data attribution (TDA) techniques find influential training data for the model's prediction on the test data of interest. They approximate the impact of down- or up-weighting a particular training sample. While conceptually useful, they are hardly applicable to deep models in practice, particularly because of their sensitivity to different model initialisation. In this paper, we introduce a Bayesian perspective on the TDA task, where the learned model is treated as a Bayesian posterior and the TDA estimates as random variables. From this novel viewpoint, we observe that the influence of an individual training sample is often overshadowed by the noise stemming from model initialisation and SGD batch composition. Based on this observation, we argue that TDA can only be reliably used for explaining deep model predictions that are consistently influenced by certain training data, independent of other noise factors. Our experiments demonstrate the rarity of such noise-independent training-test data pairs but confirm their existence. We recommend that future researchers and practitioners trust TDA estimates only in such cases. Further, we find a disagreement between ground truth and estimated TDA distributions and encourage future work to study this gap. Code is provided at https://github.com/ElisaNguyen/bayesian-tda.

Neural Harmonics: Bridging Spectral Embedding and Matrix Completion in Self-Supervised Learning
Marina Munkhoeva Ivan Oseledets



研究问题:本文旨在从拉普拉斯算子的角度理解现代自监督表示学习方法的工作机制,并将其中的归纳偏置与低秩矩阵补全问题相联系。
动机:由于其看似启发式的方法来学习尊重数据语义的表示,而无需任何明显的标签形式的监督,因此自监督方法受到了极大的关注。
方法:本文利用低秩矩阵补全的结果,对现代SSL方法的收敛性以及影响其下游性能的关键属性进行了理论分析。
效果:通过这种方式,我们能够更深入地理解自监督学习方法的工作机制,并为其提供理论支持。

Self-supervised methods received tremendous attention thanks to their seemingly heuristic approach to learning representations that respect the semantics of the data without any apparent supervision in the form of labels. A growing body of literature is already being published in an attempt to build a coherent and theoretically grounded understanding of the workings of a zoo of losses used in modern self-supervised representation learning methods. In this paper, we attempt to provide an understanding from the perspective of a Laplace operator and connect the inductive bias stemming from the augmentation process to a low-rank matrix completion problem. To this end, we leverage the results from low-rank matrix completion to provide theoretical analysis on the convergence of modern SSL methods and a key property that affects their downstream performance.

UniTSFace: Unified Threshold Integrated Sample-to-Sample Loss for Face Recognition
Qiufu Li Xi Jia Jiancan Zhou Linlin Shen Jinming Duan



研究问题:现有的人脸识别模型无法充分探索大量人脸图像之间的跨样本关系,同时基于样本对的模型在训练时需要复杂的配对过程。
动机:为了解决上述问题,本文提出了一种统一的阈值集成的基于样本对的损失函数(USS损失),该函数具有明确的区分正负样本对的统一阈值。
方法:受USS损失的启发,我们还推导了基于样本对的softmax和BCE损失,并讨论了它们之间的关系。我们在多个基准数据集上进行了广泛的评估,包括MFR、IJB-C、LFW、CFP-FP、AgeDB和MegaFace,结果表明提出的USS损失非常高效,可以与基于样本类的loss无缝结合。
效果:通过使用嵌入的损失(USS和基于样本类的Softmax损失),我们克服了以前方法的缺点,训练出的人脸模型UniTSFace表现出色,超过了最先进的方法,如CosFace、ArcFace、VPL、AnchorFace和UNPG。我们的代码可以在https://github.com/CVI-SZU/UniTSFace获取。

Sample-to-class-based face recognition models can not fully explore the cross-sample relationship among large amounts of facial images, while sample-to-sample-based models require sophisticated pairing processes for training. Furthermore, neither method satisfies the requirements of real-world face verification applications, which expect a unified threshold separating positive from negative facial pairs. In this paper, we propose a unified threshold integrated sample-to-sample based loss (USS loss), which features an explicit unified threshold for distinguishing positive from negative pairs. Inspired by our USS loss, we also derive the sample-to-sample based softmax and BCE losses, and discuss their relationship. Extensive evaluation on multiple benchmark datasets, including MFR, IJB-C, LFW, CFP-FP, AgeDB, and MegaFace, demonstrates that the proposed USS loss is highly efficient and can work seamlessly with sample-to-class-based losses. The embedded loss (USS and sample-to-class Softmax loss) overcomes the pitfalls of previous approaches and the trained facial model UniTSFace exhibits exceptional performance, outperforming state-of-the-art methods, such as CosFace, ArcFace, VPL, AnchorFace, and UNPG. Our code is available at https://github.com/CVI-SZU/UniTSFace.

Improving Self-supervised Molecular Representation Learning using Persistent Homology
Yuankai Luo Lei Shi Veronika Thost



研究问题:本文旨在研究自监督学习在分子表示学习中的应用,特别是在处理复杂的分子图和大量未标记数据时。
动机:由于分子图的复杂性、大量未标记的数据以及实验获取标签的高成本,使得获取训练数据集通常很小,因此自监督学习在分子表示学习中具有巨大潜力。
方法:本文采用持续同调作为数学工具来模拟跨越多个尺度的数据拓扑特征,并将其应用于自监督学习。我们设计了一个自动编码器和一个对比损失函数,以提升表示空间的性能。
效果:实验结果表明,通过持续同调进行自监督学习后,所获得的分子表示比基线方法在不同探测任务上的表现更好,预测能力更强。此外,我们的方法和损失函数可以显著提高小数据集上的模型性能,这在实际场景中非常常见。

Self-supervised learning (SSL) has great potential for molecular representation learning given the complexity of molecular graphs, the large amounts of unlabelled data available, the considerable cost of obtaining labels experimentally, and the hence often only small training datasets. The importance of the topic is reflected in the variety of paradigms and architectures that have been investigated recently. Yet the differences in performance seem often minor and are barely understood to date. In this paper, we study SSL based on persistent homology (PH), a mathematical tool for modeling topological features of data that persist across multiple scales. It has several unique features which particularly suit SSL, naturally offering: different views of the data, stability in terms of distance preservation, and the opportunity to flexibly incorporate domain knowledge. We (1) investigate an autoencoder, which shows the general representational power of PH, and (2) propose a contrastive loss that complements existing approaches. We rigorously evaluate our approach for molecular property prediction and demonstrate its particular features in improving the embedding space: after SSL, the representations are better and offer considerably more predictive power than the baselines over different probing tasks; our loss increases baseline performance, sometimes largely; and we often obtain substantial improvements over very small datasets, a common scenario in practice.

Mitigating the Effect of Incidental Correlations on Part-based Learning
Gaurav Bhatt Deepayan Das Leonid Sigal Vineeth N. Balasubramanian



研究问题:当前部分学习者在处理特定背景或特定排列方式下出现的对象时,由于观察有限,难以应对偶发的相关关系,这可能对学到的部分表示的泛化和可解释性产生不利影响。
动机:本研究主张通过两种创新的正则化方法,使基于部分的表示更具解释性和更好的泛化能力。
方法:首先,通过独特的混合-of-parts公式分离前景和背景信息的生成过程,使用弱监督损失对部分施加结构约束,确保混合-of-parts对前景和背景进行软、对象无关的掩蔽。其次,采用蒸馏损失的形式,确保学到的部分不受偶发背景相关性的影响。此外,还引入稀疏和正交约束以促进高质量的部分表示的学习。
效果:通过减少偶发背景相关性对学习到的部分的影响,本研究在MiniImagenet、TieredImageNet和FC100等基准数据集上的少次学习任务上表现出了最先进的性能。同时,即使在ImageNet-9数据集的背景和常见数据损坏的情况下,也证明了该方法获得的部分表示比现有技术具有更好的泛化能力。

Intelligent systems possess a crucial characteristic of breaking complicated problems into smaller reusable components or parts and adjusting to new tasks using these part representations. However, current part-learners encounter difficulties in dealing with incidental correlations resulting from the limited observations of objects that may appear only in specific arrangements or with specific backgrounds. These incidental correlations may have a detrimental impact on the generalization and interpretability of learned part representations. This study asserts that part-based representations could be more interpretable and generalize better with limited data, employing two innovative regularization methods. The first regularization separates foreground and background information's generative process via a unique mixture-of-parts formulation. Structural constraints are imposed on the parts using a weakly-supervised loss, guaranteeing that the mixture-of-parts for foreground and background entails soft, object-agnostic masks. The second regularization assumes the form of a distillation loss, ensuring the invariance of the learned parts to the incidental background correlations. Furthermore, we incorporate sparse and orthogonal constraints to facilitate learning high-quality part representations. By reducing the impact of incidental background correlations on the learned parts, we exhibit state-of-the-art (SoTA) performance on few-shot learning tasks on benchmark datasets, including MiniImagenet, TieredImageNet, and FC100. We also demonstrate that the part-based representations acquired through our approach generalize better than existing techniques, even under domain shifts of the background and common data corruption on the ImageNet-9 dataset.

PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning
Hojoon Lee Hanseul Cho Hyunseung Kim Daehoon Gwak Joonkee Kim Jaegul Choo Se-Young Yun Chulhee Yun



研究问题:在强化学习中,如何提高样本效率,特别是在数据获取成本高且风险大的情况下。
动机:虽然理论上说,离线策略强化学习算法可以通过允许每次环境交互进行多次更新来提高样本效率,但这些多次更新往往会使模型过度拟合早期的交互,即所谓的塑性损失。
方法:通过将塑性分为输入塑性和标签塑性两个方面,研究了这种现象的根本原因。在CIFAR-10数据集上的合成实验表明,找到更平滑的损失景观最小值可以提高输入塑性,而优化的梯度传播可以改善标签塑性。基于这些发现,提出了**PLASTIC**算法,该算法和谐地结合了解决这两个问题的技术。
效果:通过对Atari-100k和Deepmind控制套件等基准测试,PLASTIC在最小的架构修改下实现了竞争的性能。这一结果强调了保持模型塑性以提高强化学习中样本效率的重要性。代码可在https://github.com/dojeon-ai/plastic获取。

In Reinforcement Learning (RL), enhancing sample efficiency is crucial, particularly in scenarios when data acquisition is costly and risky. In principle, off-policy RL algorithms can improve sample efficiency by allowing multiple updates per environment interaction. However, these multiple updates often lead the model to overfit to earlier interactions, which is referred to as the loss of plasticity. Our study investigates the underlying causes of this phenomenon by dividing plasticity into two aspects. Input plasticity, which denotes the model's adaptability to changing input data, and label plasticity, which denotes the model's adaptability to evolving input-output relationships. Synthetic experiments on the CIFAR-10 dataset reveal that finding smoother minima of loss landscape enhances input plasticity, whereas refined gradient propagation improves label plasticity. Leveraging these findings, we introduce the **PLASTIC** algorithm, which harmoniously combines techniques to address both concerns. With minimal architectural modifications, PLASTIC achieves competitive performance on benchmarks including Atari-100k and Deepmind Control Suite. This result emphasizes the importance of preserving the model's plasticity to elevate the sample efficiency in RL. The code is available at https://github.com/dojeon-ai/plastic.

Risk-Averse Active Sensing for Timely Outcome Prediction under Cost Pressure
Yuchao Qin Mihaela van der Schaar Changhee Lee



研究问题:如何在医疗健康监测中,以高效的方式获取病人的共变量信息,以实现早期发现和干预不良事件。
动机:在对病人健康状况进行长期跟踪的过程中,由于筛查和实验室测试的成本高昂,因此需要一种有效且经济的方式来获取病人的共变量信息。
方法:本文提出了一种新的风险规避型主动感知策略RAS,该策略将决策问题分解为何时进行采集和进行何种测量两个子问题。同时,引入了一种新的风险规避训练策略,重点关注高风险病人这一被忽视的群体。
效果:实验结果显示,该方法在合成数据和真实世界数据集上都优于基线主动感知方法。案例研究进一步证明了策略分解的重要性和风险规避型感知策略的必要性。

Timely outcome prediction is essential in healthcare to enable early detection and intervention of adverse events. However, in longitudinal follow-ups to patients' health status, cost-efficient acquisition of patient covariates is usually necessary due to the significant expense involved in screening and lab tests. To balance the timely and accurate outcome predictions with acquisition costs, an effective active sensing strategy is crucial. In this paper, we propose a novel risk-averse active sensing approach RAS that addresses the composite decision problem of when to conduct the acquisition and which measurements to make. Our approach decomposes the policy into two sub-policies: acquisition scheduler and feature selector, respectively. Moreover, we introduce a novel risk-aversion training strategy to focus on the underrepresented subgroup of high-risk patients for whom timely and accurate prediction of disease progression is of greater value. Our method outperforms baseline active sensing approaches in experiments with both synthetic and real-world datasets, and we illustrate the significance of our policy decomposition and the necessity of a risk-averse sensing policy through case studies.

Analyzing the Sample Complexity of Self-Supervised Image Reconstruction Methods
Tobit Klug Dogukan Atik Reinhard Heckel



研究问题:本文旨在探讨自我监督训练在样本复杂度方面的代价,以及其与有监督训练之间的性能差距。
动机:虽然有监督训练在许多图像重建任务上取得了最先进的性能,但收集干净的图像和噪声测量对的训练对是困难的。自我监督方法允许只基于噪声测量进行训练,无需干净的图像。
方法:本文研究了一类能够计算有监督损失梯度的无偏估计的自我监督方法(包括noise2noise方法)的成本。我们分析表明,使用这种自我监督训练的模型与用有监督方式训练的同一模型一样好,但需要比有监督训练更多的示例。
效果:通过实验研究了自我监督去噪和加速MRI,并从所需额外样本数量的角度描述了自我监督训练的成本。我们发现,随着训练样本的增加,自我监督和有监督训练之间的性能差距会逐渐缩小,这与我们的理论预测相符。

Supervised training of deep neural networks on pairs of clean image and noisy measurement achieves state-of-the-art performance for many image reconstruction tasks, but such training pairs are difficult to collect. Self-supervised methods enable training based on noisy measurements only, without clean images. In this work, we investigate the cost of self-supervised training in terms of sample complexity for a class of self-supervised methods that enable the computation of unbiased estimates of gradients of the supervised loss, including noise2noise methods. We analytically show that a model trained with such self-supervised training is as good as the same model trained in a supervised fashion, but self-supervised training requires more examples than supervised training. We then study self-supervised denoising and accelerated MRI empirically and characterize the cost of self-supervised training in terms of the number of additional samples required, and find that the performance gap between self-supervised and supervised training vanishes as a function of the training examples, at a problem-dependent rate, as predicted by our theory.

Cross-Domain Policy Adaptation via Value-Guided Data Filtering
Kang Xu Chenjia Bai Xiaoteng Ma Dong Wang Bin Zhao Zhen Wang Xuelong Li Wei Li



研究问题:在强化学习中,如何在不同的领域间推广政策,特别是在动态不匹配的情况下,是一个重大挑战。
动机:例如,机器人在模拟器中学习策略,但在真实世界中部署时,环境的动态性可能会有所不同。考虑到源领域和目标领域之间的动态不匹配,我们考虑在线动态适应问题,在这种情况下,代理可以访问足够的源领域数据,而与目标领域的在线交互是有限的。
方法:我们提出了一种新的方法,即值引导的数据过滤(VGDF)算法,通过一种新的对跨领域价值一致性的见解,从值差异的角度来解决这个问题。具体来说,我们根据两个领域配对的价值目标的接近程度,有选择地分享源领域的转换。
效果:我们在各种具有运动学和形态变化的环境中的实验结果表明,我们的方法比现有方法取得了更好的性能。

Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning. For example, a robot learns the policy in a simulator, but when it is deployed in the real world, the dynamics of the environment may be different. Given the source and target domain with dynamics mismatch, we consider the online dynamics adaptation problem, in which case the agent can access sufficient source domain data while online interactions with the target domain are limited. Existing research has attempted to solve the problem from the dynamics discrepancy perspective. In this work, we reveal the limitations of these methods and explore the problem from the value difference perspective via a novel insight on the value consistency across domains. Specifically, we present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets across the two domains. Empirical results on various environments with kinematic and morphology shifts demonstrate that our method achieves superior performance compared to prior approaches.

Interpreting Unsupervised Anomaly Detection in Security via Rule Extraction
Ruoyu Li Qing Li Yu Zhang Dan Zhao Yong Jiang Yong Yang



研究问题:如何对黑箱无监督异常检测模型进行全局解释。
动机:由于恶意数据非常罕见,许多安全应用需要无监督的异常检测,并且只有未标记的正常数据可用于训练。然而,由于缺乏可解释性,安全操作员对信任黑箱模型的高风险表示担忧。
方法:本文提出了一种后处理的方法,通过规则提取来全局解释黑箱无监督异常检测模型。首先,我们提出了分布分解规则的概念,将正常数据的复杂分布分解为多个组合分布。为了找到这样的规则,我们设计了一个包含模型预测在内的分裂标准的无监督内部聚类树。然后,我们提出了组合边界探索(CBE)算法,以获取估计原始模型在每个组合分布上的决策边界的边界推理规则。通过将这些两种类型的规则合并为一个规则集,我们可以以人类可理解的方式呈现无监督黑箱模型的推理过程,并同时构建一个用于在线部署的替代规则基模型。
效果:我们在各种真实数据集上对四种不同的无监督异常检测模型进行了全面的解释实验。评估表明,我们的方法在保真度、正确性和鲁棒性等多样化指标上优于现有方法。

Many security applications require unsupervised anomaly detection, as malicious data are extremely rare and often only unlabeled normal data are available for training (i.e., zero-positive). However, security operators are concerned about the high stakes of trusting black-box models due to their lack of interpretability. In this paper, we propose a post-hoc method to globally explain a black-box unsupervised anomaly detection model via rule extraction. First, we propose the concept of distribution decomposition rules that decompose the complex distribution of normal data into multiple compositional distributions. To find such rules, we design an unsupervised Interior Clustering Tree that incorporates the model prediction into the splitting criteria. Then, we propose the Compositional Boundary Exploration (CBE) algorithm to obtain the boundary inference rules that estimate the decision boundary of the original model on each compositional distribution. By merging these two types of rules into a rule set, we can present the inferential process of the unsupervised black-box model in a human-understandable way, and build a surrogate rule-based model for online deployment at the same time. We conduct comprehensive experiments on the explanation of four distinct unsupervised anomaly detection models on various real-world datasets. The evaluation shows that our method outperforms existing methods in terms of diverse metrics including fidelity, correctness and robustness.

LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching
Duy Minh Ho Nguyen Hoang Nguyen Nghiem Tuong Diep Tan Ngoc Pham Tri Cao Binh T. Nguyen Paul Swoboda Nhat Ho Shadi Albarqouni Pengtao Xie Daniel Sonntag Mathias Niepert



研究问题:如何利用大规模医疗图像数据集训练出可以适应新任务的预训练模型,以解决医学影像数据的标注样本有限的问题。
动机:虽然在ImageNet和网络规模数据上预训练的网络以及视觉语言基础模型是主流方法,但由于自然图像和医学图像之间的显著领域偏移,它们在医学任务上的效果有限。
方法:介绍了LVM-Med,这是第一个在大规模医疗数据集上训练的深度网络家族。从55个公开可用的数据集中收集了大约130万张医疗图像,覆盖了许多器官和模态,如CT、MRI、X射线和超声波。在这个数据集上对几种最先进的自监督算法进行了基准测试,并提出了一种新的基于图匹配公式的自监督对比学习算法。
效果:在15个下游医疗任务上对提出的LVM-Med进行了全面评估,包括分割和分类以及对象检测,无论是在分布内还是分布外设置中。实验证明,LVM-Med在许多任务上都优于最先进的有监督、自监督和基础模型。对于像脑瘤分类或糖尿病视网膜病变分级等挑战性任务,LVM-Med在使用仅一个ResNet-50的情况下,比之前在10亿个masks上训练的视觉语言模型提高了6-7%。

Obtaining large pre-trained models that can be fine-tuned to new tasks with limited annotated samples has remained an open challenge for medical imaging data. While pre-trained networks on ImageNet and vision-language foundation models trained on web-scale data are the prevailing approaches, their effectiveness on medical tasks is limited due to the significant domain shift between natural and medical images. To bridge this gap, we introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets, covering a large number of organs and modalities such as CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art self-supervised algorithms on this dataset and propose a novel self-supervised contrastive learning algorithm using a graph-matching formulation. The proposed approach makes three contributions: (i) it integrates prior pair-wise image similarity metrics based on local and global information; (ii) it captures the structural constraints of feature embeddings through a loss function constructed through a combinatorial graph-matching objective, and (iii) it can be trained efficiently end-to-end using modern gradient-estimation techniques for black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream medical tasks ranging from segmentation and classification to object detection, and both for the in and out-of-distribution settings. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models. For challenging tasks such as Brain Tumor Classification or Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models trained on 1 billion masks by 6-7% while using only a ResNet-50.

Hyperbolic Space with Hierarchical Margin Boosts Fine-Grained Learning from Coarse Labels
ShuLin Xu Yifan Sun Faen Zhang Anqi Xu Xiu-Shen Wei Yi Yang



研究问题:如何从粗糙标签中学习细粒度嵌入,特别是在少量精细识别任务中。
动机:由于缺乏详细的区别,从粗糙标签中学习细粒度嵌入是一项具有挑战性的任务,尤其是在少量精细识别任务中。
方法:提出一种将视觉嵌入到双曲空间并使用分层余弦间隔增强其判别能力的新方法。具体来说,双曲空间提供了捕获层次关系和增加表达能力的优势,有利于精细对象建模。
效果:在五个基准数据集上进行的大量实验表明,该方法的有效性超过了竞争方法,取得了最先进的结果。

Learning fine-grained embeddings from coarse labels is a challenging task due to limited label granularity supervision, i.e., lacking the detailed distinctions required for fine-grained tasks. The task becomes even more demanding when attempting few-shot fine-grained recognition, which holds practical significance in various applications. To address these challenges, we propose a novel method that embeds visual embeddings into a hyperbolic space and enhances their discriminative ability with a hierarchical cosine margins manner. Specifically, the hyperbolic space offers distinct advantages, including the ability to capture hierarchical relationships and increased expressive power, which favors modeling fine-grained objects. Based on the hyperbolic space, we further enforce relatively large/small similarity margins between coarse/fine classes, respectively, yielding the so-called hierarchical cosine margins manner. While enforcing similarity margins in the regular Euclidean space has become popular for deep embedding learning, applying it to the hyperbolic space is non-trivial and validating the benefit for coarse-to-fine generalization is valuable. Extensive experiments conducted on five benchmark datasets showcase the effectiveness of our proposed method, yielding state-of-the-art results surpassing competing methods.

Generalized Information-theoretic Multi-view Clustering
Weitian Huang Sirui Yang Hongmin Cai



研究问题:本文旨在从信息理论的角度重新定义多视图聚类问题,并提出一个通用的理论模型。
动机:现有的多视图无监督学习方法往往依赖于样本之间的语义一致性的严格假设。
方法:通过近似高维互信息来获取多视图变分下界,并利用KL散度推导样本分配。最终,基于信息的方法利用深度神经网络和随机梯度变分贝叶斯实现表示学习和聚类的同时进行。
效果:在各种类型的合成和真实数据集上的广泛实验表明,该方法比最先进的算法表现出更稳定和优越的聚类性能。

In an era of more diverse data modalities, multi-view clustering has become a fundamental tool for comprehensive data analysis and exploration. However, existing multi-view unsupervised learning methods often rely on strict assumptions on semantic consistency among samples. In this paper, we reformulate the multi-view clustering problem from an information-theoretic perspective and propose a general theoretical model. In particular, we define three desiderata under multi-view unsupervised learning in terms of mutual information, namely, comprehensiveness, concentration, and cross-diversity. The multi-view variational lower bound is then obtained by approximating the samples' high-dimensional mutual information. The Kullback–Leibler divergence is utilized to deduce sample assignments. Ultimately the information-based multi-view clustering model leverages deep neural networks and Stochastic Gradient Variational Bayes to achieve representation learning and clustering simultaneously. Extensive experiments on both synthetic and real datasets with wide types demonstrate that the proposed method exhibits a more stable and superior clustering performance than state-of-the-art algorithms.

Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data
Boris van Breugel Nabeel Seedat Fergus Imrie Mihaela van der Schaar



研究问题:如何准确评估机器学习模型在多样化和代表性不足的子群体上的性能,以确保其在现实世界应用中的公平性和可靠性。
动机:由于缺乏测试数据(特别是对于小的子群体)以及模型部署环境中可能出现的分布偏移,准确评估模型性能变得具有挑战性。
方法:提出了3S测试,这是一种深度生成模型框架,通过为小的子群体生成合成测试集并模拟分布偏移来促进模型评估。
效果:实验表明,3S测试在估计少数群体模型性能和可能的分布偏移下优于传统的仅使用真实测试数据的方法。此外,3S还提供了其性能估计的区间,与现有方法相比,更好地覆盖了真实情况。

Evaluating the performance of machine learning models on diverse and underrepresented subgroups is essential for ensuring fairness and reliability in real-world applications. However, accurately assessing model performance becomes challenging due to two main issues: (1) a scarcity of test data, especially for small subgroups, and (2) possible distributional shifts in the model's deployment setting, which may not align with the available test data. In this work, we introduce 3S Testing, a deep generative modeling framework to facilitate model evaluation by generating synthetic test sets for small subgroups and simulating distributional shifts. Our experiments demonstrate that 3S-Testing outperforms traditional baselines---including real test data alone---in estimating model performance on minority subgroups and under plausible distributional shifts. In addition, 3S offers intervals around its performance estimates, exhibiting superior coverage of the ground truth compared to existing approaches. Overall, these results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data.

Disentangling Cognitive Diagnosis with Limited Exercise Labels
Xiangzhi Chen Le Wu Fei Liu Lei Chen Kun Zhang Richang Hong Meng Wang



研究问题:如何在只有少量练习题标签的情况下进行认知诊断。
动机:由于标注练习题的成本巨大,因此更实际的情况是只有少量练习题被标注了概念。如何在这种情况下进行认知诊断是一个未充分探索的问题。
方法:提出了基于解耦的认知诊断(DCD)模型,利用学生的回答记录来模拟学生的熟练程度、练习题的困难度和练习题标签的分布。引入了基于组的解耦和有限标签对齐两个新模块,以分离与概念相关的因素并将其与实际的有限标签对齐。
效果:在广泛使用的基准测试上进行的大量实验表明,所提出的模型具有优越性。

Cognitive diagnosis is an important task in intelligence education, which aims at measuring students’ proficiency in specific knowledge concepts. Given a fully labeled exercise-concept matrix, most existing models focused on mining students' response records for cognitive diagnosis. Despite their success, due to the huge cost of labeling exercises, a more practical scenario is that limited exercises are labeled with concepts. Performing cognitive diagnosis with limited exercise labels is under-explored and remains pretty much open. In this paper, we propose Disentanglement based Cognitive Diagnosis (DCD) to address the challenges of limited exercise labels. Specifically, we utilize students' response records to model student proficiency, exercise difficulty and exercise label distribution. Then, we introduce two novel modules - group-based disentanglement and limited-labeled alignment modules - to disentangle the factors relevant to concepts and align them with real limited labels. Particularly, we introduce the tree-like structure of concepts with negligible cost for group-based disentangling, as concepts of different levels exhibit different independence relationships. Extensive experiments on widely used benchmarks demonstrate the superiority of our proposed model.

Switching Temporary Teachers for Semi-Supervised Semantic Segmentation
Jaemin Na Jung-Woo Ha Hyung Jin Chang Dongyoon Han Wonjun Hwang



研究问题:现有的教师-学生框架在半监督语义分割中主要使用指数移动平均(EMA)来更新单个教师的权重,但这种方法存在教师和学生的权重耦合问题,可能导致性能瓶颈。
动机:为了解决教师-学生框架中的权重耦合问题,本文提出了Dual Teacher方法,该方法采用两个临时教师来减轻学生的问题。
方法:Dual Teacher通过让两个临时教师轮流生成伪标签来训练学生模型,并保持每个时期学生模型的独特特性,从而防止教师和学生过于接近。
效果:实验结果表明,Dual Teacher在PASCAL VOC、Cityscapes和ADE20K基准测试上取得了有竞争力的性能,并且训练时间明显短于最先进的方法。此外,该方法与CNN和Transformer模型都兼容。

The teacher-student framework, prevalent in semi-supervised semantic segmentation, mainly employs the exponential moving average (EMA) to update a single teacher's weights based on the student's. However, EMA updates raise a problem in that the weights of the teacher and student are getting coupled, causing a potential performance bottleneck. Furthermore, this problem may become more severe when training with more complicated labels such as segmentation masks but with few annotated data. This paper introduces Dual Teacher, a simple yet effective approach that employs dual temporary teachers aiming to alleviate the coupling problem for the student. The temporary teachers work in shifts and are progressively improved, so consistently prevent the teacher and student from becoming excessively close. Specifically, the temporary teachers periodically take turns generating pseudo-labels to train a student model and maintain the distinct characteristics of the student model for each epoch. Consequently, Dual Teacher achieves competitive performance on the PASCAL VOC, Cityscapes, and ADE20K benchmarks with remarkably shorter training times than state-of-the-art methods. Moreover, we demonstrate that our approach is model-agnostic and compatible with both CNN- and Transformer-based models. Code is available at https://github.com/naver-ai/dual-teacher.

Fair Canonical Correlation Analysis
Zhuoping Zhou Davoud Ataee Tarzanagh Bojian Hou Boning Tong Jia Xu Yanbo Feng Qi Long Li Shen



研究问题:本文旨在调查在典型关联分析(CCA)中是否存在公平性和偏见,这是一种广泛用于检查两组变量之间关系的统计技术。
动机:由于CCA模型可能会对受保护的属性产生不公平的结果,因此需要一种可以最小化与受保护属性相关的相关性差异误差的方法来减轻不公平性。
方法:我们提出了一个框架,通过使CCA模型从所有数据点学习全局投影矩阵,同时确保这些矩阵产生的相关性水平与组特定的投影矩阵相当,从而减轻不公平性。
效果:我们在合成和真实世界的数据集上进行实验评估,结果显示我们的方法在不牺牲CCA模型准确性的情况下有效地减少了不公平性。这些发现强调了在将CCA应用于现实世界问题时考虑公平性的重要性。

This paper investigates fairness and bias in Canonical Correlation Analysis (CCA), a widely used statistical technique for examining the relationship between two sets of variables. We present a framework that alleviates unfairness by minimizing the correlation disparity error associated with protected attributes. Our approach enables the CCA model to learn global projection matrices from all data points while ensuring that these matrices yield comparable correlation levels to group-specific projection matrices. Experimental evaluation on both synthetic and real-world datasets demonstrates the efficacy of our method in reducing unfairness without compromising CCA model accuracy. These findings emphasize the importance of considering fairness in CCA applications to real-world problems.

Towards Test-Time Refusals via Concept Negation
Peiran Dong Song Guo Junxiao Wang Bingjie WANG Jiewei Zhang Ziming Liu



研究问题:本文旨在解决生成模型无限制的输出问题,特别是在处理广泛应用的扩散模型时,如何保持合成内容的伦理和版权完整性。
动机:虽然概念否定作为一种有前景的方法已经在定义和管理模型输出空间方面做出了有价值的贡献,但它仍然受到显著的限制,例如无法处理现实中概念的相互关联性。
方法:本文提出了一个名为$ProtoRe$的新框架,通过测试时间的负概念识别和特征空间的净化来提高概念否定的灵活性。具体来说,$ProtoRe$通过引入CLIP的语言对比知识来识别负概念的原型,然后使用该原型作为提示从输出中提取负特征,并通过检索负特征进一步精炼注意力图。
效果:在多个基准测试上的评估显示,$ProtoRe$在各种设置下都优于最先进的方法,无论是在净化效果上还是在生成图像的真实性上都表现出色。

Generative models produce unbounded outputs, necessitating the use of refusal techniques to confine their output space. Employing generative refusals is crucial in upholding the ethical and copyright integrity of synthesized content, particularly when working with widely adopted diffusion models. "Concept negation'' presents a promising paradigm to achieve generative refusals, as it effectively defines and governs the model's output space based on concepts, utilizing natural language interfaces that are readily comprehensible to humans. However, despite the valuable contributions of prior research to the field of concept negation, it still suffers from significant limitations. The existing concept negation methods, which operate based on the composition of score or noise predictions from the diffusion process, are limited to independent concepts (e.g., ``a blonde girl`` without ``glasses``) and fail to consider the interconnected nature of concepts in reality (e.g., ``Mickey mouse eats ice cream`` without ``Disney characters``). Keeping the limitations in mind, we propose a novel framework, called $ProtoRe$, to improve the flexibility of concept negation via test-time negative concept identification along with purification in the feature space. $ProtoRe$ works by incorporating CLIP's language-contrastive knowledge to identify the prototype of negative concepts, extract the negative features from outputs using the prototype as a prompt, and further refine the attention maps by retrieving negative features. Our evaluation on multiple benchmarks shows that $ProtoRe$ outperforms state-of-the-art methods under various settings, in terms of the effectiveness of purification and the fidelity of generative images.

FiGURe: Simple and Efficient Unsupervised Node Representations with Filter Augmentations
Chanakya Ekbote Ajinkya Deshpande Arun Iyer SUNDARARAJAN SELLAMANICKAM Ramakrishna B Bairi



研究问题:现有的无监督节点表示学习方法在下游任务上表现良好,但这些方法依赖于模拟低通滤波器的增强技术,限制了其在需要不同特征值部分的任务上的性能。
动机:本文提出了一种基于滤波器的增强方法,以捕获特征值的不同部分,从而提高无监督节点表示学习的效果。
方法:通过对比学习的方法学习无监督节点表示,并采用基于滤波器的增强方法来捕获不同的特征值部分。同时,通过简单的随机傅里叶特征投影将高维表示降低到低维,以减少计算量。
效果:实验结果表明,该方法在各种数据集上均取得了显著的改进,平均增益达到4.4%,优于现有的无监督模型。

Unsupervised node representations learnt using contrastive learning-based methods have shown good performance on downstream tasks. However, these methods rely on augmentations that mimic low-pass filters, limiting their performance on tasks requiring different eigen-spectrum parts. This paper presents a simple filter-based augmentation method to capture different parts of the eigen-spectrum. We show significant improvements using these augmentations. Further, we show that sharing the same weights across these different filter augmentations is possible, reducing the computational load. In addition, previous works have shown that good performance on downstream tasks requires high dimensional representations. Working with high dimensions increases the computations, especially when multiple augmentations are involved. We mitigate this problem and recover good performance through lower dimensional embeddings using simple random Fourier feature projections. Our method, FiGURe, achieves an average gain of up to 4.4\%, compared to the state-of-the-art unsupervised models, across all datasets in consideration, both homophilic and heterophilic. Our code can be found at: https://github.com/Microsoft/figure.

CS-Isolate: Extracting Hard Confident Examples by Content and Style Isolation
Yexiong Lin Yu Yao Xiaolong Shi Mingming Gong Xu Shen Dong Xu Tongliang Liu



研究问题:大规模图像数据集中的标签噪声普遍存在,如何通过利用半监督学习选择置信度较高的样本来减轻标签噪声的副作用。
动机:现有的方法主要关注提取接近决策边界的困难置信度示例,这种能力对学习分类器的泛化能力有显著影响。
方法:本文发现一些困难示例接近决策边界的主要原因是风格因素与内容因素的纠缠。当只关注内容因素(如语义信息)而忽略风格因素时,困难示例变得更具判别性。然而,在只有噪声数据的情况下,内容因素无法直接观察,必须进行推断。
效果:为了解决在学习有噪声标签时推断用于分类的内容因素的问题,我们的目标是确保同一底层清洁类中的所有示例的内容因素保持不变,即使他们的风格信息发生变化。通过使用不同的数据增强技术改变风格,同时基于一些置信度较高的示例对内容因素进行正则化,训练现有方法与我们推断出的内容因素,证明了CS-Isolate在基准数据集上学习困难示例的有效性。

Label noise widely exists in large-scale image datasets. To mitigate the side effects of label noise, state-of-the-art methods focus on selecting confident examples by leveraging semi-supervised learning. Existing research shows that the ability to extract hard confident examples, which are close to the decision boundary, significantly influences the generalization ability of the learned classifier. In this paper, we find that a key reason for some hard examples being close to the decision boundary is due to the entanglement of style factors with content factors. The hard examples become more discriminative when we focus solely on content factors, such as semantic information, while ignoring style factors. Nonetheless, given only noisy data, content factors are not directly observed and have to be inferred. To tackle the problem of inferring content factors for classification when learning with noisy labels, our objective is to ensure that the content factors of all examples in the same underlying clean class remain unchanged as their style information changes. To achieve this, we utilize different data augmentation techniques to alter the styles while regularizing content factors based on some confident examples. By training existing methods with our inferred content factors, CS-Isolate proves their effectiveness in learning hard examples on benchmark datasets. The implementation is available at https://github.com/tmllab/2023_NeurIPS_CS-isolate.

RDumb: A simple approach that questions our progress in continual test-time adaptation
Ori Press Steffen Schneider Matthias Kuemmerer Matthias Bethge



研究问题:测试时适应(TTA)允许在部署时更新预训练模型以适应不断变化的数据分布,但现有的方法是否有效?
动机:早期的工作只针对单个固定的分布变化进行测试,近期的工作提出了并应用了长期时间尺度上的连续适应方法。为了检查该领域的进展,我们提出了持续变化的破坏(CCC)基准来测量TTA技术的渐进性能。
方法:我们评估了所有现有的TTA方法,发现除了一种之外的所有最先进的方法最终都会崩溃,表现不如非适应模型。此外,我们还引入了一个名为“RDumb”的简单基线,该模型定期将自身重置为预训练状态。
效果:我们的研究结果表明,以前的TTA方法既无法有效地适应以避免崩溃,也无法超越简单的重置策略。

Test-Time Adaptation (TTA) allows to update pre-trained models to changing data distributions at deployment time. While early work tested these algorithms for individual fixed distribution shifts, recent work proposed and applied methods for continual adaptation over long timescales. To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, "RDumb", that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks. Our results show that previous TTA approaches are neither effective at regularizing adaptation to avoid collapse nor able to outperform a simplistic resetting strategy.

Toward Understanding Generative Data Augmentation
Chenyu Zheng Guoqiang Wu Chongxuan Li



研究问题:本文旨在理论上探讨生成性数据增强对学习任务的影响,特别是在非独立同分布(non-i.i.d.)设置中。
动机:尽管生成性数据增强在各种学习任务中可以提高分类性能,但在非独立同分布的设置下,其理论效果尚未得到充分研究。
方法:本文建立了一个一般的稳定性边界,以研究生成性数据增强的效果。我们进一步将学习设置具体化为高斯混合模型和生成对抗网络。
效果:理论结果表明,当训练集规模较小时,即使生成性数据增强不能提高学习速度,也可以在常数级别上改善学习保证,这对于防止过拟合具有重要意义。模拟结果和实证结果均支持我们的理论结论。

Generative data augmentation, which scales datasets by obtaining fake labeled examples from a trained conditional generative model, boosts classification performance in various learning tasks including (semi-)supervised learning, few-shot learning, and adversarially robust learning. However, little work has theoretically investigated the effect of generative data augmentation. To fill this gap, we establish a general stability bound in this not independently and identically distributed (non-i.i.d.) setting, where the learned distribution is dependent on the original train set and generally not the same as the true distribution. Our theoretical result includes the divergence between the learned distribution and the true distribution. It shows that generative data augmentation can enjoy a faster learning rate when the order of divergence term is $o(\max\left( \log(m)\beta_m, 1 / \sqrt{m})\right)$, where $m$ is the train set size and $\beta_m$ is the corresponding stability constant. We further specify the learning setup to the Gaussian mixture model and generative adversarial nets. We prove that in both cases, though generative data augmentation does not enjoy a faster learning rate, it can improve the learning guarantees at a constant level when the train set is small, which is significant when the awful overfitting occurs. Simulation results on the Gaussian mixture model and empirical results on generative adversarial nets support our theoretical conclusions.

Deep Insights into Noisy Pseudo Labeling on Graph Data
Botao WANG Jia Li Yang Liu Jiashun Cheng Yu Rong Wenjia Wang Fugee Tsung



研究问题:本文旨在深入理解在图学习模型中伪标签(PL)策略的影响。
动机:虽然已有研究表明伪标签可以提高图学习模型的性能,但不正确标签可能对图训练过程产生致命影响,特别是在噪声可以传播的图数据上。然而,现有文献中对此错误鲜有理论分析。
方法:通过展示错误受PL阈值的置信度和多视图预测的一致性限制,我们首次对PL策略进行了错误分析。然后,我们从理论上说明了PL对收敛性的影响。基于此分析,我们提出了一种谨慎的伪标签方法,即以最高置信度和多视图一致性对样本进行伪标签。
效果:大量实验证明,所提出的策略改进了图学习过程,并在链接预测和节点分类任务上优于其他PL策略。

Pseudo labeling (PL) is a wide-applied strategy to enlarge the labeled dataset by self-annotating the potential samples during the training process. Several works have shown that it can improve the graph learning model performance in general. However, we notice that the incorrect labels can be fatal to the graph training process. Inappropriate PL may result in the performance degrading, especially on graph data where the noise can propagate. Surprisingly, the corresponding error is seldom theoretically analyzed in the literature. In this paper, we aim to give deep insights of PL on graph learning models. We first present the error analysis of PL strategy by showing that the error is bounded by the confidence of PL threshold and consistency of multi-view prediction. Then, we theoretically illustrate the effect of PL on convergence property. Based on the analysis, we propose a cautious pseudo labeling methodology in which we pseudo label the samples with highest confidence and multi-view consistency. Finally, extensive experiments demonstrate that the proposed strategy improves graph learning process and outperforms other PL strategies on link prediction and node classification tasks.

SLaM: Student-Label Mixing for Distillation with Unlabeled Examples
Vasilis Kontonis Fotis Iliopoulos Khoa Trinh Cenk Baykal Gaurav Menghani Erik Vee



研究问题:如何利用大量无标签数据进行知识蒸馏,生成紧凑、轻量级的学生模型。
动机:在有大量未标注数据但缺乏标记数据的情况下,教师模型的伪标签往往带有噪声,影响学生模型的性能。
方法:提出一种名为“学生-标签混合”(SLaM)的原则性方法,通过改进伪标签的质量来提高学生模型的性能。
效果:实验证明,SLaM在多个标准基准测试中的表现优于现有方法,且具有理论保证。

Knowledge distillation with unlabeled examples is a powerful training paradigm for generating compact and lightweight student models in applications where the amount of labeled data is limited but one has access to a large pool of unlabeled data. In this setting, a large teacher model generates "soft" pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher's pseudo-labels are often noisy, leading to impaired student performance. In this paper, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called ``forward loss-adjustment" methods.

D4Explainer: In-distribution Explanations of Graph Neural Network via Discrete Denoising Diffusion
Jialin Chen Shirley Wu Abhijit Gupta Zhitao Ying



研究问题:本文旨在解决图神经网络(GNN)解释性的问题,特别是在保证模型审计和确保可信赖的图学习方面。
动机:由于GNN对分布外数据敏感,因此需要关注分布内属性,以确保生成的解释是可靠的。然而,现有的解释方法往往将生成的解释限制在原始图的结构中,从而忽视了分布内属性的重要性,导致解释缺乏可靠性。
方法:为此,我们提出了D4Explainer,这是一种新的GNN解释方法,可以为反事实和模型级别的解释场景提供分布内的解释。该方法将生成图分布学习纳入优化目标,实现两个目标:1)为给定实例生成符合分布内属性的多样化反事实图;2)识别对特定类别预测贡献最大的图形模式,作为模型级别的解释。
效果:我们在合成和真实世界的数据集上进行的实证评估表明,D4Explainer在解释准确性、忠实度、多样性和鲁棒性方面都取得了最先进的性能。

The widespread deployment of Graph Neural Networks (GNNs) sparks significant interest in their explainability, which plays a vital role in model auditing and ensuring trustworthy graph learning. The objective of GNN explainability is to discern the underlying graph structures that have the most significant impact on model predictions. Ensuring that explanations generated are reliable necessitates consideration of the in-distribution property, particularly due to the vulnerability of GNNs to out-of-distribution data. Unfortunately, prevailing explainability methods tend to constrain the generated explanations to the structure of the original graph, thereby downplaying the significance of the in-distribution property and resulting in explanations that lack reliability. To address these challenges, we propose D4Explainer, a novel approach that provides in-distribution GNN explanations for both counterfactual and model-level explanation scenarios. The proposed D4Explainer incorporates generative graph distribution learning into the optimization objective, which accomplishes two goals: 1) generate a collection of diverse counterfactual graphs that conform to the in-distribution property for a given instance, and 2) identify the most discriminative graph patterns that contribute to a specific class prediction, thus serving as model-level explanations. It is worth mentioning that D4Explainer is the first unified framework that combines both counterfactual and model-level explanations. Empirical evaluations conducted on synthetic and real-world datasets provide compelling evidence of the state-of-the-art performance achieved by D4Explainer in terms of explanation accuracy, faithfulness, diversity, and robustness.

Knowledge Diffusion for Distillation
Tao Huang Yuan Zhang Mingkai Zheng Shan You Fei Wang Chen Qian Chang Xu



研究问题:知识蒸馏中教师和学生表示的差距是一个新兴话题。
动机:当前的方法通常通过复杂的训练方案、损失函数和特征对齐来减少差距并提高性能,但这些方法都是任务特定和特征特定的。
方法:本文提出了一种新的知识蒸馏方法DiffKD,使用扩散模型显式地对特征进行去噪和匹配。该方法基于观察,即由于学生模型的容量较小,学生特征通常比教师特征包含更多的噪声。为了解决这个问题,我们提出使用由教师特征训练的扩散模型对学生特征进行去噪。
效果:大量的实验表明,DiffKD在各种类型的特征上都有效,并在图像分类、目标检测和语义分割任务上取得了一致的最先进的性能。

The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adaptive noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at https://github.com/hunto/DiffKD.

Towards a Unified Analysis of Kernel-based Methods Under Covariate Shift
Xingdong Feng Xin HE Caixing Wang Chao Wang Jingnan Zhang



研究问题:本研究旨在解决在实践中广泛存在的协变量偏移问题,即源数据和目标数据的输入分布存在显著差异。
动机:尽管协变量偏移在各种学习问题上具有实际重要性,但大多数现有方法仅关注某些特定的学习任务,并未从理论和数值上得到充分验证。
方法:我们提出了一种在再生核希尔伯特空间(RKHS)中对一般非参数方法进行统一分析的方法,以解决协变量偏移问题。
效果:我们的理论结果适用于属于丰富损失函数家族的一般损失,其中包括许多常用方法作为特例,如均值回归、分位数回归、基于似然的分类和基于边界的分类。通过对两类协变量偏移问题的集中研究,我们为一般损失函数建立了锐利的收敛速度,从而提供了一种统一的理论分析,这与文献中使用平方损失的最佳结果相一致。大量的数值研究证实了我们的理论发现,并进一步说明了我们提出的方法的有效性。

Covariate shift occurs prevalently in practice, where the input distributions of the source and target data are substantially different. Despite its practical importance in various learning problems, most of the existing methods only focus on some specific learning tasks and are not well validated theoretically and numerically. To tackle this problem, we propose a unified analysis of general nonparametric methods in a reproducing kernel Hilbert space (RKHS) under covariate shift. Our theoretical results are established for a general loss belonging to a rich loss function family, which includes many commonly used methods as special cases, such as mean regression, quantile regression, likelihood-based classification, and margin-based classification. Two types of covariate shift problems are the focus of this paper and the sharp convergence rates are established for a general loss function to provide a unified theoretical analysis, which concurs with the optimal results in literature where the squared loss is used. Extensive numerical studies on synthetic and real examples confirm our theoretical findings and further illustrate the effectiveness of our proposed method.

TRIAGE: Characterizing and auditing training data for improved regression
Nabeel Seedat Jonathan Crabbé Zhaozhi Qian Mihaela van der Schaar



研究问题:当前的数据表征方法主要关注分类设置,而对回归设置的研究相对较少。
动机:为了解决这一问题,我们提出了一种新的数据表征框架TRIAGE,专门针对回归任务设计,并与广泛的回归器兼容。
方法:TRIAGE利用一致性预测分布提供一种与模型无关的评分方法——TRIAGE评分。我们将该评分用于分析单个样本的训练动态,并将样本分为模型低估、高估或准确估计三类。
效果:实验表明,TRIAGE的数据表征是一致且有实际效用的,可以在多个回归设置中通过数据塑造/过滤来提高性能。此外,除了样本级别之外,TRIAGE还展示了在数据集选择和特征获取方面采用新方法的可能性。总的来说,TRIAGE突显了数据表征在现实世界回归应用中的价值。

Data quality is crucial for robust machine learning algorithms, with the recent interest in data-centric AI emphasizing the importance of training data characterization. However, current data characterization methods are largely focused on classification settings, with regression settings largely understudied. To address this, we introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors. TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score. We operationalize the score to analyze individual samples' training dynamics and characterize samples as under-, over-, or well-estimated by the model. We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings. Additionally, beyond sample level, we show TRIAGE enables new approaches to dataset selection and feature acquisition. Overall, TRIAGE highlights the value unlocked by data characterization in real-world regression applications.

Adversarial Self-Training Improves Robustness and Generalization for Gradual Domain Adaptation
Lianghe Shi Weiwei Liu



研究问题:尽管渐进领域适应(GDA)在许多上下文中得到了理论和实证研究,但其对抗鲁棒性尚未得到探索。
动机:在安全关键场景中,GDA模型的对抗鲁棒性至关重要。
方法:采用有效的渐进自我训练方法,将普通的自我训练替换为对抗自我训练(AST)。AST首先对未标记的数据进行标签预测,然后在伪标记分布上进行对抗性训练。
效果:研究发现,渐进AST不仅提高了目标领域的对抗准确性,也提高了清洁准确性。这是因为当伪标签包含一部分错误标签时,对抗训练(AT)比标准训练表现得更好。此外,我们还展示了渐进AST在多分类设置中的泛化误差界限,并使用最优子集和问题的值将真实分布的标准误差和伪标记分布的对抗误差联系起来。结果表明,在有错误伪标签的数据上,AT可能获得比标准训练更紧的界限。

Gradual Domain Adaptation (GDA), in which the learner is provided with additional intermediate domains, has been theoretically and empirically studied in many contexts. Despite its vital role in security-critical scenarios, the adversarial robustness of the GDA model remains unexplored. In this paper, we adopt the effective gradual self-training method and replace vanilla self-training with adversarial self-training (AST). AST first predicts labels on the unlabeled data and then adversarially trains the model on the pseudo-labeled distribution. Intriguingly, we find that gradual AST improves not only adversarial accuracy but also clean accuracy on the target domain. We reveal that this is because adversarial training (AT) performs better than standard training when the pseudo-labels contain a portion of incorrect labels. Accordingly, we first present the generalization error bounds for gradual AST in a multiclass classification setting. We then use the optimal value of the Subset Sum Problem to bridge the standard error on a real distribution and the adversarial error on a pseudo-labeled distribution. The result indicates that AT may obtain a tighter bound than standard training on data with incorrect pseudo-labels. We further present an example of a conditional Gaussian distribution to provide more insights into why gradual AST can improve the clean accuracy for GDA.

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Yuxin Guo Shijie Ma Hu Su Zhiqing Wang Yuhao Zhao Wei Zou Siyang Sun Yun Zheng



研究问题:本文旨在解决音视频源定位(AVSL)问题,即在给定音频片段的情况下,确定视频帧中发出声音的对象的位置。
动机:现有的方法主要依赖于自我监督的对比学习进行音视频对应关系的训练,但在没有边界框标注的情况下,它们在精确定位上存在困难,尤其是对于小对象的定位,并且会出现模糊的边界和误报。此外,简单的半监督学习方法在有效利用大量未标记的音视频对方面表现不佳。
方法:本文提出了一种新的半监督学习框架,名为双均值教师(DMT),包括两个教师-学生结构来避免确认偏误问题。具体来说,使用两个在有限标记数据上预训练的教师来过滤掉噪声样本,通过预测之间的一致性生成高质量的伪标签。这种无偏框架的最佳利用标记和非标记数据的能力使DMT能够大幅超越当前最先进的方法。
效果:实验结果表明,DMT在Flickr-SoundNet和VGG-Sound Source上的CIoU分别为90.4%和48.8%,相比于现有方法分别提高了8.9%和9.6%。我们的代码已经在GitHub上公开发布。

Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in effectively utilizing the abundance of unlabeled audio-visual pairs. In this paper, we propose a novel Semi-Supervised Learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The optimal utilization of both labeled and unlabeled data combined with this unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of $\textbf{90.4\%}$ and $\textbf{48.8\%}$ on Flickr-SoundNet and VGG-Sound Source, obtaining $\textbf{8.9\%}$ and $\textbf{9.6\%}$ improvements respectively, given only $3\%$ of data positional-annotated. We also extend our framework to some existing AVSL methods and consistently boost their performance. Our code is publicly available at https://github.com/gyx-gloria/DMT.

V-InFoR: A Robust Graph Neural Networks Explainer for Structurally Corrupted Graphs
Senzhang Wang Jun Yin Chaozhuo Li Xing Xie Jianxin Wang



研究问题:现有的图神经网络解释器在面对结构受损的图(如噪声或对抗性边)时不够稳健。
动机:现有的GNN解释器主要基于原始图特征或学习到的隐藏表示进行解释,这两者都容易被破坏。此外,图的损坏在结构属性(如图的大小或连通性)方面是不规则的,这使得之前GNN解释器的严格约束无法实施。
方法:我们提出了一种名为V-InfoR的鲁棒GNN解释器。具体来说,我们提出了一种鲁棒的图表示提取器,它采用变分推理的思想来推断图表示的隐藏分布。我们不是直接使用每个单独图的损坏原始特征或表示,而是从推断的分布中采样图表示,用于下游的解释生成器,这可以有效地消除小的损坏。我们将解释探索形式化为图信息瓶颈(GIB)优化问题。作为一种更通用的方法,我们的GIB方法不需要任何严格的结构约束,可以自适应地捕获严重损坏图中的规律性和非规律性以进行解释。
效果:我们在合成和真实世界的数据集上进行了广泛的评估,结果表明V-InfoR显著提高了GNN对结构受损图的解释性能。代码和数据集可在https://anonymous.4open.science/r/V-InfoR-EF88获取。

GNN explanation method aims to identify an explanatory subgraph which contains the most informative components of the full graph. However, a major limitation of existing GNN explainers is that they are not robust to the structurally corrupted graphs, e.g., graphs with noisy or adversarial edges. On the one hand, existing GNN explainers mostly explore explanations based on either the raw graph features or the learned latent representations, both of which can be easily corrupted. On the other hand, the corruptions in graphs are irregular in terms of the structural properties, e.g., the size or connectivity of graphs, which makes the rigorous constraints used by previous GNN explainers unfeasible. To address these issues, we propose a robust GNN explainer called V-InfoR. Specifically, a robust graph representation extractor, which takes insights of variational inference, is proposed to infer the latent distribution of graph representations. Instead of directly using the corrupted raw features or representations of each single graph, we sample the graph representations from the inferred distribution for the downstream explanation generator, which can effectively eliminate the minor corruption. We next formulate the explanation exploration as a graph information bottleneck (GIB) optimization problem. As a more general method that does not need any rigorous structural constraints, our GIB-based method can adaptively capture both the regularity and irregularity of the severely corrupted graphs for explanation. Extensive evaluations on both synthetic and real-world datasets indicate that V-InfoR significantly improves the GNN explanation performance for the structurally corrupted graphs. Code and dataset are available at https://anonymous.4open.science/r/V-InfoR-EF88

DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening
Bowen Gao Bo Qiang Haichuan Tan Yinjun Jia Minsi Ren Minsi Lu Jingjing Liu Wei-Ying Ma Yanyan Lan



研究问题:如何有效地从大量化合物数据库中识别出能够与特定蛋白质口袋结合的潜在药物,以辅助药物发现。
动机:传统的对接方法在实际应用中耗时且搜索库有限,而最新的有监督学习方法由于依赖有限的可靠结合亲和力标签数据,尚未超越传统对接方法。
方法:本文提出了一种新的对比学习框架DrugCLIP,将虚拟筛选重新定义为密集检索任务,并使用对比学习对大量成对的无明确结合亲和力分数的蛋白质口袋和分子表示进行对齐。同时引入了受生物知识启发的数据增强策略来学习更好的蛋白质-分子表示。
效果:实验表明,DrugCLIP在各种虚拟筛选基准测试上显著优于传统对接方法和有监督学习方法,尤其是在零射设置下,计算时间大大减少。

Virtual screening, which identifies potential drugs from vast compound databases to bind with a particular protein pocket, is a critical step in AI-assisted drug discovery. Traditional docking methods are highly time-consuming, and can only work with a restricted search library in real-life applications. Recent supervised learning approaches using scoring functions for binding-affinity prediction, although promising, have not yet surpassed docking methods due to their strong dependency on limited data with reliable binding-affinity labels. In this paper, we propose a novel contrastive learning framework, DrugCLIP, by reformulating virtual screening as a dense retrieval task and employing contrastive learning to align representations of binding protein pockets and molecules from a large quantity of pairwise data without explicit binding-affinity scores. We also introduce a biological-knowledge inspired data augmentation strategy to learn better protein-molecule representations. Extensive experiments show that DrugCLIP significantly outperforms traditional docking and supervised learning methods on diverse virtual screening benchmarks with highly reduced computation time, especially in zero-shot setting.

Learning Sample Difficulty from Pre-trained Models for Reliable Prediction
Peng Cui Dan Zhang Zhijie Deng Yinpeng Dong Jun Zhu



研究问题:如何利用大规模预训练模型提高下游模型的预测可靠性,并解决神经网络的过度自信预测问题。
动机:现代神经网络存在过度自信预测的问题,而大规模预训练模型可以有效解决这个问题。
方法:通过特征空间的高斯建模和相对马氏距离计算,使用大规模预训练模型来测量每个训练样本的难度,然后通过样本难度感知的熵正则化来指导下游模型的训练。
效果:该方法在多个具有挑战性的基准测试中(如ImageNet1k),准确率和不确定性校准都得到了显著提升,同时优于竞争性基线,实现了可靠的预测。

Large-scale pre-trained models have achieved remarkable success in many applications, but how to leverage them to improve the prediction reliability of downstream models is undesirably under-explored. Moreover, modern neural networks have been found to be poorly calibrated and make overconfident predictions regardless of inherent sample difficulty and data uncertainty. To address this issue, we propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization. Pre-trained models that have been exposed to large-scale datasets and do not overfit the downstream training classes enable us to measure each training sample’s difficulty via feature-space Gaussian modeling and relative Mahalanobis distance computation. Importantly, by adaptively penalizing overconfident prediction based on the sample difficulty, we simultaneously improve accuracy and uncertainty calibration across challenging benchmarks (e.g., +0.55% ACC and −3.7% ECE on ImageNet1k using ResNet34), consistently surpassing competitive baselines for reliable prediction. The improved uncertainty estimate further improves selective classification (abstaining from erroneous predictions) and out-of-distribution detection.

Revisiting Logistic-softmax Likelihood in Bayesian Meta-Learning for Few-Shot Classification
Tianjun Ke Haoqun Cao Zenan Ling Feng Zhou



研究问题:元学习在少次分类任务中已经显示出了有希望的结果,但如何更好地利用先验知识来解决新的问题仍然是一个挑战。
动机:贝叶斯方法在少次分类任务中能有效的表示不确定性,这对于高风险领域至关重要。然而,逻辑斯谛-softmax理论性质不明确,且其内在的不确定性会导致性能不佳。
方法:我们重新设计了逻辑斯谛-softmax似然函数,通过温度参数控制先验信心水平。此外,我们还证明了softmax可以看作是逻辑斯谛-softmax的一个特例,并且逻辑斯谛-softmax比softmax能产生更大的数据分布族。
效果:我们的方法产生了良好校准的不确定性估计,并在标准基准数据集上取得了相当或更好的结果。

Meta-learning has demonstrated promising results in few-shot classification (FSC) by learning to solve new problems using prior knowledge. Bayesian methods are effective at characterizing uncertainty in FSC, which is crucial in high-risk fields. In this context, the logistic-softmax likelihood is often employed as an alternative to the softmax likelihood in multi-class Gaussian process classification due to its conditional conjugacy property. However, the theoretical property of logistic-softmax is not clear and previous research indicated that the inherent uncertainty of logistic-softmax leads to suboptimal performance. To mitigate these issues, we revisit and redesign the logistic-softmax likelihood, which enables control of the \textit{a priori} confidence level through a temperature parameter. Furthermore, we theoretically and empirically show that softmax can be viewed as a special case of logistic-softmax and logistic-softmax induces a larger family of data distribution than softmax. Utilizing modified logistic-softmax, we integrate the data augmentation technique into the deep kernel based Gaussian process meta-learning framework, and derive an analytical mean-field approximation for task-specific updates. Our approach yields well-calibrated uncertainty estimates and achieves comparable or superior results on standard benchmark datasets. Code is publicly available at \url{https://github.com/keanson/revisit-logistic-softmax}.

Joint Training of Deep Ensembles Fails Due to Learner Collusion
Alan Jeffares Tennison Liu Jonathan Crabbé Mihaela van der Schaar



研究问题:尽管深度学习模型的集成已被证明是提高单一模型性能的强大方法,但大多数先前的研究都是独立训练单个模型,然后在事后进行集成。
动机:作者发现直接最小化集成损失在实践中很少应用,因为联合优化会导致退化行为。
方法:作者将集成目标分解为基学习器的力量和它们之间的多样性,通过在独立训练和联合优化之间平滑插值来全面展示这一效应对一系列标准机器学习任务和架构的实际影响。
效果:实验结果表明,联合优化会导致基学习器串通起来人为地夸大其表面多样性,这种伪多样性无法泛化到训练数据之外,导致更大的泛化差距。

Ensembles of machine learning models have been well established as a powerful method of improving performance over a single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal of optimizing their joint performance. In the case of deep ensembles of neural networks, we are provided with the opportunity to directly optimize the true objective: the joint performance of the ensemble as a whole. Surprisingly, however, directly minimizing the loss of the ensemble appears to rarely be applied in practice. Instead, most previous research trains individual models independently with ensembling performed _post hoc_. In this work, we show that this is for good reason - _joint optimization of ensemble loss results in degenerate behavior_. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to comprehensively demonstrate the practical implications of this effect on a range of standard machine learning tasks and architectures by smoothly interpolating between independent training and joint optimization.

Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery
Sarah Rastegar Hazel Doughty Cees G. M. Snoek



研究问题:如何定义和发现测试时间的新类别?
动机:传统的监督识别模型受限于预定义的类别集,无法有效发现新的类别。
方法:通过优化视角将类别概念化,提出一种新颖、高效且自我监督的方法,在测试时间发现先前未知的类别。
效果:该方法能有效处理精细类别,并在实验评估中表现出色。

In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. Our code is available at: https://github.com/SarahRastegar/InfoSieve.

Learning Invariant Representations of Graph Neural Networks via Cluster Generalization
Donglin Xia Xiao Wang Nian Liu Chuan Shi



研究问题:如何使图神经网络(GNNs)在面对测试图结构与训练图结构不同的情况,即结构偏移时,仍能保持较好的性能。
动机:现有的GNNs在面对结构偏移时,其性能会显著下降,表明模型可能对特定的结构模式存在偏见。
方法:提出集群信息转移(CIT)机制,通过结合不同的集群信息和节点,同时保留节点的集群独立信息,学习不变的表示,以提高GNNs对各种未知测试图的泛化能力。
效果:实验结果表明,CIT机制能有效提高GNNs的性能,且易于集成到现有的GNNs中。

Graph neural networks (GNNs) have become increasingly popular in modeling graph-structured data due to their ability to learn node representations by aggregating local structure information. However, it is widely acknowledged that the test graph structure may differ from the training graph structure, resulting in a structure shift. In this paper, we experimentally find that the performance of GNNs drops significantly when the structure shift happens, suggesting that the learned models may be biased towards specific structure patterns. To address this challenge, we propose the Cluster Information Transfer (\textbf{CIT}) mechanism, which can learn invariant representations for GNNs, thereby improving their generalization ability to various and unknown test graphs with structure shift. The CIT mechanism achieves this by combining different cluster information with the nodes while preserving their cluster-independent information. By generating nodes across different clusters, the mechanism significantly enhances the diversity of the nodes and helps GNNs learn the invariant representations. We provide a theoretical analysis of the CIT mechanism, showing that the impact of changing clusters during structure shift can be mitigated after transfer. Additionally, the proposed mechanism is a plug-in that can be easily used to improve existing GNNs. We comprehensively evaluate our proposed method on three typical structure shift scenarios, demonstrating its effectiveness in enhancing GNNs' performance.

SoTTA: Robust Test-Time Adaptation on Noisy Data Streams
Taesik Gong Yewon Kim Taeckyung Lee Sorn Chottananurak Sung-Ju Lee



研究问题:测试时间适应(TTA)旨在使用未标记的测试数据流解决训练和测试数据之间的分布偏移,但大多数TTA方法假设测试流是良性的,而实际中测试样本可能非常多样化。
动机:例如在自动驾驶中可能出现未见过的对象或噪音,这会对现有的TTA算法构成威胁。由于现有的TTA算法会盲目适应传入的样本,因此会受到这些噪声样本的影响。
方法:我们提出了筛选式测试时间适应(SoTTA),这是一种对噪声样本具有鲁棒性的新的TTA算法。SoTTA的关键之处在于两个方面:(i)通过高置信度均匀类别采样实现输入鲁棒性,有效过滤掉噪声样本的影响;(ii)通过熵锐度最小化提高模型参数对来自噪声样本的大梯度的鲁棒性。
效果:我们在各种噪声场景的标准TTA基准上进行评估,结果显示,在存在噪声样本的情况下,我们的方法优于最先进的TTA方法,并且在没有噪声样本的情况下,我们的方法达到了与那些方法相当的准确性。

Test-time adaptation (TTA) aims to address distributional shifts between training and testing data using only unlabeled test data streams for continual model adaptation. However, most TTA methods assume benign test streams, while test samples could be unexpectedly diverse in the wild. For instance, an unseen object or noise could appear in autonomous driving. This leads to a new threat to existing TTA algorithms; we found that prior TTA algorithms suffer from those noisy test samples as they blindly adapt to incoming samples. To address this problem, we present Screening-out Test-Time Adaptation (SoTTA), a novel TTA algorithm that is robust to noisy samples. The key enabler of SoTTA is two-fold: (i) input-wise robustness via high-confidence uniform-class sampling that effectively filters out the impact of noisy samples and (ii) parameter-wise robustness via entropy-sharpness minimization that improves the robustness of model parameters against large gradients from noisy samples. Our evaluation with standard TTA benchmarks with various noisy scenarios shows that our method outperforms state-of-the-art TTA methods under the presence of noisy samples and achieves comparable accuracy to those methods without noisy samples. The source code is available at https://github.com/taeckyung/SoTTA.

3D Indoor Instance Segmentation in an Open-World
Mohamed El Amine Boudjoghra Salwa K. Al Khatib Jean Lahoud Hisham Cholakkal Rao Muhammad Anwer Salman Khan Fahad Khan



研究问题:现有的3D实例分割方法通常假设所有要分割的语义类别在训练期间都可用,仅对看到过的类别进行分割。
动机:这种封闭世界假设具有限制性,我们首次探索开放世界中的3D室内实例分割,允许模型区分一组已知类别以及将未知对象识别为未知,并在相应的类别标签可用时逐步学习未知对象的语义类别。
方法:我们引入了一种开放世界的3D室内实例分割方法,其中使用自动标记方案在训练期间生成伪标签并引发分离以区分已知和未知类别标签。我们还通过根据对象性分数分布调整未知类概率来提高推理时的伪标签质量。此外,我们还引入了经过精心策划的开放世界分割,利用基于固有对象分布的现实场景、基于区域的室内场景探索和开放世界类别的随机性方面。
效果:大量实验表明,所提出的方法有效,取得了有前景的开放世界3D实例分割性能。代码和分割可在以下网址获取:https://github.com/aminebdj/3D-OWIS。

Existing 3D instance segmentation methods typically assume that all semantic classes to be segmented would be available during training and only seen categories are segmented at inference. We argue that such a closed-world assumption is restrictive and explore for the first time 3D indoor instance segmentation in an open-world setting, where the model is allowed to distinguish a set of known classes as well as identify an unknown object as unknown and then later incrementally learning the semantic category of the unknown when the corresponding category labels are available. To this end, we introduce an open-world 3D indoor instance segmentation method, where an auto-labeling scheme is employed to produce pseudo-labels during training and induce separation to separate known and unknown category labels. We further improve the pseudo-labels quality at inference by adjusting the unknown class probability based on the objectness score distribution. We also introduce carefully curated open-world splits leveraging realistic scenarios based on inherent object distribution, region-based indoor scene exploration and randomness aspect of open-world classes. Extensive experiments reveal the efficacy of the proposed contributions leading to promising open-world 3D instance segmentation performance. Code and splits are available at: https://github.com/aminebdj/3D-OWIS.

NEO-KD: Knowledge-Distillation-Based Adversarial Training for Robust Multi-Exit Neural Networks
Seokil Ham Jungwuk Park Dong-Jun Han Jaekyun Moon



研究问题:多出口神经网络在有效推理方面具有潜力,但对抗性攻击仍是一个挑战。
动机:多出口网络中,由于不同子模型之间的高度依赖性,针对特定出口的对抗性示例不仅会降低目标出口的性能,还会同时降低所有其他出口的性能,使多出口网络容易受到简单对抗性攻击的影响。
方法:本文提出了一种基于知识蒸馏的对抗性训练策略NEO-KD,主要通过邻域知识蒸馏引导对抗性示例的输出趋向于干净数据的邻居出口的集成输出,以及采用出口特定的正交知识蒸馏来减少不同子模型之间的对抗性迁移能力。
效果:实验结果表明,与依赖于现有对抗性训练或多出口网络的知识蒸馏技术的基线相比,该方法在各种数据集/模型上实现了最佳的对抗性精度和计算预算的降低。

While multi-exit neural networks are regarded as a promising solution for making efficient inference via early exits, combating adversarial attacks remains a challenging problem. In multi-exit networks, due to the high dependency among different submodels, an adversarial example targeting a specific exit not only degrades the performance of the target exit but also reduces the performance of all other exits concurrently. This makes multi-exit networks highly vulnerable to simple adversarial attacks. In this paper, we propose NEO-KD, a knowledge-distillation-based adversarial training strategy that tackles this fundamental challenge based on two key contributions. NEO-KD first resorts to neighbor knowledge distillation to guide the output of the adversarial examples to tend to the ensemble outputs of neighbor exits of clean data. NEO-KD also employs exit-wise orthogonal knowledge distillation for reducing adversarial transferability across different submodels. The result is a significantly improved robustness against adversarial attacks. Experimental results on various datasets/models show that our method achieves the best adversarial accuracy with reduced computation budgets, compared to the baselines relying on existing adversarial training or knowledge distillation techniques for multi-exit networks.

Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP
Qi Qian Yuanhong Xu Juhua Hu



研究问题:如何通过文本代理直接学习视觉代理以实现零样本转移。
动机:当前的视觉语言预训练方法,如CLIP,虽然在视觉分类任务上表现出色,但文本和视觉空间的模态差距可能导致性能不佳。
方法:提出一种新策略,即直接使用文本代理来学习视觉代理,并进一步优化由文本代理获得的伪标签,以促进视觉模态内的代理学习(InMaP)。
效果:实验证明,该方法在多个下游任务上有效且高效。具体来说,InMaP可以在单个GPU上一分钟内获得视觉代理,并在ImageNet上将ViT-L/14@336的零样本准确率从77.02%提高到80.21%。

Vision-language pre-training methods, e.g., CLIP, demonstrate an impressive zero-shot performance on visual categorizations with the class proxy from the text embedding of the class name. However, the modality gap between the text and vision space can result in a sub-optimal performance. We theoretically show that the gap cannot be reduced sufficiently by minimizing the contrastive loss in CLIP and the optimal proxy for vision tasks may reside only in the vision space. Therefore, given unlabeled target vision data, we propose to learn the vision proxy directly with the help from the text proxy for zero-shot transfer. Moreover, according to our theoretical analysis, strategies are developed to further refine the pseudo label obtained by the text proxy to facilitate the intra-modal proxy learning (InMaP) for vision. Experiments on extensive downstream tasks confirm the effectiveness and efficiency of our proposal. Concretely, InMaP can obtain the vision proxy within one minute on a single GPU while improving the zero-shot accuracy from $77.02\%$ to $80.21\%$ on ImageNet with ViT-L/14@336 pre-trained by CLIP.

Effective Targeted Attacks for Adversarial Self-Supervised Learning
Minseon Kim Hyeonjeong Ha Sooel Son Sung Ju Hwang



研究问题:如何通过无监督对抗训练(AT)实现模型的鲁棒性,特别是在没有标签信息的情况下。
动机:目前的无监督对抗训练主要关注在自我监督学习框架中实施,但仅最大化自我监督训练损失并生成非目标对抗样本,往往不能有效提高模型的鲁棒性。
方法:提出一种新的针对目标对抗攻击的积极挖掘方法,为对抗自我监督框架生成有效的对抗样本。具体来说,根据熵和相似度选择最令人困惑但又相似的目标示例,然后对给定实例进行扰动。
效果:该方法在非对比自我监督框架上显示出显著的鲁棒性增强,并在对比自我监督框架上显示出较少但持续的鲁棒性改进。

Recently, unsupervised adversarial training (AT) has been highlighted as a means of achieving robustness in models without any label information. Previous studies in unsupervised AT have mostly focused on implementing self-supervised learning (SSL) frameworks, which maximize the instance-wise classification loss to generate adversarial examples. However, we observe that simply maximizing the self-supervised training loss with an untargeted adversarial attack often results in generating ineffective adversaries that may not help improve the robustness of the trained model, especially for non-contrastive SSL frameworks without negative examples. To tackle this problem, we propose a novel positive mining for targeted adversarial attack to generate effective adversaries for adversarial SSL frameworks. Specifically, we introduce an algorithm that selects the most confusing yet similar target example for a given instance based on entropy and similarity, and subsequently perturbs the given instance towards the selected target. Our method demonstrates significant enhancements in robustness when applied to non-contrastive SSL frameworks, and less but consistent robustness improvements with contrastive SSL frameworks, on the benchmark datasets.

Debiased and Denoised Entity Recognition from Distant Supervision
Haobo Wang Yiwen Dong Ruixuan Xiao Fei Huang Gang Chen Junbo Zhao



研究问题:如何减少远程监督在命名实体识别(NER)任务中由于无监督标签的噪声导致的性能下降。
动机:现有的远程监督方法存在两种主要偏见,一是远程标签中的噪声并非完全随机,而是具有高度结构性;二是自我训练框架会在样本选择和最终预测中引入固有的偏差。
方法:提出一种新的自我训练框架DesERT,该框架通过调整样本选择过程以适应其内在的分布性偏差结构,并通过去偏模块增强标记表示,从而提高伪标签的质量。
效果:实验结果表明,DesERT在五个标准基准数据集上平均F1分数提高了+2.22%,并在新的DSNER基准测试中表现出了有效性,其中额外的远程监督来自ChatGPT模型。

While distant supervision has been extensively explored and exploited in NLP tasks like named entity recognition, a major obstacle stems from the inevitable noisy distant labels tagged unsupervisedly. A few past works approach this problem by adopting a self-training framework with a sample-selection mechanism. In this work, we innovatively identify two types of biases that were omitted by prior work, and these biases lead to inferior performance of the distant-supervised NER setup. First, we characterize the noise concealed in the distant labels as highly structural rather than fully randomized. Second, the self-training framework would ubiquitously introduce an inherent bias that causes erroneous behavior in both sample selection and eventually prediction. To cope with these problems, we propose a novel self-training framework, dubbed DesERT. This framework augments the conventional NER predicative pathway to a dual form that effectively adapts the sample-selection process to conform to its innate distributional-bias structure. The other crucial component of DesERT composes a debiased module aiming to enhance the token representations, hence the quality of the pseudo-labels. Extensive experiments are conducted to validate the DesERT. The results show that our framework establishes a new state-of-art performance, it achieves a +2.22% average F1 score improvement on five standardized benchmarking datasets. Lastly, DesERT demonstrates its effectiveness under a new DSNER benchmark where additional distant supervision comes from the ChatGPT model.

Emergent Communication for Rules Reasoning
Yuxuan Guo Yifan Hao Rui Zhang Enshuai Zhou Zidong Du Xishan Zhang Xinkai Song Yuanbo Wen Yongwei Zhao Xuehai Zhou Jiaming Guo Qi Yi Shaohui Peng Di Huang Ruizhi Chen Qi Guo Yunji Chen



研究问题:深度学习代理之间的新兴交流在语言学和人工智能方面具有启发性,但以往的尝试主要围绕感知导向的环境设置进行。
动机:受经典人类推理测试(即雷文的渐进矩阵)的启发,我们提出了推理游戏,这是一个认知导向的环境,鼓励代理进行高层次的规则推理和交流,而不是描述低层次的感知特征。
方法:我们提出了1)一个无偏数据集(即规则-雷文)作为基准,以避免过拟合;2)并提出了两阶段的课程代理训练方法,作为在推理游戏中更稳定收敛的基线,其中上下文和语义是双向漂移的。
效果:实验结果表明,在推理游戏中,出现了一种语义稳定且组合性的语言来解决推理问题。这种新兴的语言帮助代理将提取的规则应用于未见过的属性的泛化,以及在不同属性甚至任务之间的转移。

Research on emergent communication between deep-learning-based agents has received extensive attention due to its inspiration for linguistics and artificial intelligence. However, previous attempts have hovered around emerging communication under perception-oriented environmental settings, that forces agents to describe low-level perceptual features intra image or symbol contexts. In this work, inspired by the classic human reasoning test (namely Raven's Progressive Matrix), we propose the Reasoning Game, a cognition-oriented environment that encourages agents to reason and communicate high-level rules, rather than perceived low-level contexts. Moreover, we propose 1) an unbiased dataset (namely rule-RAVEN) as a benchmark to avoid overfitting, 2) and a two-stage curriculum agent training method as a baseline for more stable convergence in the Reasoning Game, where contexts and semantics are bilaterally drifting. Experimental results show that, in the Reasoning Game, a semantically stable and compositional language emerges to solve reasoning problems. The emerged language helps agents apply the extracted rules to the generalization of unseen context attributes, and to the transfer between different context attributes or even tasks.

Hyperbolic Graph Neural Networks at Scale: A Meta Learning Approach
Nurendra Choudhary Nikhil Rao Chandan K. Reddy



研究问题:如何克服双曲神经网络在缺乏归纳偏置机制方面的研究进展缓慢,这对泛化到新任务和促进大规模数据集上的可扩展学习至关重要。
动机:目前的双曲神经网络缺乏对新任务的泛化能力和可扩展学习的归纳偏置机制,这限制了其在大规模数据集上的应用。
方法:本文提出了一种新的方法——超曲图元学习器(H-GRAM),该方法通过从节点的局部子图中学习可转移的信息,并将其转移到具有不相交节点、边和标签的新子图中,以实现在新任务上的快速学习。
效果:实验结果表明,H-GRAM在多个具有挑战性的少样本设置中有效地学习和转移信息,优于其他最先进的基线方法。此外,与标准的双曲神经网络不同,该方法能够扩展到大型图数据集,并在其欧几里得对应物上提高性能。

The progress in hyperbolic neural networks (HNNs) research is hindered by their absence of inductive bias mechanisms, which are essential for generalizing to new tasks and facilitating scalable learning over large datasets. In this paper, we aim to alleviate these issues by learning generalizable inductive biases from the nodes’ local subgraph and transfer them for faster learning over new subgraphs with a disjoint set of nodes, edges, and labels in a few-shot setting. We introduce a novel method, Hyperbolic GRAph Meta Learner (H-GRAM), that, for the tasks of node classification and link prediction, learns transferable information from a set of support local subgraphs in the form of hyperbolic meta gradients and label hyperbolic protonets to enable faster learning over a query set of new tasks dealing with disjoint subgraphs. Furthermore, we show that an extension of our meta-learning framework also mitigates the scalability challenges seen in HNNs faced by existing approaches. Our comparative analysis shows that H-GRAM effectively learns and transfers information in multiple challenging few-shot settings compared to other state-of-the-art baselines. Additionally, we demonstrate that, unlike standard HNNs, our approach is able to scale over large graph datasets and improve performance over its Euclidean counterparts.

Adapting to Continuous Covariate Shift via Online Density Ratio Estimation
Yu-Jie Zhang Zhen-Yu Zhang Peng Zhao Masashi Sugiyama



研究问题:本文旨在解决现代机器学习中分布偏移的核心挑战,特别是连续协变量偏移的问题。
动机:在连续协变量偏移的情况下,测试数据会陆续出现,其分布可能会持续变化,而现有的方法无法有效应对。
方法:我们提出了一种在线密度比估计方法,可以恰当地复用历史信息,以适应测试数据的连续分布变化。
效果:通过理论分析和实验验证,我们的方法能够有效地降低预测风险,并在实证结果上也表现出良好的效果。

Dealing with distribution shifts is one of the central challenges for modern machine learning. One fundamental situation is the covariate shift, where the input distributions of data change from the training to testing stages while the input-conditional output distribution remains unchanged. In this paper, we initiate the study of a more challenging scenario --- continuous covariate shift --- in which the test data appear sequentially, and their distributions can shift continuously. Our goal is to adaptively train the predictor such that its prediction risk accumulated over time can be minimized. Starting with the importance-weighted learning, we theoretically show the method works effectively if the time-varying density ratios of test and train inputs can be accurately estimated. However, existing density ratio estimation methods would fail due to data scarcity at each time step. To this end, we propose an online density ratio estimation method that can appropriately reuse historical information. Our method is proven to perform well by enjoying a dynamic regret bound, which finally leads to an excess risk guarantee for the predictor. Empirical results also validate the effectiveness.

Towards Self-Interpretable Graph-Level Anomaly Detection
Yixin Liu Kaize Ding Qinghua Lu Fuyi Li Leo Yu Zhang Shirui Pan



研究问题:本文旨在解决图级别异常检测(GLAD)的问题,即识别出与集合中大多数图显著不同的图。
动机:当前的工作主要关注于评估图级别的异常性,但未能为预测提供有意义的解释,这在很大程度上限制了其可靠性和应用范围。
方法:本文提出了一种新的挑战性问题——可解释的GLAD,其中学习目标是预测每个图样本的异常性以及相应的解释,即导致预测的关键子图。为此,我们提出了一种自我解释的图异常检测模型(SIGNET)。
效果:通过在16个数据集上的大量实验,证明了SIGNET的异常检测能力和自我解释性。

Graph-level anomaly detection (GLAD) aims to identify graphs that exhibit notable dissimilarity compared to the majority in a collection. However, current works primarily focus on evaluating graph-level abnormality while failing to provide meaningful explanations for the predictions, which largely limits their reliability and application scope. In this paper, we investigate a new challenging problem, explainable GLAD, where the learning objective is to predict the abnormality of each graph sample with corresponding explanations, i.e., the vital subgraph that leads to the predictions. To address this challenging problem, we propose a Self-Interpretable Graph aNomaly dETection model (SIGNET for short) that detects anomalous graphs as well as generates informative explanations simultaneously. Specifically, we first introduce the multi-view subgraph information bottleneck (MSIB) framework, serving as the design basis of our self-interpretable GLAD approach. This way SIGNET is able to not only measure the abnormality of each graph based on cross-view mutual information but also provide informative graph rationales by extracting bottleneck subgraphs from the input graph and its dual hypergraph in a self-supervised way. Extensive experiments on 16 datasets demonstrate the anomaly detection capability and self-interpretability of SIGNET.

How Re-sampling Helps for Long-Tail Learning?
Jiang-Xin Shi Tong Wei Yuke Xiang Yu-Feng Li



研究问题:近年来,由于其对极度不平衡的数据集的挑战,长尾学习受到了极大的关注。在这类数据集中,只有少数几个类别(被称为头部类别)有足够的训练样本,而其余的类别(被称为尾部类别)在训练数据中很少出现。
动机:尽管重新采样是一种广泛使用的方法来解决类不平衡问题,但最近的研究表明,在现代的长尾学习任务中,重新采样对性能的提升微乎其微。本论文旨在系统地调查这种现象。
方法:我们设计了两个同构的数据集进行实验,一个包含无关上下文,另一个不包含。我们还提出了一个新的上下文转移增强模块,通过从头部类别的图像中提取上下文库来生成尾部类别的多样化训练图像。
效果:实验证明,我们的新模块可以提升泛化能力,并优于其他方法,包括平衡重采样、解耦分类器再训练和数据增强方法。

Long-tail learning has received significant attention in recent years due to the challenge it poses with extremely imbalanced datasets. In these datasets, only a few classes (known as the head classes) have an adequate number of training samples, while the rest of the classes (known as the tail classes) are infrequent in the training data. Re-sampling is a classical and widely used approach for addressing class imbalance issues. Unfortunately, recent studies claim that re-sampling brings negligible performance improvements in modern long-tail learning tasks. This paper aims to investigate this phenomenon systematically. Our research shows that re-sampling can considerably improve generalization when the training images do not contain semantically irrelevant contexts. In other scenarios, however, it can learn unexpected spurious correlations between irrelevant contexts and target labels. We design experiments on two homogeneous datasets, one containing irrelevant context and the other not, to confirm our findings. To prevent the learning of spurious correlations, we propose a new context shift augmentation module that generates diverse training images for the tail class by maintaining a context bank extracted from the head-class images. Experiments demonstrate that our proposed module can boost the generalization and outperform other approaches, including class-balanced re-sampling, decoupled classifier re-training, and data augmentation methods. The source code is available at https://www.lamda.nju.edu.cn/code_CSA.ashx.

Identifiable Contrastive Learning with Automatic Feature Importance Discovery
Qi Zhang Yifei Wang Yisen Wang



研究问题:现有的对比学习方法通过成对样本对比来学习数据表示,但学到的特征往往缺乏人类可解释性。
动机:理论上,这种方法缺乏特征的可识别性,不同的初始化可能导致完全不同的特征。
方法:本文提出了一种新的三因素对比学习方法(triCL),该方法涉及以$z_x^\top S z_{x'}$形式的三因素对比,其中$S=\text{diag}(s_1,\dots,s_k)$是一个可学习的对角矩阵,自动捕捉每个特征的重要性。
效果:实验表明,通过这种简单的扩展,triCL不仅可以获得消除随机性的可识别特征,还可以获得按照重要性矩阵$S$排序的更具解释性的特征。具有高重要性的特征具有良好的解释性,通过捕捉常见的类别特征,并在使用少量特征进行图像检索时获得了优越的性能。提出的triCL目标具有通用性,可以应用于不同的对比学习方法,如SimCLR和CLIP。

Existing contrastive learning methods rely on pairwise sample contrast $z_x^\top z_{x'}$ to learn data representations, but the learned features often lack clear interpretability from a human perspective. Theoretically, it lacks feature identifiability and different initialization may lead to totally different features. In this paper, we study a new method named tri-factor contrastive learning (triCL) that involves a 3-factor contrast in the form of $z_x^\top S z_{x'}$, where $S=\text{diag}(s_1,\dots,s_k)$ is a learnable diagonal matrix that automatically captures the importance of each feature. We show that by this simple extension, triCL can not only obtain identifiable features that eliminate randomness but also obtain more interpretable features that are ordered according to the importance matrix $S$. We show that features with high importance have nice interpretability by capturing common classwise features, and obtain superior performance when evaluated for image retrieval using a few features. The proposed triCL objective is general and can be applied to different contrastive learning methods like SimCLR and CLIP. We believe that it is a better alternative to existing 2-factor contrastive learning by improving its identifiability and interpretability with minimal overhead. Code is available at https://github.com/PKU-ML/Tri-factor-Contrastive-Learning.

Learning Topology-Agnostic EEG Representations with Geometry-Aware Modeling
Ke Yi Yansen Wang Kan Ren Dongsheng Li



研究问题:如何利用大规模未标记的头皮脑电图(EEG)数据进行预训练,以提升下游任务的性能。
动机:由于未标记的数据丰富,开发类似的技术对于头皮脑电图(EEG)是合适的。同时,各种采样通道选择和内在的结构与空间信息为改进现有的预训练策略提供了挑战和机会。
方法:我们提出了一种将所有类型的通道选择映射到统一拓扑的方法,并在该统一拓扑上引入了多维位置编码、多级通道层次和多阶段预训练策略的预训练框架MMM,以获取与拓扑无关的表示。
效果:实验表明,我们的方法在情感识别基准数据集上比先前最先进的技术取得了显著的改进。

Large-scale pre-training has shown great potential to enhance models on downstream tasks in vision and language. Developing similar techniques for scalp electroencephalogram (EEG) is suitable since unlabelled data is plentiful. Meanwhile, various sampling channel selections and inherent structural and spatial information bring challenges and avenues to improve existing pre-training strategies further. In order to break boundaries between different EEG resources and facilitate cross-dataset EEG pre-training, we propose to map all kinds of channel selections to a unified topology. We further introduce MMM, a pre-training framework with Multi-dimensional position encoding, Multi-level channel hierarchy, and Multi-stage pre-training strategy built on the unified topology to obtain topology-agnostic representations. Experiments demonstrate that our approach yields impressive improvements over previous state-of-the-art techniques on emotional recognition benchmark datasets.

Ambient Diffusion: Learning Clean Distributions from Corrupted Data
Giannis Daras Kulin Shah Yuval Dagan Aravind Gollakota Alex Dimakis Adam Klivans



研究问题:如何仅使用高度损坏的样本学习未知分布。
动机:在科学应用中,获取未损坏的样本是不可能的或昂贵的。此外,我们的方法还可以训练不太可能记住任何单个训练样本的生成模型,因为它们从未观察到干净的训练数据。
方法:我们在扩散过程中引入额外的测量失真,并要求模型从进一步损坏的图像中预测原始损坏的图像。我们证明了我们的方法会导致模型学习完整的未损坏图像的条件期望,这是基于额外的测量失真的。
效果:我们在标准基准(CelebA、CIFAR-10和AFHQ)上训练模型,并证明即使所有训练样本有90%的像素丢失,我们也可以学习分布。我们还表明,我们可以在小损坏的数据集(如带有块损坏的MRI扫描)上微调基础模型,并在不记住训练集的情况下学习清洁分布。

We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. Another benefit of our approach is the ability to train generative models that are less likely to memorize any individual training sample, since they never observe clean training data. Our main idea is to introduce additional measurement distortion during the diffusion process and require the model to predict the original corrupted image from the further corrupted image. We prove that our method leads to models that learn the conditional expectation of the full uncorrupted image given this additional measurement corruption. This holds for any corruption process that satisfies some technical conditions (and in particular includes inpainting and compressed sensing). We train models on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn the distribution even when all the training samples have 90\% of their pixels missing. We also show that we can finetune foundation models on small corrupted datasets (e.g. MRI scans with block corruptions) and learn the clean distribution without memorizing the training set.

Diversify Your Vision Datasets with Automatic Diffusion-based Augmentation
Lisa Dunlap Alyssa Umino Han Zhang Jiezhi Yang Joseph E. Gonzalez Trevor Darrell



研究问题:如何在训练数据有限的情况下,提高细粒度分类任务的泛化能力?
动机:由于训练数据的局限性,模型往往无法适应环境或地点的变化。因此,我们探索如何利用大规模预训练数据集的自然语言描述来生成有用的训练数据变体。
方法:我们提出了ALIA(自动语言引导图像增强)方法,该方法利用大型视觉和语言模型自动生成数据集领域的自然语言描述,并通过语言引导的图像编辑来增强训练数据。为了保持数据完整性,我们在原始数据集上训练的模型过滤掉了最小限度的图像编辑和那些破坏类别相关信息的编辑。
效果:实验结果表明,ALIA在细粒度分类任务上超越了传统的数据增强和文本到图像生成的数据,包括领域泛化和上下文偏差的情况。

Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks, including cases of domain generalization and contextual bias. Code is available at https://github.com/lisadunlap/ALIA.

Better with Less: A Data-Active Perspective on Pre-Training Graph Neural Networks
Jiarong Xu Renhong Huang XIN JIANG Yuxuan Cao Carl Yang Chunping Wang Yang Yang



研究问题:本文旨在解决图神经网络预训练中“大数据诅咒”现象,即更多的训练数据并不一定能带来更好的下游性能。
动机:现有的图预训练模型通常需要大量的输入数据才能取得成功,但作者发现这并非必要条件。因此,他们提出了一种用更少但更精心选择的数据进行预训练的新框架。
方法:作者提出了一个名为“data-active graph pre-training”(APT)的预训练管道,该管道由一个图选择器和一个预训练模型组成。图选择器根据图的内在属性和预测不确定性来选择最具代表性和指导性的数据点。预训练模型在接收到选定的数据后,一方面对新的、未见过的数据进行初步理解,另一方面尝试记住从以前的数据中学到的知识。
效果:实验结果表明,提出的APT能够在使用较少的训练数据的情况下获得更有效的预训练模型,并且具有更好的下游性能。

Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for downstream tasks with unlabeled data, and it has recently become an active research area. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training data do not necessarily lead to better downstream performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: fewer, but carefully chosen data are fed into a GNN model to enhance pre-training. The proposed pre-training pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model in the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learned from previous data. Therefore, the integration and interaction between these two components form a unified framework (APT), in which graph pre-training is performed in a progressive and iterative way. Experiment results show that the proposed APT is able to obtain an efficient pre-training model with fewer training data and better downstream performance.

Doubly-Robust Self-Training
Banghua Zhu Mingyu Ding Philip Jacobson Ming Wu Wei Zhan Michael Jordan Jiantao Jiao



研究问题:本文旨在解决半监督学习中,如何有效利用未标记数据的问题。
动机:现有的半监督学习方法主要依赖于生成的伪标签的准确性,但当伪标签完全错误或完全准确时,这些方法的效果并不理想。
方法:本文提出了一种创新的半监督学习算法——双重稳健自我训练,该算法在伪标签完全错误时仅使用标记数据进行训练,而在伪标签完全准确时则使用所有伪标签和标记数据进行训练,从而增加有效样本大小。
效果:通过在ImageNet图像分类数据集和nuScenes自动驾驶3D物体检测数据集上的实验评估,验证了双重稳健损失函数优于自我训练基线。

Self-training is a well-established technique in semi-supervised learning, which leverages unlabeled data by generating pseudo-labels and incorporating them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly-robust self-training, an innovative semi-supervised algorithm that provably balances between two extremes. When pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly-robust loss over the self-training baseline.

Chasing Fairness Under Distribution Shift: A Model Weight Perturbation Approach
Zhimeng Jiang Xiaotian Han Hongye Jin Guanchu Wang Rui Chen Na Zou Xia Hu



研究问题:近年来,机器学习中的公平性问题
动机:现有的半监督学习方法主要依赖于生成的伪标签的准确性,但当伪标签完全错误或完全准确时,这些方法的效果并不理想。
方法:本文提出了一种创新的半监督学习算法——双重稳健自我训练,该算法在伪标签完全错误时仅使用标记数据进行训练,而在伪标签完全准确时则使用所有伪标签和标记数据进行训练,从而增加有效样本大小。
效果:通过在ImageNet图像分类数据集和nuScenes自动驾驶3D物体检测数据集上的实验评估,验证了双重稳健损失函数优于自我训练基线。

Fairness in machine learning has attracted increasing attention in recent years. The fairness methods improving algorithmic fairness for in-distribution data may not perform well under distribution shifts. In this paper, we first theoretically demonstrate the inherent connection between distribution shift, data perturbation, and model weight perturbation. Subsequently, we analyze the sufficient conditions to guarantee fairness (i.e., low demographic parity) for the target dataset, including fairness for the source dataset, and low prediction difference between the source and target datasets for each sensitive attribute group. Motivated by these sufficient conditions, we propose robust fairness regularization (RFR) by considering the worst case within the model weight perturbation ball for each sensitive attribute group. We evaluate the effectiveness of our proposed RFR algorithm on synthetic and real distribution shifts across various datasets. Experimental results demonstrate that RFR achieves better fairness-accuracy trade-off performance compared with several baselines. The source code is available at \url{https://github.com/zhimengj0326/RFR_NeurIPS23}.

Characterizing the Impacts of Semi-supervised Learning for Weak Supervision
Jeffrey Li Jieyu Zhang Ludwig Schmidt Alexander Ratner



研究问题:如何更有效地标注训练数据以提高机器学习模型的准确性?
动机:标注训练数据是制作高精度ML模型的关键且昂贵的步骤,无论是从头开始训练还是微调。
方法:本研究定义了一个简单的、模块化的设计空间,以系统地研究使用半监督学习(SSL)技术进行弱监督(WS)的方法。
效果:研究发现,设计空间中相当简单的方法就能达到最先进的复杂方法的性能,平均而言,在8个标准的WS基准测试中,准确率/F1分数提高了3%。此外,我们还提供了关于何时不同的组件值得增加其复杂性和训练成本的实际指导。与当前的理解相反,我们发现在大多数WS基准测试中,使用SSL并不一定能获得最佳性能,但在以下情况下更有效:(1)最终模型较小;(2)WS只为训练示例的一小部分提供标签。

Labeling training data is a critical and expensive step in producing high accuracy ML models, whether training from scratch or fine-tuning. To make labeling more efficient, two major approaches are programmatic weak supervision (WS) and semi-supervised learning (SSL). More recent works have either explicitly or implicitly used techniques at their intersection, but in various complex and ad hoc ways. In this work, we define a simple, modular design space to study the use of SSL techniques for WS more systematically. Surprisingly, we find that fairly simple methods from our design space match the performance of more complex state-of-the-art methods, averaging a 3 p.p. increase in accuracy/F1-score across 8 standard WS benchmarks. Further, we provide practical guidance on when different components are worth their added complexity and training costs. Contrary to current understanding, we find using SSL is not necessary to obtain the best performance on most WS benchmarks but is more effective when: (1) end models are smaller, and (2) WS provides labels for only a small portion of training examples.

This Looks Like Those: Illuminating Prototypical Concepts Using Multiple Visualizations
Chiyu Ma Brandon Zhao Chaofan Chen Cynthia Rudin



研究问题:如何通过结合深度学习和案例推理,使用原型部分进行可解释的图像分类。
动机:现有的基于原型的图像分类方法只提供一对一的比较,难以确定比较的基础概念(如颜色或形状)。
方法:修改原型网络架构,学习由多个图像补丁可视化的原型概念,使同一原型有多个可视化表示,从而创建更丰富、更可解释的视觉解释。
效果:实验表明,这种“这个看起来像那些”的推理过程可以作为对现有各种原型图像分类网络的修改,同时在基准数据集上实现相当的准确性。

We present ProtoConcepts, a method for interpretable image classification combining deep learning and case-based reasoning using prototypical parts. Existing work in prototype-based image classification uses a "this looks like that'' reasoning process, which dissects a test image by finding prototypical parts and combining evidence from these prototypes to make a final classification. However, all of the existing prototypical part-based image classifiers provide only one-to-one comparisons, where a single training image patch serves as a prototype to compare with a part of our test image. With these single-image comparisons, it can often be difficult to identify the underlying concept being compared (e.g., "is it comparing the color or the shape?''). Our proposed method modifies the architecture of prototype-based networks to instead learn prototypical concepts which are visualized using multiple image patches. Having multiple visualizations of the same prototype allows us to more easily identify the concept captured by that prototype (e.g., "the test image and the related training patches are all the same shade of blue''), and allows our model to create richer, more interpretable visual explanations. Our experiments show that our ``this looks like those'' reasoning process can be applied as a modification to a wide range of existing prototypical image classification networks while achieving comparable accuracy on benchmark datasets.

Is Heterogeneity Notorious? Taming Heterogeneity to Handle Test-Time Shift in Federated Learning
Yue Tan Chen Chen Weiming Zhuang Xin Dong Lingjuan Lyu Guodong Long



研究问题:如何在联邦学习中处理特征级别的测试时移问题。
动机:现有的联邦学习方法在处理训练阶段的客户端间异质性问题时效果良好,但在处理测试阶段的客户端内异质性问题时表现不佳。
方法:提出一种基于对比学习的联邦学习方法FedICON,该方法通过捕捉不同客户端之间的不变知识,并不断调整模型以适应测试数据。
效果:实验证明,FedICON能有效解决联邦学习中的测试时移问题。

Federated learning (FL) is an effective machine learning paradigm where multiple clients can train models based on heterogeneous data in a decentralized manner without accessing their private data. However, existing FL systems undergo performance deterioration due to feature-level test-time shifts, which are well investigated in centralized settings but rarely studied in FL. The common non-IID issue in FL usually refers to inter-client heterogeneity during training phase, while the test-time shift refers to the intra-client heterogeneity during test phase. Although the former is always deemed to be notorious for FL, there is still a wealth of useful information delivered by heterogeneous data sources, which may potentially help alleviate the latter issue. To explore the possibility of using inter-client heterogeneity in handling intra-client heterogeneity, we firstly propose a contrastive learning-based FL framework, namely FedICON, to capture invariant knowledge among heterogeneous clients and consistently tune the model to adapt to test data. In FedICON, each client performs sample-wise supervised contrastive learning during the local training phase, which enhances sample-wise invariance encoding ability. Through global aggregation, the invariance extraction ability can be mutually boosted among inter-client heterogeneity. During the test phase, our test-time adaptation procedure leverages unsupervised contrastive learning to guide the model to smoothly generalize to test data under intra-client heterogeneity. Extensive experiments validate the effectiveness of the proposed FedICON in taming heterogeneity to handle test-time shift problems.

Disambiguated Attention Embedding for Multi-Instance Partial-Label Learning
Wei Tang Weijia Zhang Min-Ling Zhang



研究问题:如何有效地处理多实例部分标签学习任务,特别是在真实世界的任务中,其中对象研究问题:如何有效地处理多实例部分标签学习任务,特别是在真实世界的任务中,其中对象可以表示为与候选标签集关联的多实例包,该标签集包括一个真实标签和几个假阳性标签。
动机:现有的多实例部分标签学习方法忽视了全局包级信息,并且包的预测标签对负实例的预测非常敏感。因此,需要一种更有效的方法来处理这类问题。
方法:提出了一种新的方法DEMIPL,即解歧义注意力嵌入的多实例部分标签学习方法。DEMIPL使用解歧义注意力机制将多实例包聚合成一个单一的向量表示,然后通过基于动量的解歧义策略从候选标签集中识别出真实标签。
效果:实验结果在基准数据集和真实世界数据集上都验证了DEMIPL相对于比较的多实例部分标签学习方法和部分标签学习方法的优势。

In many real-world tasks, the concerned objects can be represented as a multi-instance bag associated with a candidate label set, which consists of one ground-truth label and several false positive labels. Multi-instance partial-label learning (MIPL) is a learning paradigm to deal with such tasks and has achieved favorable performances. Existing MIPL approach follows the instance-space paradigm by assigning augmented candidate label sets of bags to each instance and aggregating bag-level labels from instance-level labels. However, this scheme may be suboptimal as global bag-level information is ignored and the predicted labels of bags are sensitive to predictions of negative instances. In this paper, we study an alternative scheme where a multi-instance bag is embedded into a single vector representation. Accordingly, an intuitive algorithm named DEMIPL, i.e., Disambiguated attention Embedding for Multi-Instance Partial-Label learning, is proposed. DEMIPL employs a disambiguation attention mechanism to aggregate a multi-instance bag into a single vector representation, followed by a momentum-based disambiguation strategy to identify the ground-truth label from the candidate label set. Furthermore, we introduce a real-world MIPL dataset for colorectal cancer classification. Experimental results on benchmark and real-world datasets validate the superiority of DEMIPL against the compared MIPL and partial-label learning approaches.

Not All Out-of-Distribution Data Are Harmful to Open-Set Active Learning
Yang Yang Yuxuan Zhang XIN SONG Yi Xu



研究问题:现有的主动学习(AL)方法在处理真实世界应用中的分布外(OOD)实例时表现不佳,因为OOD实例总是不可避免的存在于未标记的数据中,可能导致采样效率低下。
动机:为了解决这个问题,我们提出了一种简单而有效的采样方案——渐进式主动学习(PAL),它采用渐进式采样机制来有效选择有价值的OOD实例。
方法:PAL通过综合评估实例的丰富性和代表性来衡量未标记的实例,从而在每一轮中平衡伪ID和伪OOD实例,增强ID分类器和OOD检测器的能力。
效果:广泛的实验表明,与最先进的方法相比,PAL在各种开放集AL场景中都表现出了有效性。代码可在https://github.com/njustkmg/PAL获取。

Active learning (AL) methods have been proven to be an effective way to reduce the labeling effort by intelligently selecting valuable instances for annotation. Despite their great success with in-distribution (ID) scenarios, AL methods suffer from performance degradation in many real-world applications because out-of-distribution (OOD) instances are always inevitably contained in unlabeled data, which may lead to inefficient sampling. Therefore, several attempts have been explored open-set AL by strategically selecting pure ID instances while filtering OOD instances. However, concentrating solely on selecting pseudo-ID instances may cause the training constraint of the ID classifier and OOD detector. To address this issue, we propose a simple yet effective sampling scheme, Progressive Active Learning (PAL), which employs a progressive sampling mechanism to leverage the active selection of valuable OOD instances. The proposed PAL measures unlabeled instances by synergistically evaluating instances' informativeness and representativeness, and thus it can balance the pseudo-ID and pseudo-OOD instances in each round to enhance both the capacity of the ID classifier and the OOD detector. %Meanwhile, PAL measures unlabeled instances by synergistically evaluating instances' informativeness and representativeness, which can more effectively estimate the values of instances. Extensive experiments on various open-set AL scenarios demonstrate the effectiveness of the proposed PAL, compared with the state-of-the-art methods. The code is available at \url{https://github.com/njustkmg/PAL}.

A case for reframing automated medical image classification as segmentation
Sarah Hooper Mayee F Chen Khaled Kamal Saab Kush Bhatia Curtis Langlotz Christopher Re



研究问题:重新审视在医学影像分析中,训练分类模型与分割模型的选择。
动机:尽管分类模型历史上标签成本更低且应用更广泛,但近期的研究表明分割模型的训练成本已大幅降低。
方法:采用信息理论分析分割模型和分类模型在同一数据集和总体任务上可能实现不同性能的原因,并实施多种使用分割模型对医学影像进行分类的方法(称为“分割用于分类”),并将其与传统分类方法在三个回顾性数据集上进行比较。
效果:通过分析实验总结出从分割转向分类的好处,包括提高样本效率、在少数标签图像上实现更好的性能(最多可降低一个数量级)、在低流行类别和某些罕见亚组上(最多提高161.1%的召回率)、提高对虚假相关性的鲁棒性(最多提高44.8%的鲁棒AUROC)以及提高模型的可解释性、评估和错误分析。

Image classification and segmentation are common applications of deep learning to radiology. While many tasks can be framed using either classification or segmentation, classification has historically been cheaper to label and more widely used. However, recent work has drastically reduced the cost of training segmentation networks. In light of this recent work, we reexamine the choice of training classification vs. segmentation models. First, we use an information theoretic approach to analyze why segmentation vs. classification models may achieve different performance on the same dataset and overarching task. We then implement multiple methods for using segmentation models to classify medical images, which we call *segmentation-for-classification*, and compare these methods against traditional classification on three retrospective datasets. We use our analysis and experiments to summarize the benefits of switching from segmentation to classification, including: improved sample efficiency, enabling improved performance with fewer labeled images (up to an order of magnitude lower), on low-prevalence classes, and on certain rare subgroups (up to 161.1\% improved recall); improved robustness to spurious correlations (up to 44.8\% improved robust AUROC); and improved model interpretability, evaluation, and error analysis.

Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation
Tingliang Feng Hao Shi Xueyang Liu Wei Feng Liang Wan Yanlin Zhou Di Lin



研究问题:语义图像分割中,如何更准确地预测目标领域图像的伪标注以训练分割网络。
动机:现有的方法通过全局适应图像的场景风格来最小化源领域和目标领域图像之间的风格差距,但对于不同类别或实例的对象风格适应性较差。
方法:提出对象风格补偿方法,构建多组差异特征的对象级差异记忆。一组中的差异特征捕捉同一类别的对象实例从目标领域到源领域的风格变化。我们从源领域和目标领域的图像中学习差异特征,并将其存储在记忆中。利用这个记忆,我们为各种类别的对象实例选择适当的差异特征来补偿其风格信息,将对象风格适应到源领域的统一风格。
效果:该方法使目标领域图像的伪标注计算更加准确,从而在不同的数据集上取得了最先进的结果。

Many methods of semantic image segmentation have borrowed the success of open compound domain adaptation. They minimize the style gap between the images of source and target domains, more easily predicting the accurate pseudo annotations for target domain's images that train segmentation network. The existing methods globally adapt the scene style of the images, whereas the object styles of different categories or instances are adapted improperly. This paper proposes the Object Style Compensation, where we construct the Object-Level Discrepancy Memory with multiple sets of discrepancy features. The discrepancy features in a set capture the style changes of the same category's object instances adapted from target to source domains. We learn the discrepancy features from the images of source and target domains, storing the discrepancy features in memory. With this memory, we select appropriate discrepancy features for compensating the style information of the object instances of various categories, adapting the object styles to a unified style of source domain. Our method enables a more accurate computation of the pseudo annotations for target domain's images, thus yielding state-of-the-art results on different datasets.

Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization
Adel Javanmard Vahab Mirrokni



研究问题:如何在保证用户数据安全的同时,提高个性化推荐系统的性能?
动机:随着个性化推荐系统的普及,如何保护用户数据隐私成为一大关注点。
方法:本文提出了一种名为“相似性聚类”的自然技术,通过将个体的敏感特征替换为集群的平均值进行模型训练。
效果:理论分析和实验证明,在某些高维情况下,使用匿名集群中心进行训练可以作为正则化手段,提高模型的泛化能力。

While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a top concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called "look-alike clustering", which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.

A Data-Free Approach to Mitigate Catastrophic Forgetting in Federated Class Incremental Learning for Vision Tasks
Sara Babakniya Zalan Fabian Chaoyang He Mahdi Soltanolkotabi Salman Avestimehr



研究问题:深度学习模型在处理新数据时,往往会忘记之前学到的信息,这个问题在联邦学习中尤为严重。
动机:联邦学习中的数据是分散的,每个用户都可以独立地改变数据,因此如何解决遗忘问题是一大挑战。
方法:本文提出了一个联邦类别增量学习的框架,利用生成模型从过去的分布中合成样本,这些数据可以与训练数据一起使用以减轻灾难性遗忘。为了保护隐私,生成模型在每个任务结束时在服务器上使用无数据的方法进行训练,而无需请求客户端的数据。
效果:通过在多个数据集上的大量实验,我们的方法相比现有的基线取得了显著的改进。

Deep learning models often suffer from forgetting previously learned information when trained on new data. This problem is exacerbated in federated learning (FL), where the data is distributed and can change independently for each user. Many solutions are proposed to resolve this catastrophic forgetting in a centralized setting. However, they do not apply directly to FL because of its unique complexities, such as privacy concerns and resource limitations. To overcome these challenges, this paper presents a framework for \textbf{federated class incremental learning} that utilizes a generative model to synthesize samples from past distributions. This data can be later exploited alongside the training data to mitigate catastrophic forgetting. To preserve privacy, the generative model is trained on the server using data-free methods at the end of each task without requesting data from clients. Moreover, our solution does not demand the users to store old data or models, which gives them the freedom to join/leave the training at any time. Additionally, we introduce SuperImageNet, a new regrouping of the ImageNet dataset specifically tailored for federated continual learning. We demonstrate significant improvements compared to existing baselines through extensive experiments on multiple datasets.

Navigating Data Heterogeneity in Federated Learning: A Semi-Supervised Approach for Object Detection
Taehyeon Kim Eric Lin Junu Lee Christian Lau Vaikkunth Mugunthan



研究问题:如何在保持数据隐私的同时,使用联邦学习在分布式数据源上训练模型,特别是在研究问题:如何在保持数据隐私的同时,使用联邦学习在分布式数据源上训练模型,特别是在自动驾驶等应用中面临有限高质量标签和非独立同分布客户端数据的挑战。
动机:现有的联邦学习方法在处理非独立同分布的客户端数据和有限高质量标签时面临挑战,尤其是在自动驾驶等应用中。
方法:提出了一种新颖的半监督联邦目标检测(SSFOD)框架,该框架适用于只有服务器拥有标注数据,而客户端拥有未标注数据的场景。特别是,该方法是首次实现对0%标注非独立同分布数据的客户端进行SSFOD的方法。
效果:通过在著名自动驾驶数据集(BDD100K、Cityscapes和SODA10M)上的广泛验证,证明了该方法的有效性,展示了最先进的结果。特别地,FedSTO仅使用20-30%的标签,其表现几乎与全监督集中式训练方法相当。

Federated Learning (FL) has emerged as a potent framework for training models across distributed data sources while maintaining data privacy. Nevertheless, it faces challenges with limited high-quality labels and non-IID client data, particularly in applications like autonomous driving. To address these hurdles, we navigate the uncharted waters of Semi-Supervised Federated Object Detection (SSFOD). We present a pioneering SSFOD framework, designed for scenarios where labeled data reside only at the server while clients possess unlabeled data. Notably, our method represents the inaugural implementation of SSFOD for clients with 0% labeled non-IID data, a stark contrast to previous studies that maintain some subset of labels at each client. We propose FedSTO, a two-stage strategy encompassing Selective Training followed by Orthogonally enhanced full-parameter training, to effectively address data shift (e.g. weather conditions) between server and clients. Our contributions include selectively refining the backbone of the detector to avert overfitting, orthogonality regularization to boost representation divergence, and local EMA-driven pseudo label assignment to yield high-quality pseudo labels. Extensive validation on prominent autonomous driving datasets (BDD100K, Cityscapes, and SODA10M) attests to the efficacy of our approach, demonstrating state-of-the-art results. Remarkably, FedSTO, using just 20-30% of labels, performs nearly as well as fully-supervised centralized training methods.

Slimmed Asymmetrical Contrastive Learning and Cross Distillation for Lightweight Model Training
Jian Meng Li Yang Kyungmin Lee Jinwoo Shin Deliang Fan Jae-sun Seo



研究问题:现有的对比学习算法在轻量级模型上表现不佳,且需要大量计算资源,限制了其在资源受限的AI应用中的使用。
动机:为了解决这些问题,我们提出了一种新的自我监督对比学习方案SACL-XD。
方法:SACL-XD由两个技术组件组成,即缩小的非对称对比学习和跨蒸馏。这种方法不需要预先训练的模型作为教师进行无监督的知识蒸馏,而是从零开始训练对比学习模型。
效果:实验结果表明,与最先进的轻量级对比学习(蒸馏)算法相比,SACL-XD在MobileNet-V3上实现了1.79%的ImageNet-1K准确率提升,同时减少了64倍的训练FLOPs。

Contrastive learning (CL) has been widely investigated with various learning mechanisms and achieves strong capability in learning representations of data in a self-supervised manner using unlabeled data. A common fashion of contrastive learning on this line is employing mega-sized encoders to achieve comparable performance as the supervised learning counterpart. Despite the success of the labelless training, current contrastive learning algorithms *failed* to achieve good performance with lightweight (compact) models, e.g., MobileNet, while the requirements of the heavy encoders impede the energy-efficient computation, especially for resource-constrained AI applications. Motivated by this, we propose a new self-supervised CL scheme, named SACL-XD, consisting of two technical components, **S**limmed **A**symmetrical **C**ontrastive **L**earning (SACL) and **Cross**-**D**istillation (XD), which collectively enable efficient CL with compact models. While relevant prior works employed a strong pre-trained model as the teacher of unsupervised knowledge distillation to a lightweight encoder, our proposed method trains CL models from scratch and outperforms them even without such an expensive requirement. Compared to the SoTA lightweight CL training (distillation) algorithms, SACL-XD achieves 1.79% ImageNet-1K accuracy improvement on MobileNet-V3 with 64$\times$ training FLOPs reduction.

Enhancing Knowledge Transfer for Task Incremental Learning with Data-free Subnetwork
Qiang Gao Xiaojun Shan Yuchen Zhang Fan Zhou



研究问题:如何在密集网络中利用竞争子网络和彩票假设,实现知识的有效转移?
动机:解决顺序到达的任务中的知识转移问题,缓解灾难性遗忘,同时考虑到过去数据的不可用性和隐私问题。
方法:提出一种新的神经元级任务增量学习方法——数据自由子网络(DSN),通过选择一组小神经元的关联权重进行激活,包括通过神经元级掩码重用先前任务的神经元,并通过数据自由的重播将可能有价值的知识转移到早期任务。
效果:在四个基准数据集上进行的全面实验表明,与几种最先进的基线相比,DSN在任务增量学习环境中具有有效性。特别是,DSN能够实现对早期任务的知识转移,这是以往工作往往忽视的。

As there exist competitive subnetworks within a dense network in concert with Lottery Ticket Hypothesis, we introduce a novel neuron-wise task incremental learning method, namely Data-free Subnetworks (DSN), which attempts to enhance the elastic knowledge transfer across the tasks that sequentially arrive. Specifically, DSN primarily seeks to transfer knowledge to the new coming task from the learned tasks by selecting the affiliated weights of a small set of neurons to be activated, including the reused neurons from prior tasks via neuron-wise masks. And it also transfers possibly valuable knowledge to the earlier tasks via data-free replay. Especially, DSN inherently relieves the catastrophic forgetting and the unavailability of past data or possible privacy concerns. The comprehensive experiments conducted on four benchmark datasets demonstrate the effectiveness of the proposed DSN in the context of task-incremental learning by comparing it to several state-of-the-art baselines. In particular, DSN enables the knowledge transfer to the earlier tasks, which is often overlooked by prior efforts.

MIM4DD: Mutual Information Maximization for Dataset Distillation
Yuzhang Shang Zhihang Yuan Yan Yan



研究问题:如何通过合成小数据集,使其在相同模型下测试性能与完整数据集相当。
动机:目前的尖端方法主要通过匹配从真实数据和合成数据中提取的启发式指标来优化合成数据集,但忽略了信息理论中测量变量之间共享信息的必要度量标准。
方法:引入互信息作为量化合成数据和真实数据集之间共享信息的度量标准,并设计了MIM4DD,通过在对比学习框架内更新合成数据集的新设计的可优化目标,数值上最大化互信息。
效果:实验结果表明,MIM4DD可以作为现有尖端DD方法的附加模块实施。

Dataset distillation (DD) aims to synthesize a small dataset whose test performance is comparable to a full dataset using the same model. State-of-the-art (SoTA) methods optimize synthetic datasets primarily by matching heuristic indicators extracted from two networks: one from real data and one from synthetic data (see Fig.1, Left), such as gradients and training trajectories. DD is essentially a compression problem that emphasizes on maximizing the preservation of information contained in the data. We argue that well-defined metrics which measure the amount of shared information between variables in information theory are necessary for success measurement, but are never considered by previous works. Thus, we introduce mutual information (MI) as the metric to quantify the shared information between the synthetic and the real datasets, and devise MIM4DD numerically maximizing the MI via a newly designed optimizable objective within a contrastive learning framework to update the synthetic dataset. Specifically, we designate the samples in different datasets who share the same labels as positive pairs, and vice versa negative pairs. Then we respectively pull and push those samples in positive and negative pairs into contrastive space via minimizing NCE loss. As a result, the targeted MI can be transformed into a lower bound represented by feature maps of samples, which is numerically feasible. Experiment results show that MIM4DD can be implemented as an add-on module to existing SoTA DD methods.

DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets
Yash Jain Harkirat Behl Zsolt Kira Vibhav Vineet



研究问题:如何有效地在一个大型混合数据集上训练模型?
动机:现有的方法通过在公共主干上设置单独的检测头来实现,但这会导致参数显著增加。
方法:我们提出了混合专家(Mixture-of-Experts)作为解决方案,并强调MoE不仅仅是一个可扩展工具。我们提出了数据集感知的混合专家(Dataset-Aware Mixture-of-Experts,DAMEX),通过学习将每个数据集的标记路由到其映射的专家来训练专家成为某个数据集的“专家”。
效果:在通用目标检测基准测试中,我们的实验结果优于现有最先进的方法,平均AP得分提高了10.2分,比非MoE基线提高了2.0分。我们还观察到在混合具有(1)有限可用性、(2)不同领域和(3)发散标签集的数据集时,DAMEX始终能获得一致的收益。此外,我们从定性上证明DAMEX对专家表示崩溃具有鲁棒性。代码可在https://github.com/jinga-lala/DAMEX获取。

Construction of a universal detector poses a crucial question: How can we most effectively train a model on a large mixture of datasets? The answer lies in learning dataset-specific features and ensembling their knowledge but do all this in a single model. Previous methods achieve this by having separate detection heads on a common backbone but that results in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoE are much more than a scalability tool. We propose Dataset-Aware Mixture-of-Experts, DAMEX where we train the experts to become an `expert' of a dataset by learning to route each dataset tokens to its mapped expert. Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art by average +10.2 AP score and improve over our non-MoE baseline by average +2.0 AP score. We also observe consistent gains while mixing datasets with (1) limited availability, (2) disparate domains and (3) divergent label sets. Further, we qualitatively show that DAMEX is robust against expert representation collapse. Code is available at https://github.com/jinga-lala/DAMEX

Decompose a Task into Generalizable Subtasks in Multi-Agent Reinforcement Learning
Zikang Tian Ruizhi Chen Xing Hu Ling Li Rui Zhang Fan Wu Shaohui Peng Jiaming Guo Zidong Du Qi Guo Yunji Chen



研究问题:如何在多智能体强化学习(MARL)中实现任务之间的模型迁移。
动机:训练每个任务的模型从零开始既耗时又昂贵,特别是在大规模的多智能体系统中,因此需要开发一种可以在任务之间泛化模型的方法。
方法:提出了一种新的框架DT2GS,通过使用可扩展的子任务编码器和自适应子任务语义模块,将任务分解为一系列可泛化的子任务。
效果:实验结果表明,DT2GS具有可靠的零样本泛化能力,表现出足够的迁移性,并在多任务和单任务问题上优于现有方法。

In recent years, Multi-Agent Reinforcement Learning (MARL) techniques have made significant strides in achieving high asymptotic performance in single task. However, there has been limited exploration of model transferability across tasks. Training a model from scratch for each task can be time-consuming and expensive, especially for large-scale Multi-Agent Systems. Therefore, it is crucial to develop methods for generalizing the model across tasks. Considering that there exist task-independent subtasks across MARL tasks, a model that can decompose such subtasks from the source task could generalize to target tasks. However, ensuring true task-independence of subtasks poses a challenge. In this paper, we propose to \textbf{d}ecompose a \textbf{t}ask in\textbf{to} a series of \textbf{g}eneralizable \textbf{s}ubtasks (DT2GS), a novel framework that addresses this challenge by utilizing a scalable subtask encoder and an adaptive subtask semantic module. We show that these components endow subtasks with two properties critical for task-independence: avoiding overfitting to the source task and maintaining consistent yet scalable semantics across tasks. Empirical results demonstrate that DT2GS possesses sound zero-shot generalization capability across tasks, exhibits sufficient transferability, and outperforms existing methods in both multi-task and single-task problems.

Characterizing Out-of-Distribution Error via Optimal Transport
Yuzhe Lu Yilong Qin Runtian Zhai Andrew Shen Ketong Chen Zhenlin Wang Soheil Kolouri Simon Stepputtis Joseph Campbell Katia P. Sycara



研究问题:如何准确预测模型在未标记的分布外(OOD)数据上的性能,以提高机器学习的安全性。
动机:现有的方法往往低估了实际错误,这严重影响了它们在实际任务中的应用。
方法:通过识别“伪标签偏移”,即预测的和真实的OOD标签分布之间的差异,并利用最优传输理论,提出了一种新的估计模型性能的方法——Confidence Optimal Transport(COT)。
效果:实验证明,该方法在各种类型的分布转移(合成、新的子群体、自然)的标准基准上,显著优于现有的最先进方法,预测误差降低了高达3倍。

Out-of-distribution (OOD) data poses serious challenges in deployed machine learning models, so methods of predicting a model's performance on OOD data without labels are important for machine learning safety. While a number of methods have been proposed by prior work, they often underestimate the actual error, sometimes by a large margin, which greatly impacts their applicability to real tasks. In this work, we identify *pseudo-label shift*, or the difference between the predicted and true OOD label distributions, as a key indicator to this underestimation. Based on this observation, we introduce a novel method for estimating model performance by leveraging optimal transport theory, Confidence Optimal Transport (COT), and show that it provably provides more robust error estimates in the presence of pseudo-label shift. Additionally, we introduce an empirically-motivated variant of COT, Confidence Optimal Transport with Thresholding (COTT), which applies thresholding to the individual transport costs and further improves the accuracy of COT's error estimates. We evaluate COT and COTT on a variety of standard benchmarks that induce various types of distribution shift -- synthetic, novel subpopulation, and natural -- and show that our approaches significantly outperform existing state-of-the-art methods with up to 3x lower prediction errorS.

Generalized Semi-Supervised Learning via Self-Supervised Feature Adaptation
Jiachen Liang RuiBing Hou Hong Chang Bingpeng Ma Shiguang Shan Xilin CHEN



研究问题:传统的半监督学习(SSL)假设有标签和无标签的数据特征分布是一致的,但在现实场景中这很少成立。
动机:本文提出了一种新的SSL设置,其中无标签样本来自偏离有标签样本特征分布的混合分布。在这种设置下,以前的SSL方法往往会预测错误的伪标签,导致噪声积累。
方法:为解决这个问题,我们提出了“自我监督特征适应”(SSFA),这是一个通用框架,用于在有标签和无标签数据来自不同分布时提高SSL性能。SSFA将伪标签的预测与当前模型解耦,以提高伪标签的质量。特别是,SSFA将一个自我监督任务纳入SSL框架,并使用它来调整模型的特征提取器以适应无标签数据。这样,提取的特征更好地适应无标签数据的分布,从而生成高质量的伪标签。大量实验表明,我们提出的SSFA适用于各种基于伪标签的SSL学习器,并在有标签、无标签甚至未见过的数据分布上显著提高了性能。
效果:实验证明,该方法在各种类型的分布转移(合成、新的子群体、自然)的标准基准上,显著优于现有的最先进方法,预测误差降低了高达3倍。

Traditional semi-supervised learning (SSL) assumes that the feature distributions of labeled and unlabeled data are consistent which rarely holds in realistic scenarios. In this paper, we propose a novel SSL setting, where unlabeled samples are drawn from a mixed distribution that deviates from the feature distribution of labeled samples. Under this setting, previous SSL methods tend to predict wrong pseudo-labels with the model fitted on labeled data, resulting in noise accumulation. To tackle this issue, we propose \emph{Self-Supervised Feature Adaptation} (SSFA), a generic framework for improving SSL performance when labeled and unlabeled data come from different distributions. SSFA decouples the prediction of pseudo-labels from the current model to improve the quality of pseudo-labels. Particularly, SSFA incorporates a self-supervised task into the SSL framework and uses it to adapt the feature extractor of the model to the unlabeled data. In this way, the extracted features better fit the distribution of unlabeled data, thereby generating high-quality pseudo-labels. Extensive experiments show that our proposed SSFA is applicable to various pseudo-label-based SSL learners and significantly improves performance in labeled, unlabeled, and even unseen distributions.

CoDrug: Conformal Drug Property Prediction with Density Estimation under Covariate Shift
Siddhartha Laghuvarapu Zhen Lin Jimeng Sun



研究问题:在药物发现中,如何通过计算模型预测药物性质并确定其可靠性。
动机:由于计算模型的预测需要通过昂贵的湿实验进行验证,因此获取可靠的不确定性估计对于优先选择药物分子进行后续实验验证至关重要。
方法:提出一种名为CoDrug的方法,该方法使用能量基础模型和核密度估计来评估分子集的密度,然后利用这些估计的密度对分子样本进行加权,同时构建预测集并纠正分布偏移。
效果:在涉及现实分布漂移的各种小分子药物发现任务的广泛实验中,CoDrug展示了其提供有效预测集以及解决由全新药物设计模型引起的分布偏移的能力。平均而言,与未调整协变量偏移的保真预测集相比,使用CoDrug可以将覆盖差距减少超过35%。

In drug discovery, it is vital to confirm the predictions of pharmaceutical properties from computational models using costly wet-lab experiments. Hence, obtaining reliable uncertainty estimates is crucial for prioritizing drug molecules for subsequent experimental validation. Conformal Prediction (CP) is a promising tool for creating such prediction sets for molecular properties with a coverage guarantee. However, the exchangeability assumption of CP is often challenged with covariate shift in drug discovery tasks: Most datasets contain limited labeled data, which may not be representative of the vast chemical space from which molecules are drawn. To address this limitation, we propose a method called CoDrug that employs an energy-based model leveraging both training data and unlabelled data, and Kernel Density Estimation (KDE) to assess the densities of a molecule set. The estimated densities are then used to weigh the molecule samples while building prediction sets and rectifying for distribution shift. In extensive experiments involving realistic distribution drifts in various small-molecule drug discovery tasks, we demonstrate the ability of CoDrug to provide valid prediction sets and its utility in addressing the distribution shift arising from de novo drug design models. On average, using CoDrug can reduce the coverage gap by over 35% when compared to conformal prediction sets not adjusted for covariate shift.

An Iterative Self-Learning Framework for Medical Domain Generalization
Zhenbang Wu Huaxiu Yao David Liebovitz Jimeng Sun



研究问题:深度学习模型在医疗决策中应用广泛,但面临数据分布不同的挑战,即领域转移问题。
动机:现有的领域泛化算法假设所有领域的类别已知,并训练一个模型处理所有领域,但在医疗环境中,患者可以被分为许多未知的潜藏领域,每个领域的临床特征都不同,因此训练一个模型处理所有领域并不理想。
方法:我们提出了SLGD,一种自我学习框架,通过迭代发现解耦的领域并为每个解耦的领域训练个性化的分类器。
效果:我们在eICU和MIMIC-IV两个真实世界公共EHR数据集上评估了SLGD的空间和时间数据分布转移的可泛化性,结果显示,SLGD在AUPRC得分上比最佳基线提高了11%。

Deep learning models have been widely used to assist doctors with clinical decision-making. However, these models often encounter a significant performance drop when applied to data that differs from the distribution they were trained on. This challenge is known as the domain shift problem. Existing domain generalization algorithms attempt to address this problem by assuming the availability of domain IDs and training a single model to handle all domains. However, in healthcare settings, patients can be classified into numerous latent domains, where the actual domain categorizations are unknown. Furthermore, each patient domain exhibits distinct clinical characteristics, making it sub-optimal to train a single model for all domains. To overcome these limitations, we propose SLGD, a self-learning framework that iteratively discovers decoupled domains and trains personalized classifiers for each decoupled domain. We evaluate the generalizability of SLGD across spatial and temporal data distribution shifts on two real-world public EHR datasets: eICU and MIMIC-IV. Our results show that SLGD achieves up to 11% improvement in the AUPRC score over the best baseline.

Joint Learning of Label and Environment Causal Independence for Graph Out-of-Distribution Generalization
Shurui Gui Meng Liu Xiner Li Youzhi Luo Shuiwang Ji



研究问题:解决图的分布外泛化(OOD)问题。
动机:现有的图OOD算法要么依赖于限制性的假设,要么无法在训练数据中利用环境信息。
方法:提出同时结合标签和环境因果独立性(LECI)的方法,充分利用标签和环境信息,以解决现有方法在识别因果和不变子图方面面临的挑战。进一步开发了一种对抗性训练策略,为因果子图发现联合优化这两个属性,并具有理论保证。
效果:实验和分析表明,LECI在合成和真实世界数据集上都显著优于现有方法,证明LECI是一种实用且有效的图OOD泛化解决方案。

We tackle the problem of graph out-of-distribution (OOD) generalization. Existing graph OOD algorithms either rely on restricted assumptions or fail to exploit environment information in training data. In this work, we propose to simultaneously incorporate label and environment causal independence (LECI) to fully make use of label and environment information, thereby addressing the challenges faced by prior methods on identifying causal and invariant subgraphs. We further develop an adversarial training strategy to jointly optimize these two properties for casual subgraph discovery with theoretical guarantees. Extensive experiments and analysis show that LECI significantly outperforms prior methods on both synthetic and real-world datasets, establishing LECI as a practical and effective solution for graph OOD generalization.

Improving Adversarial Transferability via Intermediate-level Perturbation Decay
Qizhang Li Yiwen Guo Wangmeng Zuo Hao Chen



研究问题:现有的中间层攻击方法需要两个阶段,且在特征空间中产生的偏离可能导致次优的攻击效果。
动机:为了解决现有方法的不足,提出一种单阶段优化的中间层攻击方法。
方法:开发了一种名为“中间层扰动衰减”(ILPD)的新方法,该方法同时鼓励中间层扰动向有效的对抗方向并具有较大的幅度。
效果:实验结果表明,该方法在ImageNet和CIFAR-10上的各种受害者模型攻击中,平均性能分别比现有最佳方法高出10.07%和3.88%。

Intermediate-level attacks that attempt to perturb feature representations following an adversarial direction drastically have shown favorable performance in crafting transferable adversarial examples. Existing methods in this category are normally formulated with two separate stages, where a directional guide is required to be determined at first and the scalar projection of the intermediate-level perturbation onto the directional guide is enlarged thereafter. The obtained perturbation deviates from the guide inevitably in the feature space, and it is revealed in this paper that such a deviation may lead to sub-optimal attack. To address this issue, we develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. In particular, the proposed method, named intermediate-level perturbation decay (ILPD), encourages the intermediate-level perturbation to be in an effective adversarial direction and to possess a great magnitude simultaneously. In-depth discussion verifies the effectiveness of our method. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models on ImageNet (+10.07% on average) and CIFAR-10 (+3.88% on average). Our code is at https://github.com/qizhangli/ILPD-attack.

Retaining Beneficial Information from Detrimental Data for Neural Network Repair
Long-Kai Huang Peilin Zhao Junzhou Huang Sinno Pan



研究问题:深度学习模型的性能严重依赖于训练数据的质量,训练数据的缺陷可能导致模型泛化失败。
动机:目前的修复方法通过识别导致失败的训练样本并从模型中移除其影响进行模型修复,但这种方法可能会误删有益的信息,对模型性能产生负面影响。
方法:我们提出了一种新方法,该方法利用保留的干净数据集的知识来识别有害的数据,然后从识别出的数据中分离有益和有害的信息,最后利用提取出的有益信息来提升模型性能。
效果:实验结果表明,我们的方法在识别有害数据和纠正模型失败方面优于基线方法,特别是在识别困难且涉及大量良性数据的场景下,我们的方法在改善性能的同时,基线方法由于错误地删除有益信息而性能下降。

The performance of deep learning models heavily relies on the quality of the training data. Inadequacies in the training data, such as corrupt input or noisy labels, can lead to the failure of model generalization. Recent studies propose repairing the model by identifying the training samples that contribute to the failure and removing their influence from the model. However, it is important to note that the identified data may contain both beneficial and detrimental information. Simply erasing the information of the identified data from the model can have a negative impact on its performance, especially when accurate data is mistakenly identified as detrimental and removed. To overcome this challenge, we propose a novel approach that leverages the knowledge obtained from a retained clean set. Our method first identifies harmful data by utilizing the clean set, then separates the beneficial and detrimental information within the identified data. Finally, we utilize the extracted beneficial information to enhance the model's performance. Through empirical evaluations, we demonstrate that our method outperforms baseline approaches in both identifying harmful data and rectifying model failures. Particularly in scenarios where identification is challenging and a significant amount of benign data is involved, our method improves performance while the baselines deteriorate due to the erroneous removal of beneficial information.

Truncated Affinity Maximization: One-class Homophily Modeling for Graph Anomaly Detection
Hezhe Qiao Guansong Pang



研究问题:本文旨在解决现实世界图异常检测(GAD)数据集中的一种普遍现象,即研究问题:本文旨在解决现实世界图异常检测(GAD)数据集中的一种普遍现象,即正常节点之间的连接/亲和力强,而异常节点的同质性明显弱于正常节点。
动机:现有的GAD方法通常使用传统的异常检测目标(如数据重建)来构建,忽视了这种异常判别属性。
方法:本文提出了一种新的无监督异常评分度量——局部节点亲和力,用于GAD。该度量将较小的异常分数分配给与其邻居关联度较低的节点,亲和力定义为节点属性/表示的相似性。进一步提出了截断亲和力最大化(TAM),通过最大化节点与其邻居的局部亲和力来学习定制的节点表示。由于非同质边(即连接正常和异常节点的边)的存在,优化原始图结构可能会产生偏见。因此,TAM在迭代中移除非同质边以减轻这种偏见。
效果:在10个真实世界的GAD数据集上的大量实证结果表明,TAM显著优于七个竞争模型,在具有挑战性的数据集上与最佳竞争者相比,AUROC/AUPRC提高了超过10%。代码可在https://github.com/mala-lab/TAM-master/获取。

We reveal a one-class homophily phenomenon, which is one prevalent property we find empirically in real-world graph anomaly detection (GAD) datasets, i.e., normal nodes tend to have strong connection/affinity with each other, while the homophily in abnormal nodes is significantly weaker than normal nodes. However, this anomaly-discriminative property is ignored by existing GAD methods that are typically built using a conventional anomaly detection objective, such as data reconstruction. In this work, we explore this property to introduce a novel unsupervised anomaly scoring measure for GAD -- local node affinity-- that assigns a larger anomaly score to nodes that are less affiliated with their neighbors, with the affinity defined as similarity on node attributes/representations. We further propose Truncated Affinity Maximization (TAM) that learns tailored node representations for our anomaly measure by maximizing the local affinity of nodes to their neighbors. Optimizing on the original graph structure can be biased by non-homophily edges(i.e., edges connecting normal and abnormal nodes). Thus, TAM is instead optimized on truncated graphs where non-homophily edges are removed iteratively to mitigate this bias. The learned representations result in significantly stronger local affinity for normal nodes than abnormal nodes. Extensive empirical results on 10 real-world GAD datasets show that TAM substantially outperforms seven competing models, achieving over 10% increase in AUROC/AUPRC compared to the best contenders on challenging datasets. Our code is available at https://github.com/mala-lab/TAM-master/.

Module-wise Adaptive Distillation for Multimodality Foundation Models
Chen Liang Jiahui Yu Ming-Hsuan Yang Matthew Brown Yin Cui Tuo Zhao Boqing Gong Tianyi Zhou



研究问题:预训练的多模态基础模型虽然具有显著的泛化能力,但由于其大尺寸在部署时带来了挑战。
动机:为了减小模型的大小,我们提出了一种有效的方法,即层次蒸馏。在此过程中,我们发现某些架构组件(称为模块)对学生的学习表现的贡献更大。
方法:我们通过记录每个模块在蒸馏后的损耗减少量来跟踪单个模块的贡献,并选择贡献更大的模块进行更频繁的蒸馏。这种方法可以自然
效果:在10个真实世界的GAD数据集上的大量实证结果表明,TAM显著优于七个竞争模型,在具有挑战性的数据集上与最佳竞争者相比,AUROC/AUPRC提高了超过10%。代码可在https://github.com/mala-lab/TAM-master/获取。

Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model \citep{yu2022coca} as the teacher model.

Can Language Models Teach? Teacher Explanations Improve Student Performance via Personalization
Swarnadeep Saha Peter Hase Mohit Bansal



研究问题:本文旨在解决大型语言模型(LLMs)是否能够成为弱代理的良好教师的问题。
动机:虽然大型语言模型通过生成预测的解释来进行复杂的推理,但尚不清楚它们是否也能成为弱代理的良好教师。
方法:本文在两个LLM代理之间建立了一个学生-教师框架,并研究了教师应该在何时以及如何介入自然语言解释以改善学生的表现。
效果:实验结果表明,教师LLMs确实可以干预学生推理以提高其表现。此外,通过建立两个少次心理模型,教师不仅可以在干预效用最高时进行干预,提高预算下的学生表现,还可以为特定的学生个性化解释,超越未个性化的教师。同时,多轮交互中,教师的解释具有泛化性,从已解释数据中学习可以提高学生在未来未解释数据上的表现。最后,作者还验证了故意误导学生的不匹配教师可能会将学生的表现降低到随机水平。

A hallmark property of explainable AI models is the ability to teach other agents, communicating knowledge of how to perform a task. While Large Language Models (LLMs) perform complex reasoning by generating explanations for their predictions, it is unclear whether they also make good teachers for weaker agents. To address this, we consider a student-teacher framework between two LLM agents and study if, when, and how the teacher should intervene with natural language explanations to improve the student’s performance. Since communication is expensive, we define a budget such that the teacher only communicates explanations for a fraction of the data, after which the student should perform well on its own. We decompose the teaching problem along four axes: (1) if teacher’s test time in- tervention improve student predictions, (2) when it is worth explaining a data point, (3) how the teacher should personalize explanations to better teach the student, and (4) if teacher explanations also improve student performance on future unexplained data. We first show that teacher LLMs can indeed intervene on student reasoning to improve their performance. Next, inspired by the Theory of Mind abilities of effective teachers, we propose building two few-shot mental models of the student. The first model defines an Intervention Function that simulates the utility of an intervention, allowing the teacher to intervene when this utility is the highest and improving student performance at lower budgets. The second model enables the teacher to personalize explanations for a particular student and outperform unpersonalized teachers. We also demonstrate that in multi-turn interactions, teacher explanations generalize and learning from explained data improves student performance on future unexplained data. Finally, we also verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.

Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation
Zhongqi Yue Qianru Sun Hanwang Zhang



研究问题:领域适应(DA)面临的挑战是目标领域中普遍存在的虚假相关性,即与特定领域相关的特征(如环境)和与领域无关的特征(如类别身份)之间的相关性无法泛化到目标领域。
动机:现有的无监督领域适应(UDA)方法在增加额外的无监督目标领域后仍然受到这个问题的影响。这是因为源领域的监督只将目标领域的样本视为辅助数据(例如通过伪标签),而忽略了目标领域中隐藏的有价值的去相关性线索的内在分布。
方法:我们提出了一种名为“不变性一致性学习”(ICON)的方法,该方法赋予两个领域同等的地位。具体来说,我们学习一个不变的分类器,其预测结果同时与源领域的标签和目标领域的簇一致,从而消除了目标领域中的虚假相关性。
效果:大量的实验表明,ICON在经典的UDA基准测试中实现了最先进的性能,包括Office-Home和VisDA-2017,并在具有挑战性的WILDS 2.0基准测试中超越了所有传统方法。

Domain Adaptation (DA) is always challenged by the spurious correlation between the domain-invariant features (e.g., class identity) and the domain-specific ones (e.g., environment) that does not generalize to the target domain. Unfortunately, even enriched with additional unsupervised target domains, existing Unsupervised DA (UDA) methods still suffer from it. This is because the source domain supervision only considers the target domain samples as auxiliary data (e.g., by pseudo-labeling), yet the inherent distribution in the target domain---where the valuable de-correlation clues hide---is disregarded. We propose to make the U in UDA matter by giving equal status to the two domains. Specifically, we learn an invariant classifier whose prediction is simultaneously consistent with the labels in the source domain and clusters in the target domain, hence the spurious correlation inconsistent in the target domain is removed. We dub our approach "Invariant CONsistency learning" (ICON). Extensive experiments show that ICON achieves the state-of-the-art performance on the classic UDA benchmarks: Office-Home and VisDA-2017, and outperforms all the conventional methods on the challenging WILDS 2.0 benchmark. Codes are in https://github.com/yue-zhongqi/ICON.

Simple and Asymmetric Graph Contrastive Learning without Augmentations
Teng Xiao Huaisheng Zhu Zhengyu Chen Suhang Wang



研究问题:如何对同质和异质图进行对比学习。
动机:现有的图对比学习方法依赖于预先制作的图增强和同质性假设,无法很好地泛化到具有不同类别标签和不相似特征的异质图中。
方法:通过考虑邻居节点的非对称视图,提出了一种简单的算法——图非对称对比学习(GraphACL),无需依赖图增强和同质性假设。
效果:实验结果表明,GraphACL在同质和异质图上的对比学习性能显著优于现有方法。

Graph Contrastive Learning (GCL) has shown superior performance in representation learning in graph-structured data. Despite their success, most existing GCL methods rely on prefabricated graph augmentation and homophily assumptions. Thus, they fail to generalize well to heterophilic graphs where connected nodes may have different class labels and dissimilar features. In this paper, we study the problem of conducting contrastive learning on homophilic and heterophilic graphs. We find that we can achieve promising performance simply by considering an asymmetric view of the neighboring nodes. The resulting simple algorithm, Asymmetric Contrastive Learning for Graphs (GraphACL), is easy to implement and does not rely on graph augmentations and homophily assumptions. We provide theoretical and empirical evidence that GraphACL can capture one-hop local neighborhood information and two-hop monophily similarity, which are both important for modeling heterophilic graphs. Experimental results show that the simple GraphACL significantly outperforms state-of-the-art graph contrastive learning and self-supervised learning methods on homophilic and heterophilic graphs. The code of GraphACL is available at https://github.com/tengxiao1/GraphACL.

Bounding the Invertibility of Privacy-preserving Instance Encoding using Fisher Information
Kiwan Maeng Chuan Guo Sanjay Kariyappa G. Edward Suh



研究问题:如何将原始数据编码为特征向量,同时不泄露其隐私敏感信息。
动机:现有的大多数方案没有从理论上证明其编码是不可逆的,其增强隐私的属性仅通过有限的攻击进行经验验证。
方法:提出一种基于Fisher信息的实例编码可逆性理论测量方法,该方法广泛适用于各种流行的编码器。
效果:证明了dFIL可以用于从理论和经验上限制编码的可逆性,为实例编码的隐私性提供了直观的解释。

Privacy-preserving instance encoding aims to encode raw data into feature vectors without revealing their privacy-sensitive information. When designed properly, these encodings can be used for downstream ML applications such as training and inference with limited privacy risk. However, the vast majority of existing schemes do not theoretically justify that their encoding is non-invertible, and their privacy-enhancing properties are only validated empirically against a limited set of attacks. In this paper, we propose a theoretically-principled measure for the invertibility of instance encoding based on Fisher information that is broadly applicable to a wide range of popular encoders. We show that dFIL can be used to bound the invertibility of encodings both theoretically and empirically, providing an intuitive interpretation of the privacy of instance encoding.

Partial Multi-Label Learning with Probabilistic Graphical Disambiguation
Jun-Yi Hang Min-Ling Zhang



研究问题:本文旨在解决部分多标签学习(PML)中,每个训练样本都与一组候选标签相关联,其中只有一些标签是有效的,而现有方法主要依赖于启发式或特定规则来消除候选标签的歧义。
动机:为了提供一种原则性的方式来消除歧义,我们首次尝试探索概率图模型来解决PML问题,通过定制一个有向图来从部分多标签数据的生成过程中推断出潜在的真实标签信息。
方法:在随机梯度变分贝叶斯框架下,为这个图形模型推导出一个统一的变分下界,进一步放松概率以诱导出所需的预测模型和同时识别的真实标签信息。
效果:在多个合成和现实世界数据集上的全面实验表明,我们的方法优于最先进的对应方法。

In partial multi-label learning (PML), each training example is associated with a set of candidate labels, among which only some labels are valid. As a common strategy to tackle PML problem, disambiguation aims to recover the ground-truth labeling information from such inaccurate annotations. However, existing approaches mainly rely on heuristics or ad-hoc rules to disambiguate candidate labels, which may not be universal enough in complicated real-world scenarios. To provide a principled way for disambiguation, we make a first attempt to explore the probabilistic graphical model for PML problem, where a directed graph is tailored to infer latent ground-truth labeling information from the generative process of partial multi-label data. Under the framework of stochastic gradient variational Bayes, a unified variational lower bound is derived for this graphical model, which is further relaxed probabilistically so that the desired prediction model can be induced with simultaneously identified ground-truth labeling information. Comprehensive experiments on multiple synthetic and real-world data sets show that our approach outperforms the state-of-the-art counterparts.

Learning to Group Auxiliary Datasets for Molecule
Tinglin Huang Ziniu Hu Zhitao Ying



研究问题:小分子数据集的标注有限,对机器学习模型构成了挑战。
动机:虽然增加数据量可以缓解这个问题,但更多的数据并不总是能带来改进,因为目标数据集和辅助分子数据集的知识可能存在差异或矛盾,导致负迁移现象。
方法:我们提出了MolGroup方法,通过结合图结构相似性和任务相似性,将数据集亲和力分为任务亲和力和结构亲和力,预测每个辅助分子数据集的潜在效益。
效果:实验表明,使用MolGroup选择的分子数据集组进行训练,GIN和Graphormer在11个目标分子数据集上的平均性能提高了4.41%/3.47%。

The limited availability of annotations in small molecule datasets presents a challenge to machine learning models. To address this, one common strategy is to collaborate with additional auxiliary datasets. However, having more data does not always guarantee improvements. Negative transfer can occur when the knowledge in the target dataset differs or contradicts that of the auxiliary molecule datasets. In light of this, identifying the auxiliary molecule datasets that can benefit the target dataset when jointly trained remains a critical and unresolved problem. Through an empirical analysis, we observe that combining graph structure similarity and task similarity can serve as a more reliable indicator for identifying high-affinity auxiliary datasets. Motivated by this insight, we propose MolGroup, which separates the dataset affinity into task and structure affinity to predict the potential benefits of each auxiliary molecule dataset. MolGroup achieves this by utilizing a routing mechanism optimized through a bi-level optimization framework. Empowered by the meta gradient, the routing mechanism is optimized toward maximizing the target dataset's performance and quantifies the affinity as the gating score. As a result, MolGroup is capable of predicting the optimal combination of auxiliary datasets for each target dataset. Our extensive experiments demonstrate the efficiency and effectiveness of MolGroup, showing an average improvement of 4.41%/3.47% for GIN/Graphormer trained with the group of molecule datasets selected by MolGroup on 11 target molecule datasets.

Towards Generic Semi-Supervised Framework for Volumetric Medical Image Segmentation
Haonan Wang Xiaomeng Li



研究问题:如何利用半监督学习(SSL)技术,处理3D医疗图像的标签问题,特别是在无监督领域适应(UDA)和半监督领域泛化(SemiDG)等复杂情况下。
动机:由于3D医疗图像的标签工作需要专业知识且耗时,因此使用半监督学习方法训练模型的需求日益增长。然而,现有的半监督学习方法在UDA和SemiDG等场景中存在挑战和应用局限。
方法:提出了一个通用的半监督学习框架,该框架通过聚合和去耦两个部分来解决问题。聚合部分包括一个扩散编码器,用于从多个分布/域的聚合信息中提取分布不变的特征,构建"公共知识集"。去耦部分包括三个解码器,将有标签和无标签数据的训练过程进行解耦,从而避免过拟合到有标签数据、特定域和类别。
效果:在四个基准数据集上进行了评估,包括SSL、类别不平衡的SSL、UDA和SemiDG。实验结果展示出相比现有方法的显著改进,表明该框架具有处理更复杂半监督学习场景的潜力。

Volume-wise labeling in 3D medical images is a time-consuming task that requires expertise. As a result, there is growing interest in using semi-supervised learning (SSL) techniques to train models with limited labeled data. However, the challenges and practical applications extend beyond SSL to settings such as unsupervised domain adaptation (UDA) and semi-supervised domain generalization (SemiDG). This work aims to develop a generic SSL framework that can handle all three settings. We identify two main obstacles to achieving this goal in the existing SSL framework: 1) the weakness of capturing distribution-invariant features; and 2) the tendency for unlabeled data to be overwhelmed by labeled data, leading to over-fitting to the labeled data during training. To address these issues, we propose an Aggregating & Decoupling framework. The aggregating part consists of a Diffusion encoder that constructs a "common knowledge set" by extracting distribution-invariant features from aggregated information from multiple distributions/domains. The decoupling part consists of three decoders that decouple the training process with labeled and unlabeled data, thus avoiding over-fitting to labeled data, specific domains and classes. We evaluate our proposed framework on four benchmark datasets for SSL, Class-imbalanced SSL, UDA and SemiDG. The results showcase notable improvements compared to state-of-the-art methods across all four settings, indicating the potential of our framework to tackle more challenging SSL scenarios. Code and models are available at: https://github.com/xmed-lab/GenericSSL.

Loss Decoupling for Task-Agnostic Continual Learning
Yan-Shuo Liang Wu-Jun Li



研究问题:本文旨在解决持续学习中的任务无关问题,即在推理阶段无法获得任务身份,模型需要学会区分所有任务的所有类别。
动机:现有的任务无关的持续学习方法通常将两个新目标混合在一起,这阻碍了模型在稳定性和可塑性之间取得良好的平衡。
方法:本文提出了一种名为损失解耦(LODE)的简单而有效的方法,通过解耦新任务的损失来分离这两个新目标的两个目标。
效果:实验表明,LODE可以在多个持续学习数据集上超越现有的最先进的重播方法。

Continual learning requires the model to learn multiple tasks in a sequential order. To perform continual learning, the model must possess the abilities to maintain performance on old tasks (stability) and adapt itself to learn new tasks (plasticity). Task-agnostic problem in continual learning is a challenging problem, in which task identities are not available in the inference stage and hence the model must learn to distinguish all the classes in all the tasks. In task-agnostic problem, the model needs to learn two new objectives for learning a new task, including distinguishing new classes from old classes and distinguishing between different new classes. For task-agnostic problem, replay-based methods are commonly used. These methods update the model with both saved old samples and new samples for continual learning. Most existing replay-based methods mix the two objectives in task-agnostic problem together, inhibiting the models from achieving a good trade-off between stability and plasticity. In this paper, we propose a simple yet effective method, called loss decoupling (LODE), for task-agnostic continual learning. LODE separates the two objectives for the new task by decoupling the loss of the new task. As a result, LODE can assign different weights for different objectives, which provides a way to obtain a better trade-off between stability and plasticity than those methods with coupled loss. Experiments show that LODE can outperform existing state-of-the-art replay-based methods on multiple continual learning datasets.

Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models
Sungik Choi Hankook Lee Honglak Lee Moontae Lee



研究问题:如何有效地检测出异常(即分布外)样本,特别是在这些样本与正常样本有相似背景信息的情况下。
动机:现有的基于生成模型的新颖性检测方法主要利用了内分布样本的重建特性,但在检测到具有相似背景信息的分布外样本时表现不佳。
方法:提出了一种名为“投影遗憾”(PR)的新型新颖性检测方法,该方法通过比较测试图像与其扩散基投影之间的感知距离来检测异常。为了消除背景信息的偏见,将感知距离与递归投影进行比较。
效果:实验证明,PR在检测新颖性方面优于现有基于生成模型的方法,尤其是在处理具有相似背景信息的分布外样本时,其性能提升显著。

Novelty detection is a fundamental task of machine learning which aims to detect abnormal (*i.e.* out-of-distribution (OOD)) samples. Since diffusion models have recently emerged as the de facto standard generative framework with surprising generation results, novelty detection via diffusion models has also gained much attention. Recent methods have mainly utilized the reconstruction property of in-distribution samples. However, they often suffer from detecting OOD samples that share similar background information to the in-distribution data. Based on our observation that diffusion models can *project* any sample to an in-distribution sample with similar background information, we propose *Projection Regret (PR)*, an efficient novelty detection method that mitigates the bias of non-semantic information. To be specific, PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Since the perceptual distance often fails to capture semantic changes when the background information is dominant, we cancel out the background bias by comparing it against recursive projections. Extensive experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.

Reproducibility in Multiple Instance Learning: A Case For Algorithmic Unit Tests
Edward Raff James Holt



研究问题:多重实例学习(MIL)是一种分类问题,其中包含正负标签和一组输入,如果一个正面元素包含在包中,则标签为正面,否则为负面。在此背景下进行训练需要将包级标签与实例级信息关联起来,并隐含地包含因果关系假设和任务的不对称性。
动机:在医疗保健、网络安全等许多任务中,都会出现MIL问题。然而,我们发现五个最突出的深度-MIL模型都没有尊重标准的MIL假设,它们能够学习反相关的实例,即默认为“正面”标签,直到看到一个负面的反例。这可能会产生学习不正确的模型,从而带来操作失败的风险。
方法:我们通过提出一种“算法单元测试”,创建可以被尊重MIL假设的模型解决的合成数据集,清楚地揭示违反MIL假设的学习情况。这五种评估方法中的每一个都会在一个或多个这些测试中失败。
效果:这种方法提供了一种模型无关的方式来识别模型假设的违反,我们希望这将对未来MIL模型的开发和评估有所帮助。

Multiple Instance Learning (MIL) is a sub-domain of classification problems with positive and negative labels and a "bag" of inputs, where the label is positive if and only if a positive element is contained within the bag, and otherwise is negative. Training in this context requires associating the bag-wide label to instance-level information, and implicitly contains a causal assumption and asymmetry to the task (i.e., you can't swap the labels without changing the semantics). MIL problems occur in healthcare (one malignant cell indicates cancer), cyber security (one malicious executable makes an infected computer), and many other tasks. In this work, we examine five of the most prominent deep-MIL models and find that none of them respects the standard MIL assumption. They are able to learn anti-correlated instances, i.e., defaulting to "positive" labels until seeing a negative counter-example, which should not be possible for a correct MIL model. We suspect that enhancements and other works derived from these models will share the same issue. In any context in which these models are being used, this creates the potential for learning incorrect models, which creates risk of operational failure. We identify and demonstrate this problem via a proposed ``algorithmic unit test'', where we create synthetic datasets that can be solved by a MIL respecting model, and which clearly reveal learning that violates MIL assumptions. The five evaluated methods each fail one or more of these tests. This provides a model-agnostic way to identify violations of modeling assumptions, which we hope will be useful for future development and evaluation of MIL models.

Multi-task learning with summary statistics
Parker Knight Rui Duan



研究问题:如何利用多任务学习整合来自多个源的数据,特别是在医疗环境中受到数据共享限制的情况下。
动机:现有的多任务学习方法在实际应用中受到数据共享限制的影响,尤其是在医疗领域。
方法:提出一种灵活的多任务学习框架,利用来自不同来源的汇总统计数据进行训练,并基于Lepski方法的变体提出一种自适应参数选择方法。
效果:通过大量的模拟实验证明了该方法的理论成果和性能,为跨多个领域的相关模型训练提供了更灵活的工具,对遗传风险预测等领域具有实际意义。

Multi-task learning has emerged as a powerful machine learning paradigm for integrating data from multiple sources, leveraging similarities between tasks to improve overall model performance. However, the application of multi-task learning to real-world settings is hindered by data-sharing constraints, especially in healthcare settings. To address this challenge, we propose a flexible multi-task learning framework utilizing summary statistics from various sources. Additionally, we present an adaptive parameter selection approach based on a variant of Lepski's method, allowing for data-driven tuning parameter selection when only summary statistics are accessible. Our systematic non-asymptotic analysis characterizes the performance of the proposed methods under various regimes of the source datasets' sample complexity and overlap. We demonstrate our theoretical findings and the performance of the method through extensive simulations. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction and many other fields.

Data Quality in Imitation Learning
Suneel Belkhale Yuchen Cui Dorsa Sadigh



研究问题:本文旨在解决模仿学习中由于动作预测误差导致的状态分布偏移问题,以及如何评估和整理数据集以提高数据质量。
动机:在离线机器人学习中,缺乏互联网规模的数据,因此高质量的数据集是必要的。而在模仿学习中,由于动作预测的误差,测试时的策略会遭受状态分布偏移,导致策略无法恢复未见过的状态。
方法:本文提出了一种新的评估和整理数据集的方式,通过定义“数据质量”的度量标准来鼓励策略在测试时保持分布稳定。我们提出了两个基本属性,即动作发散和转换多样性,并从理论上和实证上分析了这两种关键属性在模仿学习中的影响。
效果:实验结果表明,状态多样性并不总是有益的,而动作发散和转换多样性在实践中是如何相互作用的。

In supervised learning, the question of data quality and curation has been sidelined in recent years in favor of increasingly more powerful and expressive models that can ingest internet-scale data. However, in offline learning for robotics, we simply lack internet scale data, and so high quality datasets are a necessity. This is especially true in imitation learning (IL), a sample efficient paradigm for robot learning using expert demonstrations. Policies learned through IL suffer from state distribution shift at test time due to compounding errors in action prediction, which leads to unseen states that the policy cannot recover from. Instead of designing new algorithms to address distribution shift, an alternative perspective is to develop new ways of assessing and curating datasets. There is growing evidence that the same IL algorithms can have substantially different performance across different datasets. This calls for a formalism for defining metrics of "data quality" that can further be leveraged for data curation. In this work, we take the first step toward formalizing data quality for imitation learning through the lens of distribution shift: a high quality dataset encourages the policy to stay in distribution at test time. We propose two fundamental properties that are necessary for a high quality datasets: i) action divergence: the mismatch between the expert and learned policy at certain states; and ii) transition diversity: the noise present in the system for a given state and action. We investigate the combined effect of these two key properties in imitation learning theoretically, and we empirically analyze models trained on a variety of different data sources. We show that state diversity is not always beneficial, and we demonstrate how action divergence and transition diversity interact in practice.

Collaboratively Learning Linear Models with Structured Missing Data
Chen Cheng Gary Cheng John Duchi



研究问题:本文研究了如何协同学习最小二乘估计,以解决多代理观察不同特征子集的问题。
动机:每个代理观察不同的特征子集,例如从不同分辨率的传感器收集的数据。目标是确定如何协调代理,以产生每个代理的最佳估计器。
方法:提出了一种分布式、半监督算法Collab,包括三个步骤:局部训练、聚合和分发。该过程不需要通信标记数据,使其具有高效的通信能力,并在标记数据不可访问的环境中非常有用。
效果:尽管存在这些限制,但该方法在局部最小最大优化方面几乎达到了近似最优——即使在允许通信标记数据的估计器(如插值方法)中也是如此。我们在US人口普查数据上测试了该方法,并讨论了将其扩展到非高斯特征设置、非线性设置和联邦学习的情况。

We study the problem of collaboratively learning least squares estimates for $m$ agents. Each agent observes a different subset of the features---e.g., containing data collected from sensors of varying resolution. Our goal is to determine how to coordinate the agents in order to produce the best estimator for each agent. We propose a distributed, semi-supervised algorithm Collab, consisting of three steps: local training, aggregation, and distribution. Our procedure does not require communicating the labeled data, making it communication efficient and useful in settings where the labeled data is inaccessible. Despite this handicap, our procedure is nearly asymptotically, local-minimax optimal---even among estimators allowed to communicate the labeled data such as imputation methods. We test our method on US Census data. We also discuss generalizations of our method to non-Gaussian feature settings, non-linear settings, and Federated Learning.

Blurred-Dilated Method for Adversarial Attacks
Yang Deng Weibin Wu Jianping Zhang Zibin Zheng



研究问题:深度神经网络易受对抗性攻击,导致预测错误。在黑箱设置中,转移攻击可以方便地用于生成对抗性样本,但这些例子往往过于适应源模型的特定架构和特征表示,导致对其他目标模型的攻击性能不佳。
动机:为了克服这个缺点,我们提出了一种新的基于模型修改的转移攻击:模糊扩张方法(BD)。
方法:BD通过减少下采样同时引入模糊池化和扩张卷积来修改源模型。然后,BD使用修改后的源模型生成对抗性样本。我们认为BD可以比原始源模型更全面地保留特征信息,从而更彻底地破坏图像特征,提高生成的对抗性样本的可转移性。
效果:在ImageNet数据集上的大量实验表明,由BD生成的对抗性示例实现了比最先进的基线显著更高的可转移性。此外,BD可以方便地与现有的黑箱攻击技术结合,进一步提高其性能。

Deep neural networks (DNNs) are vulnerable to adversarial attacks, which lead to incorrect predictions. In black-box settings, transfer attacks can be conveniently used to generate adversarial examples. However, such examples tend to overfit the specific architecture and feature representations of the source model, resulting in poor attack performance against other target models. To overcome this drawback, we propose a novel model modification-based transfer attack: Blurred-Dilated method (BD) in this paper. In summary, BD works by reducing downsampling while introducing BlurPool and dilated convolutions in the source model. Then BD employs the modified source model to generate adversarial samples. We think that BD can more comprehensively preserve the feature information than the original source model. It thus enables more thorough destruction of the image features, which can improve the transferability of the generated adversarial samples. Extensive experiments on the ImageNet dataset show that adversarial examples generated by BD achieve significantly higher transferability than the state-of-the-art baselines. Besides, BD can be conveniently combined with existing black-box attack techniques to further improve their performance.

CADet: Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning
Charles Guille-Escuret Pau Rodriguez David Vazquez Ioannis Mitliagkas Joao Monteiro



研究问题:本文旨在探索使用自我监督对比学习同时检测两种类型的OOD样本:未见过类别和对抗性扰动。
动机:处理分布外(OOD)样本已成为机器学习系统在现实世界部署中的主要难题。
方法:将自我监督对比学习与最大均值差异(MMD)两样本测试相结合,以鲁棒地测试两个独立样本集是否来自同一分布。并引入了一种新的单样本OOD检测方法CADet,该方法借鉴了MMD的思想,但利用了同一样本的对比变换之间的相似性。
效果:实验结果表明,这种方法在区分CIFAR-10和CIFAR-10.1方面的置信度高于以往工作。并且,CADet在识别对抗性扰动样本方面优于现有的对抗性检测方法,并在ImageNet-O和iNaturalist两个具有挑战性的基准上实现了与未见标签检测方法相当的性能。最重要的是,CADet是完全自我监督的,既不需要对内分布样本进行标签,也不需要访问OOD示例。

Handling out-of-distribution (OOD) samples has become a major stake in the real-world deployment of machine learning systems. This work explores the use of self-supervised contrastive learning to the simultaneous detection of two types of OOD samples: unseen classes and adversarial perturbations. First, we pair self-supervised contrastive learning with the maximum mean discrepancy (MMD) two-sample test. This approach enables us to robustly test whether two independent sets of samples originate from the same distribution, and we demonstrate its effectiveness by discriminating between CIFAR-10 and CIFAR-10.1 with higher confidence than previous work. Motivated by this success, we introduce CADet (Contrastive Anomaly Detection), a novel method for OOD detection of single samples. CADet draws inspiration from MMD, but leverages the similarity between contrastive transformations of a same sample. CADet outperforms existing adversarial detection methods in identifying adversarially perturbed samples on ImageNet and achieves comparable performance to unseen label detection methods on two challenging benchmarks: ImageNet-O and iNaturalist. Significantly, CADet is fully self-supervised and requires neither labels for in-distribution samples nor access to OOD examples.

Neural Priming for Sample-Efficient Adaptation
Matthew Wallingford Vivek Ramanujan Alex Fang Aditya Kusupati Roozbeh Mottaghi Aniruddha Kembhavi Ludwig Schmidt Ali Farhadi



研究问题:如何让大型预训练模型适应小样本或无标签的分布变化和下游任务?
动机:面对有限的标注数据和不断变化的分布,现有的预训练模型往往表现不佳。
方法:提出神经启动(Neural Priming)技术,通过在测试时对预训练阶段见过的相关数据进行轻量级更新,使模型能够适应新的分布。
效果:在ImageNet和多个转移学习基准测试中,神经启动技术都取得了显著的准确性提升,证明了其在解决有限标注数据和分布变化问题上的有效性。

We propose Neural Priming, a technique for adapting large pretrained models to distribution shifts and downstream tasks given few or no labeled examples. Presented with class names or unlabeled test samples, Neural Priming enables the model to recall and conditions its parameters on relevant data seen throughout pretraining, thereby priming it for the test distribution. Neural Priming can be performed at test time in even for pretraining datasets as large as LAION-2B. Performing lightweight updates on the recalled data significantly improves accuracy across a variety of distribution shift and transfer learning benchmarks. Concretely, in the zero-shot setting, we see a 2.45% improvement in accuracy on ImageNet and 3.81% accuracy improvement on average across standard transfer learning benchmarks. Further, using our test time inference scheme, we see a 1.41% accuracy improvement on ImageNetV2. These results demonstrate the effectiveness of Neural Priming in addressing the common challenge of limited labeled data and changing distributions. Code and models are open-sourced at [https://www.github.com/RAIVNLab/neural-priming](https://www.github.com/RAIVNLab/neural-priming).

Easy Learning from Label Proportions
Robert Istvan Busa-Fekete Heejin Choi Travis Dick Claudio Gentile Andres Munoz medina



研究问题:学习标签比例(LLP)的弱监督分类设置中,实例被分组为独立同分布的“袋”,只有每个袋子中类别标签的频率是可用的。尽管学习者的目标是在单个实例级别实现低任务损失。
动机:我们提出了EASYLLP,一种基于聚合标签的灵活且易于实施的去偏方法,适用于任意损失函数。我们的方法可以准确估计任意模型在单个实例级别的预期损失。
方法:我们阐明了我们的方法与基于标签比例匹配的标准方法之间的区别,包括适用性和最优条件。我们将我们的方法应用于流行的学习框架,如经验风险最小化(ERM)和随机梯度下降(SGD),并在实例级别性能上提供保证。
效果:最后,我们在多个数据集上验证了我们的理论结果,实证地说明了我们的方法在哪些条件下预计会比之前的LLP方法表现得更好或更差。

We consider the problem of Learning from Label Proportions (LLP), a weakly supervised classification setup where instances are grouped into i.i.d. “bags”, and only the frequency of class labels at each bag is available. Albeit, the objective of the learner is to achieve low task loss at an individual instance level. Here we propose EASYLLP, a flexible and simple-to-implement debiasing approach based on aggregate labels, which operates on arbitrary loss functions. Our technique allows us to accurately estimate the expected loss of an arbitrary model at an individual level. We elucidate the differences between our method and standard methods based on label proportion matching, in terms of applicability and optimality conditions. We showcase the flexibility of our approach compared to alternatives by applying our method to popular learning frameworks, like Empirical Risk Minimization (ERM) and Stochastic Gradient Descent (SGD) with provable guarantees on instance level performance. Finally, we validate our theoretical results on multiple datasets, empirically illustrating the conditions under which our algorithm is expected to perform better or worse than previous LLP approaches

Distributionally Robust Ensemble of Lottery Tickets Towards Calibrated Sparse Network Training
Hitesh Sapkota Dingrong Wang ZHIQIANG TAO Qi Yu



研究问题:如何实现网络预测的校准,特别是在处理过度自信和分布外情况时提高模型可靠性。
动机:最近的稀疏网络训练方法虽然可以从密集网络中找到稀疏子网络,但主要关注实现与密集对应物相当的准确性,而忽视了网络校准。
方法:提出一种新的分布稳健优化(DRO)框架,通过不确定性集的引导,学习多个多样化且互补的稀疏子网络(票),以实现校准的网络稀疏化。
效果:实验结果表明,我们提出的彩票票据集成在不牺牲准确性和推理成本的情况下,显著提高了校准性能。此外,对OOD数据集的实验表明,我们的方法在开放环境中具有鲁棒性。

The recently developed sparse network training methods, such as Lottery Ticket Hypothesis (LTH) and its variants, have shown impressive learning capacity by finding sparse sub-networks from a dense one. While these methods could largely sparsify deep networks, they generally focus more on realizing comparable accuracy to dense counterparts yet neglect network calibration. However, how to achieve calibrated network predictions lies at the core of improving model reliability, especially when it comes to addressing the overconfident issue and out-of-distribution cases. In this study, we propose a novel Distributionally Robust Optimization (DRO) framework to achieve an ensemble of lottery tickets towards calibrated network sparsification. Specifically, the proposed DRO ensemble aims to learn multiple diverse and complementary sparse sub-networks (tickets) with the guidance of uncertainty sets, which encourage tickets to gradually capture different data distributions from easy to hard and naturally complement each other. We theoretically justify the strong calibration performance by showing how the proposed robust training process guarantees to lower the confidence of incorrect predictions. Extensive experimental results on several benchmarks show that our proposed lottery ticket ensemble leads to a clear calibration improvement without sacrificing accuracy and burdening inference costs. Furthermore, experiments on OOD datasets demonstrate the robustness of our approach in the open-set environment.

Actively Testing Your Model While It Learns: Realizing Label-Efficient Learning in Practice
Dayou Yu Weishi Shi Qi Yu



研究问题:本文旨在解决主动学习中测试阶段的数据标注成本问题,以及主动学习和主动测试的断开问题。
动机:目前的主动学习方法主要关注降低模型训练阶段的标注成本,但测试阶段(即模型评估过程)同样需要数据标注,且其成本尚未得到充分探索。此外,现有的主动测试或主动评估方法往往将学习和测试阶段分开处理。
方法:本文提出了一种新的主动学习和主动测试集成框架(ATL),该框架在主动学习过程中周期性地进行测试,以实现公平的模型评估和有效的早期停止,从而进一步节省总标注成本。同时,ATL还引入了“主动反馈”机制,借鉴人类的学习方式,由教师(主动测试者)根据学生(主动学习者)的前期表现提供即时指导。
效果:理论分析和实际数据集上的实验结果表明,ATL框架能有效提高主动学习和评估任务的标注效率,同时保持了集成学习-测试目标的标签复杂度,并提高了模型的泛化能力。

In active learning (AL), we focus on reducing the data annotation cost from the model training perspective. However, "testing'', which often refers to the model evaluation process of using empirical risk to estimate the intractable true generalization risk, also requires data annotations. The annotation cost for "testing'' (model evaluation) is under-explored. Even in works that study active model evaluation or active testing (AT), the learning and testing ends are disconnected. In this paper, we propose a novel active testing while learning (ATL) framework that integrates active learning with active testing. ATL provides an unbiased sample-efficient estimation of the model risk during active learning. It leverages test samples annotated from different periods of a dynamic active learning process to achieve fair model evaluations based on a theoretically guaranteed optimal integration of different test samples. Periodic testing also enables effective early-stopping to further save the total annotation cost. ATL further integrates an "active feedback'' mechanism, which is inspired by human learning, where the teacher (active tester) provides immediate guidance given by the prior performance of the student (active learner). Our theoretical result reveals that active feedback maintains the label complexity of the integrated learning-testing objective, while improving the model's generalization capability. We study the realistic setting where we maximize the performance gain from choosing "testing'' samples for feedback without sacrificing the risk estimation accuracy. An agnostic-style analysis and empirical evaluations on real-world datasets demonstrate that the ATL framework can effectively improve the annotation efficiency of both active learning and evaluation tasks.

Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Jianwei Zhang Suren Jayasuriya Visar Berisha



研究问题:如何提高深度学习模型中嵌入的可重复性?
动机:目前的嵌入方法对特定任务的标签变化敏感,但对其他混淆因素不具有不变性。
方法:利用测量理论中的重复性概念,提出使用内部类关联系数(ICC)来评估嵌入的可重复性,并设计了一种新的正则化器——ICC正则化器,作为对比损失的补充,引导深度神经网络生成具有更高可重复性的嵌入。
效果:在模拟数据上的实验表明,ICC正则化器在最小化类内方差方面优于仅使用对比损失。在三个语音任务上的应用实验结果显示,添加ICC正则化器可以提高学习嵌入的可重复性,并提高这些下游任务的性能。

A good supervised embedding for a specific machine learning task is only sensitive to changes in the label of interest and is invariant to other confounding factors. We leverage the concept of repeatability from measurement theory to describe this property and propose to use the intra-class correlation coefficient (ICC) to evaluate the repeatability of embeddings. We then propose a novel regularizer, the ICC regularizer, as a complementary component for contrastive losses to guide deep neural networks to produce embeddings with higher repeatability. We use simulated data to explain why the ICC regularizer works better on minimizing the intra-class variance than the contrastive loss alone. We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice. The experimental results demonstrate that adding an ICC regularizer can improve the repeatability of learned embeddings compared to only using the contrastive loss; further, these embeddings lead to improved performance in these downstream tasks.

Feature Selection in the Contrastive Analysis Setting
Ethan Weinberger Ian Connick Covert Su-In Lee



研究问题:如何在对比分析(CA)设置中进行特征选择。
动机:在对比分析中,目标数据集与背景数据集之间的独特变化需要被找出和利用,但目前机器学习社区对这个问题关注不足。
方法:提出对比特征选择(CFS)方法,用于在对比分析环境中进行特征选择。通过信息理论分析来激发该方法,并在半合成数据集和四个真实世界生物医学数据集中进行实证验证。
效果:该方法始终优于先前提出的最先进的监督和非监督特征选择方法,且特别适用于对比分析环境。

Contrastive analysis (CA) refers to the exploration of variations uniquely enriched in a _target_ dataset as compared to a corresponding _background_ dataset generated from sources of variation that are irrelevant to a given task. For example, a biomedical data analyst may wish to find a small set of genes to use as a proxy for variations in genomic data only present among patients with a given disease (target) as opposed to healthy control subjects (background). However, as of yet the problem of feature selection in the CA setting has received little attention from the machine learning community. In this work we present contrastive feature selection (CFS), a method for performing feature selection in the CA setting. We motivate our approach with a novel information-theoretic analysis of representation learning in the CA setting, and we empirically validate CFS on a semi-synthetic dataset and four real-world biomedical datasets. We find that our method consistently outperforms previously proposed state-of-the-art supervised and fully unsupervised feature selection methods not designed for the CA setting. An open-source implementation of our method is available at https://github.com/suinleelab/CFS.

GradOrth: A Simple yet Efficient Out-of-Distribution Detection with Orthogonal Projection of Gradients
Sima Behpour Thang Doan Xin Li Wenbin He Liang Gou Liu Ren



研究问题:如何有效地检测机器学习模型在真实世界应用中的分布外(OOD)数据。
动机:现有的OOD检测方法主要依赖于特征映射或全梯度空间信息,忽视了预训练网络在分布内数据上最重要的参数的作用。
方法:提出了一种新的方法GradOrth,通过计算分布在ID数据中认为重要的子空间上的梯度投影的范数来识别OOD数据。
效果:这种方法表现出色,与当前最先进的方法相比,在95%的真阳性率下,平均假阳性率降低了8%。

Detecting out-of-distribution (OOD) data is crucial for ensuring the safe deployment of machine learning models in real-world applications. However, existing OOD detection approaches primarily rely on the feature maps or the full gradient space information to derive OOD scores neglecting the role of \textbf{most important parameters} of the pre-trained network over In-Distribution data. In this study, we propose a novel approach called GradOrth to facilitate OOD detection based on one intriguing observation that the important features to identify OOD data lie in the lower-rank subspace of in-distribution (ID) data. In particular, we identify OOD data by computing the norm of gradient projection on \textit{the subspaces considered \textbf{important} for the in-distribution data}. A large orthogonal projection value (i.e. a small projection value) indicates the sample as OOD as it captures a weak correlation of the in-distribution (ID) data. This simple yet effective method exhibits outstanding performance, showcasing a notable reduction in the average false positive rate at a 95\% true positive rate (FPR95) of up to 8\% when compared to the current state-of-the-art methods.

Label-Retrieval-Augmented Diffusion Models for Learning from Noisy Labels
Jian Chen Ruiyi Zhang Tong Yu Rohan Sharma zhiqiang xu Tong Sun Changyou Chen



研究问题:如何从有噪声的标签中学习,特别是在实际应用中。
动机:现有的方法通常依赖于严格的假设,并且只适用于特定类型的标签噪声。
方法:本文从生成模型的角度重新定义了标签噪声问题,并利用强大的扩散模型学习随机生成过程。同时,提出了标签检索增强(LRA)扩散模型,利用邻居一致性有效地构造伪清洁标签进行扩散训练。
效果:实验结果表明,该方法在所有的基准数据集上都取得了新的最先进的结果。特别是通过引入强大的CLIP模型的条件信息,可以在许多情况下将当前最先进的准确率提高10-20个百分点。

Learning from noisy labels is an important and long-standing problem in machine learning for real applications. One of the main research lines focuses on learning a label corrector to purify potential noisy labels. However, these methods typically rely on strict assumptions and are limited to certain types of label noise. In this paper, we reformulate the label-noise problem from a generative-model perspective, *i.e.*, labels are generated by gradually refining an initial random guess. This new perspective immediately enables existing powerful diffusion models to seamlessly learn the stochastic generative process. Once the generative uncertainty is modeled, we can perform classification inference using maximum likelihood estimation of labels. To mitigate the impact of noisy labels, we propose the **L**abel-**R**etrieval-**A**ugmented (LRA) diffusion model, which leverages neighbor consistency to effectively construct pseudo-clean labels for diffusion training. Our model is flexible and general, allowing easy incorporation of different types of conditional information, *e.g.*, use of pre-trained models, to further boost model performance. Extensive experiments are conducted for evaluation. Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets. Remarkably, by incorporating conditional information from the powerful CLIP model, our method can boost the current SOTA accuracy by 10-20 absolute points in many cases. Code is available: https://anonymous.4open.science/r/LRA-diffusion-5F2F

DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning
Wenxuan Bao Francesco Pittaluga Vijay Kumar b g Vincent Bindschaedler



研究问题:如何在保护数据隐私的前提下,提高计算机视觉模型的泛化能力。
动机:简单的图像变换和组合等数据增强技术在训练数据有限的情况下,能有效提高计算机视觉模型的泛化能力,但与差分隐私学习方式不兼容。
方法:提出了两种专为差分隐私学习限制设计的数据增强技术,DP-Mix_Self通过自我增强数据的混合获得最佳分类性能,DP-Mix_Diff通过将预训练扩散模型的合成数据融入混合过程进一步提高性能。
效果:这两种新方法在各种数据集和设置中都实现了最佳的分类性能,且源代码已在GitHub上开源。

Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter’s built-in assumption that each training image’s contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at https://github.com/wenxuan-Bao/DP-Mix.

Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback
Mihir Prabhudesai Tsung-Wei Ke Alexander Cong Li Deepak Pathak Katerina Fragkiadaki



研究问题:扩散模型如何有效地用于判别任务?
动机:生成模型可以作为判别模型的有效测试时适配器。
方法:通过调节扩散模型的输出条件,将预训练的判别模型(如图像分类器、分割器和深度预测器)适应测试集中的每个未标记示例,使用来自扩散模型的生成反馈。然后通过反向传播梯度来最大化图像似然目标,优化判别模型的参数。
效果:Diffusion-TTA显著提高了各种大规模预训练判别模型的准确性,如ImageNet分类器、CLIP模型、图像像素标注器和图像深度预测器。在在线适应设置中,Diffusion-TTA优于现有的测试时适应方法,包括TTT-MAE和TENT。

The advancements in generative modeling, particularly the advent of diffusion models, have sparked a fundamental question: how can these models be effectively used for discriminative tasks? In this work, we find that generative models can be great test-time adapters for discriminative models. Our method, Diffusion-TTA, adapts pre-trained discriminative models such as image classifiers, segmenters and depth predictors, to each unlabelled example in the test set using generative feedback from a diffusion model. We achieve this by modulating the conditioning of the diffusion model using the output of the discriminative model. We then maximize the image likelihood objective by backpropagating the gradients to discriminative model’s parameters. We show Diffusion-TTA significantly enhances the accuracy of various large-scale pre-trained discriminative models, such as, ImageNet classifiers, CLIP models, image pixel labellers and image depth predictors. Diffusion-TTA outperforms existing test-time adaptation methods, including TTT-MAE and TENT, and particularly shines in online adaptation setups, where the discriminative model is continually adapted to each example in the test set. We provide access to code, results, and visualizations on our website: https://diffusion-tta.github.io/

Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects
Zhuofan Ying Peter Hase Mohit Bansal



研究问题:本文旨在探讨视觉模型如何自适应地利用上下文进行分布外(OOD)泛化,并利用分析结果改进模型的OOD泛化能力。
动机:生物视觉系统能够自适应地使用上下文在新环境中识别物体,以及在熟悉的环境中识别被遮挡或模糊的物体。然而,现有的计算机视觉模型在处理OOD泛化时存在困难。
方法:本文首先设定了两种不同的OOD设置,一种是有益的对象消歧,另一种是无关的背景不变性,反映了生物视觉面临的多样化上下文挑战。然后,分析模型在这两种情况中的表现,发现在一个设置中表现优秀的模型往往在另一个设置中表现不佳。通过表征几何分析和探查方法,研究发现具有更多分解表示和适当特征加权模型在处理对象消歧和背景不变性测试时更成功。
效果:基于分析结果,本文提出了新的增强模型泛化的方法,并在分布内和OOD测试中验证了这些方法的有效性。结果表明,为了复制生物视觉的泛化能力,计算机视觉模型必须具有分解的对象与背景表示,并适当权衡这两种类型的特征。

Biological vision systems make adaptive use of context to recognize objects in new settings with novel contexts as well as occluded or blurry objects in familiar settings. In this paper, we investigate how vision models adaptively use context for out-of-distribution (OOD) generalization and leverage our analysis results to improve model OOD generalization. First, we formulate two distinct OOD settings where the contexts are either beneficial Object-Disambiguation or irrelevant Background-Invariance, reflecting the diverse contextual challenges faced in biological vision. We then analyze model performance in these two different OOD settings and demonstrate that models that excel in one setting tend to struggle in the other. Notably, prior works on learning causal features improve on one setting but hurt on the other. This underscores the importance of generalizing across both OOD settings, as this ability is crucial for both human cognition and robust AI systems. Next, to better understand the model properties contributing to OOD generalization, we use representational geometry analysis and our own probing methods to examine a population of models, and we discover that those with more factorized representations and appropriate feature weighting are more successful in handling Object-Disambiguation and Background-Invariance tests. We further validate these findings through causal intervention, manipulating representation factorization and feature weighting to demonstrate their causal effect on performance. Motivated by our analysis results, we propose new augmentation methods aimed at enhancing model generalization. The proposed methods outperform strong baselines, yielding improvements in both in-distribution and OOD tests. We conclude that, in order to replicate the generalization abilities of biological vision, computer vision models must have factorized object vs. background representations and appropriately weigh both kinds of features.

CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
Rakshith Sharma Srinivasa Jaejin Cho Chouchang Yang Yashas Malur Saidutta Ching-Hua Lee Yilin Shen Hongxia Jin



研究问题:本文旨在考虑跨模态的零样本转移对比训练,其中一种模态中的预训练模型用于另一领域的表示学习。
动机:现有的对比训练方法主要采用正负例对进行相似性和差异性的对齐,但训练样本间的相似性具有连续性,因此需要更“非二元”的处理方式。
方法:提出一种新的对比损失函数——连续加权对比损失(CWCL),利用连续性的相似度度量来转移一个模态到另一个模态的嵌入空间结构。
效果:实验结果表明,使用CWCL的模型在多个模型、数据集和模态上的零样本转移性能超过了现有方法,并在零样本图像分类和零样本语音到意图分类以及关键词分类上分别取得了5-8%和20-30%的绝对性能提升。

This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-trained model in one modality is used for representation learning in another domain using pairwise data. The learnt models in the latter domain can then be used for a diverse set of tasks in a 0-shot way, similar to Contrastive Language-Image Pre-training (CLIP) and Locked-image Tuning (LiT) that have recently gained considerable attention. Classical contrastive training employs sets of positive and negative examples to align similar and repel dissimilar training data samples. However, similarity amongst training examples has a more continuous nature, thus calling for a more `non-binary' treatment. To address this, we propose a new contrastive loss function called Continuously Weighted Contrastive Loss (CWCL) that employs a continuous measure of similarity. With CWCL, we seek to transfer the structure of the embedding space from one modality to another. Owing to the continuous nature of similarity in the proposed loss function, these models outperform existing methods for 0-shot transfer across multiple models, datasets and modalities. By using publicly available datasets, we achieve 5-8% (absolute) improvement over previous state-of-the-art methods in 0-shot image classification and 20-30% (absolute) improvement in 0-shot speech-to-intent classification and keyword classification.

Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement
Hui Yuan Kaixuan Huang Chengzhuo Ni Minshuo Chen Mengdi Wang



研究问题:本文旨在探索通过条件扩散模型进行奖励导向生成的方法与理论。
动机:奖励导向生成的目标是根据奖励函数测量的期望属性生成样本,这在生成人工智能、强化学习和计算生物学等领域有广泛应用。
方法:我们考虑了主要包含未标记数据和少量带噪声奖励标签数据的常见学习场景。我们的方法利用较小数据集上的学习奖励函数作为伪标注器来标注未标记的数据。在伪标注后,我们在数据上训练条件扩散模型(CDM),并通过设置目标值$a$作为CDM的条件来生成样本。
效果:从理论上讲,我们发现这种奖励导向的生成器可以有效地学习和采样来自奖励条件的数据分布:1. 我们的模型能够恢复数据的潜在子空间表示;2. 该模型生成的样本越来越接近用户指定的目标。样本奖励的提高受到奖励信号强度、分布偏移和超出支持范围外推成本之间相互作用的影响。

We explore the methodology and theory of reward-directed generation via conditional diffusion models. Directed generation aims to generate samples with desired properties as measured by a reward function, which has broad applications in generative AI, reinforcement learning, and computational biology. We consider the common learning scenario where the dataset consists of majorly unlabeled data and a small set of data with noisy reward labels. Our approach leverages a learned reward function on the smaller data set as a pseudolabeler to label the unlabelled data. After pseudo-labelling, a conditional diffusion model (CDM) is trained on the data and samples are generated by setting a target value $a$ as the condition in CDM. From a theoretical standpoint, we show that this directed generator can effectively learn and sample from the reward-conditioned data distribution: 1. our model is capable of recovering the data's latent subspace representation. 2. the model generates samples moving closer to the user-specified target. The improvement in rewards of samples is influenced by a interplay between the strength of the reward signal, the distribution shift, and the cost of off-support extrapolation. We provide empirical results to validate our theory and highlight the relationship between the strength of extrapolation and the quality of generated samples.

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models
Andy Zhou Jindong Wang Yu-Xiong Wang Haohan Wang



研究问题:本文旨在通过结合知识蒸馏和数据增强来提高视觉模型的鲁棒性。
动机:作者认为更大的模型并不一定能成为更好的教师,并通过在预训练的基础模型上进行蒸馏来证明这一假设。
方法:提出了离散对抗蒸馏(DAD)方法,利用一个鲁棒的教师生成对抗性样本,并使用VQGAN将它们离散化,从而创建出比标准数据增强技术更具信息量的样本。
效果:实验结果表明,该方法在不同学生架构上都取得了显著的分布外鲁棒性和清洁准确性的提升。此外,与类似技术相比,该方法只增加了很小的计算开销,并且可以很容易地与其他数据增强技术相结合以进一步提高性能。

We propose a conceptually simple and lightweight framework for improving the robustness of vision models through the combination of knowledge distillation and data augmentation. We address the conjecture that larger models do not make for better teachers by showing strong gains in out-of-distribution robustness when distilling from pretrained foundation models. Following this finding, we propose Discrete Adversarial Distillation (DAD), which leverages a robust teacher to generate adversarial examples and a VQGAN to discretize them, creating more informative samples than standard data augmentation techniques. We provide a theoretical framework for the use of a robust teacher in the knowledge distillation with data augmentation setting and demonstrate strong gains in out-of-distribution robustness and clean accuracy across different student architectures. Notably, our method adds minor computational overhead compared to similar techniques and can be easily combined with other data augmentations for further improvements.

Fairness Continual Learning Approach to Semantic Scene Understanding in Open-World Environments
Thanh-Dat Truong Hoang-Quan Nguyen Bhiksha Raj Khoa Luu



研究问题:本文旨在解决语义分割中的持续学习问题,同时关注模型的公平性。
动机:尽管现有的语义分割模型在持续学习方面取得了显著进展,但其公平性问题仍需得到更好的解决。公平性是部署深度学习模型的关键因素,尤其是在涉及人类或安全应用的场景中。
方法:提出了一种基于类分布的新公平性持续学习框架,并设计了一种新颖的原型对比聚类损失函数来解决持续学习中的重大问题,如灾难性遗忘和背景偏移。此外,还提出了条件结构一致性损失来进一步规范预测分割的结构约束。
效果:在三个标准场景理解基准测试(ADE20K、Cityscapes和Pascal VOC)上,该方法实现了最先进的性能,并提高了分割模型的公平性。

Continual semantic segmentation aims to learn new classes while maintaining the information from the previous classes. Although prior studies have shown impressive progress in recent years, the fairness concern in the continual semantic segmentation needs to be better addressed. Meanwhile, fairness is one of the most vital factors in deploying the deep learning model, especially in human-related or safety applications. In this paper, we present a novel Fairness Continual Learning approach to the semantic segmentation problem. In particular, under the fairness objective, a new fairness continual learning framework is proposed based on class distributions. Then, a novel Prototypical Contrastive Clustering loss is proposed to address the significant challenges in continual learning, i.e., catastrophic forgetting and background shift. Our proposed loss has also been proven as a novel, generalized learning paradigm of knowledge distillation commonly used in continual learning. Moreover, the proposed Conditional Structural Consistency loss further regularized the structural constraint of the predicted segmentation. Our proposed approach has achieved State-of-the-Art performance on three standard scene understanding benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC, and promoted the fairness of the segmentation model.

Robust Learning with Progressive Data Expansion Against Spurious Correlation
Yihe Deng Yu Yang Baharan Mirzasoleiman Quanquan Gu



研究问题:深度学习模型易受虚假特征影响,而非真正与真实标签相关的核心特征。
动机:通过理论分析,探讨了存在虚假特征的非线性卷积神经网络的学习过程,并提出了新的训练算法PDE来提高模型的鲁棒性。
方法:PDE算法从一组平衡的训练数据开始,逐步扩大以促进核心特征的学习。
效果:在合成和真实基准数据集上的实验表明,PDE方法在ResNets和Transformers等模型上表现优越,平均来说,最差组准确率比最先进的方法提高了2.8%,同时训练效率提高了10倍。

While deep learning models have shown remarkable performance in various tasks, they are susceptible to learning non-generalizable _spurious features_ rather than the core features that are genuinely correlated to the true label. In this paper, beyond existing analyses of linear models, we theoretically examine the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. In light of this, we propose a new training algorithm called **PDE** that efficiently enhances the model's robustness for a better worst-group performance. PDE begins with a group-balanced subset of training data and progressively expands it to facilitate the learning of the core features. Experiments on synthetic and real-world benchmark datasets confirm the superior performance of our method on models such as ResNets and Transformers. On average, our method achieves a $2.8$ \% improvement in worst-group accuracy compared with the state-of-the-art method, while enjoying up to $10\times$ faster training efficiency.

Active representation learning for general task space with applications in robotics
Yifang Chen Yingbing Huang Simon Shaolei Du Kevin Jamieson Guanya Shi



研究问题:如何优化预训练语言模型,使其能同时利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Representation learning based on multi-task pretraining has become a powerful approach in many domains. In particular, task-aware representation learning aims to learn an optimal representation for a specific target task by sampling data from a set of source tasks, while task-agnostic representation learning seeks to learn a universal representation for a class of tasks. In this paper, we propose a general and versatile algorithmic and theoretic framework for \emph{active representation learning}, where the learner optimally chooses which source tasks to sample from. This framework, along with a tractable meta algorithm, allows most arbitrary target and source task spaces (from discrete to continuous), covers both task-aware and task-agnostic settings, and is compatible with deep representation learning practices. We provide several instantiations under this framework, from bilinear and feature-based nonlinear to general nonlinear cases. In the bilinear case, by leveraging the non-uniform spectrum of the task representation and the calibrated source-target relevance, we prove that the sample complexity to achieve $\varepsilon$-excess risk on target scales with $(k^*)^2 ||v^*||_2^2 \varepsilon^{-2}$ where $k^*$ is the effective dimension of the target and $||v^*||_2^2 \in (0,1]$ represents the connection between source and target space. Compared to the passive one, this can save up to $\frac{1}{d_W}$ of sample complexity, where $d_W$ is the task space dimension. Finally, we demonstrate different instantiations of our meta algorithm in synthetic datasets and robotics problems, from pendulum simulations to real-world drone flight datasets. On average, our algorithms outperform baselines by 20%-70%.

Reconciling Competing Sampling Strategies of Network Embedding
Yuchen Yan Baoyu Jing Lihui Liu Ruijie Wang Jinning Li Tarek Abdelzaher Hanghang Tong



研究问题:现有的网络嵌入算法在采样训练过程中,如何平衡捕捉网络拓扑结构与优化相似度度量的问题。
动机:不同的采样策略对网络嵌入的效果有显著影响,但目前尚无方法同时满足所有节点对的区分性和单调性。
方法:提出了一种名为SENSEI的新模型,该模型在顶级K排名列表中无缝实现了区分性和部分单调性。
效果:实验证明,SENSEI在普通网络嵌入任务上优于现有技术。

Network embedding plays a significant role in a variety of applications. To capture the topology of the network, most of the existing network embedding algorithms follow a sampling training procedure, which maximizes the similarity (e.g., embedding vectors' dot product) between positively sampled node pairs and minimizes the similarity between negatively sampled node pairs in the embedding space. Typically, close node pairs function as positive samples while distant node pairs are usually considered as negative samples. However, under different or even competing sampling strategies, some methods champion sampling distant node pairs as positive samples to encapsulate longer distance information in link prediction, whereas others advocate adding close nodes into the negative sample set to boost the performance of node recommendation. In this paper, we seek to understand the intrinsic relationships between these competing strategies. To this end, we identify two properties (discrimination and monotonicity) that given any node pair proximity distribution, node embeddings should embrace. Moreover, we quantify the empirical error of the trained similarity score w.r.t. the sampling strategy, which leads to an important finding that the discrimination property and the monotonicity property for all node pairs can not be satisfied simultaneously in real-world applications. Guided by such analysis, a simple yet novel model (SENSEI) is proposed, which seamlessly fulfills the discrimination property and the partial monotonicity within the top-$K$ ranking list. Extensive experiments show that SENSEI outperforms the state-of-the-arts in plain network embedding.

Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification
Kanishk Jain Shyamgopal Karthik Vineet Gandhi



研究问题:本文旨在解决细粒度分类中错误严重性降低的问题。
动机:由于需要专业知识进行准确标注,细粒度分类具有挑战性。然而,人类在进行粗粒度分类时特别擅长,因为它需要相对较低的专业知识水平。
方法:我们提出了一种名为分层集成(HiE)的新型后处理修正方法,该方法利用标签层次结构,使用粗粒度预测来提高测试时的细粒度分类性能。
效果:在iNaturalist-19和tieredImageNet-H数据集上,我们的方法在平均错误严重性方面显著降低,同时提高了top-1准确率,并在这两个基准测试中实现了新的最先进的结果。此外,我们还研究了该方法在半监督设置中的有效性。随着训练数据减少,我们的方法是显著提高top-1准确率的同时显著降低错误严重性的。

We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of knowledge or domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semi-supervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE renders it practical to be used with any off-the-shelf trained model to improve its predictions further.

RegBN: Batch Normalization of Multimodal Data with Regularization
MORTEZA GHAHREMANI Christian Wachinger



研究问题:如何有效地整合多源传感器捕获的高维数据,特别是在存在混淆效应和依赖关系的情况下。
动机:由于神经网络在整合多模态数据方面的成功,近年来对集成多源传感器捕获的高维数据的兴趣激增。然而,异构多模态数据的整合带来了重大挑战,因为这种混淆效应和依赖关系引入了不必要的变异性和偏差,导致多模态模型的性能不佳。
方法:本文介绍了一种新的多模态批量归一化方法——RegBN。RegBN使用Frobenius范数作为正则化项来处理不同数据源之间的混淆效应和潜在依赖关系。该方法可以广泛应用于多种模态,并消除了对可学习参数的需求,简化了训练和推理过程。
效果:我们在五个研究领域的八个数据库上验证了RegBN的有效性,涵盖了语言、音频、图像、视频、深度、表格和3D MRI等多种模态。该方法在不同的架构(如多层感知器、卷积神经网络和视觉转换器)中表现出广泛的适用性,能够有效地对多模态神经网络中的低层和高层特征进行归一化。

Recent years have witnessed a surge of interest in integrating high-dimensional data captured by multisource sensors, driven by the impressive success of neural networks in integrating multimodal data. However, the integration of heterogeneous multimodal data poses a significant challenge, as confounding effects and dependencies among such heterogeneous data sources introduce unwanted variability and bias, leading to suboptimal performance of multimodal models. Therefore, it becomes crucial to normalize the low- or high-level features extracted from data modalities before their fusion takes place. This paper introduces RegBN, a novel approach for multimodal Batch Normalization with REGularization. RegBN uses the Frobenius norm as a regularizer term to address the side effects of confounders and underlying dependencies among different data sources. The proposed method generalizes well across multiple modalities and eliminates the need for learnable parameters, simplifying training and inference. We validate the effectiveness of RegBN on eight databases from five research areas, encompassing diverse modalities such as language, audio, image, video, depth, tabular, and 3D MRI. The proposed method demonstrates broad applicability across different architectures such as multilayer perceptrons, convolutional neural networks, and vision transformers, enabling effective normalization of both low- and high-level features in multimodal neural networks. RegBN is available at https://mogvision.github.io/RegBN.

Algorithm Selection for Deep Active Learning with Imbalanced Datasets
Jifan Zhang Shuai Shao saurabh verma Robert D Nowak



研究问题:如何减少深度学习应用中所需的标注样本数量。
动机:主动学习旨在减少训练深度网络所需的标注示例数量,但其在各种数据集和应用中的实证性能可能会有很大差异。
方法:提出了一种针对深度主动学习的自适应算法选择策略。对于任何未标记的数据集,该(元)算法TAILOR(Thompson主动学习算法选择)会迭代地、自适应地在一组候选主动学习算法中进行选择。
效果:在多类别和多标签应用的大量实验中,TAILOR的效果表明其准确率可与最佳候选算法相媲美或更好。

Label efficiency has become an increasingly important objective in deep learning applications. Active learning aims to reduce the number of labeled examples needed to train deep networks, but the empirical performance of active learning algorithms can vary dramatically across datasets and applications. It is difficult to know in advance which active learning strategy will perform well or best in a given application. To address this, we propose the first adaptive algorithm selection strategy for deep active learning. For any unlabeled dataset, our (meta) algorithm TAILOR (Thompson ActIve Learning algORithm selection) iteratively and adaptively chooses among a set of candidate active learning algorithms. TAILOR uses novel reward functions aimed at gathering class-balanced examples. Extensive experiments in multi-class and multi-label applications demonstrate TAILOR's effectiveness in achieving accuracy comparable or better than that of the best of the candidate algorithms. Our implementation of TAILOR is open-sourced at https://github.com/jifanz/TAILOR.

Neural Image Compression: Generalization, Robustness, and Spectral Biases
Kelsey Lieberman James Diffenderfer Charles Godfrey Bhavya Kailkhura



研究问题:当前缺乏评估和理解神经网络图像压缩(NIC)在真实世界设置中性能的全面数据集和信息工具。
动机:为了填补这一关键空白,本文提出了一个全面的基准测试套件来评估图像压缩方法的分布外(OOD)性能。
方法:通过向流行的CLIC和Kodak基准引入15种损坏,创建了CLIC-C和Kodak-C。然后,提出光谱启发的检查工具以深入了解图像压缩方法引入的错误以及其OOD性能。
效果:对几种经典编解码器和NIC变体进行了详细的性能比较,揭示了挑战我们当前对NIC优势和局限性理解的有趣发现。最后,通过理论分析证实了我们的实证发现,深入探讨了NIC的OOD性能及其对数据光谱性质的依赖性。

Recent advances in neural image compression (NIC) have produced models that are starting to outperform classic codecs. While this has led to growing excitement about using NIC in real-world applications, the successful adoption of any machine learning system in the wild requires it to generalize (and be robust) to unseen distribution shifts at deployment. Unfortunately, current research lacks comprehensive datasets and informative tools to evaluate and understand NIC performance in real-world settings. To bridge this crucial gap, first, this paper presents a comprehensive benchmark suite to evaluate the out-of-distribution (OOD) performance of image compression methods. Specifically, we provide CLIC-C and Kodak-C by introducing 15 corruptions to the popular CLIC and Kodak benchmarks. Next, we propose spectrally-inspired inspection tools to gain deeper insight into errors introduced by image compression methods as well as their OOD performance. We then carry out a detailed performance comparison of several classic codecs and NIC variants, revealing intriguing findings that challenge our current understanding of the strengths and limitations of NIC. Finally, we corroborate our empirical findings with theoretical analysis, providing an in-depth view of the OOD performance of NIC and its dependence on the spectral properties of the data. Our benchmarks, spectral inspection tools, and findings provide a crucial bridge to the real-world adoption of NIC. We hope that our work will propel future efforts in designing robust and generalizable NIC methods. Code and data will be made available at https://github.com/klieberman/ood_nic.

In Defense of Softmax Parametrization for Calibrated and Consistent Learning to Defer
Yuzhou Cao Hussein Mozannar Lei Feng Hongxin Wei Bo An



研究问题:如何让机器学习分类器在专家更准确时将决策推迟到专家,以提高安全性和性能。
动机:现有的学习推迟框架的参数化方法存在不校准的问题,需要找到一种既统计一致又具有有效概率估计器的softmax基估计器。
方法:通过分析发现,导致现有文献中不校准和无界估计器的原因是替代损失的对称性,而不是softmax。因此,提出了一种新的统计一致的非对称softmax基替代损失方法,可以产生有效的估计而不会出现无界问题。
效果:对所提出的方法进行了非渐近性质分析,并在基准数据集上对其性能和校准进行了实证验证。

Enabling machine learning classifiers to defer their decision to a downstream expert when the expert is more accurate will ensure improved safety and performance. This objective can be achieved with the learning-to-defer framework which aims to jointly learn how to classify and how to defer to the expert. In recent studies, it has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring which makes them uncalibrated. However, it remains unknown whether this is due to the widely used softmax parameterization and if we can find a softmax-based estimator that is both statistically consistent and possesses a valid probability estimator. In this work, we first show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax. We then propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness. We further analyze the non-asymptotic properties of our proposed method and empirically validate its performance and calibration on benchmark datasets.

Curriculum Learning for Graph Neural Networks: Which Edges Should We Learn First
Zheng Zhang Junxiang Wang Liang Zhao



研究问题:现有的图神经网络(GNNs)在处理真实世界图中的边时,由于边的难易程度不同,可能导致学习到的表示效果不佳。
动机:为了解决这个问题,本文提出了一种新的课程学习(CL)策略,通过逐渐增加训练中边的复杂性,以提高GNN的学习能力和鲁棒性。
方法:我们的方法根据模型的训练状态,测量边的预期难度,并从简单到复杂逐渐将更多的边纳入训练。
效果:实验结果表明,我们的方法在九个合成数据集和九个真实世界数据集上,都显著提高了学习到的表示的泛化能力和鲁棒性。

Graph Neural Networks (GNNs) have achieved great success in representing data with dependencies by recursively propagating and aggregating messages along the edges. However, edges in real-world graphs often have varying degrees of difficulty, and some edges may even be noisy to the downstream tasks. Therefore, existing GNNs may lead to suboptimal learned representations because they usually treat every edge in the graph equally. On the other hand, Curriculum Learning (CL), which mimics the human learning principle of learning data samples in a meaningful order, has been shown to be effective in improving the generalization ability and robustness of representation learners by gradually proceeding from easy to more difficult samples during training. Unfortunately, existing CL strategies are designed for independent data samples and cannot trivially generalize to handle data dependencies. To address these issues, we propose a novel CL strategy to gradually incorporate more edges into training according to their difficulty from easy to hard, where the degree of difficulty is measured by how well the edges are expected given the model training status. We demonstrate the strength of our proposed method in improving the generalization ability and robustness of learned representations through extensive experiments on nine synthetic datasets and nine real-world datasets. The code for our proposed method is available at https://github.com/rollingstonezz/Curriculum_learning_for_GNNs

A Unified Approach to Count-Based Weakly Supervised Learning
Vinay Shukla Zhe Zeng Kareem Ahmed Guy Van den Broeck



研究问题:如何利用弱标签数据进行学习。
动机:高质量的标签往往非常稀缺,而带有推断性弱标签的未标记数据则更为常见。
方法:开发了一种名为“计数基础的弱监督学习”的统一方法,该方法的核心是能够计算恰好有k个输出被设置为真的概率。
效果:通过在模型分布和基于标签计数的算术约束之间引入计数损失,实现了对模型偏差的有效惩罚。

High-quality labels are often very scarce, whereas unlabeled data with inferred weak labels occurs more naturally. In many cases, these weak labels dictate the frequency of each respective class over a set of instances. In this paper, we develop a unified approach to learning from such weakly-labeled data, which we call *count-based weakly-supervised learning*. At the heart of our approach is the ability to compute the probability of exactly $k$ out of $n$ outputs being set to true. This computation is differentiable, exact, and efficient. Building upon the previous computation, we derive a *count loss* penalizing the model for deviations in its distribution from an arithmetic constraint defined over label counts.

Towards Personalized Federated Learning via Heterogeneous Model Reassembly
Jiaqi Wang Xingyi Yang Suhan Cui Liwei Che Lingjuan Lyu Dongkuan Xu Fenglong Ma



研究问题:本文旨在解决联邦学习中模型异构性的问题,即客户端拥有不同网络结构模型的问题。
动机:在联邦学习中,由于客户端的模型存在异构性,这给个性化联邦学习带来了挑战。为了解决这个问题,我们提出了一个名为pFedHR的新框架,利用异构模型重组来实现个性化联邦学习。
方法:我们的方法将异构模型个性化问题视为服务器端的一个模型匹配优化任务。此外,pFedHR可以自动、动态地生成信息丰富且多样化的个性化候选模型,而无需人工干预。
效果:实验结果表明,pFedHR在三个数据集上均优于基线方法,无论是在独立同分布还是非独立同分布设置下。此外,pFedHR有效地减少了使用不同公共数据带来的负面影响,并能以自动化的方式动态生成多样化的个性化模型。

This paper focuses on addressing the practical yet challenging problem of model heterogeneity in federated learning, where clients possess models with different network structures. To track this problem, we propose a novel framework called pFedHR, which leverages heterogeneous model reassembly to achieve personalized federated learning. In particular, we approach the problem of heterogeneous model personalization as a model-matching optimization task on the server side. Moreover, pFedHR automatically and dynamically generates informative and diverse personalized candidates with minimal human intervention. Furthermore, our proposed heterogeneous model reassembly technique mitigates the adverse impact introduced by using public data with different distributions from the client data to a certain extent. Experimental results demonstrate that pFedHR outperforms baselines on three datasets under both IID and Non-IID settings. Additionally, pFedHR effectively reduces the adverse impact of using different public data and dynamically generates diverse personalized models in an automated manner.

Towards Last-Layer Retraining for Group Robustness with Fewer Annotations
Tyler LaBonte Vidya Muthukumar Abhishek Kumar



研究问题:深度神经网络的经验风险最小化(ERM)容易过度依赖虚假相关性,并在少数群体上表现不佳。
动机:最近的深度特征重调(DFR)技术通过简单的最后一层再训练实现了最先进的组鲁棒性,但需要预留出组和类别注释来构建一个组平衡的重调数据集,这在实践中是不可行的。
方法:我们检查了这个不切实际的要求,并发现即使重调数据集只有一小部分最差组数据,最后一层再训练也可以非常有效,无需任何组注释(除了模型选择)。
效果:我们的实验首次证明了即使只使用一小部分训练数据进行最后一层再训练,也可以大大优于在整个数据集上进行经验风险最小化(ERM),无需额外的数据、注释或计算进行训练。进一步的实验表明,模型的不一致可以通过增加最差组数据来提高组鲁棒性,使SELF在视觉和语言任务的四个公认的基准测试中几乎与DFR匹配,而无需任何组注释和使用不到3%的预留类别注释。

Empirical risk minimization (ERM) of neural networks is prone to over-reliance on spurious correlations and poor generalization on minority groups. The recent deep feature reweighting (DFR) technique achieves state-of-the-art group robustness via simple last-layer retraining, but it requires held-out group and class annotations to construct a group-balanced reweighting dataset. In this work, we examine this impractical requirement and find that last-layer retraining can be surprisingly effective with no group annotations (other than for model selection) and only a handful of class annotations. We first show that last-layer retraining can greatly improve worst-group accuracy even when the reweighting dataset has only a small proportion of worst-group data. This implies a "free lunch" where holding out a subset of training data to retrain the last layer can substantially outperform ERM on the entire dataset with no additional data, annotations, or computation for training. To further improve group robustness, we introduce a lightweight method called selective last-layer finetuning (SELF), which constructs the reweighting dataset using misclassifications or disagreements. Our experiments present the first evidence that model disagreement upsamples worst-group data, enabling SELF to nearly match DFR on four well-established benchmarks across vision and language tasks with no group annotations and less than 3% of the held-out class annotations.

Adversarial Examples Are Not Real Features
Ang Li Yifei Wang Yiwen Guo Yisen Wang



研究问题:对抗性示例的成因及其在机器学习中的影响。
动机:对抗性示例的存在一直是一个谜,引起了广泛的关注。现有的理论通过从数据角度解释对抗性脆弱性,但这种解释对人来说相当反直觉。
方法:本文通过引入多种学习范式重新审视了这一理论,发现非鲁棒特征在其他自我监督学习范式(如对比学习、掩蔽图像建模和扩散模型)中的实用性较差。
效果:实验结果表明,非鲁棒特征并不像鲁棒或自然特征那样具有好的可转移性,可能更像是一种范式特定的捷径。同时,我们还发现,自然训练的编码器从鲁棒特征出发在很大程度上也是非鲁棒的。

The existence of adversarial examples has been a mystery for years and attracted much interest. A well-known theory by \citet{ilyas2019adversarial} explains adversarial vulnerability from a data perspective by showing that one can extract non-robust features from adversarial examples and these features alone are useful for classification. However, the explanation remains quite counter-intuitive since non-robust features are mostly noise features to humans. In this paper, we re-examine the theory from a larger context by incorporating multiple learning paradigms. Notably, we find that contrary to their good usefulness under supervised learning, non-robust features attain poor usefulness when transferred to other self-supervised learning paradigms, such as contrastive learning, masked image modeling, and diffusion models. It reveals that non-robust features are not really as useful as robust or natural features that enjoy good transferability between these paradigms. Meanwhile, for robustness, we also show that naturally trained encoders from robust features are largely non-robust under AutoAttack. Our cross-paradigm examination suggests that the non-robust features are not really useful but more like paradigm-wise shortcuts, and robust features alone might be insufficient to attain reliable model robustness. Code is available at \url{https://github.com/PKU-ML/AdvNotRealFeatures}.

On the Importance of Feature Separability in Predicting Out-Of-Distribution Error
RENCHUNZI XIE Hongxin Wei Lei Feng Yuzhou Cao Bo An



研究问题:如何准确估计模型在无标签的分布外(OOD)数据上的泛化性能。
动机:尽管先前的方法强调了分布差异与OOD准确性之间的联系,但我们发现大的领域差距并不一定会导致低测试准确性。
方法:我们提出了一种基于特征分散性的数据集级别的得分,以估计分布偏移下的测试准确性。这种方法受到表示学习中特征的理想属性的启发:高类别间分散性和高类别内紧凑性。
效果:我们的分析表明,类别间的分散性与模型的准确性有强烈的相关性,而类别内的紧凑性并不能反映OOD数据上的泛化性能。大量的实验证明,我们的方法在预测性能和计算效率上都表现出优越性。

Estimating the generalization performance is practically challenging on out-of-distribution (OOD) data without ground-truth labels. While previous methods emphasize the connection between distribution difference and OOD accuracy, we show that a large domain gap not necessarily leads to a low test accuracy. In this paper, we investigate this problem from the perspective of feature separability empirically and theoretically. Specifically, we propose a dataset-level score based upon feature dispersion to estimate the test accuracy under distribution shift. Our method is inspired by desirable properties of features in representation learning: high inter-class dispersion and high intra-class compactness. Our analysis shows that inter-class dispersion is strongly correlated with the model accuracy, while intra-class compactness does not reflect the generalization performance on OOD data. Extensive experiments demonstrate the superiority of our method in both prediction performance and computational efficiency.

Differentiable Clustering with Perturbed Spanning Forests
Lawrence Stewart Francis Bach Felipe Llinares-López Quentin Berthet



研究问题:如何将聚类方法融入可端到端训练的管道中,并有效地计算梯度?
动机:现有的聚类方法无法直接融入可训练的管道中,且在高噪声和复杂几何形状的数据集中表现不佳。
方法:提出了一种基于最小权重生成森林随机扰动的可微分聚类方法,可以将聚类纳入可端到端训练的管道中,并有效计算梯度。
效果:该方法在具有挑战性的数据集上表现出色,并在有监督和半监督任务的多个数据集上进行了性能演示。

We introduce a differentiable clustering method based on stochastic perturbations of minimum-weight spanning forests. This allows us to include clustering in end-to-end trainable pipelines, with efficient gradients. We show that our method performs well even in difficult settings, such as data sets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several data sets for supervised and semi-supervised tasks.

Retrieval-Augmented Multiple Instance Learning
Yufei CUI Ziquan Liu Yixin CHEN Yuchen Lu Xinyue Yu Xue Liu Tei-Wei Kuo Miguel R. D. Rodrigues Chun Jason Xue Antoni B. Chan



研究问题:现有的弱监督学习方法,多实例学习(MIL),在训练和测试数据来自同一领域时表现优秀,但在跨领域的测试集上性能下降。
动机:针对这一问题,本文提出了检索增强的MIL(RAM-MIL)框架,通过整合最优传输(OT)作为最近邻检索的距离度量标准。
方法:RAM-MIL的开发基于两个关键洞察。首先,理论发现降低输入的内在维度可以最小化注意力基础的MIL中的近似误差。其次,先前的研究强调了输入内在维度与特征合并过程和检索数据的关联。
效果:在全幅图像分类的实证评估中,RAM-MIL框架在同域场景(训练和检索数据在同一领域)和更重要的跨域场景(检索数据来自不同领域)都取得了最先进的性能。此外,使用从最优传输产生的运输矩阵使得检索结果在实例级别可解释,与普通的$l_2$距离相比,并允许人类专家进行可视化。

Multiple Instance Learning (MIL) is a crucial weakly supervised learning method applied across various domains, e.g., medical diagnosis based on whole slide images (WSIs). Recent advancements in MIL algorithms have yielded exceptional performance when the training and test data originate from the same domain, such as WSIs obtained from the same hospital. However, this paper reveals a performance deterioration of MIL models when tested on an out-of-domain test set, exemplified by WSIs sourced from a novel hospital. To address this challenge, this paper introduces the Retrieval-AugMented MIL (RAM-MIL) framework, which integrates Optimal Transport (OT) as the distance metric for nearest neighbor retrieval. The development of RAM-MIL is driven by two key insights. First, a theoretical discovery indicates that reducing the input's intrinsic dimension can minimize the approximation error in attention-based MIL. Second, previous studies highlight a link between input intrinsic dimension and the feature merging process with the retrieved data. Empirical evaluations conducted on WSI classification demonstrate that the proposed RAM-MIL framework achieves state-of-the-art performance in both in-domain scenarios, where the training and retrieval data are in the same domain, and more crucially, in out-of-domain scenarios, where the (unlabeled) retrieval data originates from a different domain. Furthermore, the use of the transportation matrix derived from OT renders the retrieval results interpretable at the instance level, in contrast to the vanilla $l_2$ distance, and allows for visualization for human experts. *Code can be found at \url{https://github.com/ralphc1212/ram-mil*.

On the Exploration of Local Significant Differences For Two-Sample Test
Zhijian Zhou Jie Ni Jia-He Yao Wei Gao



研究问题:本文旨在探索两种样本测试中局部显著差异的探索方法。
动机:近年来,两种样本测试受到了广泛关注,并在实际中得到了广泛应用。然而,对局部显著差异的探索仍然有待提高。
方法:我们提出了ME$_\text{MaBiD}$测试,这是一种有效的两种样本测试方法。该方法的基本思想是利用多个马氏核来挖掘局部信息,并引入双向假设进行测试。在探索局部显著差异时,我们首先通过一种新的分割标准将嵌入空间划分为几个矩形区域,这与测试功率和数据相关性有关。然后,我们基于我们的双向掩蔽$p$-值和ME$_\text{MaBiD}$测试来探索局部显著差异。
效果:从理论上讲,我们为ME$_\text{MaBiD}$测试提供了渐近分布和测试功率的下界,并在局部显著差异的探索上控制了家族误差率。最后,我们进行了广泛的实验,以验证我们的方法在两种样本测试和局部显著差异探索上的有效性。

Recent years have witnessed increasing attentions on two-sample test with diverse real applications, while this work takes one more step on the exploration of local significant differences for two-sample test. We propose the ME$_\text{MaBiD}$, an effective test for two-sample testing, and the basic idea is to exploit local information by multiple Mahalanobis kernels and introduce bi-directional hypothesis for testing. On the exploration of local significant differences, we first partition the embedding space into several rectangle regions via a new splitting criterion, which is relevant to test power and data correlation. We then explore local significant differences based on our bi-directional masked $p$-value together with the ME$_\text{MaBiD}$ test. Theoretically, we present the asymptotic distribution and lower bounds of test power for our ME$_\text{MaBiD}$ test, and control the familywise error rate on the exploration of local significant differences. We finally conduct extensive experiments to validate the effectiveness of our proposed methods on two-sample test and the exploration of local significant differences.

Meta-Learning with Neural Bandit Scheduler
Yunzhe Qi Yikun Ban Tianxin Wei Jiaru Zou Huaxiu Yao Jingrui He



研究问题:本文旨在解决元学习中任务调度策略的优化问题,以提高模型的泛化能力。
动机:现有的任务调度策略主要基于预定义的采样协议或假设的任务-模型关联,这可能导致元模型的性能瓶颈。
方法:本文提出了一种基于上下文Bandits设置的新型任务调度框架BASS,该框架直接优化基于元模型状态的任务调度策略。
效果:通过平衡元学习任务调度中的探索和利用,BASS可以应对元训练早期阶段对任务分布知识有限的问题,同时通过自适应的探索策略为未来的元训练迭代寻找潜在的好处。理论分析和大量实验表明了我们提出的框架的有效性。

Meta-learning has been proven an effective learning paradigm for training machine learning models with good generalization ability. Apart from the common practice of uniformly sampling the meta-training tasks, existing methods working on task scheduling strategies are mainly based on pre-defined sampling protocols or the assumed task-model correlations, and greedily make scheduling decisions, which can lead to sub-optimal performance bottlenecks of the meta-model. In this paper, we propose a novel task scheduling framework under Contextual Bandits settings, named BASS, which directly optimizes the task scheduling strategy based on the status of the meta-model. By balancing the exploitation and exploration in meta-learning task scheduling, BASS can help tackle the challenge of limited knowledge about the task distribution during the early stage of meta-training, while simultaneously exploring potential benefits for forthcoming meta-training iterations through an adaptive exploration strategy. Theoretical analysis and extensive experiments are presented to show the effectiveness of our proposed framework.

Minimax Forward and Backward Learning of Evolving Tasks with Performance Guarantees
Veronica Alvarez Santiago Mazuelas Jose A. Lozano



研究问题:如何有效地处理随时间到达的序列分类任务,特别是在任务相似性逐渐增加的情况下。
动机:对于连续到达的、相似性逐渐增加的序列分类任务,现有的持续学习和概念漂移适应技术往往无法有效应对。
方法:本文提出了增量最小最大风险分类器(IMRCs),该分类器能够有效利用前向和后向学习,并考虑到任务的演变。
效果:实验结果表明,IMRCs在减少样本数量的情况下,可以显著提高性能。

For a sequence of classification tasks that arrive over time, it is common that tasks are evolving in the sense that consecutive tasks often have a higher similarity. The incremental learning of a growing sequence of tasks holds promise to enable accurate classification even with few samples per task by leveraging information from all the tasks in the sequence (forward and backward learning). However, existing techniques developed for continual learning and concept drift adaptation are either designed for tasks with time-independent similarities or only aim to learn the last task in the sequence. This paper presents incremental minimax risk classifiers (IMRCs) that effectively exploit forward and backward learning and account for evolving tasks. In addition, we analytically characterize the performance improvement provided by forward and backward learning in terms of the tasks’ expected quadratic change and the number of tasks. The experimental evaluation shows that IMRCs can result in a significant performance improvement, especially for reduced sample sizes.

Rethinking Semi-Supervised Imbalanced Node Classification from Bias-Variance Decomposition
Liang Yan Gengchen Wei Chen Yang Shengzhong Zhang Zengfeng Huang



研究问题:本文旨在解决图神经网络(GNN)在图结构数据学习中面临的类别不平衡问题。
动机:现有的方法对于图结构数据的类别不平衡问题处理不足,本研究通过整合不平衡节点分类和偏差-方差分解,建立了一个紧密联系数据不平衡与模型方差的理论框架。
方法:我们利用图增强技术来估计方差并设计了一个正则化项以减轻不平衡的影响。
效果:我们在多个基准测试上进行了详尽的测试,包括自然不平衡数据集和公开划分的类别不平衡数据集,实验结果表明,我们的方法在各种不平衡情况下都优于最先进的方法。这项工作为解决GNN中的不平衡节点分类问题提供了一种新的理论视角。

This paper introduces a new approach to address the issue of class imbalance in graph neural networks (GNNs) for learning on graph-structured data. Our approach integrates imbalanced node classification and Bias-Variance Decomposition, establishing a theoretical framework that closely relates data imbalance to model variance. We also leverage graph augmentation technique to estimate the variance and design a regularization term to alleviate the impact of imbalance. Exhaustive tests are conducted on multiple benchmarks, including naturally imbalanced datasets and public-split class-imbalanced datasets, demonstrating that our approach outperforms state-of-the-art methods in various imbalanced scenarios. This work provides a novel theoretical perspective for addressing the problem of imbalanced node classification in GNNs.

CLeAR: Continual Learning on Algorithmic Reasoning for Human-like Intelligence
Bong Gyun Kang HyunGi Kim Dahuin Jung Sungroh Yoon



研究问题:本文旨在解决连续学习(CL)中,针对抽象逻辑概念如计数、排序和算术等任务的研究不足。
动机:人类是优秀的持续学习者,他们在真实世界中逐渐学习这些抽象概念。然而,大多数关于连续学习的研究表明,现有的方法主要适用于结构化数据,如图像,而对于抽象逻辑概念的连续学习研究则相对缺乏。
方法:本文首次引入了一种新的算法推理(AR)方法,用于处理抽象概念的连续任务,即CLeAR。该方法提出了一种将输入分布映射到一个共享映射空间的一对多映射,使得不同维度和共享语义的各种任务能够对齐。
效果:在包含15个任务的实验中,从同层级到跨层级的各种层次的乔姆斯基等级,CLeAR不仅实现了接近零遗忘的效果,而且在后续任务中提高了准确率,这一现象被称为反向迁移。而以往专为图像分类设计的连续学习方法则在这方面表现不佳。

Continual learning (CL) aims to incrementally learn multiple tasks that are presented sequentially. The significance of CL lies not only in the practical importance but also in studying the learning mechanisms of humans who are excellent continual learners. While most research on CL has been done on structured data such as images, there is a lack of research on CL for abstract logical concepts such as counting, sorting, and arithmetic, which humans learn gradually over time in the real world. In this work, for the first time, we introduce novel algorithmic reasoning (AR) methodology for continual tasks of abstract concepts: CLeAR. Our methodology proposes a one-to-many mapping of input distribution to a shared mapping space, which allows the alignment of various tasks of different dimensions and shared semantics. Our tasks of abstract logical concepts, in the form of formal language, can be classified into Chomsky hierarchies based on their difficulty. In this study, we conducted extensive experiments consisting of 15 tasks with various levels of Chomsky hierarchy, ranging from in-hierarchy to inter-hierarchy scenarios. CLeAR not only achieved near zero forgetting but also improved accuracy during following tasks, a phenomenon known as backward transfer, while previous CL methods designed for image classification drastically failed.

Interpretable Prototype-based Graph Information Bottleneck
Sangwoo Seo Sungwon Kim Chanyoung Park



研究问题:如何提高图神经网络(GNNs)的解释性,使其预测过程透明化?
动机:现有的模型解释方法往往从整个图中提取过多信息,导致关键子结构被排除或包含无关子结构,限制了模型在下游任务中的可解释性和性能。
方法:提出一种新的可解释GNN框架——可解释原型基于图信息瓶颈(PGIB),将原型学习纳入信息瓶颈框架,为模型预测提供来自输入图的关键子图。
效果:实验表明,PGIB在预测性能和可解释性方面优于现有方法。

The success of Graph Neural Networks (GNNs) has led to a need for understanding their decision-making process and providing explanations for their predictions, which has given rise to explainable AI (XAI) that offers transparent explanations for black-box models. Recently, the use of prototypes has successfully improved the explainability of models by learning prototypes to imply training graphs that affect the prediction. However, these approaches tend to provide prototypes with excessive information from the entire graph, leading to the exclusion of key substructures or the inclusion of irrelevant substructures, which can limit both the interpretability and the performance of the model in downstream tasks. In this work, we propose a novel framework of explainable GNNs, called interpretable Prototype-based Graph Information Bottleneck (PGIB) that incorporates prototype learning within the information bottleneck framework to provide prototypes with the key subgraph from the input graph that is important for the model prediction. This is the first work that incorporates prototype learning into the process of identifying the key subgraphs that have a critical impact on the prediction performance. Extensive experiments, including qualitative analysis, demonstrate that PGIB outperforms state-of-the-art methods in terms of both prediction performance and explainability.

Back-Modality: Leveraging Modal Transformation for Data Augmentation
Zhi Li Yifan Liu Yin Zhang



研究问题:如何利用一种基于模态转换的新型数据增强模式(Back-Modality)进行跨模态的数据增强?
动机:现有的数据增强方法主要针对单一模态,缺乏对跨模态数据的有效处理。
方法:通过将初始模态的数据转换为中间模态,然后再进行反向转换,实现数据的跨模态增强。同时,也可以在中间模态上应用适合的增强技术来进一步增强初始模态的数据。
效果:通过图像分类、情感分类和文本蕴含等任务的全面评估,证明该方法在数据稀缺的情况下能显著提高性能。

We introduce Back-Modality, a novel data augmentation schema predicated on modal transformation. Data from an initial modality undergoes transformation to an intermediate modality, followed by a reverse transformation. This framework serves dual roles. On one hand, it operates as a general data augmentation strategy. On the other hand, it allows for other augmentation techniques, suitable for the intermediate modality, to enhance the initial modality. For instance, data augmentation methods applicable to pure text can be employed to augment images, thereby facilitating the cross-modality of data augmentation techniques. To validate the viability and efficacy of our framework, we proffer three instantiations of Back-Modality: back-captioning, back-imagination, and back-speech. Comprehensive evaluations across tasks such as image classification, sentiment classification, and textual entailment demonstrate that our methods consistently enhance performance under data-scarce circumstances.

Evolving Standardization for Continual Domain Generalization over Temporal Drift
Mixue Xie Shuang Li Longhui Yuan Chi Harold Liu Zehui Dai



研究问题:如何训练一种能适应数据分布逐渐改变的模型,特别是在新领域不断出现的情况下。
动机:现有的领域泛化方法主要适用于离线离散场景,而现实世界中的数据分布可能会因为各种因素(如时间推移)而逐渐改变,且新的领域会不断出现,因此需要更高效的方法来处理。
方法:提出连续领域泛化过时间漂移(CDGTD)的问题定义和演化标准化(EvoS)方法。EvoS通过在多个尺度上学习特征分布的演变模式,并利用生成的统计信息对特征进行标准化,以减轻分布偏移。
效果:在多个真实世界数据集上的实验验证了EvoS的有效性。

The capability of generalizing to out-of-distribution data is crucial for the deployment of machine learning models in the real world. Existing domain generalization (DG) mainly embarks on offline and discrete scenarios, where multiple source domains are simultaneously accessible and the distribution shift among domains is abrupt and violent. Nevertheless, such setting may not be universally applicable to all real-world applications, as there are cases where the data distribution gradually changes over time due to various factors, e.g., the process of aging. Additionally, as the domain constantly evolves, new domains will continually emerge. Re-training and updating models with both new and previous domains using existing DG methods can be resource-intensive and inefficient. Therefore, in this paper, we present a problem formulation for Continual Domain Generalization over Temporal Drift (CDGTD). CDGTD addresses the challenge of gradually shifting data distributions over time, where domains arrive sequentially and models can only access the data of the current domain. The goal is to generalize to unseen domains that are not too far into the future. To this end, we propose an Evolving Standardization (EvoS) method, which characterizes the evolving pattern of feature distribution and mitigates the distribution shift by standardizing features with generated statistics of corresponding domain. Specifically, inspired by the powerful ability of transformers to model sequence relations, we design a multi-scale attention module (MSAM) to learn the evolving pattern under sliding time windows of different lengths. MSAM can generate statistics of current domain based on the statistics of previous domains and the learned evolving pattern. Experiments on multiple real-world datasets including images and texts validate the efficacy of our EvoS.

Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift
Saurabh Garg Amrith Setlur Zachary Chase Lipton Sivaraman Balakrishnan Virginia Smith Aditi Raghunathan



研究问题:本研究旨在探索自训练和对比学习相结合在无标签数据上的有效性,特别是在研究问题:本研究旨在探索自训练和对比学习相结合在无标签数据上的有效性,特别是在分布转移(无监督领域适应)和不存在分布转移(半监督学习)的情况下。
动机:尽管自训练和对比学习这两种技术非常流行且兼容,但它们结合使用的有效性尚未得到充分探索。
方法:通过系统实验调查了这种组合的效果,发现在领域适应设置中,自训练和对比学习提供了显著的互补增益;而在半监督学习设置中,令人惊讶的是,两者的结合并未带来协同效应。
效果:通过对八个分布转移数据集(如BREEDs、WILDS)进行实验,证明组合方法比单独使用任一种方法的准确率高出3-8%。最后,通过简化的分布转移模型进行理论分析,揭示了在某些情况下,即使单独使用其中任何一种方法都会失败,对比学习产生的特征也能为自训练提供良好的初始化,进一步放大收益并实现最佳性能。

Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains surprisingly unexplored. In this paper, we first undertake a systematic empirical investigation of this combination, finding (i) that in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) that in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8\% higher accuracy than either approach independently. Finally, we theoretically analyze these techniques in a simplified model of distribution shift demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.

Cluster-aware Semi-supervised Learning: Relational Knowledge Distillation Provably Learns Clustering
Yijun Dong Kevin Miller Qi Lei Rachel Ward



研究问题:尽管关系性知识蒸馏在实证上取得了成功并具有实际意义,但其理论解释仍受到限制。
动机:本研究旨在对关系性知识蒸馏(RKD)进行初步的理论理解,重点关注半监督分类问题。
方法:我们将RKD视为由教师模型揭示的种群诱导图上的谱聚类,通过量化预测和真实聚类之间的差异的聚类错误概念,我们说明在种群上进行RKD可以保证低聚类错误。
效果:对于半监督学习,我们进一步通过假设低聚类错误的集群感知半监督学习通用框架展示了RKD的标签效率。最后,我们将数据增强一致性正则化统一到这个集群感知框架中,表明尽管学习准确的聚类有共同的效果,但RKD通过谱聚类促进了"全局"视角,而一致性正则化则通过扩展关注"局部"视角。

Despite the empirical success and practical significance of (relational) knowledge distillation that matches (the relations of) features between teacher and student models, the corresponding theoretical interpretations remain limited for various knowledge distillation paradigms. In this work, we take an initial step toward a theoretical understanding of relational knowledge distillation (RKD), with a focus on semi-supervised classification problems. We start by casting RKD as spectral clustering on a population-induced graph unveiled by a teacher model. Via a notion of clustering error that quantifies the discrepancy between the predicted and ground truth clusterings, we illustrate that RKD over the population provably leads to low clustering error. Moreover, we provide a sample complexity bound for RKD with limited unlabeled samples. For semi-supervised learning, we further demonstrate the label efficiency of RKD through a general framework of cluster-aware semi-supervised learning that assumes low clustering errors. Finally, by unifying data augmentation consistency regularization into this cluster-aware framework, we show that despite the common effect of learning accurate clusterings, RKD facilitates a "global" perspective through spectral clustering, whereas consistency regularization focuses on a "local" perspective via expansion.

Fair Graph Distillation
Qizhang Feng Zhimeng Jiang Ruiquan Li Yicheng Wang Na Zou Jiang Bian Xia Hu



研究问题:如何通过知识图谱和大规模文本语料库训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

As graph neural networks (GNNs) struggle with large-scale graphs due to high computational demands, data distillation for graph data promises to alleviate this issue by distilling a large real graph into a smaller distilled graph while maintaining comparable prediction performance for GNNs trained on both graphs. However, we observe that GNNs trained on distilled graphs may exhibit more severe group fairness problems than those trained on real graphs. Motivated by this observation, we propose \textit{fair graph distillation} (\Algnameabbr), an approach for generating small distilled \textit{fair and informative} graphs based on the graph distillation method. The challenge lies in the deficiency of sensitive attributes for nodes in the distilled graph, making most debiasing methods (e.g., regularization and adversarial debiasing) intractable for distilled graphs. We develop a simple yet effective bias metric, called coherence, for distilled graphs. Based on the proposed coherence metric, we introduce a framework for fair graph distillation using a bi-level optimization algorithm. Extensive experiments demonstrate that the proposed algorithm can achieve better prediction performance-fairness trade-offs across various datasets and GNN architectures.

Direct Diffusion Bridge using Data Consistency for Inverse Problems
Hyungjin Chung Jeongsol Kim Jong Chul Ye



研究问题:扩散模型为基础的反向问题求解器在速度上存在限制,需要从噪声开始进行反向扩散采样。
动机:为了解决这个问题,一些近期的研究尝试通过建立扩散过程,直接连接清洁和损坏的数据,以解决特定的反向问题。
方法:本文首先将这些现有的工作统一命名为直接扩散桥(DDB),并指出虽然这些算法的动机不同,但结果只在于参数的选择。然后,我们提出了一个改进的推理过程,该方法在不需要微调的情况下确保了数据一致性。
效果:我们的方法称为数据一致的DDB(CDDB),它在感知和失真度量方面都优于其不一致的对应方法,从而有效地将Pareto前沿推向最优。我们的方法在两种评估标准上都取得了最先进的结果,展示了其超越现有方法的优势。

Diffusion model-based inverse problem solvers have shown impressive performance, but are limited in speed, mostly as they require reverse diffusion sampling starting from noise. Several recent works have tried to alleviate this problem by building a diffusion process, directly bridging the clean and the corrupted for specific inverse problems. In this paper, we first unify these existing works under the name Direct Diffusion Bridges (DDB), showing that while motivated by different theories, the resulting algorithms only differ in the choice of parameters. Then, we highlight a critical limitation of the current DDB framework, namely that it does not ensure data consistency. To address this problem, we propose a modified inference procedure that imposes data consistency without the need for fine-tuning. We term the resulting method data Consistent DDB (CDDB), which outperforms its inconsistent counterpart in terms of both perception and distortion metrics, thereby effectively pushing the Pareto-frontier toward the optimum. Our proposed method achieves state-of-the-art results on both evaluation criteria, showcasing its superiority over existing methods. Code is open-sourced [here](https://github.com/HJ-harry/CDDB).

Error Discovery By Clustering Influence Embeddings
Fulton Wang Julius Adebayo Sarah Tan Diego Garcia-Olano Narine Kokhlikyan



研究问题:如何发现模型表现不佳的测试样本组,即切片。
动机:为了提高模型的预测准确性和可解释性,需要找出模型在哪些样本上表现不佳。
方法:提出了一种名为InfEmbed的新方法,该方法通过应用K-Means聚类到我们称之为影响嵌入的新型表示上来满足连贯性要求。
效果:实验结果表明,InfEmbed在两个基准测试中优于当前最先进的方法,并在多个案例研究中有效地进行了模型调试。

We present a method for identifying groups of test examples---slices---on which a model under-performs, a task now known as slice discovery. We formalize coherence---a requirement that erroneous predictions, within a slice, should be wrong for the same reason---as a key property that any slice discovery method should satisfy. We then use influence functions to derive a new slice discovery method, InfEmbed, which satisfies coherence by returning slices whose examples are influenced similarly by the training data. InfEmbed is simple, and consists of applying K-Means clustering to a novel representation we deem influence embeddings. We show InfEmbed outperforms current state-of-the-art methods on 2 benchmarks, and is effective for model debugging across several case studies.

Understanding Few-Shot Learning: Measuring Task Relatedness and Adaptation Difficulty via Attributes
Minyang Hu Hong Chang Zong Guo Bingpeng Ma Shiguang Shan Xilin CHEN



研究问题:本文旨在通过探索两个关键问题来理解Few-shot学习(FSL):(1)如何量化训练任务和新颖任务之间的关系?(2)这种关系如何影响不同模型在新任务上的适应难度?
动机:FSL的目标是通过利用相关训练任务的经验,用很少的标注样本学习新任务。然而,如何量化任务之间的相关性并理解其对模型适应新任务的影响仍然是一个未解决的问题。
方法:我们提出了任务属性距离(TAD)作为度量任务相关性的一种方法,它通过属性来量化任务之间的关联性。此外,我们还建立了任务相关性和任务适应难度之间的理论联系,并通过推导新任务的泛化误差边界来发现TAD如何度量不同模型在新任务上的适应难度。
效果:实验结果证实了TAD度量能有效量化任务相关性,并能反映不同FSL方法在新任务上的适应难度。我们的代码可以在 https://github.com/hu-my/TaskAttributeDistance 上找到。

Few-shot learning (FSL) aims to learn novel tasks with very few labeled samples by leveraging experience from \emph{related} training tasks. In this paper, we try to understand FSL by exploring two key questions: (1) How to quantify the relationship between \emph{ training} and \emph{novel} tasks? (2) How does the relationship affect the \emph{adaptation difficulty} on novel tasks for different models? To answer the first question, we propose Task Attribute Distance (TAD) as a metric to quantify the task relatedness via attributes. Unlike other metrics, TAD is independent of models, making it applicable to different FSL models. To address the second question, we utilize TAD metric to establish a theoretical connection between task relatedness and task adaptation difficulty. By deriving the generalization error bound on a novel task, we discover how TAD measures the adaptation difficulty on novel tasks for different models. To validate our theoretical results, we conduct experiments on three benchmarks. Our experimental results confirm that TAD metric effectively quantifies the task relatedness and reflects the adaptation difficulty on novel tasks for various FSL methods, even if some of them do not learn attributes explicitly or human-annotated attributes are not provided. Our code is available at \href{https://github.com/hu-my/TaskAttributeDistance}{https://github.com/hu-my/TaskAttributeDistance}.

Federated Learning with Bilateral Curation for Partially Class-Disjoint Data
Ziqing Fan Ruipeng Zhang Jiangchao Yao Bo Han Ya Zhang Yanfeng Wang



研究问题:本文旨在解决联邦学习中部分不相交数据(PCDD)的问题,即每个客户端只贡献部分类别的样本,这严重挑战了联邦算法的性能。
动机:由于每个客户端只贡献部分类别的样本,局部目标将与全局目标相矛盾,导致局部缺失类别的角度塌陷问题和局部存在类别的空间浪费问题。现有的方法都无法从根本上解决PCDD的挑战,以实现联邦学习双方视角的整体改进。
方法:受不平衡数据上单纯形等角紧框架(ETF)的强大泛化性的启发,我们提出了一种名为FedGELA的新方法,其中分类器被全局固定为单纯形ETF,同时局部适应于个人分布。在全球范围内,FedGELA为所有类别提供公平和平等的歧视,避免分类器的不准确更新,而在局部范围内,它利用了局部缺失类别的空间用于局部存在的类别。
效果:我们在一系列数据集上进行了广泛的实验,证明我们的FedGELA取得了良好的性能(比FedAvg平均提高了3.9%,比最佳基线平均提高了1.5%),并提供了局部和全局收敛保证。

Partially class-disjoint data (PCDD), a common yet under-explored data formation where each client contributes a part of classes (instead of all classes) of samples, severely challenges the performance of federated algorithms. Without full classes, the local objective will contradict the global objective, yielding the angle collapse problem for locally missing classes and the space waste problem for locally existing classes. As far as we know, none of the existing methods can intrinsically mitigate PCDD challenges to achieve holistic improvement in the bilateral views (both global view and local view) of federated learning. To address this dilemma, we are inspired by the strong generalization of simplex Equiangular Tight Frame (ETF) on the imbalanced data, and propose a novel approach called FedGELA where the classifier is globally fixed as a simplex ETF while locally adapted to the personal distributions. Globally, FedGELA provides fair and equal discrimination for all classes and avoids inaccurate updates of the classifier, while locally it utilizes the space of locally missing classes for locally existing classes. We conduct extensive experiments on a range of datasets to demonstrate that our FedGELA achieves promising performance (averaged improvement of 3.9% to FedAvg and 1.5% to best baselines) and provide both local and global convergence guarantees.

Learning From Biased Soft Labels
Hua Yuan Yu Shi Ning Xu Xu Yang Xin Geng Yong Rui



研究问题:本文旨在研究有偏软标签的有效性,即在教师模型生成的软标签与真实标签存在偏差的情况下,这些软标签是否仍然有效。
动机:知识蒸馏的出现引发了研究者对教师模型生成的软标签中隐藏的“暗知识”的兴趣。然而,现有的理论都隐含地要求软标签接近真实标签。本文则探究了存在偏差的软标签是否仍然有效。
方法:本文提出了两个指标来衡量软标签的有效性,并根据这两个指标提出了适度的条件,以确保有偏软标签学习问题既是分类器一致的,又是经验风险最小化(ERM)可学习的,即使对于大偏差的软标签也可以适用。此外,本文还设计了一种启发式方法来训练技能差但坏的老师(SBTs),这些准确率低于30%的老师可以教学生在CIFAR-10上达到90%以上的准确率,这相当于在原始数据上训练的模型。
效果:实验结果表明,提出的指标能够充分衡量在这个过程中生成的有偏软标签的有效性。此外,本文的理论框架可以适用于阐明弱监督学习范式中的软标签的有效性,包括不完整监督、部分标签学习和带噪声学习。

Since the advent of knowledge distillation, many researchers have been intrigued by the $\textit{dark knowledge}$ hidden in the soft labels generated by the teacher model. This prompts us to scrutinize the circumstances under which these soft labels are effective. Predominant existing theories implicitly require that the soft labels are close to the ground-truth labels. In this paper, however, we investigate whether biased soft labels are still effective. Here, bias refers to the discrepancy between the soft labels and the ground-truth labels. We present two indicators to measure the effectiveness of the soft labels. Based on the two indicators, we propose moderate conditions to ensure that, the biased soft label learning problem is both $\textit{classifier-consistent}$ and $\textit{Empirical Risk Minimization}$ (ERM) $\textit{learnable}$, which can be applicable even for large-biased soft labels. We further design a heuristic method to train Skillful but Bad Teachers (SBTs), and these teachers with accuracy less than 30\% can teach students to achieve accuracy over 90\% on CIFAR-10, which is comparable to models trained on the original data. The proposed indicators adequately measure the effectiveness of the soft labels generated in this process. Moreover, our theoretical framework can be adapted to elucidate the effectiveness of soft labels in three weakly-supervised learning paradigms, namely incomplete supervision, partial label learning and learning with additive noise. Experimental results demonstrate that our indicators can measure the effectiveness of biased soft labels generated by teachers or in these weakly-supervised learning paradigms.

Does Invariant Graph Learning via Environment Augmentation Learn Invariance?
Yongqiang Chen Yatao Bian Kaiwen Zhou Binghui Xie Bo Han James Cheng



研究问题:如何通过环境增强学习不变的图表示,以实现图上的分布外泛化。
动机:由于获取图环境分区通常代价高昂,因此增强环境信息已成为事实方法。然而,尚未验证增强的环境信息的有效性。
方法:提出了一组最小假设,包括变异充分性和变异一致性,用于可行的不变图学习。然后提出了一个新的框架Graph invAriant Learning Assistant(GALA)。GALA引入了一个需要对图环境变化或分布偏移敏感的辅助模型。辅助模型的正确代理预测可以区分虚假子图中的变化。
效果:在包括DrugOOD的各种图分布偏移的数据集上进行的大量实验证实了GALA的有效性。

Invariant graph representation learning aims to learn the invariance among data from different environments for out-of-distribution generalization on graphs. As the graph environment partitions are usually expensive to obtain, augmenting the environment information has become the de facto approach. However, the usefulness of the augmented environment information has never been verified. In this work, we find that it is fundamentally impossible to learn invariant graph representations via environment augmentation without additional assumptions. Therefore, we develop a set of minimal assumptions, including variation sufficiency and variation consistency, for feasible invariant graph learning. We then propose a new framework Graph invAriant Learning Assistant (GALA). GALA incorporates an assistant model that needs to be sensitive to graph environment changes or distribution shifts. The correctness of the proxy predictions by the assistant model hence can differentiate the variations in spurious subgraphs. We show that extracting the maximally invariant subgraph to the proxy predictions provably identifies the underlying invariant subgraph for successful OOD generalization under the established minimal assumptions. Extensive experiments on datasets including DrugOOD with various graph distribution shifts confirm the effectiveness of GALA.

Understanding and Improving Feature Learning for Out-of-Distribution Generalization
Yongqiang Chen Wei Huang Kaiwen Zhou Yatao Bian Bo Han James Cheng



研究问题:本文旨在解决模型在面对分布外(OOD)数据时泛化能力差的问题。
动机:虽然一些研究认为模型在学习经验风险最小化(ERM)过程中可能学习到错误的、非不变的特征,但最近的一些研究对此提出了质疑,认为深度网络可能已经学习到了足够好的特征来进行OOD泛化。
方法:本文通过理论研究发现,ERM实际上既学习了错误的也学习了正确的特征,且当错误相关性更强时,ERM倾向于更快地学习错误特征。因此,作者提出特征增强训练(FeAT)方法,通过迭代增强模型来学习新的特征,同时保留已学习的特征,以改善OOD泛化性能。
效果:实验证明,FeAT能有效学习更丰富的特征,从而提高各种OOD目标的性能。

A common explanation for the failure of out-of-distribution (OOD) generalization is that the model trained with empirical risk minimization (ERM) learns spurious features instead of invariant features. However, several recent studies challenged this explanation and found that deep networks may have already learned sufficiently good features for OOD generalization. Despite the contradictions at first glance, we theoretically show that ERM essentially learns both spurious and invariant features, while ERM tends to learn spurious features faster if the spurious correlation is stronger. Moreover, when fed the ERM learned features to the OOD objectives, the invariant feature learning quality significantly affects the final OOD performance, as OOD objectives rarely learn new features. Therefore, ERM feature learning can be a bottleneck to OOD generalization. To alleviate the reliance, we propose Feature Augmented Training (FeAT), to enforce the model to learn richer features ready for OOD generalization. FeAT iteratively augments the model to learn new features while retaining the already learned features. In each round, the retention and augmentation operations are performed on different subsets of the training data that capture distinct features. Extensive experiments show that FeAT effectively learns richer features thus boosting the performance of various OOD objectives.

Label Correction of Crowdsourced Noisy Annotations with an Instance-Dependent Noise Transition Model
Hui Guo Boyu Wang Grace Yi



研究问题:如何有效地整合来自不同专家的众包标注,以提高监督学习算法的预测能力。
动机:现有的方法通常使用标注者特定的实例无关噪声转移矩阵来描述每个标注者的标注技能,但这种方法无法准确捕捉实例相关的噪声。
方法:本文在贝叶斯框架下构建了噪声转移模型,并设计了一种新的标签校正算法。具体来说,我们使用具有分层尖峰和平板先验的贝叶斯网络来近似实例相关的噪声转移矩阵。
效果:通过在基准和真实世界数据集上的实验,验证了该方法的有效性。

The predictive ability of supervised learning algorithms hinges on the quality of annotated examples, whose labels often come from multiple crowdsourced annotators with diverse expertise. To aggregate noisy crowdsourced annotations, many existing methods employ an annotator-specific instance-independent noise transition matrix to characterize the labeling skills of each annotator. Learning an instance-dependent noise transition model, however, is challenging and remains relatively less explored. To address this problem, in this paper, we formulate the noise transition model in a Bayesian framework and subsequently design a new label correction algorithm. Specifically, we approximate the instance-dependent noise transition matrices using a Bayesian network with a hierarchical spike and slab prior. To theoretically characterize the distance between the noise transition model and the true instance-dependent noise transition matrix, we provide a posterior-concentration theorem that ensures the posterior consistency in terms of the Hellinger distance. We further formulate the label correction process as a hypothesis testing problem and propose a novel algorithm to infer the true label from the noisy annotations based on the pairwise likelihood ratio test. Moreover, we establish an information-theoretic bound on the Bayes error for the proposed method. We validate the effectiveness of our approach through experiments on benchmark and real-world datasets.

Joint Data-Task Generation for Auxiliary Learning
Hong Chen Xin Wang Yuwei Zhou Yijian Qin Chaoyu Guan Wenwu Zhu



研究问题:现有的辅助学习方法主要采用重新加权损失的方法处理手动收集的辅助数据和任务,但这些方法在数据收集过程中严重依赖领域知识,这在实际中可能很难实现。
动机:当使用无用的辅助数据和任务时,当前的方法可能会变得无效,甚至对主要任务产生伤害。为了解决这个问题,我们提出了一种联合数据-任务生成框架用于辅助学习(DTG-AuxL)。
方法:我们提出的DTG-AuxL框架包含一个联合生成器和一个双层优化策略。具体来说,联合生成器包含一个特征生成器和一个标签生成器,它们被设计为适用于各种辅助学习场景并具有表现力。双层优化策略优化联合生成器和任务学习模型,其中联合生成器通过主损失的隐式梯度和我们提出的实例正则化的显式梯度在上层进行有效优化,而任务学习模型则通过生成的数据和任务在下层进行优化。
效果:广泛的实验表明,我们提出的DTG-AuxL框架在各种辅助学习场景中始终优于现有方法,特别是在手动收集的辅助数据和任务无用时。

Current auxiliary learning methods mainly adopt the methodology of reweighing losses for the manually collected auxiliary data and tasks. However, these methods heavily rely on domain knowledge during data collection, which may be hardly available in reality. Therefore, current methods will become less effective and even do harm to the primary task when unhelpful auxiliary data and tasks are employed. To tackle the problem, we propose a joint data-task generation framework for auxiliary learning (DTG-AuxL), which can bring benefits to the primary task by generating the new auxiliary data and task in a joint manner. The proposed DTG-AuxL framework contains a joint generator and a bi-level optimization strategy. Specifically, the joint generator contains a feature generator and a label generator, which are designed to be applicable and expressive for various auxiliary learning scenarios. The bi-level optimization strategy optimizes the joint generator and the task learning model, where the joint generator is effectively optimized in the upper level via the implicit gradient from the primary loss and the explicit gradient of our proposed instance regularization, while the task learning model is optimized in the lower level by the generated data and task. Extensive experiments show that our proposed DTG-AuxL framework consistently outperforms existing methods in various auxiliary learning scenarios, particularly when the manually collected auxiliary data and tasks are unhelpful.

Domain Adaptive Imitation Learning with Visual Observation
Sungho Choi Seungyul Han Woojun Kim Jongseong Chae Whiyoung Jung Youngchul Sung



研究问题:如何通过视觉观察进行领域自适应模仿学习,使目标领域的代理通过观察源领域的专家演示来执行任务。
动机:在实际应用中,接收视觉感官数据的机器人需要通过从不同角度观察其他机器人或观察形状不同的机器人来模仿运动,因此需要进行领域自适应的模仿学习。
方法:我们提出了一种新的框架,用于从输入观察中提取与领域无关的行为特征,以训练学习者,该框架基于双重特征提取和图像重建。
效果:实验结果表明,我们的方法在处理具有领域转移的视觉观察模仿学习方面优于以前的算法。

In this paper, we consider domain-adaptive imitation learning with visual observation, where an agent in a target domain learns to perform a task by observing expert demonstrations in a source domain. Domain adaptive imitation learning arises in practical scenarios where a robot, receiving visual sensory data, needs to mimic movements by visually observing other robots from different angles or observing robots of different shapes. To overcome the domain shift in cross-domain imitation learning with visual observation, we propose a novel framework for extracting domain-independent behavioral features from input observations that can be used to train the learner, based on dual feature extraction and image reconstruction. Empirical results demonstrate that our approach outperforms previous algorithms for imitation learning from visual observation with domain shift.

Navigating the Pitfalls of Active Learning Evaluation: A Systematic Framework for Meaningful Performance Assessment
Carsten Tim Lüth Till J. Bungert Lukas Klein Paul F Jaeger



研究问题:当前主动学习(AL)的研究结果存在矛盾,缺乏系统性和实际性评估,导致实践者对在任务中使用AL感到不确定。
动机:为了解决这一问题,本文提出了一个评估框架,并进行了大规模的实证研究。
方法:通过识别当前文献中的五个关键问题,并设计出一个能够克服这些问题的评估框架。同时,进行大规模的图像分类实验,涵盖了各种数据集、查询方法、AL设置和训练范式。
效果:实证研究结果澄清了文献中的矛盾情况,为实践者提供了实用的建议。

Active Learning (AL) aims to reduce the labeling burden by interactively selecting the most informative samples from a pool of unlabeled data. While there has been extensive research on improving AL query methods in recent years, some studies have questioned the effectiveness of AL compared to emerging paradigms such as semi-supervised (Semi-SL) and self-supervised learning (Self-SL), or a simple optimization of classifier configurations. Thus, today’s AL literature presents an inconsistent and contradictory landscape, leaving practitioners uncertain about whether and how to use AL in their tasks. In this work, we make the case that this inconsistency arises from a lack of systematic and realistic evaluation of AL methods. Specifically, we identify five key pitfalls in the current literature that reflect the delicate considerations required for AL evaluation. Further, we present an evaluation framework that overcomes these pitfalls and thus enables meaningful statements about the performance of AL methods. To demonstrate the relevance of our protocol, we present a large-scale empirical study and benchmark for image classification spanning various data sets, query methods, AL settings, and training paradigms. Our findings clarify the inconsistent picture in the literature and enable us to give hands-on recommendations for practitioners. The benchmark is hosted at https://github.com/IML-DKFZ/realistic-al.

MADG: Margin-based Adversarial Learning for Domain Generalization
Aveen Dayal Vimal K B Linga Reddy Cenkeramaddi C Krishna Mohan Abhinav Kumar Vineeth N. Balasubramanian



研究问题:本文旨在解决深度学习中领域转移的挑战,即如何让模型在训练期间未见过的目标领域中表现良好。
动机:现有的对抗性领域泛化方法主要使用基于0-1损失的$\mathcal{H}\Delta\mathcal{H}$散度度量,而基于间隔损失的散度度量具有信息量大、紧实、实用和可优化高效等优点。
方法:本文提出了一种名为$\textbf{MADG}$的新型对抗性学习领域泛化算法,该算法利用基于间隔损失的散度度量来学习所有源领域的领域不变特征,并通过对抗性训练来很好地泛化到未见过的目标领域。
效果:我们在流行的真实世界领域泛化数据集VLCS、PACS、OfficeHome、DomainNet和TerraIncognita上广泛实验了$\textbf{MADG}$模型。我们在DomainBed的基准测试中评估了所提出的算法,并在所有数据集上都观察到了一致的性能。

Domain Generalization (DG) techniques have emerged as a popular approach to address the challenges of domain shift in Deep Learning (DL), with the goal of generalizing well to the target domain unseen during the training. In recent years, numerous methods have been proposed to address the DG setting, among which one popular approach is the adversarial learning-based methodology. The main idea behind adversarial DG methods is to learn domain-invariant features by minimizing a discrepancy metric. However, most adversarial DG methods use 0-1 loss based $\mathcal{H}\Delta\mathcal{H}$ divergence metric. In contrast, the margin loss-based discrepancy metric has the following advantages: more informative, tighter, practical, and efficiently optimizable. To mitigate this gap, this work proposes a novel adversarial learning DG algorithm, $\textbf{MADG}$, motivated by a margin loss-based discrepancy metric. The proposed $\textbf{MADG}$ model learns domain-invariant features across all source domains and uses adversarial training to generalize well to the unseen target domain. We also provide a theoretical analysis of the proposed $\textbf{MADG}$ model based on the unseen target error bound. Specifically, we construct the link between the source and unseen domains in the real-valued hypothesis space and derive the generalization bound using margin loss and Rademacher complexity. We extensively experiment with the $\textbf{MADG}$ model on popular real-world DG datasets, VLCS, PACS, OfficeHome, DomainNet, and TerraIncognita. We evaluate the proposed algorithm on DomainBed's benchmark and observe consistent performance across all the datasets.

Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation
Haoran Chen Xintong Han Zuxuan Wu Yu-Gang Jiang



研究问题:现有的无监督领域适应(UDA)方法大多依赖于共享网络来提取领域不变的特征,但在面对多个源域时,优化这样的网络会涉及更新整个网络的参数,计算成本高且具有挑战性。
动机:受最近在提示学习中取得的进展启发,该研究提出了一种简单而高效的多源UDA框架——多提示对齐(MPA)。
方法:MPA首先为源域和目标域对训练一个单独的提示,通过对比损失最小化领域差距。然后,MPA通过自编码过程对学习到的提示进行去噪,并通过最大化所有重构提示的一致性来对齐它们。此外,研究还表明,自编码过程中获得的子空间可以很容易地推广到一系列精简的目标域,使该方法更适用于实际应用。
效果:大量实验表明,MPA在三个流行的数据集上取得了最先进的结果,在DomainNet上的平均准确率达到了54.1%,令人印象深刻。

Most existing methods for unsupervised domain adaptation (UDA) rely on a shared network to extract domain-invariant features. However, when facing multiple source domains, optimizing such a network involves updating the parameters of the entire network, making it both computationally expensive and challenging, particularly when coupled with min-max objectives. Inspired by recent advances in prompt learning that adapts high-capacity models for downstream tasks in a computationally economic way, we introduce Multi-Prompt Alignment (MPA), a simple yet efficient framework for multi-source UDA. Given a source and target domain pair, MPA first trains an individual prompt to minimize the domain gap through a contrastive loss. Then, MPA denoises the learned prompts through an auto-encoding process and aligns them by maximizing the agreement of all the reconstructed prompts. Moreover, we show that the resulting subspace acquired from the auto-encoding process can easily generalize to a streamlined set of target domains, making our method more efficient for practical usage. Extensive experiments show that MPA achieves state-of-the-art results on three popular datasets with an impressive average accuracy of 54.1% on DomainNet.

Implicit Contrastive Representation Learning with Guided Stop-gradient
Byeongchan Lee Sehyun Lee



研究问题:解决自监督学习中Siamese网络容易塌陷的问题。
动机:现有的对比学习方法在减少负样本数量时不够稳健,而只使用正样本的算法则通过非对称网络架构来防止塌陷。
方法:提出一种利用非对称架构隐式引入对比学习思想的新方法——指导停止梯度法,并将其应用于SimSiam和BYOL等基准算法。
效果:该方法稳定了训练过程并提高了性能,同时在小批量大小和无预测器的情况下也能防止网络塌陷。

In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available in the supplementary material.

Knowledge Distillation Performs Partial Variance Reduction
Mher Safaryan Alexandra Peste Dan Alistarh



研究问题:本文旨在从优化的角度探讨知识蒸馏方法的内在工作机制。
动机:尽管知识蒸馏是一种流行的提升性能的方法,但其背后的机制尚未完全理解。
方法:通过线性和深度线性模型,将知识蒸馏解释为一种新的随机方差减少机制,并对其产生的动态进行详细的收敛分析。
效果:研究发现,知识蒸馏可以降低随机梯度噪声,但可能无法完全消除,这取决于“教师”模型的性质。这一分析强调了对知识蒸馏参数化,特别是关于蒸馏损失权重的考虑的重要性,并在线性模型和深度神经网络上进行了实证验证。

Knowledge distillation is a popular approach for enhancing the performance of "student" models, with lower representational capacity, by taking advantage of more powerful "teacher" models. Despite its apparent simplicity, the underlying mechanics behind knowledge distillation (KD) are not yet fully understood. In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective. Specifically, we show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism. We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses, showing that KD acts as a form of \emph{partial variance reduction}, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the ``teacher'' model. Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.

Addressing Negative Transfer in Diffusion Models
Hyojun Go Jinyoung Kim Yunsung Lee Seunghyun Lee Shinhyeok Oh Hyeongdon Moon Seungtaek Choi



研究问题:扩散模型在多任务学习中可能出现负迁移现象,导致某些任务的性能下降。
动机:解决扩散模型在多任务训练中的负迁移问题,提高模型性能。
方法:通过将去噪任务进行聚类,并应用多任务学习方法来减轻负迁移。使用间隔聚类来确保同一集群内的去噪任务具有时间上的接近性。
效果:实验证明该方法可以有效改善扩散模型的样本质量。

Diffusion-based generative models have achieved remarkable success in various domains. It trains a shared model on denoising tasks that encompass different noise levels simultaneously, representing a form of multi-task learning (MTL). However, analyzing and improving diffusion models from an MTL perspective remains under-explored. In particular, MTL can sometimes lead to the well-known phenomenon of \textit{negative transfer}, which results in the performance degradation of certain tasks due to conflicts between tasks. In this paper, we first aim to analyze diffusion training from an MTL standpoint, presenting two key observations: \textbf{(O1)} the task affinity between denoising tasks diminishes as the gap between noise levels widens, and \textbf{(O2)} negative transfer can arise even in diffusion training. Building upon these observations, we aim to enhance diffusion training by mitigating negative transfer. To achieve this, we propose leveraging existing MTL methods, but the presence of a huge number of denoising tasks makes this computationally expensive to calculate the necessary per-task loss or gradient. To address this challenge, we propose clustering the denoising tasks into small task clusters and applying MTL methods to them. Specifically, based on \textbf{(O2)}, we employ interval clustering to enforce temporal proximity among denoising tasks within clusters. We show that interval clustering can be solved using dynamic programming, utilizing signal-to-noise ratio, timestep, and task affinity for clustering objectives. Through this, our approach addresses the issue of negative transfer in diffusion models by allowing for efficient computation of MTL methods. We validate the proposed clustering and its integration with MTL methods through various experiments, demonstrating improved sample quality of diffusion models. Our project page is available at https://gohyojun15.github.io/ANT_diffusion.

Towards a Unified Framework of Contrastive Learning for Disentangled Representations
Stefan Matthes Zhiwei Han Hao Shen



研究问题:如何通过对比学习来发现和分离数据的解释性因素,以获取更好的数据表示。
动机:对比学习是一种有前景的方法,可以学习到能够发现和分离数据解释性因素的数据表示。
方法:本文对对比学习方法进行了理论扩展,放宽了对数据分布的假设,并证明了四种对比损失函数的真实潜在因素的可识别性。
效果:理论发现在几个基准数据集上得到了验证,同时,这些方法的实际局限性也得到了研究。

Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.

Convolution Monge Mapping Normalization for learning on sleep data
Theo Gnassounou Rémi Flamary Alexandre Gramfort



研究问题:在信号和生物医学数据(特别是脑电图)的许多机器学习应用中,一个主要挑战是数据在不同受试者、会话和硬件设备之间的变异性。
动机:为了解决这个问题,我们提出了一种新的方法,称为卷积蒙热映射归一化(CMMN),该方法通过过滤信号来调整其功率谱密度(PSD),使其适应训练数据上估计的Wasserstein重心。
方法:CMMN依赖于新的最优传输映射和重心的闭型解,并在不需要重新训练预测模型的情况下为新数据提供个体测试时间适应。
效果:在睡眠EEG数据的数值实验中,CMMN在受试者、会话甚至使用不同硬件收集的数据集之间进行适应时,无论神经网络架构如何,都能带来显著且一致的性能提升。值得注意的是,我们的性能增益与计算量更大的领域适应(DA)方法相当,并且可以与这些方法结合使用以获得更好的性能。

In many machine learning applications on signals and biomedical data, especially electroencephalogram (EEG), one major challenge is the variability of the data across subjects, sessions, and hardware devices. In this work, we propose a new method called Convolutional Monge Mapping Normalization ($\texttt{CMMN}$), which consists in filtering the signals in order to adapt their power spectrum density (PSD) to a Wasserstein barycenter estimated on training data. $\texttt{CMMN}$ relies on novel closed-form solutions for optimal transport mappings and barycenters and provides individual test time adaptation to new data without needing to retrain a prediction model. Numerical experiments on sleep EEG data show that $\texttt{CMMN}$ leads to significant and consistent performance gains independent from the neural network architecture when adapting between subjects, sessions, and even datasets collected with different hardware. Notably our performance gain is on par with much more numerically intensive Domain Adaptation (DA) methods and can be used in conjunction with those for even better performances.

Cross-modal Active Complementary Learning with Self-refining Correspondence
Yang Qin Yuan Sun Dezhong Peng Joey Tianyi Zhou Xi Peng Peng Hu



研究问题:图像-文本匹配在理解和揭示视觉和文本模态之间的潜在对应关系方面越来越受到学术界和工业界的关注,但大多数现有方法都假设训练对是良好对齐的,忽视了普遍存在的标注噪声,即噪声对应(NC),这不可避免地会导致性能下降。
动机:尽管一些方法试图解决这种噪声问题,但仍面临着过度记忆/过拟合和不可靠的NC校正两个挑战,尤其是在高噪声环境下。
方法:我们提出了一种通用的跨模态鲁棒互补学习框架(CRCL),该框架利用一种新的主动互补损失(ACL)和一种有效的自我修正对应校正(SCC)来提高现有方法的鲁棒性。
效果:实验结果表明,我们的CRCL在Flickr30K、MS-COCO和CC152K三个图像-文本基准上,对于合成和真实世界的噪声对应具有优越的鲁棒性。

Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning
Siming Lan Rui Zhang Qi Yi Jiaming Guo Shaohui Peng Yunkai Gao Fan Wu Ruizhi Chen Zidong Du Xing Hu Xishan Zhang Ling Li Yunji Chen



研究问题:多任务强化学习中,模块原则被广泛应用以防止任务间冲突导致的性能下降,但研究问题:多任务强化学习中,模块原则被广泛应用以防止任务间冲突导致的性能下降,但大多数现有方法仅在任务级别组合共享模块,忽视了任务内部可能存在的冲突。
动机:本文提出了一种对比性模块与时间注意力(CMTA)的方法,通过对比学习和以比任务级别更细的粒度结合共享模块来解决这些问题。
方法:CMTA通过对比学习约束模块彼此不同,并使用时间注意力在比任务级别更细的粒度上结合共享模块,以减轻任务内部的负转移并提高多任务强化学习的性能和泛化能力。
效果:在Meta-World多任务强化学习基准测试中进行的实验表明,CMTA首次超越了单独学习每个任务,并显著提高了基线的性能。

In the field of multi-task reinforcement learning, the modular principle, which involves specializing functionalities into different modules and combining them appropriately, has been widely adopted as a promising approach to prevent the negative transfer problem that performance degradation due to conflicts between tasks. However, most of the existing multi-task RL methods only combine shared modules at the task level, ignoring that there may be conflicts within the task. In addition, these methods do not take into account that without constraints, some modules may learn similar functions, resulting in restricting the model's expressiveness and generalization capability of modular methods. In this paper, we propose the Contrastive Modules with Temporal Attention(CMTA) method to address these limitations. CMTA constrains the modules to be different from each other by contrastive learning and combining shared modules at a finer granularity than the task level with temporal attention, alleviating the negative transfer within the task and improving the generalization ability and the performance for multi-task RL. We conducted the experiment on Meta-World, a multi-task RL benchmark containing various robotics manipulation tasks. Experimental results show that CMTA outperforms learning each task individually for the first time and achieves substantial performance improvements over the baselines.

Class-Distribution-Aware Pseudo-Labeling for Semi-Supervised Multi-Label Learning
Ming-Kun Xie Jia-Hao Xiao Hao-Zhe Liu Gang Niu Masashi Sugiyama Sheng-Jun Huang



研究问题:在半监督多标签学习(SSMLL)中,传统的伪标签方法在处理与多个标签相关联的实例和未知标签数量时遇到困难。
动机:为了克服这些挑战,本文提出了一种名为“类别感知伪标签”(CAP)的新解决方案,以类别感知的方式执行伪标签。
方法:该方法引入了一个包含类别感知阈值的正则化学习框架,有效地控制了每个类别的积极和消极伪标签的分配。
效果:实验结果证实了所估计的类别分布作为可靠近似值的有效性。因此,我们开发了一种类别分布感知的阈值策略,以确保伪标签分布与真实分布的对齐。

Pseudo-labeling has emerged as a popular and effective approach for utilizing unlabeled data. However, in the context of semi-supervised multi-label learning (SSMLL), conventional pseudo-labeling methods encounter difficulties when dealing with instances associated with multiple labels and an unknown label count. These limitations often result in the introduction of false positive labels or the neglect of true positive ones. To overcome these challenges, this paper proposes a novel solution called Class-Aware Pseudo-Labeling (CAP) that performs pseudo-labeling in a class-aware manner. The proposed approach introduces a regularized learning framework incorporating class-aware thresholds, which effectively control the assignment of positive and negative pseudo-labels for each class. Notably, even with a small proportion of labeled examples, our observations demonstrate that the estimated class distribution serves as a reliable approximation. Motivated by this finding, we develop a class-distribution-aware thresholding strategy to ensure the alignment of pseudo-label distribution with the true distribution. The correctness of the estimated class distribution is theoretically verified, and a generalization error bound is provided for our proposed method. Extensive experiments on multiple benchmark datasets confirm the efficacy of CAP in addressing the challenges of SSMLL problems.

Context Shift Reduction for Offline Meta-Reinforcement Learning
Yunkai Gao Rui Zhang Jiaming Guo Fan Wu Qi Yi Shaohui Peng Siming Lan Ruizhi Chen Zidong Du Xing Hu Qi Guo Ling Li Yunji Chen



研究问题:本文旨在解决离线元强化学习(OMRL)中由于训练和测试策略分布不匹配导致的上下文转移问题。
动机:现有的OMRL方法忽视了这个问题,或者试图通过额外的信息来缓解它,这导致任务推断错误,进一步降低了元策略的泛化能力。
方法:本文提出了一种新的方法,称为离线元强化学习的上下文转移减少(CSRO),仅使用离线数据集来解决上下文转移问题。其核心思想是在元训练和元测试阶段都尽量减少策略对上下文的影响。
效果:实验结果表明,CSRO显著减少了上下文转移,提高了泛化能力,在各种具有挑战性的领域中超越了先前的方法。

Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks. However, the context shift problem arises due to the distribution discrepancy between the contexts used for training (from the behavior policy) and testing (from the exploration policy). The context shift problem leads to incorrect task inference and further deteriorates the generalization ability of the meta-policy. Existing OMRL methods either overlook this problem or attempt to mitigate it with additional information. In this paper, we propose a novel approach called Context Shift Reduction for OMRL (CSRO) to address the context shift problem with only offline datasets. The key insight of CSRO is to minimize the influence of policy in context during both the meta-training and meta-test phases. During meta-training, we design a max-min mutual information representation learning mechanism to diminish the impact of the behavior policy on task representation. In the meta-test phase, we introduce the non-prior context collection strategy to reduce the effect of the exploration policy. Experimental results demonstrate that CSRO significantly reduces the context shift and improves the generalization ability, surpassing previous methods across various challenging domains.

InstanT: Semi-supervised Learning with Instance-dependent Thresholds
Muyang Li Runze Wu Haoyu Liu Jun Yu Xun Yang Bo Han Tongliang Liu



研究问题:半监督学习(SSL)是机器学习中的基本挑战,如何选择有信息的无标签实例作为伪标签并纳入训练集是关键。
动机:目前SSL方法通常对所有样本使用相同的阈值或对属于某一类的实例使用类别依赖的阈值,忽视了实例级别的信息。
方法:本文提出了一种具有最高自由度的实例依赖阈值研究,通过利用实例级别的模糊性和伪标签的实例依赖错误率,为所有无标签实例设计了一种新的实例依赖阈值函数。
效果:实验证明,这种实例依赖阈值函数为其分配的伪标签的正确性提供了有界的概率保证。

Semi-supervised learning (SSL) has been a fundamental challenge in machine learning for decades. The primary family of SSL algorithms, known as pseudo-labeling, involves assigning pseudo-labels to confident unlabeled instances and incorporating them into the training set. Therefore, the selection criteria of confident instances are crucial to the success of SSL. Recently, there has been growing interest in the development of SSL methods that use dynamic or adaptive thresholds. Yet, these methods typically apply the same threshold to all samples, or use class-dependent thresholds for instances belonging to a certain class, while neglecting instance-level information. In this paper, we propose the study of instance-dependent thresholds, which has the highest degree of freedom compared with existing methods. Specifically, we devise a novel instance-dependent threshold function for all unlabeled instances by utilizing their instance-level ambiguity and the instance-dependent error rates of pseudo-labels, so instances that are more likely to have incorrect pseudo-labels will have higher thresholds. Furthermore, we demonstrate that our instance-dependent threshold function provides a bounded probabilistic guarantee for the correctness of the pseudo-labels it assigns.

How to Select Which Active Learning Strategy is Best Suited for Your Specific Problem and Budget
Guy Hacohen Daphna Weinshall



研究问题:在主动学习领域,如何为特定情况确定最合适的查询策略仍是一个开放的问题。
动机:不同的查询策略更适合不同的条件和预算约束,因此需要一种能够动态确定最佳策略的方法。
方法:提出了一种基于导数的方法,该方法可以动态地为给定的预算识别最佳的主动学习策略。
效果:通过理论分析和实验结果证明,该方法在不同预算和计算机视觉任务中都表现出了有效性。

In the domain of Active Learning (AL), a learner actively selects which unlabeled examples to seek labels from an oracle, while operating within predefined budget constraints. Importantly, it has been recently shown that distinct query strategies are better suited for different conditions and budgetary constraints. In practice, the determination of the most appropriate AL strategy for a given situation remains an open problem. To tackle this challenge, we propose a practical derivative-based method that dynamically identifies the best strategy for a given budget. Intuitive motivation for our approach is provided by the theoretical analysis of a simplified scenario. We then introduce a method to dynamically select an AL strategy, which takes into account the unique characteristics of the problem and the available budget. Empirical results showcase the effectiveness of our approach across diverse budgets and computer vision tasks.

Few-Shot Class-Incremental Learning via Training-Free Prototype Calibration
Qi-Wei Wang Da-Wei Zhou Yi-Kai Zhang De-Chuan Zhan Han-Jia Ye



研究问题:现实世界中,新类别的不断出现以及少量标记样本的情况,要求机器学习模型能够增量学习新类别并保持对基础类别的知识。
动机:现有的Few-Shot Class-Incremental Learning(FSCIL)方法存在将新类别的样本错误分类到基础类别的问题,导致新类别的性能较差。
方法:我们提出了一种简单而有效的Training-free calibration(TEEN)策略,通过将新原型(即一类的平均特征)与加权的基原型融合,增强新类别的可分性。
效果:实验结果表明,TEEN不仅在FSCIL的标准基准上表现出色,而且在少次学习场景中比基线方法有显著改进。

Real-world scenarios are usually accompanied by continuously appearing classes with scare labeled samples, which require the machine learning model to incrementally learn new classes and maintain the knowledge of base classes. In this Few-Shot Class-Incremental Learning (FSCIL) scenario, existing methods either introduce extra learnable components or rely on a frozen feature extractor to mitigate catastrophic forgetting and overfitting problems. However, we find a tendency for existing methods to misclassify the samples of new classes into base classes, which leads to the poor performance of new classes. In other words, the strong discriminability of base classes distracts the classification of new classes. To figure out this intriguing phenomenon, we observe that although the feature extractor is only trained on base classes, it can surprisingly represent the *semantic similarity* between the base and *unseen* new classes. Building upon these analyses, we propose a *simple yet effective* Training-frEE calibratioN (TEEN) strategy to enhance the discriminability of new classes by fusing the new prototypes (i.e., mean features of a class) with weighted base prototypes. In addition to standard benchmarks in FSCIL, TEEN demonstrates remarkable performance and consistent improvements over baseline methods in the few-shot learning scenario. Code is available at: https://github.com/wangkiw/TEEN

PPi: Pretraining Brain Signal Model for Patient-independent Seizure Detection
Zhizhang Yuan Daoze Zhang Yang Yang Junru Chen Yafeng Li



研究问题:如何有效地进行癫痫诊断和治疗的自动癫痫发作检测。
动机:新兴的立体脑电图(SEEG)方法可以提供详细且立体的脑电信息,但在临床场景中建模SEEG将面临不同患者之间的巨大领域转移和不同大脑区域之间剧烈的模式演变等挑战。
方法:提出了一种基于预训练的患者无关癫痫发作检测模型(PPi)。设计了两种新颖的自监督任务,从丰富的SEEG数据中提取丰富信息,同时保留来自不同大脑区域的脑信号的独特特征。然后提出了通道背景减法和大脑区域增强两种技术,有效解决领域转移问题。
效果:大量实验表明,PPi在两个公共数据集和一个我们自己收集的真实世界临床数据集上优于最先进的SOTA基线,证明了PPi的有效性和实用性。最后,可视化分析说明了两种领域泛化技术的合理性。

Automated seizure detection is of great importance to epilepsy diagnosis and treatment. An emerging method used in seizure detection, stereoelectroencephalography (SEEG), can provide detailed and stereoscopic brainwave information. However, modeling SEEG in clinical scenarios will face challenges like huge domain shift between different patients and dramatic pattern evolution among different brain areas. In this study, we propose a Pretraining-based model for Patient-independent seizure detection (PPi) to address these challenges. Firstly, we design two novel self-supervised tasks which can extract rich information from abundant SEEG data while preserving the unique characteristics between brain signals recorded from different brain areas. Then two techniques channel background subtraction and brain region enhancement are proposed to effectively tackle the domain shift problem. Extensive experiments show that PPi outperforms the SOTA baselines on two public datasets and a real-world clinical dataset collected by ourselves, which demonstrates the effectiveness and practicability of PPi. Finally, visualization analysis illustrates the rationality of the two domain generalization techniques.

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
Yuchao Gu Xintao Wang Jay Zhangjie Wu Yujun Shi Yunpeng Chen Zihan Fan WUYOU XIAO Rui Zhao Shuning Chang Weijia Wu Yixiao Ge Ying Shan Mike Zheng Shou



研究问题:如何有效地利用大规模文本-图像扩散模型进行多概念的分散式定制。
动机:现有的单客户低秩适应(LoRA)在处理多个概念时存在概念冲突和身份丢失的问题,需要一种能够解决这些问题的新方法。
方法:提出了一种新的框架Mix-of-Show,采用嵌入分解的LoRA(ED-LoRA)进行单客户调整和梯度融合以保留单个概念的领域本质并支持理论上无限制的概念融合。
效果:实验证明,Mix-of-Show能够高保真地合成多个定制的概念,包括字符、对象和场景。

Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community. These models can be easily customized for new concepts using low-rank adaptations (LoRAs). However, the utilization of multiple-concept LoRAs to jointly support multiple customized concepts presents a challenge. We refer to this scenario as decentralized multi-concept customization, which involves single-client concept tuning and center-node concept fusion. In this paper, we propose a new framework called Mix-of-Show that addresses the challenges of decentralized multi-concept customization, including concept conflicts resulting from existing single-client LoRA tuning and identity loss during model fusion. Mix-of-Show adopts an embedding-decomposed LoRA (ED-LoRA) for single-client tuning and gradient fusion for the center node to preserve the in-domain essence of single concepts and support theoretically limitless concept fusion. Additionally, we introduce regionally controllable sampling, which extends spatially controllable sampling (e.g., ControlNet and T2I-Adapter) to address attribute binding and missing object problems in multi-concept sampling. Extensive experiments demonstrate that Mix-of-Show is capable of composing multiple customized concepts with high fidelity, including characters, objects, and scenes.

A Unified Framework for Rank-based Loss Minimization
Rufeng Xiao Yuze Ge Rujun Jiang Yifan Yan



研究问题:本文旨在提出一种优化等级损失的统一框架,以解决机器学习模型多样化的性能需求。
动机:虽然平均损失被广泛用于训练机器学习模型,但为了应对各种性能要求,等级损失的使用越来越普遍,在许多情况下取代了平均损失。
方法:通过使用邻近交替方向乘子法,提出了一种优化等级损失的统一框架。
效果:实验证明,该算法在温和条件下具有良好的收敛性和收敛速度,并在合成和真实数据集上展示了其有效性和效率。

The empirical loss, commonly referred to as the average loss, is extensively utilized for training machine learning models. However, in order to address the diverse performance requirements of machine learning models, the use of the rank-based loss is prevalent, replacing the empirical loss in many cases. The rank-based loss comprises a weighted sum of sorted individual losses, encompassing both convex losses like the spectral risk, which includes the empirical risk and conditional value-at-risk, and nonconvex losses such as the human-aligned risk and the sum of the ranked range loss. In this paper, we introduce a unified framework for the optimization of the rank-based loss through the utilization of a proximal alternating direction method of multipliers. We demonstrate the convergence and convergence rate of the proposed algorithm under mild conditions. Experiments conducted on synthetic and real datasets illustrate the effectiveness and efficiency of the proposed algorithm.

Active Learning-Based Species Range Estimation
Christian Lange Elijah Cole Grant Van Horn Oisin Mac Aodha



研究问题:如何有效地从有限的实地观察中估计物种的地理范围。
动机:为了解决传统方法在实地观察数量有限的情况下,对物种地理范围估计不准确的问题。
方法:提出了一种新的主动学习方法,将未映射物种的范围模型化为不同物种估计范围的加权组合。通过在大型弱监督社区收集的观察数据上训练模型生成候选范围集,然后开发一种新的主动查询方法,按顺序选择地理位置进行访问,以最大程度地减少未映射物种范围的不确定性。
效果:通过对比现有的主动学习方法和专家推导的一万个物种范围的评价数据集,实验结果表明该方法优于其他主动学习方法,即使只使用部分数据,也能达到端到端训练模型的性能。这突出了通过转移学习的空间表示进行主动学习在物种范围估计中的效用,同时也强调了利用新兴的大型众包数据集的价值,不仅用于建模物种的范围,也用于主动发现它们。

We propose a new active learning approach for efficiently estimating the geographic range of a species from a limited number of on the ground observations. We model the range of an unmapped species of interest as the weighted combination of estimated ranges obtained from a set of different species. We show that it is possible to generate this candidate set of ranges by using models that have been trained on large weakly supervised community collected observation data. From this, we develop a new active querying approach that sequentially selects geographic locations to visit that best reduce our uncertainty over an unmapped species’ range. We conduct a detailed evaluation of our approach and compare it to existing active learning methods using an evaluation dataset containing expert-derived ranges for one thousand species. Our results demonstrate that our method outperforms alternative active learning methods and approaches the performance of end-to-end trained models, even when only using a fraction of the data. This highlights the utility of active learning via transfer learned spatial representations for species range estimation. It also emphasizes the value of leveraging emerging large-scale crowdsourced datasets, not only for modeling a species' range, but also for actively discovering them.

Removing Hidden Confounding in Recommendation: A Unified Multi-Task Learning Approach
Haoxuan Li Kunhan Wu Chunyuan Zheng Yanghao Xiao Hao Wang Zhi Geng Fuli Feng Xiangnan He Peng Wu



研究问题:推荐系统中的训练数据存在选择偏差,这对无偏学习构成了巨大挑战。
动机:尽管已有研究提出了基于用户和项目特征的去偏方法,但忽视了隐藏混杂因素的影响。
方法:本文首先进行理论分析,揭示了在存在隐藏混杂因素的情况下,先前的方法(包括倾向性基础、多任务学习和双层优化方法)可能无法实现无偏学习。然后,我们提出了一种统一的多任务学习方法来消除隐藏混杂因素,该方法使用少量无偏评分来校准从有偏数据中学习到的名义倾向性和名义错误推断。
效果:我们在三个公开的基准数据集上进行了广泛的实验,其中包括一个完全暴露的大型工业数据集,验证了所提出的方法在消除隐藏混杂因素方面的有效性。

In recommender systems, the collected data used for training is always subject to selection bias, which poses a great challenge for unbiased learning. Previous studies proposed various debiasing methods based on observed user and item features, but ignored the effect of hidden confounding. To address this problem, recent works suggest the use of sensitivity analysis for worst-case control of the unknown true propensity, but only valid when the true propensity is near to the nominal propensity within a finite bound. In this paper, we first perform theoretical analysis to reveal the possible failure of previous approaches, including propensity-based, multi-task learning, and bi-level optimization methods, in achieving unbiased learning when hidden confounding is present. Then, we propose a unified multi-task learning approach to remove hidden confounding, which uses a few unbiased ratings to calibrate the learned nominal propensities and nominal error imputations from biased data. We conduct extensive experiments on three publicly available benchmark datasets containing a fully exposed large-scale industrial dataset, validating the effectiveness of the proposed methods in removing hidden confounding.

Task-Robust Pre-Training for Worst-Case Downstream Adaptation
Jianghui Wang Yang Chen Xingyu Xie Cong Fang Zhouchen Lin



研究问题:如何使预训练模型在各种相关下游任务中表现一致良好。
动机:目前的预训练模型在处理一系列相关下游任务时,其性能可能并不一致。
方法:将上游任务分解为几个具有代表性的子任务,并采用简单的迷你max损失进行预训练。设计一个有效的算法来解决最小最大损失,并在凸设置中证明其收敛性。
效果:实验表明,该方法在大规模自然语言处理和计算机视觉数据集上提高了最坏情况下的下游任务指标。

Pre-training has achieved remarkable success when transferred to downstream tasks. In machine learning, we care about not only the good performance of a model but also its behavior under reasonable shifts of condition. The same philosophy holds when pre-training a foundation model. However, the foundation model may not uniformly behave well for a series of related downstream tasks. This happens, for example, when conducting mask recovery regression where the recovery ability or the training instances diverge like pattern features are extracted dominantly on pre-training, but semantic features are also required on a downstream task. This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks. We call this goal as *downstream-task robustness*. Our method first separates the upstream task into several representative ones and applies a simple minimax loss for pre-training. We then design an efficient algorithm to solve the minimax loss and prove its convergence in the convex setting. In the experiments, we show both on large-scale natural language processing and computer vision datasets our method increases the metrics on worse-case downstream tasks. Additionally, some theoretical explanations for why our loss is beneficial are provided. Specifically, we show fewer samples are inherently required for the most challenging downstream task in some cases.

Unleashing the Power of Graph Data Augmentation on Covariate Distribution Shift
Yongduo Sui Qitian Wu Jiancan Wu Qing Cui Longfei Li JUN ZHOU Xiang Wang Xiangnan He



研究问题:图表示学习中分布偏移的问题日益突出。
动机:现有的策略如不变学习和图增强在处理协变量偏移问题上存在局限性。
方法:提出一种名为对抗性不变增强(AIA)的数据增强策略,通过在训练数据基础上生成新环境并保留原始稳定特征来应对协变量偏移。
效果:大量实验和深入的实证分析表明,该方法具有优越性。

The issue of distribution shifts is emerging as a critical concern in graph representation learning. From the perspective of invariant learning and stable learning, a recently well-established paradigm for out-of-distribution generalization, stable features of the graph are assumed to causally determine labels, while environmental features tend to be unstable and can lead to the two primary types of distribution shifts. The correlation shift is often caused by the spurious correlation between environmental features and labels that differs between the training and test data; the covariate shift often stems from the presence of new environmental features in test data. However, most strategies, such as invariant learning or graph augmentation, typically struggle with limited training environments or perturbed stable features, thus exposing limitations in handling the problem of covariate shift. To address this challenge, we propose a simple-yet-effective data augmentation strategy, Adversarial Invariant Augmentation (AIA), to handle the covariate shift on graphs. Specifically, given the training data, AIA aims to extrapolate and generate new environments, while concurrently preserving the original stable features during the augmentation process. Such a design equips the graph classification model with an enhanced capability to identify stable features in new environments, thereby effectively tackling the covariate shift in data. Extensive experiments with in-depth empirical analysis demonstrate the superiority of our approach. The implementation codes are publicly available at https://github.com/yongduosui/AIA.

Eliminating Catastrophic Overfitting Via Abnormal Adversarial Examples Regularization
Runqi Lin Chaojian Yu Tongliang Liu



研究问题:单步对抗训练(SSAT)存在灾难性过拟合(CO),导致分类器容易受到多步对抗攻击。
动机:研究者发现SSAT训练的网络生成的对抗样本中,一些异常对抗样本(AAEs)在训练过程中的损失反而减小,这种现象与分类器的扭曲有关。
方法:研究者提出了一种新的方法——异常对抗样本正则化(AAER),通过显式地对AAEs的变化进行正则化,防止分类器扭曲,从而消除CO。
效果:实验证明,该方法可以有效消除CO,并进一步提高对抗鲁棒性,且计算开销很小。

Single-step adversarial training (SSAT) has demonstrated the potential to achieve both efficiency and robustness. However, SSAT suffers from catastrophic overfitting (CO), a phenomenon that leads to a severely distorted classifier, making it vulnerable to multi-step adversarial attacks. In this work, we observe that some adversarial examples generated on the SSAT-trained network exhibit anomalous behaviour, that is, although these training samples are generated by the inner maximization process, their associated loss decreases instead, which we named abnormal adversarial examples (AAEs). Upon further analysis, we discover a close relationship between AAEs and classifier distortion, as both the number and outputs of AAEs undergo a significant variation with the onset of CO. Given this observation, we re-examine the SSAT process and uncover that before the occurrence of CO, the classifier already displayed a slight distortion, indicated by the presence of few AAEs. Furthermore, the classifier directly optimizing these AAEs will accelerate its distortion, and correspondingly, the variation of AAEs will sharply increase as a result. In such a vicious circle, the classifier rapidly becomes highly distorted and manifests as CO within a few iterations. These observations motivate us to eliminate CO by hindering the generation of AAEs. Specifically, we design a novel method, termed Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the variation of AAEs to hinder the classifier from becoming distorted. Extensive experiments demonstrate that our method can effectively eliminate CO and further boost adversarial robustness with negligible additional computational overhead. Our implementation can be found at https://github.com/tmllab/2023_NeurIPS_AAER.

Balance, Imbalance, and Rebalance: Understanding Robust Overfitting from a Minimax Game Perspective
Yifei Wang Liangchen Li Jiansheng Yang Zhouchen Lin Yisen Wang



研究问题:对抗训练在提取鲁棒特征方面是最先进的算法,但存在严重的鲁棒过拟合问题,特别是在学习率衰减后。
动机:通过将对抗训练视为模型训练者和攻击者之间的动态最小最大博弈,解释了这种现象。
方法:分析学习率衰减如何打破最小最大博弈的平衡,并展示这种不平衡会导致由于记忆非鲁棒特征而产生的鲁棒过拟合。
效果:通过大量的实验验证了这种理解,并从两个游戏者的动态角度提供了对鲁棒过拟合的全面看法。进一步提出通过规范训练者的容量或提高攻击强度来重新平衡两个参与者,以减轻鲁棒过拟合。实验表明,提出的ReBalanced Adversarial Training (ReBAT)可以获得良好的鲁棒性,即使在长时间的训练后也不会出现鲁棒过拟合。

Adversarial Training (AT) has become arguably the state-of-the-art algorithm for extracting robust features. However, researchers recently notice that AT suffers from severe robust overfitting problems, particularly after learning rate (LR) decay. In this paper, we explain this phenomenon by viewing adversarial training as a dynamic minimax game between the model trainer and the attacker. Specifically, we analyze how LR decay breaks the balance between the minimax game by empowering the trainer with a stronger memorization ability, and show such imbalance induces robust overfitting as a result of memorizing non-robust features. We validate this understanding with extensive experiments, and provide a holistic view of robust overfitting from the dynamics of both the two game players. This understanding further inspires us to alleviate robust overfitting by rebalancing the two players by either regularizing the trainer's capacity or improving the attack strength. Experiments show that the proposed ReBalanced Adversarial Training (ReBAT) can attain good robustness and does not suffer from robust overfitting even after very long training. Code is available at https://github.com/PKU-ML/ReBAT.

Contextually Affinitive Neighborhood Refinery for Deep Clustering
Chunlin Yu Ye Shi Jingya Wang



研究问题:现有的自我监督学习方法在语义相似实例分组方面存在局限性,如局部邻域样本有限,可能无法提供丰富多样的监督信号。
动机:受图像检索中灵活的重排方法启发,提出利用有效的在线重排过程挖掘更多信息丰富的邻居,并鼓励跨视图邻域一致性。
方法:通过引入一种渐进式放松的边界过滤策略来减轻聚类边界附近的固有邻域噪声,该方法可以容易地集成到通用的自我监督框架中。
效果:实验结果表明,该方法在几个流行的基准测试上优于最先进的方法。

Previous endeavors in self-supervised learning have enlightened the research of deep clustering from an instance discrimination perspective. Built upon this foundation, recent studies further highlight the importance of grouping semantically similar instances. One effective method to achieve this is by promoting the semantic structure preserved by neighborhood consistency. However, the samples in the local neighborhood may be limited due to their close proximity to each other, which may not provide substantial and diverse supervision signals. Inspired by the versatile re-ranking methods in the context of image retrieval, we propose to employ an efficient online re-ranking process to mine more informative neighbors in a Contextually Affinitive (ConAff) Neighborhood, and then encourage the cross-view neighborhood consistency. To further mitigate the intrinsic neighborhood noises near cluster boundaries, we propose a progressively relaxed boundary filtering strategy to circumvent the issues brought by noisy neighbors. Our method can be easily integrated into the generic self-supervised frameworks and outperforms the state-of-the-art methods on several popular benchmarks.

Reining Generalization in Offline Reinforcement Learning via Representation Distinction
Yi Ma Hongyao Tang Dong Li Zhaopeng Meng



研究问题:本文旨在解决离线强化学习中分布偏移的问题,即数据集和已学习策略之间的差异可能导致对OOD数据的估计错误。
动机:现有的离线强化学习方法通过设计保守项来防止过拟合,但大部分效果来自于它们对学习表示的影响。因此,作者希望通过改进表示来提高离线强化学习的效果。
方法:提出了一种名为“表示区分”(RD)的新方法,通过明确区分学习策略生成的样本内和OOD状态-动作对的表示来提高离线RL算法的性能。当学习策略反映行为策略且相似样本可能被错误区分时,作者建议基于OOD数据生成器的动态调整机制来防止数据表示崩溃并进一步提高策略性能。
效果:通过对专门设计的基线算法和广泛使用的离线强化学习算法应用RD方法,作者在D4RL数据集的各种连续控制任务上展示了其方法的有效性,超过了几种最先进的离线强化学习方法。

Offline Reinforcement Learning (RL) aims to address the challenge of distribution shift between the dataset and the learned policy, where the value of out-of-distribution (OOD) data may be erroneously estimated due to overgeneralization. It has been observed that a considerable portion of the benefits derived from the conservative terms designed by existing offline RL approaches originates from their impact on the learned representation. This observation prompts us to scrutinize the learning dynamics of offline RL, formalize the process of generalization, and delve into the prevalent overgeneralization issue in offline RL. We then investigate the potential to rein the generalization from the representation perspective to enhance offline RL. Finally, we present Representation Distinction (RD), an innovative plug-in method for improving offline RL algorithm performance by explicitly differentiating between the representations of in-sample and OOD state-action pairs generated by the learning policy. Considering scenarios in which the learning policy mirrors the behavioral policy and similar samples may be erroneously distinguished, we suggest a dynamic adjustment mechanism for RD based on an OOD data generator to prevent data representation collapse and further enhance policy performance. We demonstrate the efficacy of our approach by applying RD to specially-designed backbone algorithms and widely-used offline RL algorithms. The proposed RD method significantly improves their performance across various continuous control tasks on D4RL datasets, surpassing several state-of-the-art offline RL algorithms.

Evaluating Neuron Interpretation Methods of NLP Models
Yimin Fan Fahim Dalvi Nadir Durrani Hassan Sajjad



研究问题:本文旨在解决现有神经解释方法缺乏比较标准的问题,以推动神经网络模型中知识结构的研究。
动机:虽然文献中提出了许多神经解释方法,但该领域缺乏这些方法之间的全面比较。由于缺乏标准化的度量和基准,这阻碍了研究的进展。
方法:为解决这个问题,我们提出了一个基于投票理论的评估框架。我们的假设是,不同方法一致识别出的神经元携带更重要的信息。我们严格地在各种神经解释方法上评估我们的框架。
效果:主要发现包括:i)尽管这些方法在理论上存在差异,但在识别显著神经元时,神经元排名方法共享超过60%的排名;ii)神经解释方法对最后一层表示最敏感;iii)Probeless神经元排名成为最一致的方法。

Neuron interpretation offers valuable insights into how knowledge is structured within a deep neural network model. While a number of neuron interpretation methods have been proposed in the literature, the field lacks a comprehensive comparison among these methods. This gap hampers progress due to the absence of standardized metrics and benchmarks. The commonly used evaluation metric has limitations, and creating ground truth annotations for neurons is impractical. Addressing these challenges, we propose an evaluation framework based on voting theory. Our hypothesis posits that neurons consistently identified by different methods carry more significant information. We rigorously assess our framework across a diverse array of neuron interpretation methods. Notable findings include: i) despite the theoretical differences among the methods, neuron ranking methods share over 60% of their rankings when identifying salient neurons, ii) the neuron interpretation methods are most sensitive to the last layer representations, iii) Probeless neuron ranking emerges as the most consistent method.

Empowering Collaborative Filtering with Principled Adversarial Contrastive Loss
An Zhang Leheng Sheng Zhibo Cai Xiang Wang Tat-Seng Chua



研究问题:如何将对比学习(CL)应用于协同过滤(CF)中,以解决现有方法在处理分布外数据、假负例和top-K评估等问题上的不足。
动机:尽管对比学习已在自监督学习任务上取得了显著效果,但在推荐系统中的协同过滤应用仍存在优化空间,如处理分布外数据、假负例和top-K评估等问题。
方法:提出一种针对协同过滤的对抗性InfoNCE损失(AdvInfoNCE),通过以敌对的方式探索和分配每个负实例的难度,并利用精细的难易度感知排名标准来增强推荐器的泛化能力。
效果:在合成和真实世界基准数据集上训练了使用AdvInfoNCE的CF模型,验证了其在处理分布外问题上的有效性,并在理论保证和实证优势上优于大多数对比损失函数,因此建议将其作为推荐系统的标准损失,特别是用于处理分布外任务。

Contrastive Learning (CL) has achieved impressive performance in self-supervised learning tasks, showing superior generalization ability. Inspired by the success, adopting CL into collaborative filtering (CF) is prevailing in semi-supervised topK recommendations. The basic idea is to routinely conduct heuristic-based data augmentation and apply contrastive losses (e.g., InfoNCE) on the augmented views. Yet, some CF-tailored challenges make this adoption suboptimal, such as the issue of out-of-distribution, the risk of false negatives, and the nature of top-K evaluation. They necessitate the CL-based CF scheme to focus more on mining hard negatives and distinguishing false negatives from the vast unlabeled user-item interactions, for informative contrast signals. Worse still, there is limited understanding of contrastive loss in CF methods, especially w.r.t. its generalization ability. To bridge the gap, we delve into the reasons underpinning the success of contrastive loss in CF, and propose a principled Adversarial InfoNCE loss (AdvInfoNCE), which is a variant of InfoNCE, specially tailored for CF methods. AdvInfoNCE adaptively explores and assigns hardness to each negative instance in an adversarial fashion and further utilizes a fine-grained hardness-aware ranking criterion to empower the recommender’s generalization ability. Training CF models with AdvInfoNCE, we validate the effectiveness of AdvInfoNCE on both synthetic and real-world benchmark datasets, thus showing its generalization ability to mitigate out-of-distribution problems. Given the theoretical guarantees and empirical superiority of AdvInfoNCE over most contrastive loss functions, we advocate its adoption as a standard loss in recommender systems, particularly for the out-of-distribution tasks. Codes are available at https://github.com/LehengTHU/AdvInfoNCE.

Enhancing Minority Classes by Mixing: An Adaptative Optimal Transport Approach for Long-tailed Classification
Jintong Gao He Zhao Zhuo Li Dan dan Guo



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Real-world data usually confronts severe class-imbalance problems, where several majority classes have a significantly larger presence in the training set than minority classes. One effective solution is using mixup-based methods to generate synthetic samples to enhance the presence of minority classes. Previous approaches mix the background images from the majority classes and foreground images from the minority classes in a random manner, which ignores the sample-level semantic similarity, possibly resulting in less reasonable or less useful images. In this work, we propose an adaptive image-mixing method based on optimal transport (OT) to incorporate both class-level and sample-level information, which is able to generate semantically reasonable and meaningful mixed images for minority classes. Due to its flexibility, our method can be combined with existing long-tailed classification methods to enhance their performance and it can also serve as a general data augmentation method for balanced datasets. Extensive experiments indicate that our method achieves effective performance for long-tailed classification tasks. The code is available at https://github.com/JintongGao/Enhancing-Minority-Classes-by-Mixing.

Generate What You Prefer: Reshaping Sequential Recommendation via Guided Diffusion
Zhengyi Yang Jiancan Wu Zhicai Wang Xiang Wang Yancheng Yuan Xiangnan He



研究问题:本文旨在解决序列推荐中存在的两个固有限制,即用户可能会想象一个理想物品并选择与之匹配的潜在物品,以及分类受限于候选池中的噪声或易受负样本监督影响的问题。
动机:目前的序列推荐模型通过学习分类来预测用户对下一个项目的偏好,但这与人类行为可能存在差异,且可能受到噪声或易受负样本监督影响的限制。
方法:本文提出了一种学习生成的范式,通过引导扩散模型DreamRec来实现。对于历史项目序列,它使用Transformer编码器创建指导表示。通过对目标项目进行噪声处理来探索项目空间的底层分布;然后,在历史交互的指导下,去噪过程生成一个理想项目以恢复正项目,从而摆脱负采样并直接描述用户的真实偏好。
效果:通过大量实验和与现有方法的比较,验证了DreamRec的有效性。

Sequential recommendation aims to recommend the next item that matches a user’s interest, based on the sequence of items he/she interacted with before. Scrutinizing previous studies, we can summarize a common learning-to-classify paradigm— given a positive item, a recommender model performs negative sampling to add negative items and learns to classify whether the user prefers them or not, based on his/her historical interaction sequence. Although effective, we reveal two inherent limitations: (1) it may differ from human behavior in that a user could imagine an oracle item in mind and select potential items matching the oracle; and (2) the classification is limited in the candidate pool with noisy or easy supervision from negative samples, which dilutes the preference signals towards the oracle item. Yet, generating the oracle item from the historical interaction sequence is mostly unexplored. To bridge the gap, we reshape sequential recommendation as a learning-to-generate paradigm, which is achieved via a guided diffusion model, termed DreamRec. Specifically, for a sequence of historical items, it applies a Transformer encoder to create guidance representations. Noising target items explores the underlying distribution of item space; then, with the guidance of historical interactions, the denoising process generates an oracle item to recover the positive item, so as to cast off negative sampling and depict the true preference of the user directly. We evaluate the effectiveness of DreamRec through extensive experiments and comparisons with existing methods. Codes and data are open-sourced at https://github.com/YangZhengyi98/DreamRec.

Adaptive Normalization for Non-stationary Time Series Forecasting: A Temporal Slice Perspective
Zhiding Liu Mingyue Cheng Zhi Li Zhenya Huang Qi Liu Yanhu Xie Enhong Chen



研究问题:尽管深度学习模型在捕捉序列依赖性方面的能力逐步提高,但由于真实世界数据的非研究问题:尽管深度学习模型在捕捉序列依赖性方面的能力逐步提高,但由于真实世界数据的非平稳性(数据分布随时间快速变化),使得准确预测仍具有挑战。
动机:现有的减少非平稳性的方法通常忽视了输入序列和目标序列的分布差异,并假设同一实例内的所有时间点具有相同的统计特性,这可能导致次优的相对改进。
方法:为此,我们提出了一种新的切片级自适应归一化(SAN)方法,该方法通过消除子序列(局部时间片)而非全局实例的非平稳性,以及使用轻微的网络模块独立地模拟原始时间序列统计属性的演变趋势,来增强时间序列预测能力。
效果:我们在四个常用的预测模型上实例化了所提出的SAN,并在基准数据集上测试了它们的预测结果,以评估其有效性。同时,我们还报告了一些深入分析和理解我们提出的SAN的有趣发现。

Deep learning models have progressively advanced time series forecasting due to their powerful capacity in capturing sequence dependence. Nevertheless, it is still challenging to make accurate predictions due to the existence of non-stationarity in real-world data, denoting the data distribution rapidly changes over time. To mitigate such a dilemma, several efforts have been conducted by reducing the non-stationarity with normalization operation. However, these methods typically overlook the distribution discrepancy between the input series and the horizon series, and assume that all time points within the same instance share the same statistical properties, which is too ideal and may lead to suboptimal relative improvements. To this end, we propose a novel slice-level adaptive normalization, referred to \textbf{SAN}, which is a novel scheme for empowering time series forecasting with more flexible normalization and denormalization. SAN includes two crucial designs. First, SAN tries to eliminate the non-stationarity of time series in units of a local temporal slice (i.e., sub-series) rather than a global instance. Second, SAN employs a slight network module to independently model the evolving trends of statistical properties of raw time series. Consequently, SAN could serve as a general model-agnostic plugin and better alleviate the impact of the non-stationary nature of time series data. We instantiate the proposed SAN on four widely used forecasting models and test their prediction results on benchmark datasets to evaluate its effectiveness. Also, we report some insightful findings to deeply analyze and understand our proposed SAN. We make our codes publicly available.

CSOT: Curriculum and Structure-Aware Optimal Transport for Learning with Noisy Labels
Wanxing Chang Ye Shi Jingya Wang



研究问题:如何在避免过拟合到被污染标签的同时,训练一个泛化能力强的模型?
动机:现有的方法主要依赖于模型的预测,并独立评估每个样本,没有考虑样本分布的全局或局部结构,这通常会导致识别和修正过程的次优解,最终导致模型过拟合到错误的标签。
方法:本文提出了一种新的最优传输(OT)公式,称为课程和结构感知最优传输(CSOT)。CSOT同时考虑样本的分布间和分布内结构,构建了一个鲁棒的去噪和重新标记分配器。在训练过程中,分配器逐步将最可靠的标签分配给具有最高置信度的样本。
效果:广泛的实验表明,我们的方法优于当前最先进的LNL方法。

Learning with noisy labels (LNL) poses a significant challenge in training a well-generalized model while avoiding overfitting to corrupted labels. Recent advances have achieved impressive performance by identifying clean labels and correcting corrupted labels for training. However, the current approaches rely heavily on the model’s predictions and evaluate each sample independently without considering either the global or local structure of the sample distribution. These limitations typically result in a suboptimal solution for the identification and correction processes, which eventually leads to models overfitting to incorrect labels. In this paper, we propose a novel optimal transport (OT) formulation, called Curriculum and Structure-aware Optimal Transport (CSOT). CSOT concurrently considers the inter- and intra-distribution structure of the samples to construct a robust denoising and relabeling allocator. During the training process, the allocator incrementally assigns reliable labels to a fraction of the samples with the highest confidence. These labels have both global discriminability and local coherence. Notably, CSOT is a new OT formulation with a nonconvex objective function and curriculum constraints, so it is not directly compatible with classical OT solvers. Here, we develop a lightspeed computational method that involves a scaling iteration within a generalized conditional gradient framework to solve CSOT efficiently. Extensive experiments demonstrate the superiority of our method over the current state-of-the-arts in LNL.

Energy-Based Models for Anomaly Detection: A Manifold Diffusion Recovery Approach
Sangwoong Yoon Young-Uk Jin Yung-Kyun Noh Frank C. Park



研究问题:如何利用数据中的低维结构训练能量模型进行异常检测。
动机:现有的能量模型在异常检测任务上的性能有待提高,需要更有效地利用数据中的低维结构信息。
方法:提出一种新的方法——流形投影-扩散恢复(MPDR),首先沿着逼近训练数据集的低维流形对数据点进行扰动,然后训练能量模型以最大化恢复原始数据的概率。训练过程中,通过MCMC生成负样本,但负样本分布集中在流形附近,从而产生高度反映数据相关变化模式的近流形负样本。
效果:实验结果表明,MPDR在各种涉及不同类型数据的异常检测任务中表现出色,包括图像、向量和声学信号等。

We present a new method of training energy-based models (EBMs) for anomaly detection that leverages low-dimensional structures within data. The proposed algorithm, Manifold Projection-Diffusion Recovery (MPDR), first perturbs a data point along a low-dimensional manifold that approximates the training dataset. Then, EBM is trained to maximize the probability of recovering the original data. The training involves the generation of negative samples via MCMC, as in conventional EBM training, but from a different distribution concentrated near the manifold. The resulting near-manifold negative samples are highly informative, reflecting relevant modes of variation in data. An energy function of MPDR effectively learns accurate boundaries of the training data distribution and excels at detecting out-of-distribution samples. Experimental results show that MPDR exhibits strong performance across various anomaly detection tasks involving diverse data types, such as images, vectors, and acoustic signals.

Optimal Transport Model Distributional Robustness
Van-Anh Nguyen Trung Le Anh Tuan Bui Thanh-Toan Do Dinh Phung



研究问题:如何训练一种对对抗性例子和数据分布变化具有抵抗力的深度学习模型?
动机:现有的深度学习模型在面对对抗性例子和数据分布变化时较为脆弱。
方法:提出了一种基于最优传输模型空间的分布鲁棒性框架,通过最大化损失来学习最优鲁棒中心模型分布。
效果:理论验证了该框架在不同设置(单一模型、集成模型、贝叶斯神经网络)下的有效性,实验结果显示,相比于基线模型,该方法取得了显著的改进。

Distributional robustness is a promising framework for training deep learning models that are less vulnerable to adversarial examples and data distribution shifts. Previous works have mainly focused on exploiting distributional robustness in the data space. In this work, we explore an optimal transport-based distributional robustness framework in model spaces. Specifically, we examine a model distribution within a Wasserstein ball centered on a given model distribution that maximizes the loss. We have developed theories that enable us to learn the optimal robust center model distribution. Interestingly, our developed theories allow us to flexibly incorporate the concept of sharpness awareness into training, whether it's a single model, ensemble models, or Bayesian Neural Networks, by considering specific forms of the center model distribution. These forms include a Dirac delta distribution over a single model, a uniform distribution over several models, and a general Bayesian Neural Network. Furthermore, we demonstrate that Sharpness-Aware Minimization (SAM) is a specific case of our framework when using a Dirac delta distribution over a single model, while our framework can be seen as a probabilistic extension of SAM. To validate the effectiveness of our framework in the aforementioned settings, we conducted extensive experiments, and the results reveal remarkable improvements compared to the baselines.

SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models
Xiaosong Ma Jie ZHANG Song Guo Wenchao Xu



研究问题:如何利用预训练的视觉语言模型在未标记的目标领域进行测试时适应。
动机:现有的方法只关注熵优化,性能远低于监督提示适应方法,如CoOp。
方法:提出SwapPrompt框架,采用双重提示范式和交换预测机制,通过对比学习增强在线提示。
效果:实验结果表明,SwapPrompt在ImageNet和其他九个数据集上实现了最先进的测试时适应性能,甚至可以与监督提示适应方法相媲美。

Test-time adaptation (TTA) is a special and practical setting in unsupervised domain adaptation, which allows a pre-trained model in a source domain to adapt to unlabeled test data in another target domain. To avoid the computation-intensive backbone fine-tuning process, the zero-shot generalization potentials of the emerging pre-trained vision-language models (e.g., CLIP, CoOp) are leveraged to only tune the run-time prompt for unseen test domains. However, existing solutions have yet to fully exploit the representation capabilities of pre-trained models as they only focus on the entropy-based optimization and the performance is far below the supervised prompt adaptation methods, e.g., CoOp. In this paper, we propose SwapPrompt, a novel framework that can effectively leverage the self-supervised contrastive learning to facilitate the test-time prompt adaptation. SwapPrompt employs a dual prompts paradigm, i.e., an online prompt and a target prompt that averaged from the online prompt to retain historical information. In addition, SwapPrompt applies a swapped prediction mechanism, which takes advantage of the representation capabilities of pre-trained models to enhance the online prompt via contrastive learning. Specifically, we use the online prompt together with an augmented view of the input image to predict the class assignment generated by the target prompt together with an alternative augmented view of the same image. The proposed SwapPrompt can be easily deployed on vision-language models without additional requirement, and experimental results show that it achieves state-of-the-art test-time adaptation performance on ImageNet and nine other datasets. It is also shown that SwapPrompt can even achieve comparable performance with supervised prompt adaptation methods.

Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models
Beier Zhu Kaihua Tang Qianru Sun Hanwang Zhang



研究问题:本文旨在解决预训练基础模型中的固有偏见问题。
动机:由于预训练数据集的极度不平衡,基础模型往往偏向于频繁出现的语义,导致后续的微调和集成仍然带有偏见。
方法:提出了一种通用对数调整(GLA)方法,通过优化方式进行偏差估计以消除基础模型中的偏见。
效果:GLA方法在各种任务上表现出显著改进,包括ImageNet、11个少样本数据集和长尾分类任务。

Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data. Yet, the zero-shot performance is less competitive than a fully supervised one. Thus, to enhance the performance, fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks. However, we argue that such prior work has overlooked the inherent biases in foundation models. Due to the highly imbalanced Web-scale training set, these foundation models are inevitably skewed toward frequent semantics, and thus the subsequent fine-tuning or ensembling is still biased. In this study, we systematically examine the biases in foundation models and demonstrate the efficacy of our proposed Generalized Logit Adjustment (GLA) method. Note that bias estimation in foundation models is challenging, as most pre-train data cannot be explicitly assessed like in traditional long-tailed classification tasks. To this end, GLA has an optimization-based bias estimation approach for debiasing foundation models. As our work resolves a fundamental flaw in the pre-training, the proposed GLA demonstrates significant improvements across a diverse range of tasks: it achieves 1.5 pp accuracy gains on ImageNet, an large average improvement (1.4-4.6 pp) on 11 few-shot datasets, 2.4 pp gains on long-tailed classification. Codes are in https://github.com/BeierZhu/GLA.

Optimal Parameter and Neuron Pruning for Out-of-Distribution Detection
Chao Chen Zhihang Fu Kai Liu Ze Chen Mingyuan Tao Jieping Ye



研究问题:在真实世界场景中部署的机器学习模型,检测分布外(OOD)样本的能力是不可或缺的且具有挑战性。
动机:大多数现有的OOD检测方法集中在探索先进的训练技巧或无需训练的技巧,以防止模型对未知样本产生过于自信的信心分数。这些基于训练的方法需要昂贵的训练成本,并且依赖于并不总是可用的OOD样本,而大多数无需训练的方法不能有效地利用训练数据中的先验信息。
方法:我们提出了一种最优参数和神经元剪枝(OPNP)方法,旨在识别并删除导致过拟合的参数和神经元。主要方法分为两步。第一步,通过在所有训练样本上平均梯度来评估模型参数和神经元的敏感性。第二步,删除灵敏度异常大或接近零的参数和神经元进行预测。
效果:我们在多个OOD检测任务和模型架构上进行了广泛的实验,结果显示,我们提出的OPNP方法始终优于现有方法,优势明显。

For a machine learning model deployed in real world scenarios, the ability of detecting out-of-distribution (OOD) samples is indispensable and challenging. Most existing OOD detection methods focused on exploring advanced training skills or training-free tricks to prevent the model from yielding overconfident confidence score for unknown samples. The training-based methods require expensive training cost and rely on OOD samples which are not always available, while most training-free methods can not efficiently utilize the prior information from the training data. In this work, we propose an \textbf{O}ptimal \textbf{P}arameter and \textbf{N}euron \textbf{P}runing (\textbf{OPNP}) approach, which aims to identify and remove those parameters and neurons that lead to over-fitting. The main method is divided into two steps. In the first step, we evaluate the sensitivity of the model parameters and neurons by averaging gradients over all training samples. In the second step, the parameters and neurons with exceptionally large or close to zero sensitivities are removed for prediction. Our proposal is training-free, compatible with other post-hoc methods, and exploring the information from all training data. Extensive experiments are performed on multiple OOD detection tasks and model architectures, showing that our proposed OPNP consistently outperforms the existing methods by a large margin.

Nonparametric Teaching for Multiple Learners
Chen Zhang Xiaofeng Cao Weiyang Liu Ivor Tsang James Kwok



研究问题:本文研究了在非参数迭代教学设置中同时教授多个学习者的问题,其中教师迭代地向学习者提供示例,以加速对目标概念的获取。
动机:当前单学习者教学设置与现实世界中人类教学的场景之间存在差距,教师通常向多个学生传授知识。
方法:我们引入了一个新颖的框架——多学习者非参数教学(MINT)。在MINT中,教师旨在教导多个学习者,每个学习者都专注于学习一个标量值的目标模型。为实现这一目标,我们将问题表述为教授一个向量值的目标模型,并将目标模型空间从单学习者场景中的标量值再生核希尔伯特空间扩展到向量值空间。
效果:实验表明,MINT比重复的单学习者教学提供了显著的教学速度提升,特别是在多个学习者可以相互交流的情况下。最后,我们进行了广泛的实验以验证MINT的实用性和效率。

We study the problem of teaching multiple learners simultaneously in the nonparametric iterative teaching setting, where the teacher iteratively provides examples to the learner for accelerating the acquisition of a target concept. This problem is motivated by the gap between current single-learner teaching setting and the real-world scenario of human instruction where a teacher typically imparts knowledge to multiple students. Under the new problem formulation, we introduce a novel framework -- Multi-learner Nonparametric Teaching (MINT). In MINT, the teacher aims to instruct multiple learners, with each learner focusing on learning a scalar-valued target model. To achieve this, we frame the problem as teaching a vector-valued target model and extend the target model space from a scalar-valued reproducing kernel Hilbert space used in single-learner scenarios to a vector-valued space. Furthermore, we demonstrate that MINT offers significant teaching speed-up over repeated single-learner teaching, particularly when the multiple learners can communicate with each other. Lastly, we conduct extensive experiments to validate the practicality and efficiency of MINT.

Towards Accelerated Model Training via Bayesian Data Selection
Zhijie Deng Peng Cui Jun Zhu



研究问题:现实世界中的错误标记、重复或偏置数据可能导致训练时间过长,甚至阻碍模型收敛。
动机:传统的解决方案优先考虑易或难的样本,缺乏同时处理这种多样性的灵活性。
方法:通过利用轻量级的贝叶斯处理方法和基于大规模预训练模型的现成零样本预测器,解决了这些问题。
效果:在具有大量数据噪声和不平衡的在线批量选择场景下,对具有挑战性的基准进行了广泛的实证研究,观察到了比竞争性基线更优越的训练效率。特别是在具有挑战性的WebVision基准上,该方法可以在显著减少训练迭代次数的同时,实现与领先的数据选择方法相当的预测性能。

Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.

ATTA: Anomaly-aware Test-Time Adaptation for Out-of-Distribution Detection in Segmentation
Zhitong Gao Shipeng Yan Xuming He



研究问题:现有的OOD检测模型主要关注训练和测试数据集共享相似领域的场景,但在现实世界中,领域转移常常存在并严重影响现有OOD检测模型的准确性。
动机:在现实情况下,领域转移和语义转移同时存在,对OOD检测模型的准确性造成影响。
方法:提出一个双层次OOD检测框架来共同处理领域转移和语义转移。第一层通过利用全局低级别特征来区分图像中是否存在领域转移,第二层通过使用密集的高级别特征图来识别具有语义转移的像素。
效果:在多个OOD分割基准上验证了所提方法的有效性,包括那些存在显著领域转移和不存在领域转移的情况,观察到各种基线模型的性能持续改进。

Recent advancements in dense out-of-distribution (OOD) detection have primarily focused on scenarios where the training and testing datasets share a similar domain, with the assumption that no domain shift exists between them. However, in real-world situations, domain shift often exits and significantly affects the accuracy of existing out-of-distribution (OOD) detection models. In this work, we propose a dual-level OOD detection framework to handle domain shift and semantic shift jointly. The first level distinguishes whether domain shift exists in the image by leveraging global low-level features, while the second level identifies pixels with semantic shift by utilizing dense high-level feature maps. In this way, we can selectively adapt the model to unseen domains as well as enhance model's capacity in detecting novel classes. We validate the efficacy of our proposed method on several OOD segmentation benchmarks, including those with significant domain shifts and those without, observing consistent performance improvements across various baseline models. Code is available at https://github.com/gaozhitong/ATTA.

MoVie: Visual Model-Based Policy Adaptation for View Generalization
Sizhe Yang Yanjie Ze Huazhe Xu



研究问题:在有限视角下训练的视觉强化学习代理在未见过的视角上泛化其学习能力时面临重大挑战,即视图泛化问题。
动机:解决视图泛化问题对于真实世界的机器人应用具有巨大潜力。
方法:我们提出了一种简单而有效的方法,使基于模型的策略能够在测试时成功适应视图泛化,无需任何显式的奖励信号和训练时间的修改。
效果:我们的方法在所有四个场景中都取得了显著的进步,包括来自DMControl、xArm和Adroit的18个任务,相对改进分别为33%,86%和152%。

Visual Reinforcement Learning (RL) agents trained on limited views face significant challenges in generalizing their learned abilities to unseen views. This inherent difficulty is known as the problem of $\textit{view generalization}$. In this work, we systematically categorize this fundamental problem into four distinct and highly challenging scenarios that closely resemble real-world situations. Subsequently, we propose a straightforward yet effective approach to enable successful adaptation of visual $\textbf{Mo}$del-based policies for $\textbf{Vie}$w generalization ($\textbf{MoVie}$) during test time, without any need for explicit reward signals and any modification during training time. Our method demonstrates substantial advancements across all four scenarios encompassing a total of $\textbf{18}$ tasks sourced from DMControl, xArm, and Adroit, with a relative improvement of $\mathbf{33}$%, $\mathbf{86}$%, and $\mathbf{152}$% respectively. The superior results highlight the immense potential of our approach for real-world robotics applications. Code and videos are available at https://yangsizhe.github.io/MoVie/.

Partial Label Learning with Dissimilarity Propagation guided Candidate Label Shrinkage
Yuheng Jia Fuchao Yang Yongqiang Dong



研究问题:如何在部分标签学习(PLL)中,从一组候选标签中找出正确的标签。
动机:现有的PLL方法无法有效地从候选标签集中找出正确的标签。
方法:构建一个约束回归模型来捕捉候选标签的置信度,并利用其转置构建二阶相似性矩阵;通过考虑候选标签集的交集的补集,开发语义不相似性矩阵,并通过样本的局部几何结构将初始的不相似关系传播到整个数据集;最后,将提出的模型扩展到核版本,以利用样本的非线性结构,并通过增广拉格朗日乘子法求解。
效果:该方法在10个人工和7个真实世界的部分标签数据集上的表现优于最先进的PLL算法,且有理论保证其有效性。

In partial label learning (PLL), each sample is associated with a group of candidate labels, among which only one label is correct. The key of PLL is to disambiguate the candidate label set to find the ground-truth label. To this end, we first construct a constrained regression model to capture the confidence of the candidate labels, and multiply the label confidence matrix by its transpose to build a second-order similarity matrix, whose elements indicate the pairwise similarity relationships of samples globally. Then we develop a semantic dissimilarity matrix by considering the complement of the intersection of the candidate label set, and further propagate the initial dissimilarity relationships to the whole data set by leveraging the local geometric structure of samples. The similarity and dissimilarity matrices form an adversarial relationship, which is further utilized to shrink the solution space of the label confidence matrix and promote the dissimilarity matrix. We finally extend the proposed model to a kernel version to exploit the non-linear structure of samples and solve the proposed model by the inexact augmented Lagrange multiplier method. By exploiting the adversarial prior, the proposed method can significantly outperform state-of-the-art PLL algorithms when evaluated on 10 artificial and 7 real-world partial label data sets. We also prove the effectiveness of our method with some theoretical guarantees. The code is publicly available at https://github.com/Yangfc-ML/DPCLS.

NPCL: Neural Processes for Uncertainty-Aware Continual Learning
Saurav Jha Dong Gong He Zhao Lina Yao



研究问题:本文旨在解决持续学习(CL)在处理流数据时的效率问题和任务间干扰导致的遗忘问题,以及现有CL模型无法准确测量预测不确定性的问题。
动机:持续学习需要有效地训练深度神经网络以处理流数据,同时限制新任务带来的遗忘效应。然而,学习具有较低任务间干扰的可转移知识是困难的,并且现实世界中CL模型的应用受到其无法准确测量预测不确定性的限制。
方法:提出使用神经过程(NPs)来处理持续学习任务,这是一种将不同任务编码为函数概率分布的元学习器,同时提供可靠的不确定性估计。具体来说,提出了一种基于NP的持续学习方法(NPCL),该方法具有按层次潜在变量模型排列的任务特定模块。通过调整学习的后验分布上的正则化器来减轻遗忘。NPCL的不确定性估计能力也可用于处理持续学习中的任务头/模块推理挑战。
效果:实验表明,NPCL优于以往的持续学习方法。验证了NPCL中不确定性估计在识别新数据和评估实例级模型信心方面的有效性。代码可在https://github.com/srvCodes/NPCL获取。

Continual learning (CL) aims to train deep neural networks efficiently on streaming data while limiting the forgetting caused by new tasks. However, learning transferable knowledge with less interference between tasks is difficult, and real-world deployment of CL models is limited by their inability to measure predictive uncertainties. To address these issues, we propose handling CL tasks with neural processes (NPs), a class of meta-learners that encode different tasks into probabilistic distributions over functions all while providing reliable uncertainty estimates. Specifically, we propose an NP-based CL approach (NPCL) with task-specific modules arranged in a hierarchical latent variable model. We tailor regularizers on the learned latent distributions to alleviate forgetting. The uncertainty estimation capabilities of the NPCL can also be used to handle the task head/module inference challenge in CL. Our experiments show that the NPCL outperforms previous CL approaches. We validate the effectiveness of uncertainty estimation in the NPCL for identifying novel data and evaluating instance-level model confidence. Code is available at https://github.com/srvCodes/NPCL.

Model and Feature Diversity for Bayesian Neural Networks in Mutual Learning
Cuong Pham Cuong C. Nguyen Trung Le Dinh Phung Gustavo Carneiro Thanh-Toan Do



研究问题:本文旨在通过深度互学习提高贝叶斯神经网络(BNNs)的性能。
动机:尽管BNNs能提供模型参数的概率分布,实现预测的不确定性量化,但其性能通常低于确定性神经网络。利用互学习可以有效提升同类BNNs的性能。
方法:本文提出了一种新颖的方法,通过深度互学习来提升BNNs的性能。该方法旨在增加网络参数分布和特征分布的多样性,推动同类网络获取不同的输入特性,从而增强互学习的效果。
效果:实验结果表明,与传统的互学习相比,该方法在分类准确率、负对数似然和预期校准误差方面都有显著改进。

Bayesian Neural Networks (BNNs) offer probability distributions for model parameters, enabling uncertainty quantification in predictions. However, they often underperform compared to deterministic neural networks. Utilizing mutual learning can effectively enhance the performance of peer BNNs. In this paper, we propose a novel approach to improve BNNs performance through deep mutual learning. The proposed approaches aim to increase diversity in both network parameter distributions and feature distributions, promoting peer networks to acquire distinct features that capture different characteristics of the input, which enhances the effectiveness of mutual learning. Experimental results demonstrate significant improvements in the classification accuracy, negative log-likelihood, and expected calibration error when compared to traditional mutual learning for BNNs.

On the Stability-Plasticity Dilemma in Continual Meta-Learning: Theory and Algorithm
Qi CHEN Changjian Shui Ligong Han Mario Marchand



研究问题:本文旨在解决连续元学习(CML)中的稳定性和可塑性之间的平衡问题,即如何在避免先前任务的灾难性遗忘的同时,从新任务中学习可泛化的概念。
动机:在处理一系列非独立同分布的任务时,如何有效地积累和利用元知识是连续元学习的主要挑战。
方法:通过控制任务序列的平均超额风险上界来制定CML目标,以反映遗忘和泛化之间的权衡。基于此目标,我们为静态和动态环境引入了一个统一的CML理论框架,并为各种特定于任务的学习算法提供了保证。
效果:我们在合成和真实数据集上的实证评估表明,所提出的理论和算法是有效的。

We focus on Continual Meta-Learning (CML), which targets accumulating and exploiting meta-knowledge on a sequence of non-i.i.d. tasks. The primary challenge is to strike a balance between stability and plasticity, where a model should be stable to avoid catastrophic forgetting in previous tasks and plastic to learn generalizable concepts from new tasks. To address this, we formulate the CML objective as controlling the average excess risk upper bound of the task sequence, which reflects the trade-off between forgetting and generalization. Based on the objective, we introduce a unified theoretical framework for CML in both static and shifting environments, providing guarantees for various task-specific learning algorithms. Moreover, we first present a rigorous analysis of a bi-level trade-off in shifting environments. To approach the optimal trade-off, we propose a novel algorithm that dynamically adjusts the meta-parameter and its learning rate w.r.t environment change. Empirical evaluations on synthetic and real datasets illustrate the effectiveness of the proposed theory and algorithm.

Adaptive Test-Time Personalization for Federated Learning
Wenxuan Bao Tianxin Wei Haohan Wang Jingrui He



研究问题:如何在没有标签数据的情况下,在测试阶段进行个性化联邦学习。
动机:大多数现有的个性化联邦学习方法都需要测试客户端的标签数据,但在现实世界中这通常是不可用的。
方法:提出了一种新的设置,称为测试时个性化联邦学习(TTPFL),其中客户端无需依赖任何标签数据就可以在测试阶段进行本地全局模型的自适应。
效果:实验结果表明,ATP在处理各种分布偏移,包括标签偏移、图像损坏和领域偏移等方面,优于现有的TTA方法,并在多个数据集和模型架构上表现出色。

Personalized federated learning algorithms have shown promising results in adapting models to various distribution shifts. However, most of these methods require labeled data on testing clients for personalization, which is usually unavailable in real-world scenarios. In this paper, we introduce a novel setting called test-time personalized federated learning (TTPFL), where clients locally adapt a global model in an unsupervised way without relying on any labeled data during test-time. While traditional test-time adaptation (TTA) can be used in this scenario, most of them inherently assume training data come from a single domain, while they come from multiple clients (source domains) with different distributions. Overlooking these domain interrelationships can result in suboptimal generalization. Moreover, most TTA algorithms are designed for a specific kind of distribution shift and lack the flexibility to handle multiple kinds of distribution shifts in FL. In this paper, we find that this lack of flexibility partially results from their pre-defining which modules to adapt in the model. To tackle this challenge, we propose a novel algorithm called ATP to adaptively learns the adaptation rates for each module in the model from distribution shifts among source domains. Theoretical analysis proves the strong generalization of ATP. Extensive experiments demonstrate its superiority in handling various distribution shifts including label shift, image corruptions, and domain shift, outperforming existing TTA methods across multiple datasets and model architectures. Our code is available at https://github.com/baowenxuan/ATP.

Subclass-Dominant Label Noise: A Counterexample for the Success of Early Stopping
Yingbin Bai Zhongyi Han Erkun Yang Jun Yu Bo Han Dadong Wang Tongliang Liu



研究问题:本文研究了被忽视的广泛存在的标签噪声类型——子类主导标签噪声(SDN),并探索了其对深度神经网络训练的影响。
动机:在训练初期,深度神经网络会快速记住SDN中的误标样本,这给使用传统早期停止技术有效选择置信样本带来了挑战。
方法:通过观察发现,长期训练的表示能更好地捕获误标样本的高阶语义,导致相似样本聚集在一起的现象。基于此,提出了一种名为NoiseCluster的新方法,利用长期训练表示的几何结构来识别和纠正SDN。
效果:实验证明,NoiseCluster在合成和真实世界数据集上都优于最先进的基线,强调了在有噪声标签的学习中解决SDN的重要性。

In this paper, we empirically investigate a previously overlooked and widespread type of label noise, subclass-dominant label noise (SDN). Our findings reveal that, during the early stages of training, deep neural networks can rapidly memorize mislabeled examples in SDN. This phenomenon poses challenges in effectively selecting confident examples using conventional early stopping techniques. To address this issue, we delve into the properties of SDN and observe that long-trained representations are superior at capturing the high-level semantics of mislabeled examples, leading to a clustering effect where similar examples are grouped together. Based on this observation, we propose a novel method called NoiseCluster that leverages the geometric structures of long-trained representations to identify and correct SDN. Our experiments demonstrate that NoiseCluster outperforms state-of-the-art baselines on both synthetic and real-world datasets, highlighting the importance of addressing SDN in learning with noisy labels. The code is available at https://github.com/tmllab/2023_NeurIPS_SDN.

Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Changsheng Lv Shuai Zhang Yapeng Tian Mengshi Qi Huadong Ma



研究问题:如何模仿人类的推理能力,从视频和音频输入中推断物体的物理常识。
动机:目前的大部分方法未能充分利用多模态数据的不同特性,模型缺乏因果关系推理能力阻碍了隐含物理知识的推断进展。
方法:提出一种解耦计数器学习方法(DCL),通过解耦序列编码器将视频分解为静态(时间不变的)和动态(时间变化的)因素,并引入反事实学习模块来增强模型的推理能力。
效果:实验表明,该方法改进了基线方法,并取得了最先进的性能。

In this paper, we propose a Disentangled Counterfactual Learning (DCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects’ physics commonsense based on both video and audio input, with the main challenge is how to imitate the reasoning ability of humans. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed DCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model’s reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. Our proposed method is a plug-and-play module that can be incorporated into any baseline. In experiments, we show that our proposed method improves baseline methods and achieves state-of-the-art performance. Our source code is available at https://github.com/Andy20178/DCL.

Self-Weighted Contrastive Learning among Multiple Views for Mitigating Representation Degeneration
Jie Xu Shuo Chen Yazhou Ren Xiaoshuang Shi Heng Tao Shen Gang Niu Xiaofeng Zhu



研究问题:如何在多视图场景中,解决对比学习可能导致的表示退化问题。
动机:在多视图场景中,如果收集到的多个视图具有不一致的语义信息或其表示无法捕获足够的判别性信息,对比学习可能会导致表示退化。
方法:提出一种名为SEM的新框架,即自我加权的多视图对比学习与重建正则化。首先测量成对表示之间的差异,然后最小化相应的自我加权对比损失,使SEM能够自适应地加强有用的成对视图并减弱不可靠的成对视图。同时,通过引入自监督的重建项来规范编码器的隐藏特征,以帮助CL访问数据的足够判别性信息。
效果:实验证明,SEM可以缓解现有CL方法中的表示退化问题,并帮助他们实现显著的性能提升。消融研究也证明了SEM在不同权重策略和重建项选项下的有效性。

Recently, numerous studies have demonstrated the effectiveness of contrastive learning (CL), which learns feature representations by pulling in positive samples while pushing away negative samples. Many successes of CL lie in that there exists semantic consistency between data augmentations of the same instance. In multi-view scenarios, however, CL might cause representation degeneration when the collected multiple views inherently have inconsistent semantic information or their representations subsequently do not capture sufficient discriminative information. To address this issue, we propose a novel framework called SEM: SElf-weighted Multi-view contrastive learning with reconstruction regularization. Specifically, SEM is a general framework where we propose to first measure the discrepancy between pairwise representations and then minimize the corresponding self-weighted contrastive loss, and thus making SEM adaptively strengthen the useful pairwise views and also weaken the unreliable pairwise views. Meanwhile, we impose a self-supervised reconstruction term to regularize the hidden features of encoders, to assist CL in accessing sufficient discriminative information of data. Experiments on public multi-view datasets verified that SEM can mitigate representation degeneration in existing CL methods and help them achieve significant performance improvements. Ablation studies also demonstrated the effectiveness of SEM with different options of weighting strategies and reconstruction terms.

Geometry-Aware Adaptation for Pretrained Models
Nicholas Roberts Xintong Li Dyah Adila Sonia Cromp Tzu-Heng Huang Jitian Zhao Frederic Sala



研究问题:如何利用大规模标签空间中的标签间距离信息,对已训练的机器学习模型进行适应,以准确预测新类别或提高零样本预测的性能?
动机:目前的机器学习模型在训练时,其标签只占整个标签空间的一小部分。我们提出一种方法,通过利用标签间的信息来改进模型的预测性能。
方法:我们提出了一种简单方法,将标准的预测规则替换为Fréchet平均数,从而无需额外训练即可可靠地预测新类别或提高零样本预测的性能。
效果:实验结果表明,我们的方法Loki在ImageNet上比SimCLR获得了高达29.7%的相对改进,并且可以扩展到数十万个类别。当没有可用的外部指标时,Loki可以使用从类嵌入中得出的自我指标,并在预训练的零样本模型(如CLIP)上获得10.5%的改进。

Machine learning models---including prominent zero-shot models---are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes---or, in the case of zero-shot prediction, to improve its performance---without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping $\text{argmax}$ with the Fréchet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.

D-Separation for Causal Self-Explanation
Wei Liu Jun Wang Haozhao Wang Ruixuan Li Zhiying Deng YuanKai Zhang Yang Qiu



研究问题:本文旨在解决现有NLP模型解释性不足的问题,通过提取输入文本中易于人类理解的部分来增强其可解释性。
动机:传统的最大互信息(MMI)标准可能会受到与原因或目标标签相关的虚假特征的影响。因此,本文提出了一种新的标准——最小条件依赖(MCD)标准,以揭示因果关系。
方法:通过最小化未选择的输入部分和目标标签之间的依赖性(使用KL散度作为简单的度量),迫使所有导致标签的原因都被选中。
效果:实验结果表明,MCD相较于先前最先进的基于MMI的方法,F1分数提高了最多13.7%。

Rationalization aims to strengthen the interpretability of NLP models by extracting a subset of human-intelligible pieces of their inputting texts. Conventional works generally employ the maximum mutual information (MMI) criterion to find the rationale that is most indicative of the target label. However, this criterion can be influenced by spurious features that correlate with the causal rationale or the target label. Instead of attempting to rectify the issues of the MMI criterion, we propose a novel criterion to uncover the causal rationale, termed the Minimum Conditional Dependence (MCD) criterion, which is grounded on our finding that the non-causal features and the target label are \emph{d-separated} by the causal rationale. By minimizing the dependence between the non-selected parts of the input and the target label conditioned on the selected rationale candidate, all the causes of the label are compelled to be selected. In this study, we employ a simple and practical measure for dependence, specifically the KL-divergence, to validate our proposed MCD criterion. Empirically, we demonstrate that MCD improves the F1 score by up to 13.7% compared to previous state-of-the-art MMI-based methods. Our code is in an anonymous repository: https://anonymous.4open.science/r/MCD-CE88.

Augmentation-Aware Self-Supervision for Data-Efficient GAN Training
Liang Hou Qi Cao Yige Yuan Songtao Zhao Chongyang Ma Siyuan Pan Pengfei Wan Zhongyuan Wang Huawei Shen Xueqi Cheng



研究问题:训练生成对抗网络(GANs)时,由于判别器容易过拟合,使用有限的数据进行训练具有挑战性。
动机:虽然先前提出的可微分增强方法提高了训练GAN的数据效率,但由于忽略了数据转换对标签空间语义变化的影响,这种增强在判别器中引入了对增强的不期望的不变性,可能限制了判别器的表现学习能力,最终影响生成器的生成模型性能。
方法:我们提出了一种新的增强感知的自我监督判别器,该判别器预测增强数据的增强参数。特别是在训练过程中,真实数据和生成数据的预测目标需要被区分开来。我们还通过生成可预测的增强的真实和非假数据,鼓励生成器从自我监督判别器中进行对抗学习。
效果:我们在数据有限的CIFAR-10、CIFAR-100、FFHQ、LSUN-Cat以及五个低数据量数据集上,使用BigGAN和StyleGAN2架构与最先进的方法进行了比较。实验结果表明,我们的方法在训练数据高效的GANs方面显著优于最先进的方法。

Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting. Previously proposed differentiable augmentation demonstrates improved data efficiency of training GANs. However, the augmentation implicitly introduces undesired invariance to augmentation for the discriminator since it ignores the change of semantics in the label space caused by data transformation, which may limit the representation learning ability of the discriminator and ultimately affect the generative modeling performance of the generator. To mitigate the negative impact of invariance while inheriting the benefits of data augmentation, we propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data. Particularly, the prediction targets of real data and generated data are required to be distinguished since they are different during training. We further encourage the generator to adversarially learn from the self-supervised discriminator by generating augmentation-predictable real and not fake data. This formulation connects the learning objective of the generator and the arithmetic $-$ harmonic mean divergence under certain assumptions. We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures on data-limited CIFAR-10, CIFAR-100, FFHQ, LSUN-Cat, and five low-shot datasets. Experimental results demonstrate significant improvements of our method over SOTA methods in training data-efficient GANs.

Hierarchical Gaussian Mixture based Task Generative Model for Robust Meta-Learning
Yizhou Zhang Jingchao Ni Wei Cheng Zhengzhang Chen Liang Tong Haifeng Chen Yan Liu



研究问题:本文旨在解决元学习中训练和测试任务来自同一分布的问题,以及新任务可能来自未见过的训练分布的问题。
动机:大多数现有的元学习方法都忽视了现实中任务来源的多样性和可能出现的新任务分布。
方法:本文提出了一种基于层次高斯混合任务生成模型(HTGM)的元学习框架。该模型通过学习任务嵌入、拟合任务的混合分布,实现了对新任务密度的评分。
效果:在基准数据集上的大量实验表明,该方法在样本分类和新任务检测方面均具有有效性。

Meta-learning enables quick adaptation of machine learning models to new tasks with limited data. While tasks could come from varying distributions in reality, most of the existing meta-learning methods consider both training and testing tasks as from the same uni-component distribution, overlooking two critical needs of a practical solution: (1) the various sources of tasks may compose a multi-component mixture distribution, and (2) novel tasks may come from a distribution that is unseen during meta-training. In this paper, we demonstrate these two challenges can be solved jointly by modeling the density of task instances. We develop a meta-training framework underlain by a novel Hierarchical Gaussian Mixture based Task Generative Model (HTGM). HTGM extends the widely used empirical process of sampling tasks to a theoretical model, which learns task embeddings, fits the mixture distribution of tasks, and enables density-based scoring of novel tasks. The framework is agnostic to the encoder and scales well with large backbone networks. The model parameters are learned end-to-end by maximum likelihood estimation via an Expectation-Maximization (EM) algorithm. Extensive experiments on benchmark datasets indicate the effectiveness of our method for both sample classification and novel task detection.

Nominality Score Conditioned Time Series Anomaly Detection by Point/Sequential Reconstruction
Chih-Yu Lai Fan-Keng Sun Zhengqi Gao Jeffrey Lang Duane S Boning



研究问题:时间序列异常检测由于可能出现的复杂和多样的模式而具有挑战性。
动机:主要困难在于建立时间依赖关系模型以找到上下文异常,同时保持点异常的检测精度。
方法:本文提出了一个用于无监督时间序列异常检测的框架,该框架利用基于点和基于序列的重建模型。基于点的模型尝试量化点异常,基于序列的模型尝试量化点异常和上下文异常。
效果:在观察到的时间点是从标称时间点偏离的两个阶段的组合值的情况下,我们引入了一个由重建误差组合值的比例计算得出的标称性分数。通过进一步整合标称性分数和异常分数,我们得到了诱发异常分数,并在特定条件下从理论上证明了诱发异常分数优于原始异常分数。在多个公共数据集上的广泛研究表明,所提出的框架在时间序列异常检测方面优于大多数最先进的基线。

Time series anomaly detection is challenging due to the complexity and variety of patterns that can occur. One major difficulty arises from modeling time-dependent relationships to find contextual anomalies while maintaining detection accuracy for point anomalies. In this paper, we propose a framework for unsupervised time series anomaly detection that utilizes point-based and sequence-based reconstruction models. The point-based model attempts to quantify point anomalies, and the sequence-based model attempts to quantify both point and contextual anomalies. Under the formulation that the observed time point is a two-stage deviated value from a nominal time point, we introduce a nominality score calculated from the ratio of a combined value of the reconstruction errors. We derive an induced anomaly score by further integrating the nominality score and anomaly score, then theoretically prove the superiority of the induced anomaly score over the original anomaly score under certain conditions. Extensive studies conducted on several public datasets show that the proposed framework outperforms most state-of-the-art baselines for time series anomaly detection.

On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection
Sangha Park Jisoo Mok Dahuin Jung Saehyung Lee Sungroh Yoon



研究问题:如何成功检测出分布外(OoD)数据,以确保神经网络的安全部署。
动机:神经网络在OoD数据上输出过于自信的预测,使得仅通过预测结果难以确定数据的OoD性质,这是OoD检测的主要挑战之一。
方法:提出一种新颖的文本异常暴露方法,借鉴视觉语言预训练的最新进展,将图像领域的真实或虚拟异常替换为等价的文本异常,并提出了生成优选文本异常的各种方式。
效果:实验证明,生成的文本异常在大规模OoD和困难OoD基准测试上取得了有竞争力的性能。同时,对文本异常进行了实证分析,提供了设计有利文本异常的主要标准:接近分布、描述性和包含视觉语义。

Successful detection of Out-of-Distribution (OoD) data is becoming increasingly important to ensure safe deployment of neural networks. One of the main challenges in OoD detection is that neural networks output overconfident predictions on OoD data, make it difficult to determine OoD-ness of data solely based on their predictions. Outlier exposure addresses this issue by introducing an additional loss that encourages low-confidence predictions on OoD data during training. While outlier exposure has shown promising potential in improving OoD detection performance, all previous studies on outlier exposure have been limited to utilizing visual outliers. Drawing inspiration from the recent advancements in vision-language pre-training, this paper venture out to the uncharted territory of textual outlier exposure. First, we uncover the benefits of using textual outliers by replacing real or virtual outliers in the image-domain with textual equivalents. Then, we propose various ways of generating preferable textual outliers. Our extensive experiments demonstrate that generated textual outliers achieve competitive performance on large-scale OoD and hard OoD benchmarks. Furthermore, we conduct empirical analyses of textual outliers to provide primary criteria for designing advantageous textual outliers: near-distribution, descriptiveness, and inclusion of visual semantics.

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions
Sangwoo Mo Minkyu Kim Kyungmin Lee Jinwoo Shin



研究问题:如何改善在专业领域,如遥感图像,使用视觉语言模型(如CLIP)的效果。
动机:由于专业领域训练样本有限,现有的视觉语言模型在这些领域的应用效果不佳。
方法:提出S-CLIP,一种半监督学习方法,通过额外的未配对图像进行训练。采用两种伪标签策略,一种是结合解决未配对和配对图像之间的最优传输问题的配对图像的标题生成的标题级伪标签;另一种是使用部分标签学习,假设候选标签集进行监督,而不是精确的一个。
效果:实验证明,S-CLIP显著提高了仅使用少量图像-文本对的专业领域的模型性能,例如在遥感、时尚、科学图表和漫画等领域,S-CLIP将零射分类的性能提高了10%,图像-文本检索的性能提高了4%。

Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and adapting to such domains is challenging due to the limited number of image-text pairs available for training. To address this, we propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images. S-CLIP employs two pseudo-labeling strategies specifically designed for contrastive learning and the language modality. The caption-level pseudo-label is given by a combination of captions of paired images, obtained by solving an optimal transport problem between unpaired and paired images. The keyword-level pseudo-label is given by a keyword in the caption of the nearest paired image, trained through partial label learning that assumes a candidate set of labels for supervision instead of the exact one. By combining these objectives, S-CLIP significantly enhances the training of CLIP using only a few image-text pairs, as demonstrated in various specialist domains, including remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark, matching the performance of supervised CLIP while using three times fewer image-text pairs.

Towards Semi-Structured Automatic ICD Coding via Tree-based Contrastive Learning
Chang Lu Chandan K. Reddy Ping Wang Yue Ning



研究问题:如何利用有限的、受隐私保护的医疗数据,以及由于医生写作习惯和患者研究问题:如何利用有限的、受隐私保护的医疗数据,以及由于医生写作习惯和患者病理特征不同导致的临床笔记高变异性,进行国际疾病分类(ICD)的自动编码。
动机:尽管使用了最先进的自然语言处理技术,但由于数据有限和临床笔记的高变异性,现有的ICD编码模型仍面临挑战。
方法:我们研究了临床笔记的半结构化特性,并提出了将其分割成部分的自动算法。为了解决现有ICD编码模型在有限数据上的差异性问题,我们引入了一种基于树编辑距离的软多标签相似度度量的对比预训练方法。此外,我们还设计了一种掩蔽部分的训练策略,使ICD编码模型能够定位到与ICD代码相关的部分。
效果:大量的实验结果表明,我们提出的训练策略有效地提高了现有ICD编码方法的性能。

Automatic coding of International Classification of Diseases (ICD) is a multi-label text categorization task that involves extracting disease or procedure codes from clinical notes. Despite the application of state-of-the-art natural language processing (NLP) techniques, there are still challenges including limited availability of data due to privacy constraints and the high variability of clinical notes caused by different writing habits of medical professionals and various pathological features of patients. In this work, we investigate the semi-structured nature of clinical notes and propose an automatic algorithm to segment them into sections. To address the variability issues in existing ICD coding models with limited data, we introduce a contrastive pre-training approach on sections using a soft multi-label similarity metric based on tree edit distance. Additionally, we design a masked section training strategy to enable ICD coding models to locate sections related to ICD codes. Extensive experimental results demonstrate that our proposed training strategies effectively enhance the performance of existing ICD coding methods.

A Novel Approach for Effective Multi-View Clustering with Information-Theoretic Perspective
Chenhang Cui Yazhou Ren Jingyu Pu Jiawei Li Xiaorong Pu Tianyi Wu Yutao Shi Lifang He



研究问题:现有的多视角聚类方法主要关注获取一致信息,但往往忽视了多个视角之间冗余信息的问题。
动机:为了解决这一问题,本文提出了一种新的方法,称为充分多视角聚类(SUMVC)。
方法:该方法由两部分构成。首先,我们开发了一种简单可靠的多视角聚类方法SCMVC(简单一致多视角聚类),该方法采用变分分析生成一致信息。其次,我们提出了一种充分的表示下界来增强一致信息并减少视图之间的不必要信息。
效果:通过基于贝叶斯误差率的理论分析和在多个多视角数据集上的实验,证明了SUMVC的优越性能。

Multi-view clustering (MVC) is a popular technique for improving clustering performance using various data sources. However, existing methods primarily focus on acquiring consistent information while often neglecting the issue of redundancy across multiple views. This study presents a new approach called Sufficient Multi-View Clustering (SUMVC) that examines the multi-view clustering framework from an information-theoretic standpoint. Our proposed method consists of two parts. Firstly, we develop a simple and reliable multi-view clustering method SCMVC (simple consistent multi-view clustering) that employs variational analysis to generate consistent information. Secondly, we propose a sufficient representation lower bound to enhance consistent information and minimise unnecessary information among views. The proposed SUMVC method offers a promising solution to the problem of multi-view clustering and provides a new perspective for analyzing multi-view data. To verify the effectiveness of our model, we conducted a theoretical analysis based on the Bayes Error Rate, and experiments on multiple multi-view datasets demonstrate the superior performance of SUMVC.

Contrast Everything: A Hierarchical Contrastive Framework for Medical Time-Series
Yihe Wang Yu Han Haishuai Wang Xiang Zhang



研究问题:现有的对比学习方法主要关注单一的数据层面,无法充分利用医学时间序列的复杂性。
动机:为了解决这一问题,我们提出了COMET,这是一个创新的分层框架,利用医学时间序列所有固有层次的数据一致性。
方法:我们精心设计的模型系统地捕捉了来自观察、样本、试验和患者四个潜在层次的数据一致性。通过在多个层次上开发对比损失,我们可以学习有效的表示,以自监督的方式保留全面的数据一致性,最大限度地利用信息。
效果:我们在具有挑战性的患者独立设置中进行实验,使用三个不同的数据集(包括用于心肌梗死的心电图信号以及用于阿尔茨海默症和帕金森病的脑电图信号)将COMET与六个基线进行比较。结果显示,COMET在所有数据集上都优于所有基线,特别是在10%和1%标注数据比例的设置下。这些结果强调了我们的框架在推进医学时间序列对比表示学习方法方面的重要影响。

Contrastive representation learning is crucial in medical time series analysis as it alleviates dependency on labor-intensive, domain-specific, and scarce expert annotations. However, existing contrastive learning methods primarily focus on one single data level, which fails to fully exploit the intricate nature of medical time series. To address this issue, we present COMET, an innovative hierarchical framework that leverages data consistencies at all inherent levels in medical time series. Our meticulously designed model systematically captures data consistency from four potential levels: observation, sample, trial, and patient levels. By developing contrastive loss at multiple levels, we can learn effective representations that preserve comprehensive data consistency, maximizing information utilization in a self-supervised manner. We conduct experiments in the challenging patient-independent setting. We compare COMET against six baselines using three diverse datasets, which include ECG signals for myocardial infarction and EEG signals for Alzheimer’s and Parkinson’s diseases. The results demonstrate that COMET consistently outperforms all baselines, particularly in setup with 10% and 1% labeled data fractions across all datasets. These results underscore the significant impact of our framework in advancing contrastive representation learning techniques for medical time series. The source code is available at https://github.com/DL4mHealth/COMET.

TabMT: Generating tabular data with masked transformers
Manbir S Gulati Paul F Roysdon



研究问题:探索基于变压器的模型在各种应用领域的合成数据生成中的作用。
动机:变压器模型在自然语言处理中的自动回归和屏蔽变压器表现非常有效,同时也在其他领域如视觉中表现出强大的性能。
方法:提出一种新的变压器设计TabMT,用于生成合成的表格数据。TabMT有效地解决了异构数据字段带来的独特挑战,并能够原生处理缺失的数据。
效果:通过改进的屏蔽技术进行生成,并在从极小到极大的表格数据集上展示了最先进的性能。在以隐私为重点的应用中评估TabMT,发现它能够生成高质量的数据,同时实现优越的隐私权衡。

Autoregressive and Masked Transformers are incredibly effective as generative models and classifiers. While these models are most prevalent in NLP, they also exhibit strong performance in other domains, such as vision. This work contributes to the exploration of transformer-based models in synthetic data generation for diverse application domains. In this paper, we present TabMT, a novel Masked Transformer design for generating synthetic tabular data. TabMT effectively addresses the unique challenges posed by heterogeneous data fields and is natively able to handle missing data. Our design leverages improved masking techniques to allow for generation and demonstrates state-of-the-art performance from extremely small to extremely large tabular datasets. We evaluate TabMT for privacy-focused applications and find that it is able to generate high quality data with superior privacy tradeoffs.

Tools for Verifying Neural Models' Training Data
Dami Choi Yonadav G Shavit David Duvenaud



研究问题:如何验证大型神经网络模型的训练数据来源,以评估其能力和风险。
动机:消费者和监管机构需要能够验证大型神经网络模型的来源,以评估其能力和风险。
方法:引入“训练数据证明”的概念,任何能让模型训练者向验证者证明产生一组模型权重的训练数据的协议。这些协议可以验证用于训练模型的数据量和类型,包括是否在特定的有害或有益的数据源上进行过训练。
效果:实验结果表明,我们的验证程序可以检测到各种攻击,包括所有已知的学习证明文献中的攻击。

It is important that consumers and regulators can verify the provenance of large neural models to evaluate their capabilities and risks. We introduce the concept of a "Proof-of-Training-Data": any protocol that allows a model trainer to convince a Verifier of the training data that produced a set of model weights. Such protocols could verify the amount and kind of data and compute used to train the model, including whether it was trained on specific harmful or beneficial data sources. We explore efficient verification strategies for Proof-of-Training-Data that are compatible with most current large-model training procedures. These include a method for the model-trainer to verifiably pre-commit to a random seed used in training, and a method that exploits models' tendency to temporarily overfit to training data in order to detect whether a given data-point was included in training. We show experimentally that our verification procedures can catch a wide variety of attacks, including all known attacks from the Proof-of-Learning literature.

Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training
Rie Johnson Tong Zhang



研究问题:如何减小深度学习模型在训练数据和未见过的数据上表现的差异,即泛化差距。
动机:由于深度神经网络的高表达能力,找到能减小泛化差距的解决方案非常重要。
方法:通过理论分析,提出了一个以模型输出的不一致性(inconsistency)和不稳定性(instability)为基础的泛化差距界限,这两个指标可以在未标记的数据上进行估计。并通过实证研究验证了这一理论。
效果:研究发现,不一致性是预测泛化差距的可靠指标,比损失函数的尖锐程度更为准确。同时,降低模型输出的不一致性可以显著提高模型性能。这些结果为现有的一些方法如共蒸馏(co-distillation)和集成学习(ensemble)提供了理论基础。

As deep neural networks are highly expressive, it is important to find solutions with small generalization gap (the difference between the performance on the training data and unseen data). Focusing on the stochastic nature of training, we first present a theoretical analysis in which the bound of generalization gap depends on what we call inconsistency and instability of model outputs, which can be estimated on unlabeled data. Our empirical study based on this analysis shows that instability and inconsistency are strongly predictive of generalization gap in various settings. In particular, our finding indicates that inconsistency is a more reliable indicator of generalization gap than the sharpness of the loss landscape. Furthermore, we show that algorithmic reduction of inconsistency leads to superior performance. The results also provide a theoretical basis for existing methods such as co-distillation and ensemble.

GNNEvaluator: Evaluating GNN Performance On Unseen Graphs Without Labels
Xin Zheng Miao Zhang Chunyang Chen Soheila Molaei Chuan Zhou Shirui Pan



研究问题:评估图神经网络(GNN)的性能对于实际的GNN模型部署和服务至关重要,因为研究问题:评估图神经网络(GNN)的性能对于实际的GNN模型部署和服务至关重要,因为部署的GNN在对未见过和未标记的测试图进行推理时面临显著的性能不确定性,这是由于训练-测试图分布不匹配造成的。
动机:本文研究了一个新的问题,即GNN模型评估,其目标是通过精确估计特定GNN模型在未标记的未见过图上的性能(如节点分类准确率),来评估在标记和观察到的图上训练的GNN模型的性能。
方法:我们提出了一个两阶段的GNN模型评估框架,包括(1)DiscGraph集的构建和(2)GNNEvaluator的训练和推理。DiscGraph集通过利用GNN输出的潜在节点嵌入和节点类预测的偏差测量函数,捕捉广泛和多样的图数据分布差异。在DiscGraph集的有效训练监督下,GNNEvaluator学习精确估计待评估GNN模型的节点分类准确率,并进行准确的性能评估推理。
效果:我们在真实世界的未见过和未标记的测试图上进行了广泛的实验,证明了我们提出的方法在GNN模型评估上的有效性。

Evaluating the performance of graph neural networks (GNNs) is an essential task for practical GNN model deployment and serving, as deployed GNNs face significant performance uncertainty when inferring on unseen and unlabeled test graphs, due to mismatched training-test graph distributions. In this paper, we study a *new* problem, **GNN model evaluation**, that aims to assess the performance of a specific GNN model trained on labeled and observed graphs, by precisely estimating its performance (e.g., node classification accuracy) on unseen graphs without labels. Concretely, we propose a two-stage GNN model evaluation framework, including (1) DiscGraph set construction and (2) GNNEvaluator training and inference. The DiscGraph set captures wide-range and diverse graph data distribution discrepancies through a discrepancy measurement function, which exploits the GNN outputs of latent node embeddings and node class predictions. Under the effective training supervision from the DiscGraph set, GNNEvaluator learns to precisely estimate node classification accuracy of the to-be-evaluated GNN model and makes an accurate inference for evaluating GNN model performance. Extensive experiments on real-world unseen and unlabeled test graphs demonstrate the effectiveness of our proposed method for GNN model evaluation.

Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
Dami Choi Derrick Xin Hamid Dadkhahi Justin Gilmer Ankush Garg Orhan Firat Chih-Kuan Yeh Andrew M. Dai Behrooz Ghorbani



研究问题:本文研究了多任务学习的优化动态,特别是那些主导着存在显著数据不平衡的一系列任务的情况。
动机:针对具有显著数据不平衡的任务集合,提出了一种简单而有效的在高资源任务上预训练,然后在高/低资源任务混合中进行微调的方法。
方法:通过在高资源任务上预训练,然后对高/低资源任务混合进行微调,来优化多任务学习。
效果:实验证明,这种方法相对于标准的静态权重配置的性能-权衡轮廓实现了一致的改进,并在神经机器翻译(NMT)和多语言语言建模中得到了实证改进。

In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's benefits showing that it achieves consistent improvements relative to the performance trade-off profile of standard static weighting. We analyze under what data regimes this method is applicable and show its improvements empirically in neural machine translation (NMT) and multi-lingual language modeling.

Invariant Anomaly Detection under Distribution Shifts: A Causal Perspective
João B. S. Carvalho Mengtao Zhang Robin Geyer Carlos Cotrini Joachim M. Buhmann



研究问题:本文旨在通过利用因果关系推理的工具,提高异常检测模型对不同类型分布偏移的鲁棒性。
动机:在分布偏移的限制下,训练样本和测试样本来自同一分布的假设会失效,这对异常检测模型构成了挑战。
方法:首先阐明了确保不变表示的必要统计属性,这对于在领域和协变量偏移下的稳健AD至关重要。然后,从这个属性中推导出一个正则化项,当最小化时,可以实现跨环境的局部分布不变性。
效果:通过对包括六种不同AD方法在内的合成和真实世界任务进行广泛的实验评估,显示出显著改善了分布外性能。在协变量和领域偏移下,使用我们提出的正则化项进行模型优化显示出明显的鲁棒性增强。

Anomaly detection (AD) is the machine learning task of identifying highly discrepant abnormal samples by solely relying on the consistency of the normal training samples. Under the constraints of a distribution shift, the assumption that training samples and test samples are drawn from the same distribution breaks down. In this work, by leveraging tools from causal inference we attempt to increase the resilience of anomaly detection models to different kinds of distribution shifts. We begin by elucidating a simple yet necessary statistical property that ensures invariant representations, which is critical for robust AD under both domain and covariate shifts. From this property, we derive a regularization term which, when minimized, leads to partial distribution invariance across environments. Through extensive experimental evaluation on both synthetic and real-world tasks, covering a range of six different AD methods, we demonstrated significant improvements in out-of-distribution performance. Under both covariate and domain shift, models regularized with our proposed term showed marked increased robustness. Code is available at: https://github.com/JoaoCarv/invariant-anomaly-detection

A Unified Detection Framework for Inference-Stage Backdoor Defenses
Xun Xian Ganghua Wang Jayanth Srinivasa Ashish Kundu Xuan Bi Mingyi Hong Jie Ding



研究问题:本文旨在开发一种统一的推理阶段检测框架,以防御后门攻击。
动机:后门攻击在训练过程中插入有毒样本,导致模型包含一个隐藏的后门,可以在不影响正常样本性能的情况下触发特定行为。这些攻击难以检测,因为被后门化的模型在被后门触发器激活之前看起来是正常的,使它们特别隐蔽。
方法:我们设计了一个具有可证明误报率或错误分类干净样本概率保证的框架来进行后门攻击的检测。
效果:我们在计算机视觉和自然语言处理基准数据集上对14种不同的后门攻击进行了广泛的评估。实验结果与我们的理论结果一致,显著超越了最先进的防御方法,例如,在对抗高级自适应后门攻击时,我们的检测能力提高了300%。

Backdoor attacks involve inserting poisoned samples during training, resulting in a model containing a hidden backdoor that can trigger specific behaviors without impacting performance on normal samples. These attacks are challenging to detect, as the backdoored model appears normal until activated by the backdoor trigger, rendering them particularly stealthy. In this study, we devise a unified inference-stage detection framework to defend against backdoor attacks. We first rigorously formulate the inference-stage backdoor detection problem, encompassing various existing methods, and discuss several challenges and limitations. We then propose a framework with provable guarantees on the false positive rate or the probability of misclassifying a clean sample. Further, we derive the most powerful detection rule to maximize the detection power, namely the rate of accurately identifying a backdoor sample, given a false positive rate under classical learning scenarios. Based on the theoretically optimal detection rule, we suggest a practical and effective approach for real-world applications based on the latent representations of backdoored deep nets. We extensively evaluate our method on 14 different backdoor attacks using Computer Vision (CV) and Natural Language Processing (NLP) benchmark datasets. The experimental findings align with our theoretical results. We significantly surpass the state-of-the-art methods, e.g., up to 300\% improvement on the detection power as evaluated by AUCROC, over the state-of-the-art defense against advanced adaptive backdoor attacks.

Passive learning of active causal strategies in agents and language models
Andrew Kyle Lampinen Stephanie C.Y. Chan Ishita Dasgupta Andrew Joo Hun Nam Jane X Wang



研究问题:被动数据中的因果关系和实验性学习有何启示?
动机:尽管被动学习有其局限性,但最近的研究表明,被动训练的语言模型在交互式领域(如工具使用)中取得了成功。
方法:通过模仿专家数据进行训练的代理,可以在测试时间推断和使用从未在训练数据中出现的因果关系链接,并可以对未观察到的新变量集进行实验策略的泛化。
效果:即使在具有高维观测值的复杂环境中,自然语言解释也可以支持从被动数据中泛化出因果关系干预和利用的策略。此外,仅通过被动的下一词预测训练的语言模型,可以从包含解释和推理的少数样本提示中泛化出因果关系干预策略。这些结果突显了被动学习主动因果关系策略的强大能力,并对理解语言模型的行为和能力有重要意义。

What can be learned about causality and experimentation from passive data? This question is salient given recent successes of passively-trained language models in interactive domains such as tool use. Passive learning is inherently limited. However, we show that purely passive learning can in fact allow an agent to learn generalizable strategies for determining and using causal structures, as long as the agent can intervene at test time. We formally illustrate that learning a strategy of first experimenting, then seeking goals, can allow generalization from passive learning in principle. We then show empirically that agents trained via imitation on expert data can indeed generalize at test time to infer and use causal links which are never present in the training data; these agents can also generalize experimentation strategies to novel variable sets never observed in training. We then show that strategies for causal intervention and exploitation can be generalized from passive data even in a more complex environment with high-dimensional observations, with the support of natural language explanations. Explanations can even allow passive learners to generalize out-of-distribution from perfectly-confounded training data. Finally, we show that language models, trained only on passive next-word prediction, can generalize causal intervention strategies from a few-shot prompt containing explanations and reasoning. These results highlight the surprising power of passive learning of active causal strategies, and have implications for understanding the behaviors and capabilities of language models.

Revisiting Scalarization in Multi-Task Learning: A Theoretical Perspective
Yuzheng Hu Ruicheng Xian Qilong Wu Qiuling Fan Lang Yin Han Zhao



研究问题:本文旨在从理论角度探讨标量归一化在多任务学习中是否具有基本优势,特别是在寻找帕累托最优解方面。
动机:近年来,专门化的多任务优化器(SMTOs)在处理多目标优化问题上受到关注,但它们是否优于标量归一化仍存在争议。
方法:本文通过理论研究线性多任务学习模型,分析标量归一化是否能充分探索帕累托前沿。
效果:研究发现,与之前的研究结果相反,标量归一化本质上无法完全探索帕累托前沿,特别是对于那些在多个任务之间找到平衡的帕累托最优解。实验结果证实了这一理论发现,并揭示了SMTOs在寻找平衡解决方案方面的潜力,这是标量归一化无法实现的。

Linear scalarization, i.e., combining all loss functions by a weighted sum, has been the default choice in the literature of multi-task learning (MTL) since its inception. In recent years, there is a surge of interest in developing Specialized Multi-Task Optimizers (SMTOs) that treat MTL as a multi-objective optimization problem. However, it remains open whether there is a fundamental advantage of SMTOs over scalarization. In fact, heated debates exist in the community comparing these two types of algorithms, mostly from an empirical perspective. To approach the above question, in this paper, we revisit scalarization from a theoretical perspective. We focus on linear MTL models and study whether scalarization is capable of fully exploring the Pareto front. Our findings reveal that, in contrast to recent works that claimed empirical advantages of scalarization, scalarization is inherently incapable of full exploration, especially for those Pareto optimal solutions that strike the balanced trade-offs between multiple tasks. More concretely, when the model is under-parametrized, we reveal a multi-surface structure of the feasible region and identify necessary and sufficient conditions for full exploration. This leads to the conclusion that scalarization is in general incapable of tracing out the Pareto front. Our theoretical results partially answer the open questions in Xin et al. (2021), and provide a more intuitive explanation on why scalarization fails beyond non-convexity. We additionally perform experiments on a real-world dataset using both scalarization and state-of-the-art SMTOs. The experimental results not only corroborate our theoretical findings, but also unveil the potential of SMTOs in finding balanced solutions, which cannot be achieved by scalarization.

TaskMet: Task-driven Metric Learning for Model Learning
Dishank Bansal Ricky T. Q. Chen Mustafa Mukadam Brandon Amos



研究问题:如何让深度学习模型在保持原有预测性能的同时,更好地适应下游任务?
动机:仅优化预测准确率的模型可能在下游任务上表现不佳。我们提出使用任务损失来学习一个参数化的损失函数来训练模型。
方法:通过实验验证了两种主要设置下的方法:1)涉及投资组合优化和预算分配的决策关注模型学习场景;2)在有干扰状态的嘈杂环境中进行强化学习。
效果:我们的方法在两个主要设置下的实验中都取得了良好的效果,证明了其有效性。

Deep learning models are often used with some downstream task. Models solely trained to achieve accurate predictions may struggle to perform well on the desired downstream tasks. We propose using the task loss to learn a metric which parameterizes a loss to train the model. This approach does not alter the optimal prediction model itself, but rather changes the model learning to emphasize the information important for the downstream task. This enables us to achieve the best of both worlds: a prediction model trained in the original prediction space while also being valuable for the desired downstream task. We validate our approach through experiments conducted in two main settings: 1) decision-focused model learning scenarios involving portfolio optimization and budget allocation, and 2) reinforcement learning in noisy environments with distracting states.

Cal-DETR: Calibrated Detection Transformer
Muhammad Akhtar Munir Salman Khan Muhammad Haris Khan Mohsen Ali Fahad Khan



研究问题:深度神经网络在计算机视觉任务中虽然预测性能出色,但往往过于自信的预测限制了其在许多安全关键应用中的采用和广泛应用。
动机:尽管近期已有一些努力对深度神经网络进行校准,但几乎所有的努力都集中在分类任务上。令人惊讶的是,对于现代基于深度神经网络的对象检测器,特别是检测变换器的校准,却鲜有人关注。
方法:本文提出了一种针对检测变换器的校准机制(Cal-DETR),特别是针对Deformable-DETR、UP-DETR和DINO。我们追求训练时的校准路线,并做出以下贡献:首先,我们提出了一种简单而有效的方法来量化变换器基对象检测器的不确定性;其次,我们开发了一种基于不确定性的日志调制机制,利用不确定性来调制类别日志;最后,我们开发了一种日志混合方法,该方法作为具有检测特定损失的正则化器,并与基于不确定性的日志调制技术相辅相成,以进一步提高校准性能。
效果:我们在三个域内和四个域外场景中进行了广泛的实验。结果证实了Cal-DETR相对于竞争性训练时方法在校准域内和域外检测方面的有效性,同时保持甚至提高了检测性能。我们的代码库和预训练模型可以在\url{https://github.com/akhtarvision/cal-detr}获取。

Albeit revealing impressive predictive performance for several computer vision tasks, deep neural networks (DNNs) are prone to making overconfident predictions. This limits the adoption and wider utilization of DNNs in many safety-critical applications. There have been recent efforts toward calibrating DNNs, however, almost all of them focus on the classification task. Surprisingly, very little attention has been devoted to calibrating modern DNN-based object detectors, especially detection transformers, which have recently demonstrated promising detection performance and are influential in many decision-making systems. In this work, we address the problem by proposing a mechanism for calibrated detection transformers (Cal-DETR), particularly for Deformable-DETR, UP-DETR, and DINO. We pursue the train-time calibration route and make the following contributions. First, we propose a simple yet effective approach for quantifying uncertainty in transformer-based object detectors. Second, we develop an uncertainty-guided logit modulation mechanism that leverages the uncertainty to modulate the class logits. Third, we develop a logit mixing approach that acts as a regularizer with detection-specific losses and is also complementary to the uncertainty-guided logit modulation technique to further improve the calibration performance. Lastly, we conduct extensive experiments across three in-domain and four out-domain scenarios. Results corroborate the effectiveness of Cal-DETR against the competing train-time methods in calibrating both in-domain and out-domain detections while maintaining or even improving the detection performance. Our codebase and pre-trained models can be accessed at \url{https://github.com/akhtarvision/cal-detr}.

Active Learning for Semantic Segmentation with Multi-class Label Query
Sehyun Hwang Sohyun Lee Hoyoung Kim Minhyeon Oh Jungseul Ok Suha Kwak



研究问题:提出一种新的主动学习方法用于语义分割。
动机:现有的标注方法在标注时间上效率低下,因此提出了一种新的标注策略,通过查询有信息量的局部图像区域来更高效地进行标注。
方法:设计了一种新的标注策略,对每个局部图像区域进行多类别标签的标注,然后通过两个阶段的训练过程来解决类别模糊的问题。
效果:该方法在Cityscapes和PASCAL VOC 2012数据集上的表现优于之前的方法,同时减少了标注成本。

This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions ($\textit{e.g.}$, superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training as it assigns partial labels ($\textit{i.e.}$, a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperforms previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost. Our code and results are available at [https://github.com/sehyun03/MulActSeg](https://github.com/sehyun03/MulActSeg).

Language Semantic Graph Guided Data-Efficient Learning
Wenxuan Ma Shuang Li Lincan Cai Jingxuan Kang



研究问题:如何更有效地利用标签中的语义信息来提高数据效率?
动机:尽管现有的深度学习模型在处理少量标注数据时已经取得了显著的效果,但标签中的额外知识尚未得到充分利用。
方法:提出了一种新的数据高效学习方法,该方法通过构建一个由自然语言描述的标签构成的语义图(LSG),并在这个图上训练一个辅助的图神经网络来提取高层次的语义关系,然后利用这个关系来指导主模型的训练。
效果:在图像、视频和音频等多种模态的数据上,该方法在迁移学习和半监督学习等不同场景下都表现出了优越的性能,并且能够加速模型的训练过程。

Developing generalizable models that can effectively learn from limited data and with minimal reliance on human supervision is a significant objective within the machine learning community, particularly in the era of deep neural networks. Therefore, to achieve data-efficient learning, researchers typically explore approaches that can leverage more related or unlabeled data without necessitating additional manual labeling efforts, such as Semi-Supervised Learning (SSL), Transfer Learning (TL), and Data Augmentation (DA). SSL leverages unlabeled data in the training process, while TL enables the transfer of expertise from related data distributions. DA broadens the dataset by synthesizing new data from existing examples. However, the significance of additional knowledge contained within labels has been largely overlooked in research. In this paper, we propose a novel perspective on data efficiency that involves exploiting the semantic information contained in the labels of the available data. Specifically, we introduce a Language Semantic Graph (LSG) which is constructed from labels manifest as natural language descriptions. Upon this graph, an auxiliary graph neural network is trained to extract high-level semantic relations and then used to guide the training of the primary model, enabling more adequate utilization of label knowledge. Across image, video, and audio modalities, we utilize the LSG method in both TL and SSL scenarios and illustrate its versatility in significantly enhancing performance compared to other data-efficient learning approaches. Additionally, our in-depth analysis shows that the LSG method also expedites the training process.

ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion
Yingjun Du Zehao Xiao Shengcai Liao Cees G. M. Snoek



研究问题:解决小样本学习挑战的原型基于元学习技术。
动机:使用简单的平均函数从有限的例子中估计确定性的原型是一个脆弱的过程。
方法:提出ProtoDiff,一种在元训练阶段利用任务引导扩散模型逐渐生成原型的新框架,以提供有效的类别表示。
效果:ProtoDiff在小样本分类任务上取得了新的最先进的性能,证明了其有效捕捉原型分布和提高泛化能力的能力。

Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.

Transfer learning for atomistic simulations using GNNs and kernel mean embeddings
John Isak Texas Falk Luigi Bonati Pietro Novelli Michele Parrinello massimiliano pontil



研究问题:如何利用图神经网络(GNNs)和预训练模型,通过迁移学习算法来学习原子间势能。
动机:准确的模型需要大量的训练数据集,而生成参考计算是计算密集型的。为了解决这个问题,我们提出了一种利用GNNs表示化学环境和核均值嵌入的转移学习方法。
方法:我们从在OC20数据集上预训练的GNNs中提取一个特征图,并使用它从催化过程的特定系统数据集中学习势能面。我们的方法通过将化学物种信息纳入内核中得到进一步的增强,从而提高了性能和可解释性。
效果:我们在一系列复杂性不断增加的真实数据集上测试了我们的方法,显示出优秀的泛化和迁移性能,并且比仅依赖GNNs或岭回归的方法以及类似的微调方法有所改进。

Interatomic potentials learned using machine learning methods have been successfully applied to atomistic simulations. However, accurate models require large training datasets, while generating reference calculations is computationally demanding. To bypass this difficulty, we propose a transfer learning algorithm that leverages the ability of graph neural networks (GNNs) to represent chemical environments together with kernel mean embeddings. We extract a feature map from GNNs pre-trained on the OC20 dataset and use it to learn the potential energy surface from system-specific datasets of catalytic processes. Our method is further enhanced by incorporating into the kernel the chemical species information, resulting in improved performance and interpretability. We test our approach on a series of realistic datasets of increasing complexity, showing excellent generalization and transferability performance, and improving on methods that rely on GNNs or ridge regression alone, as well as similar fine-tuning approaches.

A Metadata-Driven Approach to Understand Graph Neural Networks
Ting Wei Li Qiaozhu Mei Jiaqi Ma



研究问题:本研究旨在通过元数据分析,探讨图神经网络(GNN)对图数据集属性的敏感性。
动机:当前关于理解GNN局限性的研究主要采用模型驱动的方法,这种方法依赖于网络科学或图论的启发式和领域知识来模拟GNN的行为,既耗时又具有高度主观性。因此,本研究提出了一种元数据驱动的方法,以分析GNN对图数据属性的敏感性。
方法:通过对不同数据集上GNN性能基准测试的元数据进行多元稀疏回归分析,得出一组显著的数据属性。然后,我们专注于识别出的一个数据属性——度分布,并通过理论分析和控制实验来研究这个属性如何影响GNN的性能。
效果:理论分析发现,度分布更平衡的数据集表现出更好的节点表示线性可分性,从而提高了GNN的性能。在控制实验中,使用具有不同度分布的合成数据集,结果与理论分析一致。总的来说,理论分析和控制实验都验证了提出的元数据驱动方法在识别GNN关键数据属性方面的有效性。

Graph Neural Networks (GNNs) have achieved remarkable success in various applications, but their performance can be sensitive to specific data properties of the graph datasets they operate on. Current literature on understanding the limitations of GNNs has primarily employed a \emph{model-driven} approach that leverage heuristics and domain knowledge from network science or graph theory to model the GNN behaviors, which is time-consuming and highly subjective. In this work, we propose a \emph{metadata-driven} approach to analyze the sensitivity of GNNs to graph data properties, motivated by the increasing availability of graph learning benchmarks. We perform a multivariate sparse regression analysis on the metadata derived from benchmarking GNN performance across diverse datasets, yielding a set of salient data properties. To validate the effectiveness of our data-driven approach, we focus on one identified data property, the degree distribution, and investigate how this property influences GNN performance through theoretical analysis and controlled experiments. Our theoretical findings reveal that datasets with more balanced degree distribution exhibit better linear separability of node representations, thus leading to better GNN performance. We also conduct controlled experiments using synthetic datasets with varying degree distributions, and the results align well with our theoretical findings. Collectively, both the theoretical analysis and controlled experiments verify that the proposed metadata-driven approach is effective in identifying critical data properties for GNNs.

Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning
Guozheng Ma Linrui Zhang Haoyu Wang Lu Li Zilin Wang Zhen Wang Li Shen Xueqian Wang Dacheng Tao



研究问题:本研究旨在解决增强视觉强化学习算法样本效率的问题,并探索数据增强(DA)的有效性。
动机:虽然简单的观察变换就能显著提高性能,但目前还不清楚DA的哪些属性使其在实现视觉RL的样本效率方面有效。
方法:通过全面实验评估DA的属性对其效果的影响,提出了新的DA操作和多类型DA融合方案。
效果:实验证明,新提出的方法和操作在DeepMind Control套件和CARLA驾驶模拟器上实现了优于现有最先进技术的样本效率。

Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms. Notably, employing simple observation transformations alone can yield outstanding performance without extra auxiliary representation tasks or pre-trained encoders. However, it remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL. To investigate this issue and further explore the potential of DA, this work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy and provides the following insights and improvements: (1) For individual DA operations, we reveal that both ample spatial diversity and slight hardness are indispensable. Building on this finding, we introduce Random PadResize (Rand PR), a new DA operation that offers abundant spatial diversity with minimal hardness. (2) For multi-type DA fusion schemes, the increased DA hardness and unstable data distribution result in the current fusion schemes being unable to achieve higher sample efficiency than their corresponding individual operations. Taking the non-stationary nature of RL into account, we propose a RL-tailored multi-type DA fusion scheme called Cycling Augmentation (CycAug), which performs periodic cycles of different DA operations to increase type diversity while maintaining data distribution consistency. Extensive evaluations on the DeepMind Control suite and CARLA driving simulator demonstrate that our methods achieve superior sample efficiency compared with the prior state-of-the-art methods.

SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization
Hao Dong Ismail Nejjar Han Sun Eleni Chatzi Olga Fink



研究问题:本研究旨在解决增强视觉强化学习算法样本效率的问题,并探索数据增强(DA)的有效性。
动机:虽然简单的观察变换就能显著提高性能,但目前还不清楚DA的哪些属性使其在实现视觉RL的样本效率方面有效。
方法:通过全面实验评估DA的属性对其效果的影响,提出了新的DA操作和多类型DA融合方案。
效果:实验证明,新提出的方法和操作在DeepMind Control套件和CARLA驾驶模拟器上实现了优于现有最先进技术的样本效率。

In real-world scenarios, achieving domain generalization (DG) presents significant challenges as models are required to generalize to unknown target distributions. Generalizing to unseen multi-modal distributions poses even greater difficulties due to the distinct properties exhibited by different modalities. To overcome the challenges of achieving domain generalization in multi-modal scenarios, we propose SimMMDG, a simple yet effective multi-modal DG framework. We argue that mapping features from different modalities into the same embedding space impedes model generalization. To address this, we propose splitting the features within each modality into modality-specific and modality-shared components. We employ supervised contrastive learning on the modality-shared features to ensure they possess joint properties and impose distance constraints on modality-specific features to promote diversity. In addition, we introduce a cross-modal translation module to regularize the learned features, which can also be used for missing-modality generalization. We demonstrate that our framework is theoretically well-supported and achieves strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel Human-Animal-Cartoon (HAC) dataset introduced in this paper. Our source code and HAC dataset are available at https://github.com/donghao51/SimMMDG.

ALIM: Adjusting Label Importance Mechanism for Noisy Partial Label Learning
Mingyu Xu Zheng Lian Lei Feng Bin Liu Jianhua Tao



研究问题:本文旨在解决弱监督学习中的一种重要分支——有噪声的部分标签学习(noisy PLL)的问题。
动机:与部分标签学习不同,有噪声的部分标签学习允许真实标签不在候选标签集中,但大多数现有方法试图检测噪声样本并估计每个噪声样本的真实标签,而这种检测误差是不可避免的,会在训练过程中不断影响模型优化。
方法:为此,我们提出了一种名为“调整标签重要性机制(ALIM)”的新框架,通过权衡初始候选集和模型输出来减少检测误差的负面影响。ALIM是一种即插即用的策略,可以与现有的部分标签学习方法集成。
效果:我们在多个基准数据集上的实验结果表明,我们的方法在有噪声的部分标签学习上取得了最先进的性能。

Noisy partial label learning (noisy PLL) is an important branch of weakly supervised learning. Unlike PLL where the ground-truth label must conceal in the candidate label set, noisy PLL relaxes this constraint and allows the ground-truth label may not be in the candidate label set. To address this challenging problem, most of the existing works attempt to detect noisy samples and estimate the ground-truth label for each noisy sample. However, detection errors are unavoidable. These errors can accumulate during training and continuously affect model optimization. To this end, we propose a novel framework for noisy PLL with theoretical interpretations, called ``Adjusting Label Importance Mechanism (ALIM)''. It aims to reduce the negative impact of detection errors by trading off the initial candidate set and model outputs. ALIM is a plug-in strategy that can be integrated with existing PLL approaches. Experimental results on multiple benchmark datasets demonstrate that our method can achieve state-of-the-art performance on noisy PLL. Our code is available at: https://github.com/zeroQiaoba/ALIM.

Post-processing Private Synthetic Data for Improving Utility on Selected Measures
Hao Wang Shivchander Sudalairaj John Henning Kristjan Greenewald Akash Srivastava



研究问题:现有的私有合成数据生成算法对下游任务不敏感,但终端用户可能有特定的需求,如果不能满足这些需求,可能会大大降低数据的效用。
动机:我们提出了一种后处理方法,通过选择终端用户关心的度量标准来提高合成数据的效用,同时保持强大的隐私保证和数据集质量。
方法:我们的技术涉及从合成数据中进行重采样,以过滤掉不符合所选效用度量标准的样本,使用高效的随机一阶算法找到最优的重采样权重。
效果:通过全面的数值实验,我们证明我们的方法在多个基准数据集和最先进的合成数据生成算法上始终能提高合成数据的效用。

Existing private synthetic data generation algorithms are agnostic to downstream tasks. However, end users may have specific requirements that the synthetic data must satisfy. Failure to meet these requirements could significantly reduce the utility of the data for downstream use. We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user, while preserving strong privacy guarantees and dataset quality. Our technique involves resampling from the synthetic data to filter out samples that do not meet the selected utility measures, using an efficient stochastic first-order algorithm to find optimal resampling weights. Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.

GEX: A flexible method for approximating influence via Geometric Ensemble
SungYub Kim Kyungsu Kim Eunho Yang



研究问题:现有的影响函数(Influence Function, IF)近似方法由于其双线性近似导致的过于简化的影响分布,会抑制具有相对强影响力的样本的表达能力,从而在实际应用中出现性能下降的问题。
动机:为了解决这个问题,我们提出了一种新的IF近似方法,该方法通过消除线性化来缓解双线性约束,并利用针对非线性损失设计的几何集成(Geometric Ensemble, GE)。
方法:我们首先将现有的IF近似方法解释为参数从拉普拉斯近似(Laplace Approximation, LA)采样的两个线性化损失之间的平均关系。然后,我们分别解决了这两个点的问题,即消除线性化和利用针对非线性损失设计的几何集成。
效果:实验结果表明,我们的方法在下游任务上优于现有的IF近似方法,且计算量更小,因此为低复杂度/非线性基于IF的设计提供了新的可行性。

Through a deeper understanding of predictions of neural networks, Influence Function (IF) has been applied to various tasks such as detecting and relabeling mislabeled samples, dataset pruning, and separation of data sources in practice. However, we found standard approximations of IF suffer from performance degradation due to oversimplified influence distributions caused by their bilinear approximation, suppressing the expressive power of samples with a relatively strong influence. To address this issue, we propose a new interpretation of existing IF approximations as an average relationship between two linearized losses over parameters sampled from the Laplace approximation (LA). In doing so, we highlight two significant limitations of current IF approximations: the linearity of gradients and the singularity of Hessian. Accordingly, by improving each point, we introduce a new IF approximation method with the following features: i) the removal of linearization to alleviate the bilinear constraint and ii) the utilization of Geometric Ensemble (GE) tailored for non-linear losses. Empirically, our approach outperforms existing IF approximations for downstream tasks with lighter computation, thereby providing new feasibility of low-complexity/nonlinear-based IF design.

Parameterizing Context: Unleashing the Power of Parameter-Efficient Fine-Tuning and In-Context Tuning for Continual Table Semantic Parsing
Yongrui Chen Shenyu Zhang Guilin Qi Xinnan Guo



研究问题:本文旨在解决连续表格语义解析的问题,即在每个任务中训练一个将自然语言翻译成SQL的解析器,但每个任务只提供有限的训练示例。
动机:传统的方法由于监督不足容易过拟合,同时由于参数更新可能导致灾难性遗忘。尽管最近的进展通过半监督数据增强和保留一些过去的示例部分缓解了这些问题,但其性能仍受限于未监督数据的量和存储的示例数量。
方法:本文提出了一种新颖的方法,通过参数高效微调(PEFT)和上下文调整(ICT)来训练连续表格语义解析器。首先,我们提出了一种任务适应的PEFT框架,通过冻结预训练模型的主干并微调小规模提示,完全避免了灾难性遗忘。在此基础上,我们提出了一种基于教师-学生框架的解决方案。教师使用ICT解决了少量样本的问题,通过展示几个训练示例获取上下文信息。然后,学生利用提出的PEFT框架从教师的输出分布中学习,并将上下文信息压缩并保存到提示中,从而消除了存储任何训练示例的需要。
效果:我们在两个基准测试上的实验评估证实了我们的方法在各种指标上优于常见的少量样本和持续学习基线。

Continual table semantic parsing aims to train a parser on a sequence of tasks, where each task requires the parser to translate natural language into SQL based on task-specific tables but only offers limited training examples. Conventional methods tend to suffer from overfitting with limited supervision, as well as catastrophic forgetting due to parameter updates. Despite recent advancements that partially alleviate these issues through semi-supervised data augmentation and retention of a few past examples, the performance is still limited by the volume of unsupervised data and stored examples. To overcome these challenges, this paper introduces a novel method integrating parameter-efficient fine-tuning (PEFT) and in-context tuning (ICT) for training a continual table semantic parser. Initially, we present a task-adaptive PEFT framework capable of fully circumventing catastrophic forgetting, which is achieved by freezing the pre-trained model backbone and fine-tuning small-scale prompts. Building on this, we propose a teacher-student framework-based solution. The teacher addresses the few-shot problem using ICT, which procures contextual information by demonstrating a few training examples. In turn, the student leverages the proposed PEFT framework to learn from the teacher's output distribution, and subsequently compresses and saves the contextual information to the prompts, eliminating the need to store any training examples. Experimental evaluations on two benchmarks affirm the superiority of our method over prevalent few-shot and continual learning baselines across various metrics.

Unsupervised Anomaly Detection with Rejection
Lorenzo Perini Jesse Davis



研究问题:异常检测旨在检测数据中的意外行为,但传统的异常检测器通常使用基于直觉研究问题:异常检测旨在检测数据中的意外行为,但传统的异常检测器通常使用基于直觉的启发式方法来学习决策边界,这在实践中难以验证,可能会降低用户对检测器预测的信任。
动机:为了解决这个问题,我们提出了一种通过允许检测器拒绝高不确定性的预测(学习拒绝)的方法。这需要使用一个能够捕获到决策边界距离的信心度量标准,并设置一个拒绝阈值来拒绝低信心的预测。
方法:在本文中,我们通过设置ExCeeD计算的稳定性度量标准的常数拒绝阈值来解决这些挑战。我们的洞察基于这种度量标准的理论研究。此外,设置一个恒定的阈值会产生强大的保证:我们估计测试拒绝率,并推导出拒绝率和预期预测成本的理论上限。
效果:实验表明,我们的方法优于一些基于度量的方法。

Anomaly detection aims at detecting unexpected behaviours in the data. Because anomaly detection is usually an unsupervised task, traditional anomaly detectors learn a decision boundary by employing heuristics based on intuitions, which are hard to verify in practice. This introduces some uncertainty, especially close to the decision boundary, that may reduce the user trust in the detector's predictions. A way to combat this is by allowing the detector to reject predictions with high uncertainty (Learning to Reject). This requires employing a confidence metric that captures the distance to the decision boundary and setting a rejection threshold to reject low-confidence predictions. However, selecting a proper metric and setting the rejection threshold without labels are challenging tasks. In this paper, we solve these challenges by setting a constant rejection threshold on the stability metric computed by ExCeeD. Our insight relies on a theoretical analysis of such a metric. Moreover, setting a constant threshold results in strong guarantees: we estimate the test rejection rate, and derive a theoretical upper bound for both the rejection rate and the expected prediction cost. Experimentally, we show that our method outperforms some metric-based methods.

Anchor Data Augmentation
Nora Schneider Shirin Goshtasbpour Fernando Perez-Cruz



研究问题:本文旨在提出一种新的非线性过参数化回归的数据增强算法。
动机:目前的最先进的解决方案依赖于Mixup算法的修改,而我们的数据增强算法借鉴了关于因果关系的文献。
方法:我们扩展了最近提出的分布稳健的锚点回归(AR)方法来进行数据增强。我们的锚点数据增强(ADA)使用AR中修改样本的多个副本来提供更多的训练示例,从而产生更稳健的回归预测。
效果:我们将ADA应用于使用神经网络的线性和非线性回归问题,ADA与最先进的C-Mixup解决方案具有竞争力。

We propose a novel algorithm for data augmentation in nonlinear over-parametrized regression. Our data augmentation algorithm borrows from the literature on causality. Contrary to the current state-of-the-art solutions that rely on modifications of Mixup algorithm, we extend the recently proposed distributionally robust Anchor regression (AR) method for data augmentation. Our Anchor Data Augmentation (ADA) uses several replicas of the modified samples in AR to provide more training examples, leading to more robust regression predictions. We apply ADA to linear and nonlinear regression problems using neural networks. ADA is competitive with state-of-the-art C-Mixup solutions.

Meta-AdaM: An Meta-Learned Adaptive Optimizer with Momentum for Few-Shot Learning
Siyuan Sun Hongyang Gao



研究问题:本文旨在解决深度学习模型在少量标注样本的任务中面临的挑战,设计了一种元学习优化器Meta-AdaM。
动机:由于少量标注样本的限制,深度学习模型在少量学习任务上面临挑战。元学习已被成功应用于解决这些问题,通过将元学习到的先验知识转移到新任务上。
方法:我们提出了一种元学习的自适应学习率学习器,利用权重更新历史作为输入来预测更合适的学习率以实现快速收敛。此外,我们首次将动量引入到少量学习的优化过程中,通过双前瞻机制实现了类似于多示例设置的快速收敛。
效果:我们在基准数据集上的大量实验结果表明,所提出的Meta-AdaM具有很高的有效性。

We introduce Meta-AdaM, a meta-learned adaptive optimizer with momentum, designed for few-shot learning tasks that pose significant challenges to deep learning models due to the limited number of labeled examples. Meta-learning has been successfully employed to address these challenges by transferring meta-learned prior knowledge to new tasks. Most existing works focus on meta-learning an optimal model initialization or an adaptive learning rate learner for rapid convergence. However, these approaches either neglect to consider weight-update history for the adaptive learning rate learner or fail to effectively integrate momentum for fast convergence, as seen in many-shot learning settings. To tackle these limitations, we propose a meta-learned learning rate learner that utilizes weight-update history as input to predict more appropriate learning rates for rapid convergence. Furthermore, for the first time, our approach incorporates momentum into the optimization process of few-shot learning via a double look-ahead mechanism, enabling rapid convergence similar to many-shot settings. Extensive experimental results on benchmark datasets demonstrate the effectiveness of the proposed Meta-AdaM.

How a Student becomes a Teacher: learning and forgetting through Spectral methods
Lorenzo Giambagli Lorenzo Buffoni Lorenzo Chicchi Duccio Fanelli



研究问题:如何通过优化算法,训练出能匹配教师网络能力的学生网络,并找出学生网络中对应的稳定子结构。
动机:在学生网络参数过多的情况下,传统的学习方法无法找出这种稳定的子结构。
方法:提出一种新的优化方案,基于层间信息线性传递的谱表示进行梯度计算,以找出与教师网络复杂度相匹配的学生网络子结构。
效果:通过优化后的子结构,即使对不重要的节点进行剪枝,性能也不会下降到超过对应教师网络大小的阈值,表现出了普适性的二次相变特性。

In theoretical Machine Learning, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. A student network is trained on data generated by a fixed teacher network until it matches the instructor’s ability to cope with the assigned task. The above scheme proves particularly relevant when the student network is overparameterized (namely, when larger layer sizes are employed) as compared to the underlying teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a decisive leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.

Binary Classification with Confidence Difference
Wei Wang Lei Feng Yuchen Jiang Gang Niu Min-Ling Zhang Masashi Sugiyama



研究问题:如何在没有精确标签的情况下,利用置信度差异进行弱监督的二分类学习。
动机:传统的弱监督学习方法需要点对点的标签置信度,这在现实世界中可能难以收集和计算。本文提出了一种新的弱监督学习方法,即基于置信度差异的分类方法。
方法:我们提出了一种风险一致的方法来处理这个问题,并证明了估计误差边界可以达到最优收敛速度。我们还引入了一种风险修正方法来缓解过拟合问题,其一致性和收敛速度也得到了证明。
效果:我们在基准数据集和一个真实世界的推荐系统数据集上进行了广泛的实验,验证了我们提出的方法在利用置信度差异的监督信息上的有效性。

Recently, learning with soft labels has been shown to achieve better performance than learning with hard labels in terms of model generalization, calibration, and robustness. However, collecting pointwise labeling confidence for all training examples can be challenging and time-consuming in real-world scenarios. This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification. Instead of pointwise labeling confidence, we are given only unlabeled data pairs with confidence difference that specifies the difference in the probabilities of being positive. We propose a risk-consistent approach to tackle this problem and show that the estimation error bound achieves the optimal convergence rate. We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven. Extensive experiments on benchmark data sets and a real-world recommender system data set validate the effectiveness of our proposed approaches in exploiting the supervision information of the confidence difference.

Dynamically Masked Discriminator for GANs
Wentian Zhang Haozhe Liu Bing Li Jinheng Xie Yawen Huang Yuexiang Li Yefeng Zheng Bernard Ghanem



研究问题:训练生成对抗网络(GANs)仍是一个具有挑战性的问题,尤其是在判别器适应新生成数据变化的过程中。
动机:由于生成数据的分布在整个训练过程中会发生变化,判别器很难学习到这种变化。
方法:本文从在线持续学习的角度提出了一种新的GANs方法。通过将训练中的生成数据视为数据流,检测判别器是否在新的生成数据的学习中放慢了速度,并强制判别器快速学习新知识。特别是,我们提出了一个新的判别器,它可以自动检测其学习的滞后性,并动态地掩盖其特征,以便判别器可以自适应地学习生成数据的时变分布。
效果:实验结果表明,我们的方法优于最先进的方法。

Training Generative Adversarial Networks (GANs) remains a challenging problem. The discriminator trains the generator by learning the distribution of real/generated data. However, the distribution of generated data changes throughout the training process, which is difficult for the discriminator to learn. In this paper, we propose a novel method for GANs from the viewpoint of online continual learning. We observe that the discriminator model, trained on historically generated data, often slows down its adaptation to the changes in the new arrival generated data, which accordingly decreases the quality of generated results. By treating the generated data in training as a stream, we propose to detect whether the discriminator slows down the learning of new knowledge in generated data. Therefore, we can explicitly enforce the discriminator to learn new knowledge fast. Particularly, we propose a new discriminator, which automatically detects its retardation and then dynamically masks its features, such that the discriminator can adaptively learn the temporally-vary distribution of generated data. Experimental results show our method outperforms the state-of-the-art approaches.

Combating Bilateral Edge Noise for Robust Link Prediction
Zhanke Zhou Jiangchao Yao Jiaxu Liu Xiawei Guo quanming yao LI He Liang Wang Bo Zheng Bo Han



研究问题:尽管图上的链接预测在图神经网络(GNNs)的发展下取得了巨大成功,但其在边缘噪声下的鲁棒性仍鲜有研究。
动机:我们首先进行实证研究,揭示边缘噪声会双向干扰输入的拓扑结构和目标标签,导致性能下降和表示崩溃。为了解决这个问题,我们提出了一种信息理论指导的原则——鲁棒图信息瓶颈(RGIB),以提取可靠的监督信号并避免表示崩溃。
方法:与基本的信息瓶颈不同,RGIB进一步解耦和平衡了图拓扑、目标标签和表示之间的相互依赖关系,为抵抗双向噪声建立了新的学习目标。我们探索了两种实例化方法,即RGIB-SSL和RGIB-REP,分别利用自监督学习和数据重参数化的优点进行隐式和显式的数据去噪。
效果:我们在六个数据集和三种具有不同噪声场景的GNN上进行了广泛的实验,验证了我们的RGIB实例化的有效性。代码已公开发布在:https://github.com/tmlr-group/RGIB。

Although link prediction on graphs has achieved great success with the development of graph neural networks (GNNs), the potential robustness under the edge noise is still less investigated. To close this gap, we first conduct an empirical study to disclose that the edge noise bilaterally perturbs both input topology and target label, yielding severe performance degradation and representation collapse. To address this dilemma, we propose an information-theory-guided principle, Robust Graph Information Bottleneck (RGIB), to extract reliable supervision signals and avoid representation collapse. Different from the basic information bottleneck, RGIB further decouples and balances the mutual dependence among graph topology, target labels, and representation, building new learning objectives for robust representation against the bilateral noise. Two instantiations, RGIB-SSL and RGIB-REP, are explored to leverage the merits of different methodologies, i.e., self-supervised learning and data reparameterization, for implicit and explicit data denoising, respectively. Extensive experiments on six datasets and three GNNs with diverse noisy scenarios verify the effectiveness of our RGIB instantiations. The code is publicly available at: https://github.com/tmlr-group/RGIB.

Boosting Spectral Clustering on Incomplete Data via Kernel Correction and Affinity Learning
Fangchen Yu Runze Zhao Zhan Shi Yiwen Lu Jicong Fan Yicheng Zeng Jianfeng Mao Wenye Li



研究问题:如何提高在不完整数据上进行谱聚类的效果。
动机:不完整数据会导致亲和度测量不准确,从而降低聚类性能。
方法:提出了一种无需插补的框架,包括一种新的核修正方法和一系列亲和度学习方法。新的核修正方法提高了对不完整数据估计的核矩阵的质量,而亲和度学习方法则利用自适应扩展构建了具有$\ell_p$-范数的内在亲和矩阵。
效果:在基准数据集上,该方法优于现有的数据插补和距离校准技术,为各种真实世界应用中的不完整数据的谱聚类提供了有前景的解决方案。

Spectral clustering has gained popularity for clustering non-convex data due to its simplicity and effectiveness. It is essential to construct a similarity graph using a high-quality affinity measure that models the local neighborhood relations among the data samples. However, incomplete data can lead to inaccurate affinity measures, resulting in degraded clustering performance. To address these issues, we propose an imputation-free framework with two novel approaches to improve spectral clustering on incomplete data. Firstly, we introduce a new kernel correction method that enhances the quality of the kernel matrix estimated on incomplete data with a theoretical guarantee, benefiting classical spectral clustering on pre-defined kernels. Secondly, we develop a series of affinity learning methods that equip the self-expressive framework with $\ell_p$-norm to construct an intrinsic affinity matrix with an adaptive extension. Our methods outperform existing data imputation and distance calibration techniques on benchmark datasets, offering a promising solution to spectral clustering on incomplete data in various real-world applications.

Understanding Contrastive Learning via Distributionally Robust Optimization
Junkang Wu Jiawei Chen Jiancan Wu Wentao Shi Xiang Wang Xiangnan He



研究问题:本研究旨在揭示对比学习对采样偏差的内在容忍度,并解释其现象。
动机:现有的理论无法充分解释对比学习中负样本可能包含相似语义(如标签)的现象。
方法:通过分布稳健优化(DRO)的视角分析对比学习,得出几个关键洞察:(1) 对比学习实质上是在负采样分布上进行DRO,从而实现在各种潜在分布上的稳健性能,显示出对采样偏差的稳健性;(2) 温度参数τ的设计不仅仅是启发式的,而是作为拉格朗日系数,调整潜在分布集的大小;(3) 建立了DRO和互信息之间的理论联系,为“InfoNCE作为MI的估计”提供了新的证据,并为基于φ-散度的广义互信息提供了新的估计方法。
效果:我们还识别出对比学习的潜力缺陷,包括过度保守和对异常值的敏感性,并提出了一种新颖的调整后的InfoNCE损失(ADNCE)来缓解这些问题。它优化了潜在分布,提高了性能并加速了收敛。在各种领域(图像、句子和图形)的大量实验验证了该提案的有效性。

This study reveals the inherent tolerance of contrastive learning (CL) towards sampling bias, wherein negative samples may encompass similar semantics (\eg labels). However, existing theories fall short in providing explanations for this phenomenon. We bridge this research gap by analyzing CL through the lens of distributionally robust optimization (DRO), yielding several key insights: (1) CL essentially conducts DRO over the negative sampling distribution, thus enabling robust performance across a variety of potential distributions and demonstrating robustness to sampling bias; (2) The design of the temperature $\tau$ is not merely heuristic but acts as a Lagrange Coefficient, regulating the size of the potential distribution set; (3) A theoretical connection is established between DRO and mutual information, thus presenting fresh evidence for ``InfoNCE as an estimate of MI'' and a new estimation approach for $\phi$-divergence-based generalized mutual information. We also identify CL's potential shortcomings, including over-conservatism and sensitivity to outliers, and introduce a novel Adjusted InfoNCE loss (ADNCE) to mitigate these issues. It refines potential distribution, improving performance and accelerating convergence. Extensive experiments on various domains (image, sentence, and graph) validate the effectiveness of the proposal.

MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection
Junho Song Keonwoo Kim Jeonglyul Oh Sungzoon Cho



研究问题:检测现实世界中的多元时间序列数据异常是具有挑战性的,因为存在复杂的时间依赖性和变量间的相关性。
动机:虽然重建基础的深度模型已被广泛用于解决这个问题,但它们仍然存在过度概括的问题,无法提供一致的高性能。
方法:我们提出了MEMTO,一种使用重建基础方法的内存引导变压器。它设计了一个新颖的内存模块,可以学习每个内存项应根据输入数据更新的程度。为了稳定训练过程,我们使用了两阶段训练范式,包括使用K-means聚类初始化内存项。此外,我们还引入了一种基于偏差的二维检测标准,该标准考虑了输入空间和潜在空间来计算异常分数。
效果:我们在五个来自不同领域的实际数据集上评估了我们提出的方法,其平均异常检测F1分数为95.74%,显著优于先前最先进的方法。我们还进行了广泛的实验,以实证验证我们提出的模型的关键组件的有效性。

Detecting anomalies in real-world multivariate time series data is challenging due to complex temporal dependencies and inter-variable correlations. Recently, reconstruction-based deep models have been widely used to solve the problem. However, these methods still suffer from an over-generalization issue and fail to deliver consistently high performance. To address this issue, we propose the MEMTO, a memory-guided Transformer using a reconstruction-based approach. It is designed to incorporate a novel memory module that can learn the degree to which each memory item should be updated in response to the input data. To stabilize the training procedure, we use a two-phase training paradigm which involves using K-means clustering for initializing memory items. Additionally, we introduce a bi-dimensional deviation-based detection criterion that calculates anomaly scores considering both input space and latent space. We evaluate our proposed method on five real-world datasets from diverse domains, and it achieves an average anomaly detection F1-score of 95.74%, significantly outperforming the previous state-of-the-art methods. We also conduct extensive experiments to empirically validate the effectiveness of our proposed model's key components.

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset
Jiakang Yuan Bo Zhang Xiangchao Yan Botian Shi Tao Chen Yikang LI Yu Qiao



研究问题:如何利用大规模点云数据集训练出具有统一表示的感知模型,以在不同的任务或基准上取得良好效果。
动机:目前的工作主要集中在自我监督的预训练流程上,即在同一基准上进行预训练和微调,这在预训练检查点的绩效可扩展性和跨数据集应用方面存在困难。
方法:首次构建了具有多样化数据分布的大规模预训练点云数据集,并从这种多样化的预训练数据集中学习泛化表示。将点云预训练任务制定为半监督问题,利用少量标记和大量未标记的点云数据生成可以直接应用于许多基线模型和基准的统一骨干表示,从而解耦了与自动驾驶相关的预训练过程和下游微调任务。
效果:在骨干预训练期间,通过增强场景级和实例级的分布多样性,并利用骨干从未知实例中学习的能力,在Waymo、nuScenes和KITTI等一系列下游感知基准上取得了显著的性能提升,适用于PV-RCNN++、SECOND、CenterPoint等不同的基线模型。

It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset, to obtain unified representations that can achieve promising results on different tasks or benchmarks. Previous works mainly focus on the self-supervised pre-training pipeline, meaning that they perform the pre-training and fine-tuning on the same benchmark, which is difficult to attain the performance scalability and cross-dataset application for the pre-training checkpoint. In this paper, for the first time, we are committed to building a large-scale pre-training point-cloud dataset with diverse data distribution, and meanwhile learning generalizable representations from such a diverse pre-training dataset. We formulate the point-cloud pre-training task as a semi-supervised problem, which leverages the few-shot labeled and massive unlabeled point-cloud data to generate the unified backbone representations that can be directly applied to many baseline models and benchmarks, decoupling the AD-related pre-training process and downstream fine-tuning task. During the period of backbone pre-training, by enhancing the scene- and instance-level distribution diversity and exploiting the backbone's ability to learn from unknown instances, we achieve significant performance gains on a series of downstream perception benchmarks including Waymo, nuScenes, and KITTI, under different baseline models like PV-RCNN++, SECOND, CenterPoint.

Fused Gromov-Wasserstein Graph Mixup for Graph-level Classifications
Xinyu Ma Xu Chu Yasha Wang Yang Lin Junfeng Zhao Liantao Ma Wenwu Zhu



研究问题:现有的图数据增强方法主要关注图信号空间和图结构空间的独立增强,忽视了它们之间的交互作用。
动机:为了解决上述问题,我们提出了一种新的图混合算法FGWMixup,通过在Fused Gromov-Wasserstein(FGW)度量空间中寻找源图的"中点"来优化图间节点匹配策略。
方法:我们通过将问题形式化为最优传输问题,以考虑图结构和信号之间的交互作用。同时,我们还引入了一种放松的FGW求解器,以提高FGWMixup的可扩展性,并加快了收敛速度。
效果:我们在五个数据集上进行了广泛的实验,使用经典的(MPNNs)和先进的(Graphormers)GNN骨干网络,结果表明FGWMixup有效地提高了GNN的泛化能力和鲁棒性。

Graph data augmentation has shown superiority in enhancing generalizability and robustness of GNNs in graph-level classifications. However, existing methods primarily focus on the augmentation in the graph signal space and the graph structure space independently, neglecting the joint interaction between them. In this paper, we address this limitation by formulating the problem as an optimal transport problem that aims to find an optimal inter-graph node matching strategy considering the interactions between graph structures and signals. To solve this problem, we propose a novel graph mixup algorithm called FGWMixup, which seeks a "midpoint" of source graphs in the Fused Gromov-Wasserstein (FGW) metric space. To enhance the scalability of our method, we introduce a relaxed FGW solver that accelerates FGWMixup by improving the convergence rate from $\mathcal{O}(t^{-1})$ to $\mathcal{O}(t^{-2})$. Extensive experiments conducted on five datasets using both classic (MPNNs) and advanced (Graphormers) GNN backbones demonstrate that \mname\xspace effectively improves the generalizability and robustness of GNNs. Codes are available at https://github.com/ArthurLeoM/FGWMixup.

Revisiting Adversarial Robustness Distillation from the Perspective of Robust Fairness
Xinli Yue Ningping Mou Qian Wang Lingchen Zhao



研究问题:现有的对抗性鲁棒性蒸馏(ARD)方法主要关注学生模型的整体鲁棒性,忽视了关键的鲁棒公平性问题。
动机:学生模型可能在一些数据类别上显示出强大的鲁棒性,而在其他类别上表现出高度的脆弱性,这被称为“桶效应”。
方法:我们提出了公平对抗性鲁棒性蒸馏(Fair-ARD)方法,通过增加困难类别的权重来提高学生模型的鲁棒公平性。
效果:实验表明,Fair-ARD在鲁棒公平性方面优于最先进的ARD方法和现有的鲁棒公平性算法,同时在整体鲁棒性上也略有提升。

Adversarial Robustness Distillation (ARD) aims to transfer the robustness of large teacher models to small student models, facilitating the attainment of robust performance on resource-limited devices. However, existing research on ARD primarily focuses on the overall robustness of student models, overlooking the crucial aspect of $\textit{robust fairness}$. Specifically, these models may demonstrate strong robustness on some classes of data while exhibiting high vulnerability on other classes. Unfortunately, the "buckets effect" implies that the robustness of the deployed model depends on the classes with the lowest level of robustness. In this paper, we first investigate the inheritance of robust fairness during ARD and reveal that student models only partially inherit robust fairness from teacher models. We further validate this issue through fine-grained experiments with various model capacities and find that it may arise due to the gap in capacity between teacher and student models, as well as the existing methods treating each class equally during distillation. Based on these observations, we propose $\textbf{Fair}$ $\textbf{A}$dversarial $\textbf{R}$obustness $\textbf{D}$istillation (Fair-ARD), a novel framework for enhancing the robust fairness of student models by increasing the weights of difficult classes, and design a geometric perspective-based method to quantify the difficulty of different classes for determining the weights. Extensive experiments show that Fair-ARD surpasses both state-of-the-art ARD methods and existing robust fairness algorithms in terms of robust fairness (e.g., the worst-class robustness under AutoAttack is improved by at most 12.3\% and 5.3\% using ResNet18 on CIFAR10, respectively), while also slightly improving overall robustness. Our code is available at: [https://github.com/NISP-official/Fair-ARD](https://github.com/NISP-official/Fair-ARD).

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Yung-Hsuan Lai Yen-Chun Chen Yu-Chiang Frank Wang



研究问题:本文旨在探索视听学习中的未充分开发的不同步态设置,即在只有弱标签的情况下识别视频中的视听事件。
动机:目前的研究主要集中在视听学习中的模态对齐设置,而对视听学习中的不同步态设置的研究较少。
方法:本文提出了一种简单、有效且通用的方法,称为视听标签提取(VALOR),以获取训练事件的模态标签。
效果:实验结果表明,通过使用大规模对比预训练模型作为模态教师,可以显著提高注意力基线的平均F-score(Type@AV)。此外,我们的最优模型在所有LLP指标上均取得了新的最先进水平。

Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its $\textit{modality-aligned}$ setting, $\textit{i.e.}$, the audio and visual modality are $\textit{both}$ assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored $\textit{unaligned}$ setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed $\textbf{V}$isual-$\textbf{A}$udio $\textbf{L}$abel Elab$\textbf{or}$ation (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by $\textbf{8.0}$ in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin ($\textbf{+5.4}$ F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well.

SODA: Robust Training of Test-Time Data Adaptors
Zige Wang Yonggang Zhang Zhen Fang Long Lan Wenjing Yang Bo Han



研究问题:如何缓解由于分布变化引起的性能下降,同时考虑到隐私问题使得模型参数无法访问。
动机:现有的方法如零阶优化(ZOO)在训练数据适应器以适应已部署的模型时,由于数据适应器可能对数据特征造成破坏,其效果有限。
方法:我们提出了伪标签鲁棒数据适应(SODA)方法。具体来说,SODA利用高置信度的预测标签作为可靠的标签来优化使用ZOO进行标签预测的数据适应器。对于低置信度预测的数据,SODA鼓励适应器保留数据信息以减轻数据破坏。
效果:实验结果表明,SODA可以在存在分布变化的情况下显著提高已部署模型的性能,而无需访问模型参数。

Adapting models deployed to test distributions can mitigate the performance degradation caused by distribution shifts. However, privacy concerns may render model parameters inaccessible. One promising approach involves utilizing zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. Nevertheless, the data adaptor trained with ZOO typically brings restricted improvements due to the potential corruption of data features caused by the data adaptor. To address this issue, we revisit ZOO in the context of test-time data adaptation. We find that the issue directly stems from the unreliable estimation of the gradients used to optimize the data adaptor, which is inherently due to the unreliable nature of the pseudo-labels assigned to the test data. Based on this observation, we propose pseudo-label-robust data adaptation (SODA) to improve the performance of data adaptation. Specifically, SODA leverages high-confidence predicted labels as reliable labels to optimize the data adaptor with ZOO for label prediction. For data with low-confidence predictions, SODA encourages the adaptor to preserve data information to mitigate data corruption. Empirical results indicate that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters.

H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation
Yanjie Ze Yuyao Liu Ruizhe Shi Jiaxin Qin Zhecheng Yuan Jiashun Wang Huazhe Xu



研究问题:如何通过强化学习解决困难的精细操作任务。
动机:人类手的灵巧性一直是机器人操作的灵感来源,我们提出一个以人的手为信息源的视觉表示学习框架来解决精细操作任务。
方法:我们的框架包含三个阶段:1)使用3D人体手部姿态估计进行预训练表示;2)使用自我监督的关键部位检测进行离线适应表示;3)使用指数移动平均批量归一化进行强化学习。后两个阶段只修改了预训练表示的0.36%的参数,确保了预训练知识的完整性。
效果:我们在12个具有挑战性的精细操作任务上进行了实证研究,发现H-InDex大大超过了强大的基线方法和最近的视觉基础模型用于运动控制。

Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human $\textbf{H}$and-$\textbf{In}$formed visual representation learning framework to solve difficult $\textbf{Dex}$terous manipulation tasks ($\textbf{H-InDex}$) with reinforcement learning. Our framework consists of three stages: $\textit{(i)}$ pre-training representations with 3D human hand pose estimation, $\textit{(ii)}$ offline adapting representations with self-supervised keypoint detection, and $\textit{(iii)}$ reinforcement learning with exponential moving average BatchNorm. The last two stages only modify $0.36$% parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study $\textbf{12}$ challenging dexterous manipulation tasks and find that $\textbf{H-InDex}$ largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code and videos are available at https://yanjieze.com/H-InDex .

Transfer Learning with Affine Model Transformation
Shunya Minami Kenji Fukumizu Yoshihiro Hayashi Ryo Yoshida



研究问题:如何利用监督迁移学习提升在数据稀缺情况下的预测能力。
动机:尽管监督迁移学习方法在许多实际应用中取得了成功,但由于缺乏理论基础,其进一步发展受到了阻碍。
方法:本文提出了一种称为仿射模型迁移的通用迁移学习回归方法,遵循期望平方损失最小化原则。
效果:通过几个案例研究,证明了使用仿射型迁移模型分别对跨领域共性和领域特定因素进行建模和估计的实际效益。

Supervised transfer learning has received considerable attention due to its potential to boost the predictive power of machine learning in scenarios where data are scarce. Generally, a given set of source models and a dataset from a target domain are used to adapt the pre-trained models to a target domain by statistically learning domain shift and domain-specific factors. While such procedurally and intuitively plausible methods have achieved great success in a wide range of real-world applications, the lack of a theoretical basis hinders further methodological development. This paper presents a general class of transfer learning regression called affine model transfer, following the principle of expected-square loss minimization. It is shown that the affine model transfer broadly encompasses various existing methods, including the most common procedure based on neural feature extractors. Furthermore, the current paper clarifies theoretical properties of the affine model transfer such as generalization error and excess risk. Through several case studies, we demonstrate the practical benefits of modeling and estimating inter-domain commonality and domain-specific factors separately with the affine-type transfer models.

PUe: Biased Positive-Unlabeled Learning Enhancement by Causal Inference
Xutao Wang Hanting Chen Tianyu Guo Yunhe Wang



研究问题:本文旨在解决正负样本不平衡的问题,即在有大量未标记数据和少量已标记正例的情况下,如何进行高精度的二分类学习。
动机:现有的基于代价敏感的方法往往假设观察到的正例标签是完全随机选择的,但实际上,真实世界中的正负样本分布往往是不均匀的,存在选择偏差。
方法:本文提出了一种基于因果关系推断理论的PU学习增强(PUe)算法,使用归一化倾向分数和归一化逆概率加权(NIPW)技术重构损失函数,从而获得一致、无偏的分类器估计,提高模型性能。同时,当标签机制未知时,我们提出并研究了使用正则化技术估计深度学习中倾向分数的方法。
效果:实验结果表明,相比于先进的代价敏感PU方法,本文提出的PUe算法在非均匀标签分布数据集上显著提高了分类器的准确性。

Positive-Unlabeled (PU) learning aims to achieve high-accuracy binary classification with limited labeled positive examples and numerous unlabeled ones. Existing cost-sensitive-based methods often rely on strong assumptions that examples with an observed positive label were selected entirely at random. In fact, the uneven distribution of labels is prevalent in real-world PU problems, indicating that most actual positive and unlabeled data are subject to selection bias. In this paper, we propose a PU learning enhancement (PUe) algorithm based on causal inference theory, which employs normalized propensity scores and normalized inverse probability weighting (NIPW) techniques to reconstruct the loss function, thus obtaining a consistent, unbiased estimate of the classifier and enhancing the model's performance. Moreover, we investigate and propose a method for estimating propensity scores in deep learning using regularization techniques when the labeling mechanism is unknown. Our experiments on three benchmark datasets demonstrate the proposed PUe algorithm significantly improves the accuracy of classifiers on non-uniform label distribution datasets compared to advanced cost-sensitive PU methods. Codes are available at https://github.com/huawei-noah/Noah-research/tree/master/PUe and https://gitee.com/mindspore/models/tree/master/research/cv/PUe.

Topological RANSAC for instance verification and retrieval without fine-tuning
Guoyuan An Ju-hyeong Seon Inkyu An Yuchi Huo Sung-eui Yoon



研究问题:本文旨在解决现有图像检索方法(如Spatial verification,SP)在缺乏微调集的情况下解释性差的问题。
动机:尽管SP方法广泛使用,但其依赖于空间模型和假设平面结构以及忽视特征间拓扑关系等问题,限制了其性能。
方法:本文提出了一种创新的技术,将RANSAC过程中的空间模型替换为拓扑模型,并引入仿生扫视和中央凹函数来验证特征间的拓扑一致性。
效果:实验结果表明,该方法显著优于SP,并在非微调检索中实现了最先进的性能。同时,当与微调特征结合使用时,可以进一步提升性能。此外,该方法保持了高解释性和轻量级特性,为各种实际应用提供了实用且灵活的解决方案。

This paper presents an innovative approach to enhancing explainable image retrieval, particularly in situations where a fine-tuning set is unavailable. The widely-used SPatial verification (SP) method, despite its efficacy, relies on a spatial model and the hypothesis-testing strategy for instance recognition, leading to inherent limitations, including the assumption of planar structures and neglect of topological relations among features. To address these shortcomings, we introduce a pioneering technique that replaces the spatial model with a topological one within the RANSAC process. We propose bio-inspired saccade and fovea functions to verify the topological consistency among features, effectively circumventing the issues associated with SP's spatial model. Our experimental results demonstrate that our method significantly outperforms SP, achieving state-of-the-art performance in non-fine-tuning retrieval. Furthermore, our approach can enhance performance when used in conjunction with fine-tuned features. Importantly, our method retains high explainability and is lightweight, offering a practical and adaptable solution for a variety of real-world applications.

Regularizing Neural Networks with Meta-Learning Generative Models
Shin'ya Yamaguchi Daiki Chijiwa Sekitoshi Kanai Atsutoshi Kumagai Hisashi Kashima



研究问题:本文旨在改善深度学习中生成性数据增强的方法。
动机:生成性数据增强利用生成模型产生的合成样本作为小数据集设置中的额外数据集进行分类,但其中的关键挑战是合成数据包含的无信息样本会降低准确性。
方法:本文提出了一种名为“元生成正则化”(MGR)的新型生成性数据增强策略。为了避免生成性数据增强的性能下降,MGR使用合成样本来正则化特征提取器,而不是训练分类器。这些合成样本通过元学习动态确定,以最小化验证损失。
效果:实验表明,MGR特别在数据集较小时有效,并在测试准确性上稳定地超过基线高达7个百分点。

This paper investigates methods for improving generative data augmentation for deep learning. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small dataset settings. A key challenge of generative data augmentation is that the synthetic data contain uninformative samples that degrade accuracy. This can be caused by the synthetic samples not perfectly representing class categories in real data and uniform sampling not necessarily providing useful samples for tasks. In this paper, we present a novel strategy for generative data augmentation called *meta generative regularization* (MGR). To avoid the degradation of generative data augmentation, MGR utilizes synthetic samples for regularizing feature extractors instead of training classifiers. These synthetic samples are dynamically determined to minimize the validation losses through meta-learning. We observed that MGR can avoid the performance degradation of naive generative data augmentation and boost the baselines. Experiments on six datasets showed that MGR is effective particularly when datasets are smaller and stably outperforms baselines by up to 7 percentage points on test accuracy.

StableFDG: Style and Attention Based Learning for Federated Domain Generalization
Jungwuk Park Dong-Jun Han Jinho Kim Shiqiang Wang Christopher Brinton Jaekyun Moon



研究问题:现有的联邦学习算法假设训练(源领域)和测试(目标领域)的数据分布相同,但实际中经常出现领域转移的问题。
动机:由于每个客户端的局部数据集中样本/领域的缺乏,现有的领域泛化算法在联邦设置中面临基本挑战。
方法:本文提出了一种基于风格和注意力的学习策略StableFDG,用于实现联邦域泛化。首先,通过风格基础学习,使每个客户端在其局部数据集中探索超越原始源领域的新风格,提高基于提出的风格共享、转换和探索策略的领域多样性。其次,引入了一种基于注意力的特征突出器,捕捉同一类别数据样本特征之间的相似性,并强调重要/共同特征,以更好地学习数据贫乏的联邦环境中每个类别的领域不变特性。
效果:实验结果表明,StableFDG在各种领域泛化基准数据集上优于现有基线,证明了其有效性。

Traditional federated learning (FL) algorithms operate under the assumption that the data distributions at training (source domains) and testing (target domain) are the same. The fact that domain shifts often occur in practice necessitates equipping FL methods with a domain generalization (DG) capability. However, existing DG algorithms face fundamental challenges in FL setups due to the lack of samples/domains in each client’s local dataset. In this paper, we propose StableFDG, a style and attention based learning strategy for accomplishing federated domain generalization, introducing two key contributions. The first is style-based learning, which enables each client to explore novel styles beyond the original source domains in its local dataset, improving domain diversity based on the proposed style sharing, shifting, and exploration strategies. Our second contribution is an attention-based feature highlighter, which captures the similarities between the features of data samples in the same class, and emphasizes the important/common characteristics to better learn the domain-invariant characteristics of each class in data-poor FL scenarios. Experimental results show that StableFDG outperforms existing baselines on various DG benchmark datasets, demonstrating its efficacy.

Synthetic-to-Real Pose Estimation with Geometric Reconstruction
Qiuxia Lin Kerui Gu Linlin Yang Angela Yao



研究问题:如何将基于合成数据的模型适应到真实世界的目标领域,特别是在只有无标签数据的情况下。
动机:获取标注数据(特别是对于新的部署)既昂贵又耗时。
方法:提出一种重构策略作为伪标签的补充,用于合成到真实的领域适应。通过根据预测的关键部位几何变换基础图像来生成驱动图像,并施加重构损失以精炼预测结果。
效果:在四个大规模的手部和人体真实世界数据集上,该方法比之前最先进的方法提高了8%的PCK,尤其在指尖和头部等端点上,PCK分别提高了7.2%和29.9%。

Pose estimation is remarkably successful under supervised learning, but obtaining annotations, especially for new deployments, is costly and time-consuming. This work tackles adapting models trained on synthetic data to real-world target domains with only unlabelled data. A common approach is model fine-tuning with pseudo-labels from the target domain; yet many pseudo-labelling strategies cannot provide sufficient high-quality pose labels. This work proposes a reconstruction-based strategy as a complement to pseudo-labelling for synthetic-to-real domain adaptation. We generate the driving image by geometrically transforming a base image according to the predicted keypoints and enforce a reconstruction loss to refine the predictions. It provides a novel solution to effectively correct confident yet inaccurate keypoint locations through image reconstruction in domain adaptation. Our approach outperforms the previous state-of-the-arts by 8% for PCK on four large-scale hand and human real-world datasets. In particular, we excel on endpoints such as fingertips and head, with 7.2% and 29.9% improvements in PCK.

Recasting Continual Learning as Sequence Modeling
Soochan Lee Jaehyeon Son Gunhee Kim



研究问题:本文旨在将机器学习的两个重要领域——持续学习与序列建模建立紧密联系。
动机:提出将持续学习形式化为一个序列建模问题,使得先进的序列模型能够用于持续学习。
方法:采用元持续学习(MCL)框架,在多个持续学习阶段上对序列模型进行元级训练。
效果:实验结果表明,序列模型可以成为通用MCL的有吸引力的解决方案。

In this work, we aim to establish a strong connection between two significant bodies of machine learning research: continual learning and sequence modeling. That is, we propose to formulate continual learning as a sequence modeling problem, allowing advanced sequence models to be utilized for continual learning. Under this formulation, the continual learning process becomes the forward pass of a sequence model. By adopting the meta-continual learning (MCL) framework, we can train the sequence model at the meta-level, on multiple continual learning episodes. As a specific example of our new formulation, we demonstrate the application of Transformers and their efficient variants as MCL methods. Our experiments on seven benchmarks, covering both classification and regression, show that sequence models can be an attractive solution for general MCL.

NICE: NoIse-modulated Consistency rEgularization for Data-Efficient GANs
Yao Ni Piotr Koniusz



研究问题:生成对抗网络(GANs)在图像合成方面具有强大的功能,但需要大量的训练数据,这往往是昂贵且难以获取的。
动机:有限的数据会影响GANs,导致判别器过拟合和训练不稳定。
方法:本文提出了一种名为Noise-modulated Consistency rEgularization (NICE)的新方法来克服这些挑战。该方法通过向判别器引入自适应乘性噪声来调制其潜在特征。
效果:实验结果表明,这种调制有效地防止了判别器过拟合,提高了GAN的稳定性。在CIFAR-10、CIFAR-100、ImageNet和FFHQ等数据集上,NICE在有限数据训练和低样本生成任务中取得了最先进的结果。

Generative Adversarial Networks (GANs) are powerful tools for image synthesis. However, they require access to vast amounts of training data, which is often costly and prohibitive. Limited data affects GANs, leading to discriminator overfitting and training instability. In this paper, we present a novel approach called NoIse-modulated Consistency rEgularization (NICE) to overcome these challenges. To this end, we introduce an adaptive multiplicative noise into the discriminator to modulate its latent features. We demonstrate the effectiveness of such a modulation in preventing discriminator overfitting by adaptively reducing the Rademacher complexity of the discriminator. However, this modulation leads to an unintended consequence of increased gradient norm, which can undermine the stability of GAN training. To mitigate this undesirable effect, we impose a constraint on the discriminator, ensuring its consistency for the same inputs under different noise modulations. The constraint effectively penalizes the first and second-order gradients of latent features, enhancing GAN stability. Experimental evidence aligns with our theoretical analysis, demonstrating the reduction of generalization error and gradient penalization of NICE. This substantiates the efficacy of NICE in reducing discriminator overfitting and improving stability of GAN training. NICE achieves state-of-the-art results on CIFAR-10, CIFAR-100, ImageNet and FFHQ datasets when trained with limited data, as well as in low-shot generation tasks.

Data-Informed Geometric Space Selection
Shuai Zhang Wenqi Jiang



研究问题:本文旨在解决几何表示学习中的核心挑战,即如何将内在的几何偏见与数据的基本结构对齐。
动机:现有的方法严重依赖于对数据结构的启发式假设来决定采用哪种几何类型,这往往导致次优的性能。
方法:本文通过一种数据驱动的策略自动化对齐过程,具体来说,使用了稀疏的门控机制,使得每个输入数据点可以选择K个几何空间,这些空间来自具有N个不同几何形状的空间池(其中K 效果:实验结果表明,这种方法可以在没有人工干预的情况下有效地对齐数据和空间,并在真实世界的任务上进一步提高性能,展示了其在激发几何表示的表达能力和实用性方面的潜力。

Geometric representation learning (e.g., hyperbolic and spherical geometry) has proven to be efficacious in solving many intricate machine learning tasks. The fundamental challenge of geometric representation learning lies in aligning the inherent geometric bias with the underlying structure of the data, which is a rarely explored topic in the literature. Existing methods heavily rely on heuristic assumptions on the data structure to decide the type of geometry to be adopted, which often leads to suboptimal performance. This work aims to automate the alignment process via a data-informed strategy such that we optimize model performance with minimal overhead. Specifically, a sparse gating mechanism is employed to enable each input data point $\mathit{p}$ to select $K$ geometric spaces from a given candidate geometric space pool with $N$ ($K

Augmentation-free Dense Contrastive Distillation for Efficient Semantic Segmentation
Jiawei Fan Chao Li Xiaolong Liu Meina Song Anbang Yao



研究问题:近年来,基于对比学习的 distillation 方法在图像分类和目标检测任务上取得了显著成果,但在语义分割方面的研究较少。
动机:现有的语义分割方法主要依赖于数据增强和内存缓冲,这在处理需要保留高分辨率特征图进行密集像素级预测的语义分割任务时,会导致计算资源需求较大。
方法:提出了一种无数据增强密集对比知识蒸馏(Af-DCD)的新对比学习范式,通过利用巧妙的特征分区策略,并设计一种新的对比学习损失函数,有效地将教师模型学习到的密集和结构化的局部知识转移到学生模型中,同时保持训练效率。
效果:在五个主流基准测试集上的大量实验表明了该方法的有效性。例如,使用 Af-DCD 训练的 DeepLabV3-Res18|DeepLabV3-MBV2 模型在 Cityscapes 数据集上选择 DeepLabV3-Res101 作为教师时,达到了 77.03\%|76.38\% mIOU,创造了新的性能记录。此外,与单独训练的模型相比,Af-DCD 在 Cityscapes|Pascal VOC|Camvid|ADE20K|COCO-Stuff-164K 数据集上分别实现了 3.26\%|3.04\%|2.75\%|2.30\%|1.42\% 的 mIOU 绝对改进。代码可在 https://github.com/OSVAI/Af-DCD 获取。

In recent years, knowledge distillation methods based on contrastive learning have achieved promising results on image classification and object detection tasks. However, in this line of research, we note that less attention is paid to semantic segmentation. Existing methods heavily rely on data augmentation and memory buffer, which entail high computational resource demands when applying them to handle semantic segmentation that requires to preserve high-resolution feature maps for making dense pixel-wise predictions. In order to address this problem, we present Augmentation-free Dense Contrastive Knowledge Distillation (Af-DCD), a new contrastive distillation learning paradigm to train compact and accurate deep neural networks for semantic segmentation applications. Af-DCD leverages a masked feature mimicking strategy, and formulates a novel contrastive learning loss via taking advantage of tactful feature partitions across both channel and spatial dimensions, allowing to effectively transfer dense and structured local knowledge learnt by the teacher model to a target student model while maintaining training efficiency. Extensive experiments on five mainstream benchmarks with various teacher-student network pairs demonstrate the effectiveness of our approach. For instance, DeepLabV3-Res18|DeepLabV3-MBV2 model trained by Af-DCD reaches 77.03\%|76.38\% mIOU on Cityscapes dataset when choosing DeepLabV3-Res101 as the teacher, setting new performance records. Besides that, Af-DCD achieves an absolute mIOU improvement of 3.26\%|3.04\%|2.75\%|2.30\%|1.42\% compared with individually trained counterpart on Cityscapes|Pascal VOC|Camvid|ADE20K|COCO-Stuff-164K. Code is available at https://github.com/OSVAI/Af-DCD.

Towards Free Data Selection with General-Purpose Models
Yichen Xie Mingyu Ding Masayoshi Tomizuka Wei Zhan



研究问题:如何有效地选择最具有信息量的样本,以最大限度地利用有限的注释预算。
动机:现有的数据选择算法,如主动学习方法,通常需要反复进行耗时的模型训练和批量数据选择,效率低下。
方法:本文设计了一种独特的数据选择管道,该管道利用现有的通用模型在无需额外训练或监督的情况下从各种数据集进行单次传递推理来选择数据。提出了一种新的自由数据选择(FreeSel)方法。
效果:实验结果表明,FreeSel在各种计算机视觉任务上均表现出良好的效果,其效率比现有的主动学习方法提高了530倍。

A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. However, current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. In this paper, we challenge this status quo by designing a distinct data selection pipeline that utilizes existing general-purpose models to select data from various datasets with a single-pass inference without the need for additional training or supervision. A novel free data selection (FreeSel) method is proposed following this new pipeline. Specifically, we define semantic patterns extracted from inter-mediate features of the general-purpose model to capture subtle local information in each image. We then enable the selection of all data samples in a single pass through distance-based sampling at the fine-grained semantic pattern level. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods. Extensive experiments verify the effectiveness of FreeSel on various computer vision tasks.

Hierarchical Vector Quantized Transformer for Multi-class Unsupervised Anomaly Detection
Ruiying Lu YuJie Wu Long Tian Dongsheng Wang Bo Chen Xiyang Liu Ruimin Hu



研究问题:本文旨在解决无监督图像异常检测中的问题,即如何区分正常和异常样本。
动机:现有的重建网络在处理多类问题时计算成本高且泛化能力有限。同时,这些网络往往存在"相同捷径"问题,即正常和异常样本都能被很好地恢复,难以区分。
方法:提出了一种分层矢量量化原型导向的Transformer模型。首先,保留典型的正常模式作为离散的图标原型,并利用矢量量化防止模型陷入捷径。然后将矢量量化的图标原型集成到Transformer中进行重建,使异常数据点变为正常数据点。其次,研究了一种精致的分层框架以缓解码本塌陷问题并补充脆弱的正常模式。最后,提出了一种原型导向的最佳传输方法来更好地调整原型并分层评估异常分数。
效果:通过在MVTec-AD和VisA数据集上进行评估,该模型超越了最先进的替代方案,并且具有良好的可解释性。

Unsupervised image Anomaly Detection (UAD) aims to learn robust and discriminative representations of normal samples. While separate solutions per class endow expensive computation and limited generalizability, this paper focuses on building a unified framework for multiple classes. Under such a challenging setting, popular reconstruction-based networks with continuous latent representation assumption always suffer from the "identical shortcut" issue, where both normal and abnormal samples can be well recovered and difficult to distinguish. To address this pivotal issue, we propose a hierarchical vector quantized prototype-oriented Transformer under a probabilistic framework. First, instead of learning the continuous representations, we preserve the typical normal patterns as discrete iconic prototypes, and confirm the importance of Vector Quantization in preventing the model from falling into the shortcut. The vector quantized iconic prototypes are integrated into the Transformer for reconstruction, such that the abnormal data point is flipped to a normal data point. Second, we investigate an exquisite hierarchical framework to relieve the codebook collapse issue and replenish frail normal patterns. Third, a prototype-oriented optimal transport method is proposed to better regulate the prototypes and hierarchically evaluate the abnormal score. By evaluating on MVTec-AD and VisA datasets, our model surpasses the state-of-the-art alternatives and possesses good interpretability. The code is available at https://github.com/RuiyingLu/HVQ-Trans.

Architecture Matters: Uncovering Implicit Mechanisms in Graph Contrastive Learning
Xiaojun Guo Yifei Wang Zeming Wei Yisen Wang



研究问题:本文旨在研究图对比学习(GCL)中的各种方法,并发现其与原始视觉对比学习(VCL)方法的不同之处。
动机:通过系统研究各种图对比学习(GCL)方法,我们发现了一些不同于原始视觉对比学习(VCL)方法的常见现象,包括正样本并非必须,负样本对于图分类和特定归一化模块下的节点分类并非必要,数据增强对GCL的影响较小等。
方法:我们揭示了图神经网络在对比学习中的隐含归纳偏置,为上述GCL的有趣特性提供了理论洞察。我们主张更多地关注图学习的独特架构,并在设计GCL方法时考虑其隐含影响。
效果:通过对图对比学习的深入研究,我们提出了针对图学习独特架构的新视角和方法,为未来的研究和应用提供了理论指导。

With the prosperity of contrastive learning for visual representation learning (VCL), it is also adapted to the graph domain and yields promising performance. However, through a systematic study of various graph contrastive learning (GCL) methods, we observe that some common phenomena among existing GCL methods that are quite different from the original VCL methods, including 1) positive samples are not a must for GCL; 2) negative samples are not necessary for graph classification, neither for node classification when adopting specific normalization modules; 3) data augmentations have much less influence on GCL, as simple domain-agnostic augmentations (e.g., Gaussian noise) can also attain fairly good performance. By uncovering how the implicit inductive bias of GNNs works in contrastive learning, we theoretically provide insights into the above intriguing properties of GCL. Rather than directly porting existing VCL methods to GCL, we advocate for more attention toward the unique architecture of graph learning and consider its implicit influence when designing GCL methods. Code is available at https://github.com/PKU-ML/ArchitectureMattersGCL.

Effective Robustness against Natural Distribution Shifts for Models with Different Training Data
Zhouxing Shi Nicholas Carlini Ananth Balashankar Ludwig Schmidt Cho-Jui Hsieh Alex Beutel Yao Qin



研究问题:如何评估和比较在不同数据上训练的模型的有效鲁棒性。
动机:现有的有效鲁棒性评估方法通常使用单一的测试集,如ImageNet,来评估在分布内(ID)的准确性,这在评估在不同数据上训练的模型时存在问题。
方法:本文提出了一种新的评估指标,通过控制覆盖所有被评估模型的训练分布的多个ID测试集上的准确率,来评估和比较在不同数据上训练的模型的有效鲁棒性。
效果:新的评估指标为存在不同训练数据的模型提供了更好的有效鲁棒性估计,并可能解释先前使用ImageNet作为唯一ID测试集的CLIP类零样本模型所表现出的惊人有效鲁棒性增益,而这种增益在我们的新评估下会减弱。

``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data. To do this, we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of effective robustness when there are models with different training data. It may also explain the surprising effective robustness gains of zero-shot CLIP-like models exhibited in prior works that used ImageNet as the only ID test set, while the gains diminish under our new evaluation. Additional artifacts including interactive visualizations are provided at https://shizhouxing.github.io/effective-robustness.

Rethinking Conditional Diffusion Sampling with Progressive Guidance
Anh-Dung Dinh Daochang Liu Chang Xu



研究问题:本文解决了扩散生成模型分类器指导中遇到的两个关键挑战,即缺乏多样性和存在对抗性影响。
动机:这些问题通常会导致多样化样本的稀缺或非稳健特征的产生,其根本原因在于分类器指导的机制,其中判别性梯度会强烈推动样本被识别为条件。
方法:我们提出了一种称为渐进式指导的通用分类器指导方法,通过在早期采样步骤中允许相关类的梯度参与共享信息构建来缓解这些问题。在后期采样阶段,我们逐步增强梯度以细化图像中的细节朝向主要条件。
效果:实验结果表明,我们提出的方法进一步提高了图像质量,同时提供了显著的多样性和稳健特征。

This paper tackles two critical challenges encountered in classifier guidance for diffusion generative models, i.e., the lack of diversity and the presence of adversarial effects. These issues often result in a scarcity of diverse samples or the generation of non-robust features. The underlying cause lies in the mechanism of classifier guidance, where discriminative gradients push samples to be recognized as conditions aggressively. This inadvertently suppresses information with common features among relevant classes, resulting in a limited pool of features with less diversity or the absence of robust features for image construction. We propose a generalized classifier guidance method called Progressive Guidance, which mitigates the problems by allowing relevant classes' gradients to contribute to shared information construction when the image is noisy in early sampling steps. In the later sampling stage, we progressively enhance gradients to refine the details in the image toward the primary condition. This helps to attain a high level of diversity and robustness compared to the vanilla classifier guidance. Experimental results demonstrate that our proposed method further improves the image quality while offering a significant level of diversity as well as robust features.

Zero-Shot Anomaly Detection via Batch Normalization
Aodong Li Chen Qiu Marius Kloft Padhraic Smyth Maja Rudolph Stephan Mandt



研究问题:如何使异常检测器适应正常数据分布的漂移,特别是在没有“新正常”训练数据可用的情况下。
动机:现有的异常检测技术在面对正常数据分布漂移时存在挑战,需要开发零样本异常检测技术。
方法:提出一种名为自适应中心表示(ACR)的简单有效方法进行零样本批量级异常检测。该方法通过结合批归一化和元训练,训练现成的深度异常检测器(如深度SVDD),使其适应一组相互关联的训练数据分布,实现对未见过异常检测任务的自动零样本泛化。
效果:实验结果表明,该方法在表格数据上实现了首次零样本异常检测结果,并在专业领域的图像数据上超越了现有的零样本异常检测和分割方法。

Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal," has led to the development of zero-shot AD techniques. In this paper, we propose a simple yet effective method called Adaptive Centered Representations (ACR) for zero-shot batch-level AD. Our approach trains off-the-shelf deep anomaly detectors (such as deep SVDD) to adapt to a set of inter-related training data distributions in combination with batch normalization, enabling automatic zero-shot generalization for unseen AD tasks. This simple recipe, batch normalization plus meta-training, is a highly effective and versatile tool. Our results demonstrate the first zero-shot AD results for tabular data and outperform existing methods in zero-shot anomaly detection and segmentation on image data from specialized domains.

Dream the Impossible: Outlier Imagination with Diffusion Models
Xuefeng Du Yiyou Sun Jerry Zhu Yixuan Li



研究问题:如何利用辅助异常数据集来规范机器学习模型,进行分布外(OOD)检测和安全预测。
动机:由于数据收集和清理的劳动强度大,自动化生成异常数据一直是人们渴望的替代方案。尽管有这个吸引力,但在高维像素空间生成真实的异常值一直是该领域的一个开放性挑战。
方法:本文提出了一个新的框架Dream-OOD,通过扩散模型在只有分布内(ID)数据和类别的情况下想象出真实的异常值。具体来说,Dream-OOD根据ID数据学习一个文本条件的潜在空间,然后通过潜在空间采样出低概率区域的异常值,这些异常值可以通过扩散模型解码成图像。与先前的工作[16, 95]不同,Dream-OOD可以直接在像素空间中可视化和理解想象的异常值。
效果:通过全面的定量和定性研究了解Dream-OOD的有效性,结果显示,使用Dream-OOD生成的样本进行训练可以显著提高OOD检测性能。

Utilizing auxiliary outlier datasets to regularize the machine learning model has demonstrated promise for out-of-distribution (OOD) detection and safe prediction. Due to the labor intensity in data collection and cleaning, automating outlier data generation has been a long-desired alternative. Despite the appeal, generating photo-realistic outliers in the high dimensional pixel space has been an open challenge for the field. To tackle the problem, this paper proposes a new framework Dream-OOD, which enables imagining photo-realistic outliers by way of diffusion models, provided with only the in-distribution (ID) data and classes. Specifically, Dream-OOD learns a text-conditioned latent space based on ID data, and then samples outliers in the low-likelihood region via the latent, which can be decoded into images by the diffusion model. Different from prior works [16, 95], Dream-OOD enables visualizing and understanding the imagined outliers, directly in the pixel space. We conduct comprehensive quantitative and qualitative studies to understand the efficacy of Dream-OOD, and show that training with the samples generated by Dream-OOD can significantly benefit OOD detection performance.

SANFlow: Semantic-Aware Normalizing Flow for Anomaly Detection
Daehyun Kim Sungyong Baik Tae Hyun Kim



研究问题:图像的异常检测是一项挑战,因为异常的稀有性和不可预测性。
动机:现有的基于归一化流(NF)的方法都依赖于其密度估计能力,但它们将所有特征的分布强行转换为单一分布(如单位正态分布),这可能会限制网络区分正常和异常数据的能力。
方法:我们提出在给定图像的每个位置将特征的分布转换为不同的分布。具体来说,我们训练归一化流将正常数据分布映射到具有相同均值但不同方差的分布,并在每个位置进行此操作。为了增强判别能力,我们还训练归一化流将异常数据分布映射到一个均值与正常数据不同的分布,其中异常数据是通过数据增强合成的。
效果:实验结果表明,我们提出的框架能够有效地改善密度建模,从而提高异常检测性能。

Visual anomaly detection, the task of detecting abnormal characteristics in images, is challenging due to the rarity and unpredictability of anomalies. In order to reliably model the distribution of normality and detect anomalies, a few works have attempted to exploit the density estimation ability of normalizing flow (NF). However, previous NF-based methods have relied solely on the capability of NF and forcibly transformed the distribution of all features to a single distribution (e.g., unit normal distribution), when features can have different semantic information and thus follow different distributions. We claim that forcibly learning to transform such diverse distributions to a single distribution with a single network will cause the learning difficulty, limiting the capacity of a network to discriminate normal and abnormal data. As such, we propose to transform the distribution of features at each location of a given image to different distributions. In particular, we train NF to map normal data distribution to distributions with the same mean but different variances at each location of the given image. To enhance the discriminability, we also train NF to map abnormal data distribution to a distribution with a mean that is different from that of normal data, where abnormal data is synthesized with data augmentation. The experimental results outline the effectiveness of the proposed framework in improving the density modeling and thus anomaly detection performance.

C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder
Xiaoyu Liu Jiaxin Yuan Bang An Yuancheng Xu Yifan Yang Furong Huang



研究问题:本文旨在解决现有表示学习模型在发现生成因素时,未考虑共同原因(即混淆因子)的问题。
动机:大多数现有的工作都假设在发现过程中没有混淆,但实际上混淆因子的存在对发现有语义意义的生成因素有重要影响。
方法:本文提出了一个名为“混淆-解缠”(C-Disentanglement)的框架,这是第一个明确引入混淆因子的先验知识的框架。同时,我们还提出了一种方法来充分识别任何混淆因子先验知识下的因果解缠因素。
效果:通过在合成和真实世界数据集上的大量实验,我们的方法在获取因果解缠特征和处理下游任务方面与各种最先进的基线方法相比具有竞争力。

Representation learning assumes that real-world data is generated by a few semantically meaningful generative factors (i.e., sources of variation) and aims to discover them in the latent space. These factors are expected to be causally disentangled, meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. Compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. However, most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally-disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.

Label-Only Model Inversion Attacks via Knowledge Transfer
Ngoc-Bao Nguyen Keshigeyan Chandrasegaran Milad Abdollahzadeh Ngai-man Cheung



研究问题:在标签仅MI攻击中,对手只能访问模型的预测标签(硬标签),无法获取置信度分数或其他任何模型信息。
动机:现有的白盒和黑盒设置中已经取得了显著进展,但标签仅MI攻击是最具有挑战性且实际重要的设置,对此的研究非常有限。
方法:我们提出了一种新的标签仅MI攻击方法LOKT,该方法基于从不透明的目标模型向替代模型转移知识的思想。通过这些替代模型,我们可以利用先进的白盒攻击。
效果:我们的实验表明,我们的方法在所有MI基准测试中比现有的最先进的标签仅MI攻击方法提高了超过15%的性能。此外,就查询预算而言,我们的方法也表现良好。这项研究强调了即使暴露最少的信息(即硬标签),ML模型的隐私威胁也在增加。

In a model inversion (MI) attack, an adversary abuses access to a machine learning (ML) model to infer and reconstruct private training data. Remarkable progress has been made in the white-box and black-box setups, where the adversary has access to the complete model or the model's soft output respectively. However, there is very limited study in the most challenging but practically important setup: Label-only MI attacks, where the adversary only has access to the model's predicted label (hard label) without confidence scores nor any other model information. In this work, we propose LOKT, a novel approach for label-only MI attacks. Our idea is based on transfer of knowledge from the opaque target model to surrogate models. Subsequently, using these surrogate models, our approach can harness advanced white-box attacks. We propose knowledge transfer based on generative modelling, and introduce a new model, Target model-assisted ACGAN (T-ACGAN), for effective knowledge transfer. Our method casts the challenging label-only MI into the more tractable white-box setup. We provide analysis to support that surrogate models based on our approach serve as effective proxies for the target model for MI. Our experiments show that our method significantly outperforms existing SOTA Label-only MI attack by more than 15% across all MI benchmarks. Furthermore, our method compares favorably in terms of query budget. Our study highlights rising privacy threats for ML models even when minimal information (i.e., hard labels) is exposed. Our study highlights rising privacy threats for ML models even when minimal information (i.e., hard labels) is exposed. Our code, demo, models and reconstructed data are available at our project page: https://ngoc-nguyen-0.github.io/lokt/

Better Correlation and Robustness: A Distribution-Balanced Self-Supervised Learning Framework for Automatic Dialogue Evaluation
Peiwen Yuan Xinglin Wang Jiayi Shi Bin Sun Yiwei Li Kan Li



研究问题:如何提高对话评估模型的相关性与鲁棒性。
动机:现有的自我监督学习框架在训练数据中存在不均衡的连贯性分布,导致模型在中等连贯性样本上与人类相关性低,且评分分布不均匀,可能削弱模型的鲁棒性。
方法:提出Better Correlation and Robustness(BCR)框架,通过有效的训练集重构方法提供连贯性平衡的训练信号,并进一步促进对话评估模型的平衡评估能力。同时,提出一种新的损失函数,可以根据核密度估计的评分分布均匀性进行自适应调整。
效果:在17个基准数据集上的全面实验表明,使用BCR的vanilla BERT-base平均性能比最先进的方法提高了11.3%。BCR还表现出强大的泛化能力,可以引导多种最先进的方法实现更好的相关性和鲁棒性。

Turn-level dialogue evaluation models (TDEMs), using self-supervised learning (SSL) framework, have achieved state-of-the-art performance in open-domain dialogue evaluation. However, these models inevitably face two potential problems. First, they have low correlations with humans on medium coherence samples as the SSL framework often brings training data with unbalanced coherence distribution. Second, the SSL framework leads TDEM to nonuniform score distribution. There is a danger that the nonuniform score distribution will weaken the robustness of TDEM through our theoretical analysis. To tackle these problems, we propose Better Correlation and Robustness (BCR), a distribution-balanced self-supervised learning framework for TDEM. Given a dialogue dataset, BCR offers an effective training set reconstructing method to provide coherence-balanced training signals and further facilitate balanced evaluating abilities of TDEM. To get a uniform score distribution, a novel loss function is proposed, which can adjust adaptively according to the uniformity of score distribution estimated by kernel density estimation. Comprehensive experiments on 17 benchmark datasets show that vanilla BERT-base using BCR outperforms SOTA methods significantly by 11.3% on average. BCR also demonstrates strong generalization ability as it can lead multiple SOTA methods to attain better correlation and robustness.

Representation Learning via Consistent Assignment of Views over Random Partitions
Thalles Santos Silva Adín Ramírez Rivera



研究问题:如何有效地进行视觉特征表示学习。
动机:现有的自监督聚类方法在解决聚类分配问题上需要额外的非可微模块,且训练稳定性差,易产生塌陷解。
方法:提出一种基于随机分区的一致性视图分配(CARP)方法,通过在线梯度下降方式端到端地学习原型,无需额外模块解决聚类分配问题,优化基于原型随机分区的新预训练任务,增强模型并强制视图分配的一致性。
效果:实验表明,CARP的表示适合学习下游任务,并在17个数据集上进行了广泛评估。在迁移学习任务中,CARP的平均性能优于许多训练时间更长的自监督学习方法。

We present Consistent Assignment of Views over Random Partitions (CARP), a self-supervised clustering method for representation learning of visual features. CARP learns prototypes in an end-to-end online fashion using gradient descent without additional non-differentiable modules to solve the cluster assignment problem. CARP optimizes a new pretext task based on random partitions of prototypes that regularizes the model and enforces consistency between views' assignments. Additionally, our method improves training stability and prevents collapsed solutions in joint-embedding training. Through an extensive evaluation, we demonstrate that CARP's representations are suitable for learning downstream tasks. We evaluate CARP's representations capabilities in 17 datasets across many standard protocols, including linear evaluation, few-shot classification, $k$-NN, $k$-means, image retrieval, and copy detection. We compare CARP performance to 11 existing self-supervised methods. We extensively ablate our method and demonstrate that our proposed random partition pretext task improves the quality of the learned representations by devising multiple random classification tasks. In transfer learning tasks, CARP achieves the best performance on average against many SSL methods trained for a longer time.

Drift doesn't Matter: Dynamic Decomposition with Diffusion Reconstruction for Unstable Multivariate Time Series Anomaly Detection
Chengsen Wang Zirui Zhuang Qi Qi Jingyu Wang Xingyu Wang Haifeng Sun Jianxin Liao



研究问题:现有的无监督方法主要关注稳定数据,忽视了非平稳环境产生的漂移,可能导致大量误报。
动机:针对真实世界中的不稳定数据,提出一种新的异常检测网络D$^3$R来填补这一空白。
方法:D$^3$R通过分解和重建来解决漂移问题。在分解过程中,利用数据-时间混合注意力动态地分解长周期多元时间序列,克服了局部滑动窗口的限制。在重建过程中,通过噪声扩散控制信息瓶颈,直接重建被污染的数据,避免了瓶颈变化时的重新训练。整个模型可以端到端训练。
效果:在各种真实世界数据集上的广泛实验表明,D$^3$R显著优于现有方法,比之前的SOTA模型平均提高了11%。

Many unsupervised methods have recently been proposed for multivariate time series anomaly detection. However, existing works mainly focus on stable data yet often omit the drift generated from non-stationary environments, which may lead to numerous false alarms. We propose **D**ynamic **D**ecomposition with **D**iffusion **R**econstruction (D$^3$R), a novel anomaly detection network for real-world unstable data to fill the gap. D$^3$R tackles the drift via decomposition and reconstruction. In the decomposition procedure, we utilize data-time mix-attention to dynamically decompose long-period multivariate time series, overcoming the limitation of the local sliding window. The information bottleneck is critical yet difficult to determine in the reconstruction procedure. To avoid retraining once the bottleneck changes, we control it externally by noise diffusion and directly reconstruct the polluted data. The whole model can be trained end-to-end. Extensive experiments on various real-world datasets demonstrate that D$^3$R significantly outperforms existing methods, with a 11% average relative improvement over the previous SOTA models.

Test-time Training for Matching-based Video Object Segmentation
Juliette Bertrand Giorgos Kordopatis-Zilos Yannis Kalantidis Giorgos Tolias



研究问题:视频对象分割(VOS)任务中,如何应对测试时分布变化的问题。
动机:当前最先进的方法依赖于匹配来估计后续帧的分割掩码,但缺乏适应机制,容易受到测试时分布变化的影响。
方法:提出了一种适用于VOS的基于匹配的方法,并探索了针对VOS的测试时训练策略,包括一种基于掩模循环一致性的变体。
效果:实验结果表明,所提出的测试时训练在性能上取得了显著改进,尤其是在sim-to-real场景下,即使只使用单个测试视频,也能恢复大部分通过在真实视频上训练获得的性能增益。同时,引入了DAVIS-C,这是一个增强版的流行DAVIS测试集,具有图像/视频级别的损坏和风格化等极端分布变化。

The video object segmentation (VOS) task involves the segmentation of an object over time based on a single initial mask. Current state-of-the-art approaches use a memory of previously processed frames and rely on matching to estimate segmentation masks of subsequent frames. Lacking any adaptation mechanism, such methods are prone to test-time distribution shifts. This work focuses on matching-based VOS under distribution shifts such as video corruptions, stylization, and sim-to-real transfer. We explore test-time training strategies that are agnostic to the specific task as well as strategies that are designed specifically for VOS. This includes a variant based on mask cycle consistency tailored to matching-based VOS methods. The experimental results on common benchmarks demonstrate that the proposed test-time training yields significant improvements in performance. In particular for the sim-to-real scenario and despite using only a single test video, our approach manages to recover a substantial portion of the performance gain achieved through training on real videos. Additionally, we introduce DAVIS-C, an augmented version of the popular DAVIS test set, featuring extreme distribution shifts like image-/video-level corruptions and stylizations. Our results illustrate that test-time training enhances performance even in these challenging cases.

Causal-structure Driven Augmentations for Text OOD Generalization
Amir Feder Yoav Wald Claudia Shi Suchi Saria David Blei



研究问题:文本分类器对虚假相关性的依赖可能导致部署时泛化能力差,引发在如医疗等安全关键领域的使用担忧。
动机:提出利用因果结构知识引导的反事实数据增强来模拟对虚假特征的干预,以学习更鲁棒的文本分类器。
方法:通过辅助数据匹配例子,采用差异-在-差异方法,并使用大型语言模型表示文本的条件概率。
效果:实验证明,该方法在预测问题中优于基线不变学习算法,提高了分布外(OOD)准确性。

The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.

Learning Invariant Molecular Representation in Latent Discrete Space
Xiang Zhuang Qiang Zhang Keyan Ding Yatao Bian Xiao Wang Jingsong Lv Hongyang Chen Huajun Chen



研究问题:现有的分子表示学习方法在面对训练和测试数据来自不同环境时,存在分布外泛化能力差的问题。
动机:为了解决这一问题,我们提出了一种新的学习分子表示的框架,该框架能够展示对分布偏移的不变性和鲁棒性。
方法:我们提出了一种“先编码后分离”的策略,以识别潜在空间中的不变分子特征。此外,我们还引入了残差向量量化模块来防止过拟合训练数据分布,同时保持编码器的表达能力。
效果:我们在18个真实世界的分子数据集上进行了广泛的实验,结果显示,我们的模型在各种分布偏移下,比最先进的基线模型具有更强的泛化能力。

Molecular representation learning lays the foundation for drug discovery. However, existing methods suffer from poor out-of-distribution (OOD) generalization, particularly when data for training and testing originate from different environments. To address this issue, we propose a new framework for learning molecular representations that exhibit invariance and robustness against distribution shifts. Specifically, we propose a strategy called ``first-encoding-then-separation'' to identify invariant molecule features in the latent space, which deviates from conventional practices. Prior to the separation step, we introduce a residual vector quantization module that mitigates the over-fitting to training data distributions while preserving the expressivity of encoders. Furthermore, we design a task-agnostic self-supervised learning objective to encourage precise invariance identification, which enables our method widely applicable to a variety of tasks, such as regression and multi-label classification. Extensive experiments on 18 real-world molecular datasets demonstrate that our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts. Our code is available at https://github.com/HICAI-ZJU/iMoLD.

How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization
Hai Zhang Hang Yu Junqiao Zhao Di Zhang Chang Huang Hongtu Zhou Xiao Zhang Chen Ye



研究问题:设计并推导出具有性能改进保证的有效基于模型的强化学习(MBRL)算法是一项挑战,主要由于模型学习和策略优化之间的高度耦合。
动机:许多依赖回报差异来指导模型学习的方法忽视了模型偏移的影响,这可能导致由于过度更新模型而导致的性能下降。其他方法使用性能差分界限来明确考虑模型偏移,但这些方法依赖于固定的阈值来约束模型偏移,导致对阈值的重度依赖和训练过程中的缺乏适应性。
方法:本文从理论上推导出一个可以统一模型偏移和模型偏差的优化目标,然后制定一个微调过程。这个过程自适应地调整模型更新,以获得性能改进的保证,同时避免模型过拟合。基于这些,我们开发了一个直接的算法USB-PO(统一模型偏移和模型偏差策略优化)。
效果:实验结果表明,USB-PO在几个具有挑战性的基准任务上实现了最先进的性能。

Designing and deriving effective model-based reinforcement learning (MBRL) algorithms with a performance improvement guarantee is challenging, mainly attributed to the high coupling between model learning and policy optimization. Many prior methods that rely on return discrepancy to guide model learning ignore the impacts of model shift, which can lead to performance deterioration due to excessive model updates. Other methods use performance difference bound to explicitly consider model shift. However, these methods rely on a fixed threshold to constrain model shift, resulting in a heavy dependence on the threshold and a lack of adaptability during the training process. In this paper, we theoretically derive an optimization objective that can unify model shift and model bias and then formulate a fine-tuning process. This process adaptively adjusts the model updates to get a performance improvement guarantee while avoiding model overfitting. Based on these, we develop a straightforward algorithm USB-PO (Unified model Shift and model Bias Policy Optimization). Empirical results show that USB-PO achieves state-of-the-art performance on several challenging benchmark tasks.

Trade-off Between Efficiency and Consistency for Removal-based Explanations
Yifan Zhang Haowei He Zhiquan Tan Yang Yuan



研究问题:当前的解释方法主要采用删除技术来评估单个特征的影响,但这些方法在效率和一致性上存在固有的不协调。
动机:为了解决这一问题,我们提出了解释误差作为衡量效率和一致性的指标,并基于标准多项式基础提出了两种新的算法。
方法:我们建立了不可能三位一体定理,认为解释性、效率和一致性不能同时实现。然后,我们提出了利用解释误差作为衡量效率和一致性的指标,并基于标准多项式基础提出了两种新的算法。
效果:实验结果表明,所提出的方法能够显著降低解释误差,最高可达31.8倍,比替代技术更有效。

In the current landscape of explanation methodologies, most predominant approaches, such as SHAP and LIME, employ removal-based techniques to evaluate the impact of individual features by simulating various scenarios with specific features omitted. Nonetheless, these methods primarily emphasize efficiency in the original context, often resulting in general inconsistencies. In this paper, we demonstrate that such inconsistency is an inherent aspect of these approaches by establishing the Impossible Trinity Theorem, which posits that interpretability, efficiency, and consistency cannot hold simultaneously. Recognizing that the attainment of an ideal explanation remains elusive, we propose the utilization of interpretation error as a metric to gauge inefficiencies and inconsistencies. To this end, we present two novel algorithms founded on the standard polynomial basis, aimed at minimizing interpretation error. Our empirical findings indicate that the proposed methods achieve a substantial reduction in interpretation error, up to 31.8 times lower when compared to alternative techniques.

Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection
Cheng-Ju Ho Chen-Hsuan Tai Yen-Yu Lin Ming-Hsuan Yang Yi-Hsuan Tsai



研究问题:如何提高半监督3D物体检测的标注质量?
动机:现有的半监督3D物体检测方法主要采用教师-学生框架和伪标签来利用未标记的点云,但在多样化的3D空间中生成可靠的伪标签仍然具有挑战性。
方法:提出了一种新的半监督3D物体检测方法Diffusion-SS3D,通过扩散模型提升伪标签的质量。具体来说,我们引入噪声以产生损坏的3D物体大小和类别标签分布,然后使用扩散模型作为去噪过程以获取边界框输出。此外,我们将扩散模型整合到教师-学生框架中,以便去噪后的边界框可以用于改进伪标签生成以及整个半监督学习过程。
效果:在ScanNet和SUN RGB-D基准数据集上进行的实验表明,我们的方法在性能上超过了现有方法,达到了最先进的水平。我们还进行了广泛的分析,以了解我们的扩散模型设计如何影响半监督学习的性能。

Semi-supervised object detection is crucial for 3D scene understanding, efficiently addressing the limitation of acquiring large-scale 3D bounding box annotations. Existing methods typically employ a teacher-student framework with pseudo-labeling to leverage unlabeled point clouds. However, producing reliable pseudo-labels in a diverse 3D space still remains challenging. In this work, we propose Diffusion-SS3D, a new perspective of enhancing the quality of pseudo-labels via the diffusion model for semi-supervised 3D object detection. Specifically, we include noises to produce corrupted 3D object size and class label distributions, and then utilize the diffusion model as a denoising process to obtain bounding box outputs. Moreover, we integrate the diffusion model into the teacher-student framework, so that the denoised bounding boxes can be used to improve pseudo-label generation, as well as the entire semi-supervised learning process. We conduct experiments on the ScanNet and SUN RGB-D benchmark datasets to demonstrate that our approach achieves state-of-the-art performance against existing methods. We also present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning. The source code will be available at https://github.com/luluho1208/Diffusion-SS3D.

Fed-CO$_{2}$: Cooperation of Online and Offline Models for Severe Data Heterogeneity in Federated Learning
Zhongyi Cai Ye Shi Wei Huang Jingya Wang



研究问题:联邦学习(FL)是一种分布式学习方法,但数据质量对FL的效果有很大影响,特别是研究问题:联邦学习(FL)是一种分布式学习方法,但数据质量对FL的效果有很大影响,特别是标签分布偏斜和特征偏斜等数据异构性问题。
动机:目前的研究主要关注解决标签分布偏斜问题,而对特征偏斜问题的处理则相对较少。此外,这两种形式的异构性在现有的联邦学习框架中并没有得到很好的统一处理。
方法:我们提出了Fed-CO2,这是一个通用的联邦学习框架,通过在线模型和离线模型之间的合作机制同时处理标签分布偏斜和特征偏斜问题。我们还设计了两种知识转移机制,一种是增强在线和离线模型之间相互学习的客户端内知识转移机制,另一种是提高模型领域泛化能力的客户端间知识转移机制。
效果:实验表明,Fed-CO2在处理标签分布偏斜和特征偏斜问题上优于多种现有的个性化联邦学习算法,无论是单独还是共同考虑这两种问题。

Federated Learning (FL) has emerged as a promising distributed learning paradigm that enables multiple clients to learn a global model collaboratively without sharing their private data. However, the effectiveness of FL is highly dependent on the quality of the data that is being used for training. In particular, data heterogeneity issues, such as label distribution skew and feature skew, can significantly impact the performance of FL. Previous studies in FL have primarily focused on addressing label distribution skew data heterogeneity, while only a few recent works have made initial progress in tackling feature skew issues. Notably, these two forms of data heterogeneity have been studied separately and have not been well explored within a unified FL framework. To address this gap, we propose Fed-CO$_2$, a universal FL framework that handles both label distribution skew and feature skew within a Cooperation mechanism between the Online and Offline models. Specifically, the online model learns general knowledge that is shared among all clients, while the offline model is trained locally to learn the specialized knowledge of each individual client. To further enhance model cooperation in the presence of feature shifts, we design an intra-client knowledge transfer mechanism that reinforces mutual learning between the online and offline models, and an inter-client knowledge transfer mechanism to increase the models’ domain generalization ability. Extensive experiments show that our Fed-CO$_2$ outperforms a wide range of existing personalized federated learning algorithms in terms of handling label distribution skew and feature skew, both individually and collectively. The empirical results are supported by our convergence analyses in a simplified setting.

Fast Model DeBias with Machine Unlearning
Ruizhe Chen Jianfei Yang Huimin Xiong Jianhong Bai Tianxiang Hu Jin Hao YANG FENG Joey Tianyi Zhou Jian Wu Zuozhu Liu



研究问题:深度神经网络在现实场景中可能存在偏见行为,如性别、种族等社会偏见。
动机:这种偏见不仅影响模型的稳健性,还可能加剧和扩大社会偏见,对医疗、招聘等领域的自动决策过程构成威胁。
方法:提出一种快速模型去偏方法(FMD),通过反事实概念识别偏差属性,用影响函数量化数据样本的影响,并设计了一种基于机器撤销学习的高效策略来去除训练模型中的偏见。
效果:在Colored MNIST, CelebA, Adult Income等数据集上的实验表明,该方法在减少偏见和降低去偏成本方面优于现有的重训练方法,同时达到或超过先进的分类精度。

Recent discoveries have revealed that deep neural networks might behave in a biased manner in many real-world scenarios. For instance, deep networks trained on a large-scale face recognition dataset CelebA tend to predict blonde hair for females and black hair for males. Such biases not only jeopardize the robustness of models but also perpetuate and amplify social biases, which is especially concerning for automated decision-making processes in healthcare, recruitment, etc., as they could exacerbate unfair economic and social inequalities among different groups. Existing debiasing methods suffer from high costs in bias labeling or model re-training, while also exhibiting a deficiency in terms of elucidating the origins of biases within the model. To this respect, we propose a fast model debiasing method (FMD) which offers an efficient approach to identify, evaluate and remove biases inherent in trained models. The FMD identifies biased attributes through an explicit counterfactual concept and quantifies the influence of data samples with influence functions. Moreover, we design a machine unlearning-based strategy to efficiently and effectively remove the bias in a trained model with a small counterfactual dataset. Experiments on the Colored MNIST, CelebA, and Adult Income datasets demonstrate that our method achieves superior or competing classification accuracies compared with state-of-the-art retraining-based methods while attaining significantly fewer biases and requiring much less debiasing cost. Notably, our method requires only a small external dataset and updating a minimal amount of model parameters, without the requirement of access to training data that may be too large or unavailable in practice.

Robust Knowledge Transfer in Tiered Reinforcement Learning
Jiawei Huang Niao He



研究问题:本研究旨在解决分层强化学习设置中的问题,即如何将低层(源)任务的知识迁移到高层(目标)任务,以减少后者的探索风险,同时并行解决两个任务。
动机:与以往工作不同,我们不假设低层和高层任务具有相同的动态或奖励函数,而是专注于在没有任务相似性先验知识的情况下进行稳健的知识迁移。
方法:我们提出了一种新的在线学习算法,对于高层任务,根据任务的相似性,它可以在部分状态下实现常数遗憾,当两个任务不相似时,它可以保持接近最优的遗憾;对于低层任务,它可以保持接近最优,而无需做出牺牲。此外,我们还研究了具有多个低层任务的设置,并提出了一种新的转移源选择机制,该机制可以集成所有低层任务的信息,并在更大的状态-动作空间上获得可证明的好处。
效果:实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the ``Optimal Value Dominance'' for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.

All Points Matter: Entropy-Regularized Distribution Alignment for Weakly-supervised 3D Segmentation
Liyao Tang Zhe Chen Shanshan Zhao Chaoyue Wang Dacheng Tao



研究问题:在只有稀疏真实标签的弱监督3D分割任务中广泛使用伪标签进行学习,但现有方法可能会阻碍对未标记数据点的全面利用。
动机:由于在未标记的数据上生成的伪标签存在噪声,这可能导致伪标签与模型预测之间存在显著差异,从而极大地影响模型训练。
方法:我们提出了一种新的学习策略来规范生成的伪标签,有效地缩小伪标签和模型预测之间的差距。具体来说,我们的方法引入了熵正则化损失和分布对齐损失,形成了一个名为ERDA的学习策略。
效果:通过在各种基线和大规模数据集上进行大量实验,结果表明ERDA能够有效地利用所有未标记的数据点进行学习,并在不同设置下实现最先进的性能。

Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning. Existing methods often rely on empirical label selection strategies, such as confidence thresholding, to generate beneficial pseudo-labels for model training. This approach may, however, hinder the comprehensive exploitation of unlabeled data points. We hypothesize that this selective usage arises from the noise in pseudo-labels generated on unlabeled data. The noise in pseudo-labels may result in significant discrepancies between pseudo-labels and model predictions, thus confusing and affecting the model training greatly. To address this issue, we propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for weakly supervised learning in 3D segmentation tasks, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, it reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation network and the 3D segmentation network simultaneously. Despite the simplicity, our method promisingly improves the performance. We validate the effectiveness through extensive experiments on various baselines and large-scale datasets. Results show that ERDA effectively enables the effective usage of all unlabeled data points for learning and achieves state-of-the-art performance under different settings. Remarkably, our method can outperform fully-supervised baselines using only 1\% of true annotations. Code and model will be made publicly available at https://github.com/LiyaoTang/ERDA.

FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals in Factorized Orthogonal Latent Space
Shengzhong Liu Tomoyoshi Kimura Dongxin Liu Ruijie Wang Jinyang Li Suhas Diggavi Mani Srivastava Tarek Abdelzaher



研究问题:本文旨在提出一种新的对比学习框架,用于通过自我监督训练从多模态时间序列传感信号中提取全面的特征。
动机:现有的多模态对比框架主要依赖于感觉模态之间的共享信息,但没有明确考虑对理解底层传感物理至关重要的模态独占信息。此外,时间序列的对比框架尚未适当处理时间信息的局部性。
方法:FOCAL通过以下方式解决了这些挑战:首先,对于多模态时间序列,它将每个模态编码为由共享特征和私有特征组成的因子化潜在空间,这两者是正交的。共享空间通过模态匹配目标强调跨感觉模态一致的特征模式。相反,私有空间通过变换不变目标提取模态独占信息。其次,我们提出了一种模态特征的时间结构约束,使得相邻时间样本之间的距离不大于远离的样本。
效果:在四个多模态传感数据集上进行了广泛的评估,使用两种主干编码器和两种分类器来证明FOCAL的优势。它在下游任务中始终优于最先进的基线,在不同的标签可用比例下都有明显优势。代码和自行收集的数据集可在https://github.com/tomoyoshki/focal获取。

This paper proposes a novel contrastive learning framework, called FOCAL, for extracting comprehensive features from multimodal time-series sensing signals through self-supervised training. Existing multimodal contrastive frameworks mostly rely on the shared information between sensory modalities, but do not explicitly consider the exclusive modality information that could be critical to understanding the underlying sensing physics. Besides, contrastive frameworks for time series have not handled the temporal information locality appropriately. FOCAL solves these challenges by making the following contributions: First, given multimodal time series, it encodes each modality into a factorized latent space consisting of shared features and private features that are orthogonal to each other. The shared space emphasizes feature patterns consistent across sensory modalities through a modal-matching objective. In contrast, the private space extracts modality-exclusive information through a transformation-invariant objective. Second, we propose a temporal structural constraint for modality features, such that the average distance between temporally neighboring samples is no larger than that of temporally distant samples. Extensive evaluations are performed on four multimodal sensing datasets with two backbone encoders and two classifiers to demonstrate the superiority of FOCAL. It consistently outperforms the state-of-the-art baselines in downstream tasks with a clear margin, under different ratios of available labels. The code and self-collected dataset are available at https://github.com/tomoyoshki/focal.

A Bounded Ability Estimation for Computerized Adaptive Testing
Yan Zhuang Qi Liu GuanHao Zhao Zhenya Huang Weizhe Huang Zachary Pardos Enhong Chen Jinze Wu Xin Li



研究问题:如何提高计算机化自适应测试(CAT)的能力估计准确性。
动机:现有的CAT方法没有明确针对能力估计的准确性,因为没有足够的响应来保证估计值能收敛到真实值。
方法:通过分析估计的统计特性,提出了一个基于全响应的问题库的能力估计理论近似值。基于此,提出了一个数据汇总的有限能力估计CAT(BECAT)框架,该框架选择与全响应梯度紧密匹配的问题子集。设计了一个期望梯度差近似的简单贪婪选择算法,并给出了其能力估计的严谨理论和误差上限保证。
效果:实验表明,使用该方法平均可以减少15%的问题数量,显著缩短测试长度,同时达到相同的估计精度。

Computerized adaptive testing (CAT), as a tool that can efficiently measure student's ability, has been widely used in various standardized tests (e.g., GMAT and GRE). The adaptivity of CAT refers to the selection of the most informative questions for each student, reducing test length. Existing CAT methods do not explicitly target ability estimation accuracy since there is no student's true ability as ground truth; therefore, these methods cannot be guaranteed to make the estimate converge to the true with such limited responses. In this paper, we analyze the statistical properties of estimation and find a theoretical approximation of the true ability: the ability estimated by full responses to question bank. Based on this, a Bounded Ability Estimation framework for CAT (BECAT) is proposed in a data-summary manner, which selects a question subset that closely matches the gradient of the full responses. Thus, we develop an expected gradient difference approximation to design a simple greedy selection algorithm, and show the rigorous theoretical and error upper-bound guarantees of its ability estimate. Experiments on both real-world and synthetic datasets, show that it can reach the same estimation accuracy using 15\% less questions on average, significantly reducing test length.

One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation
Zhiwei Hao Jianyuan Guo Kai Han Yehui Tang Han Hu Yunhe Wang Chang Xu



研究问题:现有的知识蒸馏方法大多基于同构教师和学生模型的假设,对于异构模型之间的蒸馏效果不佳。
动机:通过中心化核对准(CKA)比较异构教师和学生模型学习的特征,发现特征差异大,说明以前的提示式方法在跨架构蒸馏中效果不佳。
方法:提出一种简单有效的“一对所有”的知识蒸馏框架OFA-KD,将中间特征投影到一个对齐的潜在空间(如logits空间),丢弃特定于架构的信息。同时引入自适应目标增强方案,防止学生被无关信息干扰。
效果:通过各种架构(CNN、Transformer、MLP等)的大量实验,证明OFA-KD框架在异构架构之间的蒸馏上具有优越性。使用OFA-KD的学生模型性能显著提高,CIFAR-100数据集上最大提升8.0%,ImageNet-1K数据集上提升0.7%。

Knowledge distillation (KD) has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. However, most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family, particularly the hint-based approaches. By using centered kernel alignment (CKA) to compare the learned features between heterogeneous teacher and student models, we observe significant feature divergence. This divergence illustrates the ineffectiveness of previous hint-based methods in cross-architecture distillation. To tackle the challenge in distilling heterogeneous models, we propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures. Specifically, we project intermediate features into an aligned latent space such as the logits space, where architecture-specific information is discarded. Additionally, we introduce an adaptive target enhancement scheme to prevent the student from being disturbed by irrelevant information. Extensive experiments with various architectures, including CNN, Transformer, and MLP, demonstrate the superiority of our OFA-KD framework in enabling distillation between heterogeneous architectures. Specifically, when equipped with our OFA-KD, the student models achieve notable performance improvements, with a maximum gain of 8.0% on the CIFAR-100 dataset and 0.7% on the ImageNet-1K dataset. PyTorch code and checkpoints can be found at https://github.com/Hao840/OFAKD.

Fed-GraB: Federated Long-tailed Learning with Self-Adjusting Gradient Balancer
Zikai Xiao Zihan Chen Songshang Liu Hualiang Wang YANG FENG Jin Hao Joey Tianyi Zhou Jian Wu Howard Hao Yang Zuozhu Liu



研究问题:如何在保护数据隐私和处理长尾分布的情况下,对每个客户端持有的异构数据集进行联合学习。
动机:在许多真实世界的任务中,数据隐私和长尾分布是常态而非例外。当数据集可以全球聚合时,它们共同表现出长尾分布,但现有的联邦优化和/或集中式长尾学习方法由于在隐私约束下描述全局长尾分布和调整局部学习策略以应对头尾不平衡的挑战而难以应用。
方法:提出一种名为Fed-GraB的方法,包括一个自我调整的梯度平衡器(SGB)模块和一个直接先验分析器(DPA)模块,通过闭环方式根据全局长尾分布的反馈重新加权客户端的梯度。
效果:使用Fed-GraB,客户端可以在模型训练过程中有效地缓解由数据异质性引起的分布漂移,同时保持多数类的性能,获得在少数类上性能更好的全局模型。大量实验证明,Fed-GraB在CIFAR-10-LT、CIFAR-100-LT、ImageNet-LT和iNaturalist等代表性数据集上取得了最先进的性能。

Data privacy and long-tailed distribution are the norms rather than the exception in many real-world tasks. This paper investigates a federated long-tailed learning (Fed-LT) task in which each client holds a locally heterogeneous dataset; if the datasets can be globally aggregated, they jointly exhibit a long-tailed distribution. Under such a setting, existing federated optimization and/or centralized long-tailed learning methods hardly apply due to challenges in (a) characterizing the global long-tailed distribution under privacy constraints and (b) adjusting the local learning strategy to cope with the head-tail imbalance. In response, we propose a method termed $\texttt{Fed-GraB}$, comprised of a Self-adjusting Gradient Balancer (SGB) module that re-weights clients' gradients in a closed-loop manner, based on the feedback of global long-tailed distribution evaluated by a Direct Prior Analyzer (DPA) module. Using $\texttt{Fed-GraB}$, clients can effectively alleviate the distribution drift caused by data heterogeneity during the model training process and obtain a global model with better performance on the minority classes while maintaining the performance of the majority classes. Extensive experiments demonstrate that $\texttt{Fed-GraB}$ achieves state-of-the-art performance on representative datasets such as CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist.

GALOPA: Graph Transport Learning with Optimal Plan Alignment
Yejiang Wang Yuhai Zhao Daniel Zhengkui Wang Ling Li



研究问题:本文旨在解决图对比学习中寻找标签不变性的增强图和确定样本对之间相似性程度的问题。
动机:现有的图对比学习方法面临在寻找标签不变性的增强图和确定样本对之间相似性程度的挑战。
方法:本文提出了一种替代的自监督解决方案,该方案无需区分正负样本,可以校准编码器以保留图中的结构信息和不同图之间的匹配信息,并学习保留图间距离的等距嵌入。
效果:实验结果表明,该方案显著优于使用传输距离的策略,即使在高干扰率下也能保持稳健的结果,并在各种基准测试中验证了该方法的有效性。

Self-supervised learning on graph aims to learn graph representations in an unsupervised manner. While graph contrastive learning (GCL - relying on graph augmentation for creating perturbation views of anchor graphs and maximizing/minimizing similarity for positive/negative pairs) is a popular self-supervised method, it faces challenges in finding label-invariant augmented graphs and determining the exact extent of similarity between sample pairs to be achieved. In this work, we propose an alternative self-supervised solution that (i) goes beyond the label invariance assumption without distinguishing between positive/negative samples, (ii) can calibrate the encoder for preserving not only the structural information inside the graph, but the matching information between different graphs, (iii) learns isometric embeddings that preserve the distance between graphs, a by-product of our objective. Motivated by optimal transport theory, this scheme relays on an observation that the optimal transport plans between node representations at the output space, which measure the matching probability between two distributions, should be consistent to the plans between the corresponding graphs at the input space. The experimental findings include: (i) The plan alignment strategy significantly outperforms the counterpart using the transport distance; (ii) The proposed model shows superior performance using only node attributes as calibration signals, without relying on edge information; (iii) Our model maintains robust results even under high perturbation rates; (iv) Extensive experiments on various benchmarks validate the effectiveness of the proposed method.

ResMem: Learn what you can and memorize the rest
Zitong Yang Michal Lukasik Vaishnavh Nagarajan Zonglin Li Ankit Singh Rawat Manzil Zaheer Aditya Krishna Menon Sanjiv Kumar



研究问题:如何通过显式记忆提高模型的泛化性能。
动机:现代神经网络的出色泛化能力部分归功于它们隐式记忆复杂训练模式的能力,受此启发,我们探索了一种新的通过显式记忆来提高模型泛化的方法。
方法:我们提出了残差记忆(ResMem)算法,这是一种新的方法,通过将现有预测模型(如神经网络)的残差拟合到基于最近邻的回归器中来增强模型。最终预测是原始模型和拟合残差回归器的总和。
效果:实验结果表明,ResMem在标准视觉和自然语言处理基准测试中始终能提高原始预测模型的测试集泛化能力。

The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g., a neural network) by fitting the model's residuals with a nearest-neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. We start by formulating a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over a base linear neural network. Then, we empirically show that ResMem consistently improves the test set generalization of the original prediction model across standard vision and natural language processing benchmarks.

Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation
Fei Zhang Tianfei Zhou Boyang Li Hao He Chaofan Ma Tianjiao Zhang Jiangchao Yao Ya Zhang Yanfeng Wang



研究问题:本文研究了弱开放词汇语义分割(WOVSS)的问题,即如何仅使用图像-文本对来分割任意类别的对象。
动机:现有的方法通过引入显式分组识别来增强基本的视觉变换器,但这些方法在组令牌的使用粒度上存在不一致,导致训练和推理阶段组令牌的对齐方式分别为全部到一对一和一对一。
方法:为了解决这种粒度上的不一致性,本文提出了显式监督组令牌的方法,从原型知识中获取。具体来说,我们提出了非可学习的原型正则化(NPR),其中非可学习的原型是从源特征中估计出来的,作为监督并实现组令牌的对比匹配。
效果:实验结果表明,我们提出的方法在几个基准数据集上取得了最先进的性能。

This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS), which learns to segment objects of arbitrary classes using mere image-text pairs. Existing works turn to enhance the vanilla vision transformer by introducing explicit grouping recognition, i.e., employing several group tokens/centroids to cluster the image tokens and perform the group-text alignment. Nevertheless, these methods suffer from a granularity inconsistency regarding the usage of group tokens, which are aligned in the all-to-one v.s. one-to-one manners during the training and inference phases, respectively. We argue that this discrepancy arises from the lack of elaborate supervision for each group token. To bridge this granularity gap, this paper explores explicit supervision for the group tokens from the prototypical knowledge. To this end, this paper proposes the non-learnable prototypical regularization (NPR) where non-learnable prototypes are estimated from source features to serve as supervision and enable contrastive matching of the group tokens. This regularization encourages the group tokens to segment objects with less redundancy and capture more comprehensive semantic regions, leading to increased compactness and richness. Based on NPR, we propose the prototypical guidance segmentation network (PGSeg) that incorporates multi-modal regularization by leveraging prototypical sources from both images and texts at different levels, progressively enhancing the segmentation capability with diverse prototypical patterns. Experimental results show that our proposed method achieves state-of-the-art performance on several benchmark datasets.

A Unified Approach to Domain Incremental Learning with Memory: Theory and Algorithm
Haizhou Shi Hao Wang



研究问题:本文旨在解决领域增量学习的问题,即如何通过访问来自以前领域的一小部分数据(即记忆)来适应一系列领域。
动机:尽管已经提出了许多解决这个问题的方法,但它们之间的关系以及实践者应该选择哪一种方法仍然不清楚。
方法:为此,我们提出了一个统一的框架,称为统一领域增量学习(UDIL),用于具有记忆的领域增量学习。我们的UDIL统一了各种现有方法,并且我们的理论分析表明,与这些方法相比,UDIL总是实现了更紧的泛化误差界限。
效果:实验结果表明,我们的UDIL在合成和真实世界的数据集上都优于最先进的领域增量学习方法。

Domain incremental learning aims to adapt to a sequence of domains with access to only a small subset of data (i.e., memory) from previous domains. Various methods have been proposed for this problem, but it is still unclear how they are related and when practitioners should choose one method over another. In response, we propose a unified framework, dubbed Unified Domain Incremental Learning (UDIL), for domain incremental learning with memory. Our UDIL **unifies** various existing methods, and our theoretical analysis shows that UDIL always achieves a tighter generalization error bound compared to these methods. The key insight is that different existing methods correspond to our bound with different **fixed** coefficients; based on insights from this unification, our UDIL allows **adaptive** coefficients during training, thereby always achieving the tightest bound. Empirical results show that our UDIL outperforms the state-of-the-art domain incremental learning methods on both synthetic and real-world datasets. Code will be available at https://github.com/Wang-ML-Lab/unified-continual-learning.

UniT: A Unified Look at Certified Robust Training against Text Adversarial Perturbation
Muchao Ye Ziyi Yin Tianrong Zhang Tianyu Du Jinghui Chen Ting Wang Fenglong Ma



研究问题:近年来,对抗性文本扰动(如同义词替换)的鲁棒训练管道不断涌现,但现有方法在离散词空间和连续潜在空间中提供预测证书时存在结构性差距。
动机:现有的训练框架需要统一,以提供更强的鲁棒性保证。同时,它们主要关注构建认证过程,而忽视了提高基础模型的鲁棒性。
方法:为解决上述问题,我们提出了一个名为UniT的统一框架,该框架可以在词嵌入空间中灵活训练,无需额外模块即可直接从词嵌入空间获得更强的鲁棒性保证。此外,我们还引入了解耦正则化(DR)损失来提高基础模型的鲁棒性,包括针对特征提取和分类器模块的两个独立的鲁棒性正则化项。
效果:在广泛使用的文本分类数据集上进行的实验结果表明,设计的统一框架和提出的DR损失对于提高认证鲁棒精度具有有效性。

Recent years have witnessed a surge of certified robust training pipelines against text adversarial perturbation constructed by synonym substitutions. Given a base model, existing pipelines provide prediction certificates either in the discrete word space or the continuous latent space. However, they are isolated from each other with a structural gap. We observe that existing training frameworks need unification to provide stronger certified robustness. Additionally, they mainly focus on building the certification process but neglect to improve the robustness of the base model. To mitigate the aforementioned limitations, we propose a unified framework named UniT that enables us to train flexibly in either fashion by working in the word embedding space. It can provide a stronger robustness guarantee obtained directly from the word embedding space without extra modules. In addition, we introduce the decoupled regularization (DR) loss to improve the robustness of the base model, which includes two separate robustness regularization terms for the feature extraction and classifier modules. Experimental results on widely used text classification datasets further demonstrate the effectiveness of the designed unified framework and the proposed DR loss for improving the certified robust accuracy.

Context-guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes
Yishi Xu Jianqiao Sun Yudi Su Xinyang Liu Zhibin Duan Bo Chen Mingyuan Zhou



研究问题:当前基于嵌入的神经主题模型在低资源主题建模中表现优越,但它们研究问题:当前基于嵌入的神经主题模型在低资源主题建模中表现优越,但它们通常忽视了词义在不同上下文中的动态变化,导致在新任务和不熟悉的上下文中适应性较差。
动机:为了解决这个问题,本文提出了一种有效的方法,通过充分利用上下文信息为每个任务自适应地生成语义定制的词嵌入。
方法:首先,我们将每个任务的单词的上下文句法依赖关系浓缩成语义图,然后使用变分图自动编码器对其进行建模以产生特定于任务的词表示。在此基础上,我们在单词的潜在空间上施加一个可学习的高斯混合先验,从聚类的角度高效地学习主题表示,有助于发现多样化的主题并快速适应新任务。
效果:大量的定量和定性实验表明,该方法全面超越了已建立的主题模型。

Embedding-based neural topic models have turned out to be a superior option for low-resourced topic modeling. However, current approaches consider static word embeddings learnt from source tasks as general knowledge that can be transferred directly to the target task, discounting the dynamically changing nature of word meanings in different contexts, thus typically leading to sub-optimal results when adapting to new tasks with unfamiliar contexts. To settle this issue, we provide an effective method that centers on adaptively generating semantically tailored word embeddings for each task by fully exploiting contextual information. Specifically, we first condense the contextual syntactic dependencies of words into a semantic graph for each task, which is then modeled by a Variational Graph Auto-Encoder to produce task-specific word representations. On this basis, we further impose a learnable Gaussian mixture prior on the latent space of words to efficiently learn topic representations from a clustering perspective, which contributes to diverse topic discovery and fast adaptation to novel tasks. We have conducted a wealth of quantitative and qualitative experiments, and the results show that our approach comprehensively outperforms established topic models.

Towards Efficient Pre-Trained Language Model via Feature Correlation Distillation
Kun Huang Xin Guo Meng Wang



研究问题:如何有效地将大型预训练语言模型的知识传递给学生模型。
动机:现有的知识蒸馏方法主要关注直接对齐变压器模块的输出特征,这可能会对学生模型的学习过程施加过于严格的约束,并通过引入额外的参数和计算成本使训练过程复杂化。
方法:我们提出了一种新的方法,直接从输出特征中建立关系。具体来说,我们同时引入了令牌级和序列级的关系,以充分利用教师模型的知识。此外,我们还提出了一种基于相关性的蒸馏损失函数,以缓解传统KL散度或MSE损失函数所固有的精确匹配特性。
效果:广泛的实验结果表明,我们的小型语言模型在各种NLP任务上都显著超越了现有的知识蒸馏方法。

Knowledge Distillation (KD) has emerged as a promising approach for compressing large Pre-trained Language Models (PLMs). The performance of KD relies on how to effectively formulate and transfer the knowledge from the teacher model to the student model. Prior arts mainly focus on directly aligning output features from the transformer block, which may impose overly strict constraints on the student model's learning process and complicate the training process by introducing extra parameters and computational cost. Moreover, our analysis indicates that the different relations within self-attention, as adopted in other works, involves more computation complexities and can easily be constrained by the number of heads, potentially leading to suboptimal solutions. To address these issues, we propose a novel approach that builds relationships directly from output features. Specifically, we introduce token-level and sequence-level relations concurrently to fully exploit the knowledge from the teacher model. Furthermore, we propose a correlation-based distillation loss to alleviate the exact match properties inherent in traditional KL divergence or MSE loss functions. Our method, dubbed FCD, presents a simple yet effective method to compress various architectures (BERT, RoBERTa, and GPT) and model sizes (base-size and large-size). Extensive experimental results demonstrate that our distilled, smaller language models significantly surpass existing KD methods across various NLP tasks.

Towards Hybrid-grained Feature Interaction Selection for Deep Sparse Network
Fuyuan Lyu Xing Tang Dugang Liu Chen Ma Weihong Luo Liang Chen xiuqiang He Xue Liu



研究问题:如何有效地在深度稀疏网络中选择特征交互,特别是在细粒度上。
动机:现有的方法主要关注如何在粗粒度空间中搜索特征交互,对细粒度的特征交互选择关注较少。
方法:提出了一种混合粒度的特征交互选择方法,同时针对特征字段和特征值进行优化。通过实时计算分解空间来探索这种广阔的空间,并开发了一种名为OptFeature的选择算法,可以同时从特征字段和特征值中高效地选择特征交互。
效果:实验结果表明,OptFeature在三个大型真实世界基准数据集上具有良好的准确性和效率。

Deep sparse networks are widely investigated as a neural network architecture for prediction tasks with high-dimensional sparse features, with which feature interaction selection is a critical component. While previous methods primarily focus on how to search feature interaction in a coarse-grained space, less attention has been given to a finer granularity. In this work, we introduce a hybrid-grained feature interaction selection approach that targets both feature field and feature value for deep sparse networks. To explore such expansive space, we propose a decomposed space which is calculated on the fly. We then develop a selection algorithm called OptFeature, which efficiently selects the feature interaction from both the feature field and the feature value simultaneously. Results from experiments on three large real-world benchmark datasets demonstrate that OptFeature performs well in terms of accuracy and efficiency. Additional studies support the feasibility of our method. All source code are publicly available\footnote{https://anonymous.4open.science/r/OptFeature-Anonymous}.

Test-Time Distribution Normalization for Contrastively Learned Visual-language Models
Yifei Zhou Juntao Ren Fengyu Li Ramin Zabih Ser-Nam Lim



研究问题:现有的视觉-语言对比学习在执行下游应用时,仅通过图像和文本表示的点积操作可能会损失信息。
动机:为了解决这一问题,本文提出了一种新的测试时间增强方法——分布归一化(DN)。
方法:通过计算一批测试样本的平均表示,并将其视为负样本在InfoNCE损失中的等效物,来近似地获取负样本的信息。这种方法无需重新训练或微调,可以很容易地应用于推理阶段。
效果:大量的实验表明,DN在各种下游任务上明显优于其他现有的测试时间增强方法。

Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference in a computationally efficient way. We propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product on top of other existing test-time augmentation methods.

RanPAC: Random Projections and Pre-trained Models for Continual Learning
Mark McDonnell Dong Gong Amin Parvaneh Ehsan Abbasnejad Anton van den Hengel



研究问题:本文旨在解决持续学习(CL)中的记忆遗忘问题,特别是在使用预训练模型进行增量学习时。
动机:大多数现有的持续学习方法主要关注从零开始的学习范式,而忽视了预训练模型在处理不同任务时的潜力。同时,已有的基于预训练模型的持续学习方法存在特征分布差距大或易忘记的问题。
方法:本文提出了一种简洁有效的基于预训练模型的持续学习方法。该方法通过在预训练模型的特征表示和输出头之间插入一个冻结的随机投影层,利用非线性激活来捕获扩展维度的特征交互,从而提高了基于类原型的持续学习的线性可分性。同时,作者还证明了类原型去相关对于减少使用预训练表示时的特征分布差异的重要性。
效果:实验结果表明,这种方法在七个类别增量基准数据集上,相比于之前应用在预训练ViT-B/16模型上的持续学习方法,最终错误率降低了10%到62%,尽管没有使用任何回忆内存。这表明预训练模型在简单、有效和快速的持续学习方面的潜力尚未完全挖掘。

Continual learning (CL) aims to incrementally learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones. Most CL works focus on tackling catastrophic forgetting under a learning-from-scratch paradigm. However, with the increasing prominence of foundation models, pre-trained models equipped with informative representations have become available for various downstream requirements. Several CL methods based on pre-trained models have been explored, either utilizing pre-extracted features directly (which makes bridging distribution gaps challenging) or incorporating adaptors (which may be subject to forgetting). In this paper, we propose a concise and effective approach for CL with pre-trained models. Given that forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning. Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 10% and 62% on seven class-incremental benchmark datasets, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast continual learning has not hitherto been fully tapped. Code is available at https://github.com/RanPAC/RanPAC.

An Empirical Study Towards Prompt-Tuning for Graph Contrastive Pre-Training in Recommendations
Haoran Yang Xiangyu Zhao Yicong Li Hongxu Chen Guandong Xu



研究问题:本文旨在解决图对比学习在推荐系统中的应用中,预训练和下游任务目标不一致的问题。
动机:目前的图对比学习方法在推荐系统中,通常将对比损失和下游推荐目标结合形成总体目标函数,这与原始的GCL范式(不涉及下游训练目标)不一致。
方法:本文提出了一种基于提示增强的GCL推荐系统框架CPTPP,通过提示调整来充分利用原始GCL协议的优点。具体来说,首先对用户档案进行总结以自动生成个性化的用户提示,然后将这些提示与预训练的用户嵌入进行组合,在下游任务中进行提示调整,从而缩小预训练和下游任务之间的不同目标。
效果:在三个基准数据集上的大量实验表明,CPTPP相对于最先进的基线具有有效性。进一步的可视化实验表明,CPTPP生成的用户嵌入具有更均匀的分布,表明其更好地模拟了用户偏好的多样性。

Graph contrastive learning (GCL) has emerged as a potent technology for numerous graph learning tasks. It has been successfully applied to real-world recommender systems, where the contrastive loss and the downstream recommendation objectives are always combined to form the overall objective function. Such a strategy is inconsistent with the original GCL paradigm, where graph embeddings are pre-trained without involving downstream training objectives. In this paper, we innovatively propose a prompt-enhanced framework for GCL-based recommender systems, namely CPTPP, which can fully leverage the advantages of the original GCL protocol through prompt tuning. Specifically, we first summarise user profiles in graph recommender systems to automatically generate personalized user prompts. These prompts will then be combined with pre-trained user embeddings to conduct prompt-tuning in downstream tasks, thereby narrowing the distinct targets between pre-training and downstream tasks. Extensive experiments on three benchmark datasets validate the effectiveness of CPTPP against state-of-the-art baselines. A further visualization experiment demonstrates that user embeddings generated by CPTPP have a more uniform distribution, indicating a better capacity to model the diversity of user preferences. The implementation code is available online to ease reproducibility: https://anonymous.4open.science/r/CPTPP-F8F4

Synthetic Experience Replay
Cong Lu Philip J. Ball Yee Whye Teh Jack Parker-Holder



研究问题:如何充分利用深度强化学习中的有限数据进行训练。
动机:深度强化学习需要收集大量数据,但数据获取困难,限制了其发展。
方法:提出合成经验回放(SynthER)方法,通过生成模型来灵活地增加代理的经验数据。
效果:在离线和在线环境中,无论是在自身感觉环境还是像素化环境中,SynthER都能显著提高训练效果和样本效率。

A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or self-supervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusion-based approach to flexibly upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings, in both proprioceptive and pixel-based environments. In offline settings, we observe drastic improvements when upsampling small offline datasets and see that additional synthetic data also allows us to effectively train larger networks. Furthermore, SynthER enables online agents to train with a much higher update-to-data ratio than before, leading to a significant increase in sample efficiency, without any algorithmic changes. We believe that synthetic training data could open the door to realizing the full potential of deep learning for replay-based RL algorithms from limited data. Finally, we open-source our code at https://github.com/conglu1997/SynthER.

Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
Zhu Wang Sourav Medya Sathya N. Ravi



研究问题:如何有效地结合预训练深度网络和外部语义知识,同时减少模型大小和计算成本。
动机:目前的深度学习模型在处理未见过的数据时,可能无法捕捉到重要的语义信息和数据集内的隐含依赖关系。
方法:提出了一种简化的方法,将预训练深度网络的特征和免费的显式语义知识进行组合,并引入了一个可微分的异常分布(OOD)检测层来移除与图像不对应的无关显式知识。
效果:实验结果表明,这种方法可以在显著减少样本和训练时间的情况下,达到与最先进的结果相当的性能。

Deep network models are often purely inductive during both training and inference on unseen data. When these models are used for prediction, but they may fail to capture important semantic information and implicit dependencies within datasets. Recent advancements have shown that combining multiple modalities in large-scale vision and language settings can improve understanding and generalization performance. However, as the model size increases, fine-tuning and deployment become computationally expensive, even for a small number of downstream tasks. Moreover, it is still unclear how domain or prior modal knowledge can be specified in a backpropagation friendly manner, especially in large-scale and noisy settings. To address these challenges, we propose a simplified alternative of combining features from pretrained deep networks and freely available semantic explicit knowledge. In order to remove irrelevant explicit knowledge that does not correspond well to the images, we introduce an implicit Differentiable Out-of-Distribution (OOD) detection layer. This layer addresses outlier detection by solving for fixed points of a differentiable function and using the last iterate of fixed point solver to backpropagate. In practice, we apply our model on several vision and language downstream tasks including visual question answering, visual reasoning, and image-text retrieval on different datasets. Our experiments show that it is possible to design models that perform similarly to state-of-the-art results but with significantly fewer samples and less training time. Our models and code are available here: https://github.com/ellenzhuwang/implicit_vkood

Structured Federated Learning through Clustered Additive Modeling
Jie Ma Tianyi Zhou Guodong Long Jing Jiang Chengqi Zhang



研究问题:异构联邦学习在没有假设任何结构的情况下,由于客户端非同质数据分布的冲突而具有挑战性。
动机:在实践中,客户端通常由近同质集群组成,因此为每个集群训练一个服务器端模型可以缓解冲突。然而,具有客户端聚类的FL经常遭受“聚类崩溃”,即一个集群的模型在增加客户端时表现优秀,并减少到单一模型FL。此外,集群模型阻碍了集群之间的知识共享,并且每个模型依赖于更少的客户端。
方法:我们提出了“聚类加性建模(CAM)”,它在集群模型(Θ_1 : K)之上应用全局共享模型(Θ_g),即对于第k个集群的客户端,y=h(x;Θ_g)+f(x;Θ_k)。全局模型捕获所有集群共享的特征,因此Θ_1 : K被强制关注集群之间的差异。为了训练CAM,我们开发了一种新颖的Fed-CAM算法,该算法在客户端聚类和训练全局/集群模型之间交替进行,以预测彼此的残差。
效果:我们可以很容易地通过CAM修改任何现有的聚类FL方法,并在不同的非IID设置中显著提高其性能,而不会发生“聚类崩溃”。我们还提供了Fed-CAM算法的收敛性分析。

Heterogeneous federated learning without assuming any structure is challenging due to the conflicts among non-identical data distributions of clients. In practice, clients often comprise near-homogeneous clusters so training a server-side model per cluster mitigates the conflicts. However, FL with client clustering often suffers from “clustering collapse'', i.e., one cluster's model excels on increasing clients, and reduces to single-model FL. Moreover, cluster-wise models hinder knowledge sharing between clusters and each model depends on fewer clients. Furthermore, the static clustering assumption on data may not hold for dynamically changing models, which are sensitive to cluster imbalance/initialization or outliers. To address these challenges, we propose ''Clustered Additive Modeling (CAM)'', which applies a globally shared model $\Theta_g$ on top of the cluster-wise models $\Theta_{1:K}$, i.e., $y=h(x;\Theta_g)+f(x;\Theta_k)$ for clients of cluster-$k$. The global model captures the features shared by all clusters so $\Theta_{1:K}$ are enforced to focus on the difference among clusters. To train CAM, we develop a novel Fed-CAM algorithm that alternates between client clustering and training global/cluster models to predict the residual of each other. We can easily modify any existing clustered FL methods by CAM and significantly improve their performance without ‘’clustering collapse'' in different non-IID settings. We also provide a convergence analysis of Fed-CAM algorithm.

IDEA: An Invariant Perspective for Efficient Domain Adaptive Image Retrieval
Haixin Wang Hao Wu Jinan Sun Shikun Zhang Chong Chen Xian-Sheng Hua Xiao Luo



研究问题:本文旨在解决无监督领域自适应哈希问题,即如何利用标签丰富的源领域知识快速学习标签稀缺的目标领域的哈希。
动机:尽管现有的方法试图将迁移学习技术纳入深度哈希框架,但他们往往忽视了两个领域之间充分对齐的基本不变性。更糟糕的是,这些方法无法区分图像中嵌入的因果和非因果效应,使得跨领域检索无效。
方法:我们提出了一种获取不变性的领域自适应哈希(IDEA)模型。首先,我们将每个图像分解为表示标签信息的因果特征和非因果特征。然后,我们在源和目标领域上使用一致性学习生成判别哈希码。更重要的是,我们使用生成模型生成合成样本来模拟各种非因果效应的干预,最终最小化它们对哈希码的影响以实现域不变性。
效果:在基准数据集上进行的全面实验验证了我们的IDEA与各种竞争基线相比的优越性能。

In this paper, we investigate the problem of unsupervised domain adaptive hashing, which leverage knowledge from a label-rich source domain to expedite learning to hash on a label-scarce target domain. Although numerous existing approaches attempt to incorporate transfer learning techniques into deep hashing frameworks, they often neglect the essential invariance for adequate alignment between these two domains. Worse yet, these methods fail to distinguish between causal and non-causal effects embedded in images, rendering cross-domain retrieval ineffective. To address these challenges, we propose an Invariance-acquired Domain AdaptivE HAshing (IDEA) model. Our IDEA first decomposes each image into a causal feature representing label information, and a non-causal feature indicating domain information. Subsequently, we generate discriminative hash codes using causal features with consistency learning on both source and target domains. More importantly, we employ a generative model for synthetic samples to simulate the intervention of various non-causal effects, ultimately minimizing their impact on hash codes for domain invariance. Comprehensive experiments conducted on benchmark datasets validate the superior performance of our IDEA compared to a variety of competitive baselines.

Beyond probability partitions: Calibrating neural networks with semantic aware grouping
Jia-Qi Yang De-Chuan Zhan Le Gan



研究问题:深度网络的预测往往过于乐观,导致预测误差被低估。
动机:由于数据有限,现有的研究提出了基于模型预测概率的各种方法来划分数据并评估校准误差。
方法:提出一种更通用的校准误差定义,称为分区校准误差(PCE),揭示了这些校准误差度量之间的关键区别在于如何划分数据空间。通过语义相关的分区函数,展示了模型准确性和校准之间的关系在于分区函数的粒度。
效果:通过在深度模型特征和日志its上联合学习一个语义感知的分组函数来划分数据空间为子集,然后为每个子集学习单独的校准函数。实验结果表明,该方法在多个数据集和网络架构上都取得了显著的性能改进,从而强调了分区函数对校准的重要性。

Research has shown that deep networks tend to be overly optimistic about their predictions, leading to an underestimation of prediction errors. Due to the limited nature of data, existing studies have proposed various methods based on model prediction probabilities to bin the data and evaluate calibration error. We propose a more generalized definition of calibration error called Partitioned Calibration Error (PCE), revealing that the key difference among these calibration error metrics lies in how the data space is partitioned. We put forth an intuitive proposition that an accurate model should be calibrated across any partition, suggesting that the input space partitioning can extend beyond just the partitioning of prediction probabilities, and include partitions directly related to the input. Through semantic-related partitioning functions, we demonstrate that the relationship between model accuracy and calibration lies in the granularity of the partitioning function. This highlights the importance of partitioning criteria for training a calibrated and accurate model. To validate the aforementioned analysis, we propose a method that involves jointly learning a semantic aware grouping function based on deep model features and logits to partition the data space into subsets. Subsequently, a separate calibration function is learned for each subset. Experimental results demonstrate that our approach achieves significant performance improvements across multiple datasets and network architectures, thus highlighting the importance of the partitioning function for calibration.

IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers
Zhenglin Huang Xiaoan Bao Na Zhang Qingqi Zhang Xiao mei Tu Biao Wu Xi Yang



研究问题:如何在保证模型在干净数据上的准确率的同时,提高其在数据分布变化时的鲁棒性。
动机:虽然数据增强已被证明能有效防止过拟合,提高卷积神经网络分类器的准确率,但在真实场景中构建深度神经网络不仅需要对干净数据的高准确率,还需要在数据分布变化时保持鲁棒性。
方法:提出IPMix数据增强方法,将图像级、补丁级和像素级的三级数据增强整合为一个连贯且保留标签的技术,以有限的计算开销增加训练数据的多样性。为了进一步提高鲁棒性,IPMix在不同级别引入结构复杂性生成更多样化的图像,并采用随机混合方法进行多尺度信息融合。
效果:实验表明,IPMix在CIFAR-C和ImageNet-C上的表现优于最先进的腐败鲁棒性。此外,IPMix还显著提高了其他安全性措施,包括对抗性扰动的鲁棒性、校准、预测一致性和异常检测,在ImageNet-R、ImageNet-A和ImageNet-O等多个基准测试中达到或接近最先进的结果。

Data augmentation has been proven effective for training high-accuracy convolutional neural network classifiers by preventing overfitting. However, building deep neural networks in real-world scenarios requires not only high accuracy on clean data but also robustness when data distributions shift. While prior methods have proposed that there is a trade-off between accuracy and robustness, we propose IPMix, a simple data augmentation approach to improve robustness without hurting clean accuracy. IPMix integrates three levels of data augmentation (image-level, patch-level, and pixel-level) into a coherent and label-preserving technique to increase the diversity of training data with limited computational overhead. To further improve the robustness, IPMix introduces structural complexity at different levels to generate more diverse images and adopts the random mixing method for multi-scale information fusion. Experiments demonstrate that IPMix outperforms state-of-the-art corruption robustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also significantly improves the other safety measures, including robustness to adversarial perturbations, calibration, prediction consistency, and anomaly detection, achieving state-of-the-art or comparable results on several benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.

Eliminating Domain Bias for Federated Learning in Representation Space
Jianqing Zhang Yang Hua Jian Cao Hao Wang Tao Song Zhengui XUE Ruhui Ma Haibing Guan



研究问题:在统计异构场景下,客户端的有偏数据域会导致表示偏差现象,进一步导致本地训练期间通用表示退化。
动机:为了解决这些问题,我们提出了一种用于联邦学习的通用框架——领域偏差消除器(DBE)。
方法:通过减少服务器和客户端在表示空间中的域差异,DBE可以促进服务器和客户端之间的双向知识转移。
效果:实验结果表明,DBE可以在泛化和个性化能力上大大改善现有的FL方法。配备了DBE的FL方法可以大幅超越十种最先进的个性化FL方法。

Recently, federated learning (FL) is popular for its privacy-preserving and collaborative learning abilities. However, under statistically heterogeneous scenarios, we observe that biased data domains on clients cause a representation bias phenomenon and further degenerate generic representations during local training, i.e., the representation degeneration phenomenon. To address these issues, we propose a general framework Domain Bias Eliminator (DBE) for FL. Our theoretical analysis reveals that DBE can promote bi-directional knowledge transfer between server and client, as it reduces the domain discrepancy between server and client in representation space. Besides, extensive experiments on four datasets show that DBE can greatly improve existing FL methods in both generalization and personalization abilities. The DBE-equipped FL method can outperform ten state-of-the-art personalized FL methods by a large margin. Our code is public at https://github.com/TsingZ0/DBE.

Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning
Jialong Wu Haoyu Ma Chaoyi Deng Mingsheng Long



研究问题:本文旨在解决在野外视频中预训练世界模型的问题,以有效地学习下游视觉控制任务。
动机:野外视频的复杂性和多样性使得世界模型难以提取共享的世界知识进行更好的泛化。
方法:引入了上下文世界模型(ContextWM),通过将上下文编码器与潜在动态模型结合,实现了上下文和动态建模的分离,以克服野外视频的复杂性和多样性,促进不同场景之间的知识转移。
效果:实验表明,使用ContextWM进行野外视频预训练可以显著提高多领域的样本效率,包括机器人操作、移动和自动驾驶。

Unsupervised pre-training methods utilizing large and diverse datasets have achieved tremendous success across a range of domains. Recent work has investigated such unsupervised pre-training methods for model-based reinforcement learning (MBRL) but is limited to domain-specific or simulated data. In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of downstream visual control tasks. However, in-the-wild videos are complicated with various contextual factors, such as intricate backgrounds and textured appearance, which precludes a world model from extracting shared world knowledge to generalize better. To tackle this issue, we introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling to overcome the complexity and diversity of in-the-wild videos and facilitate knowledge transfer between distinct scenes. Specifically, a contextualized extension of the latent dynamics model is elaborately realized by incorporating a context encoder to retain contextual information and empower the image decoder, which encourages the latent dynamics model to concentrate on essential temporal variations. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of MBRL in various domains, including robotic manipulation, locomotion, and autonomous driving. Code is available at this repository: https://github.com/thuml/ContextWM.

ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning
Junguang Jiang Baixu Chen Junwei Pan Ximei Wang Dapeng Liu jie jiang Mingsheng Long



研究问题:本研究旨在解决多任务学习中的负迁移问题,即同时学习多个相关任务反而导致目标任务性能下降的问题。
动机:现有的优化方法主要通过协调任务梯度来解决负迁移问题,但忽视了辅助任务和目标任务的泛化能力。
方法:提出一种新的方法ForkMerge,该方法定期将模型分为多个分支,通过最小化目标验证误差自动搜索不同的任务权重,并动态合并所有分支以过滤掉有害的任务参数更新。
效果:在一系列辅助任务学习基准测试中,ForkMerge优于现有方法,有效地缓解了负迁移问题。

Auxiliary-Task Learning (ATL) aims to improve the performance of the target task by leveraging the knowledge obtained from related tasks. Occasionally, learning multiple tasks simultaneously results in lower accuracy than learning only the target task, which is known as negative transfer. This problem is often attributed to the gradient conflicts among tasks, and is frequently tackled by coordinating the task gradients in previous works. However, these optimization-based methods largely overlook the auxiliary-target generalization capability. To better understand the root cause of negative transfer, we experimentally investigate it from both optimization and generalization perspectives. Based on our findings, we introduce ForkMerge, a novel approach that periodically forks the model into multiple branches, automatically searches the varying task weights by minimizing target validation errors, and dynamically merges all branches to filter out detrimental task-parameter updates. On a series of auxiliary-task learning benchmarks, ForkMerge outperforms existing methods and effectively mitigates negative transfer.

ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction
Jia Guo shuai lu LIze JIa Weihang Zhang Huiqi Li



研究问题:目前的最先进的无监督异常检测(UAD)方法主要依赖于在大规模数据集上预研究问题:目前的最先进的无监督异常检测(UAD)方法主要依赖于在大规模数据集上预训练的冻结编码器网络的特征表示,但这些特征与目标UAD领域(如工业检测和医学影像)所需的特征相去甚远。
动机:为了解决这一问题,本文提出了一种新的认识UAD方法,即ReContrast,它通过优化整个网络来减少对预训练图像领域的偏见,并将网络定位在目标领域。
方法:该方法首先采用特征重建方法从错误中检测异常。本质上,对比学习的元素被巧妙地嵌入到特征重建中,以防止网络训练不稳定、模式崩溃和相同的捷径,同时优化目标领域的编码器和解码器。
效果:通过在两个流行的工业缺陷检测基准和三个医学图像UAD任务上进行广泛的实验,证明了我们的方法在不同图像领域的迁移能力,并显示出优于当前最先进的方法的优势。

Most advanced unsupervised anomaly detection (UAD) methods rely on modeling feature representations of frozen encoder networks pre-trained on large-scale datasets, e.g. ImageNet. However, the features extracted from the encoders that are borrowed from natural image domains coincide little with the features required in the target UAD domain, such as industrial inspection and medical imaging. In this paper, we propose a novel epistemic UAD method, namely ReContrast, which optimizes the entire network to reduce biases towards the pre-trained image domain and orients the network in the target domain. We start with a feature reconstruction approach that detects anomalies from errors. Essentially, the elements of contrastive learning are elegantly embedded in feature reconstruction to prevent the network from training instability, pattern collapse, and identical shortcut, while simultaneously optimizing both the encoder and decoder on the target domain. To demonstrate our transfer ability on various image domains, we conduct extensive experiments across two popular industrial defect detection benchmarks and three medical image UAD tasks, which shows our superiority over current state-of-the-art methods.

Feature Dropout: Revisiting the Role of Augmentations in Contrastive Learning
Alex Tamkin Margalit Glasgow Xiluo He Noah Goodman



研究问题:增强在对比学习中的作用是什么?
动机:最近的研究表明,良好的增强是针对特定下游任务的标签保留。我们通过展示破坏标签的增强在基础模型设置中的有效性,使这个问题变得更复杂,其中的目标是为多个下游任务学习多样化、通用性表示。
方法:我们在一系列图像和音频数据集上进行对比学习实验,涉及多个下游任务(例如预测照片上的数字)。我们发现,最近提出的用于对比学习的学习增强的Viewmaker网络,会产生破坏特征的增强,这些特征对于不同的下游任务是必要的。
效果:尽管这些增强没有保留标签信息,但它们通常是可解释的(例如改变形状、添加到图像的数字或字母),并且结果往往比专家设计的增强更好。我们的理论研究分析了一个带有线性模型的简单对比学习设置,发现破坏标签的增强对于防止一组特征抑制另一组有用特征的学习至关重要。

What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations can be useful in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e.g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e.g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models.

Towards Distribution-Agnostic Generalized Category Discovery
Jianhong Bai Zuozhu Liu Hualiang Wang Ruizhe Chen Lianrui Mu Xiaomeng Li Joey Tianyi Zhou YANG FENG Jian Wu Haoji Hu



研究问题:本文旨在解决真实视觉世界中数据不平衡和开放分布这两个固有特征的问题,特别是在长期开放的世界中对近集样本和开放集样本进行分类的挑战。
动机:尽管在分别应对每个挑战方面取得了令人鼓舞的进展,但很少有工作致力于将它们结合起来以应对真实世界的场景。
方法:本文提出了一个更现实的问题,即分布不可知的广义类别发现(DA-GCD),并为此提出了一个自我平衡共建议对比框架(BaCon)。该框架由一个对比学习分支和一个伪标签分支组成,通过协作提供交互式监督来解决DA-GCD任务。
效果:实验结果表明,BaCon在所有基线上都表现出优越的性能,并在各种数据集上进行了全面分析。

Data imbalance and open-ended distribution are two intrinsic characteristics of the real visual world. Though encouraging progress has been made in tackling each challenge separately, few works dedicated to combining them towards real-world scenarios. While several previous works have focused on classifying close-set samples and detecting open-set samples during testing, it's still essential to be able to classify unknown subjects as human beings. In this paper, we formally define a more realistic task as distribution-agnostic generalized category discovery (DA-GCD): generating fine-grained predictions for both close- and open-set classes in a long-tailed open-world setting. To tackle the challenging problem, we propose a Self-**Ba**lanced **Co**-Advice co**n**trastive framework (BaCon), which consists of a contrastive-learning branch and a pseudo-labeling branch, working collaboratively to provide interactive supervision to resolve the DA-GCD task. In particular, the contrastive-learning branch provides reliable distribution estimation to regularize the predictions of the pseudo-labeling branch, which in turn guides contrastive learning through self-balanced knowledge transfer and a proposed novel contrastive loss. We compare BaCon with state-of-the-art methods from two closely related fields: imbalanced semi-supervised learning and generalized category discovery. The effectiveness of BaCon is demonstrated with superior performance over all baselines and comprehensive analysis across various datasets. Our code is publicly available.

Where2Explore: Few-shot Affordance Learning for Unseen Novel Categories of Articulated Objects
Chuanruo Ning Ruihai Wu Haoran Lu Kaichun Mo Hao Dong



研究问题:本文旨在解决机器人在处理各种物体类别时的挑战,特别是在面对未见过的对象类别时的泛化问题。
动机:由于物体类别的几何和语义差异巨大,以往的操纵模型难以泛化到新的类别。少次学习是一种有前景的解决方案,允许机器人与未见过的物体进行几次交互。然而,现有的方法通常需要对每个未见过的对象实例进行昂贵且低效的测试时间交互。
方法:我们提出了"Where2Explore",一个利用不同类别共享的局部几何相似性(如可拉的手柄和可抓的边缘)进行有效探索的新框架。该框架通过估计不同类别之间的几何相似性,识别出训练类别形状不同的局部区域进行高效探索,同时将适任性知识转移到对象的相似部分。
效果:在模拟和真实环境中进行的大量实验表明,我们的框架具有高效的少次探索和泛化能力。

Articulated object manipulation is a fundamental yet challenging task in robotics. Due to significant geometric and semantic variations across object categories, previous manipulation models struggle to generalize to novel categories. Few-shot learning is a promising solution for alleviating this issue by allowing robots to perform a few interactions with unseen objects. However, extant approaches often necessitate costly and inefficient test-time interactions with each unseen instance. Recognizing this limitation, we observe that despite their distinct shapes, different categories often share similar local geometries essential for manipulation, such as pullable handles and graspable edges - a factor typically underutilized in previous few-shot learning works. To harness this commonality, we introduce 'Where2Explore', an affordance learning framework that effectively explores novel categories with minimal interactions on a limited number of instances. Our framework explicitly estimates the geometric similarity across different categories, identifying local areas that differ from shapes in the training categories for efficient exploration while concurrently transferring affordance knowledge to similar parts of the objects. Extensive experiments in simulated and real-world environments demonstrate our framework's capacity for efficient few-shot exploration and generalization.

Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand
Junfeng Guo Yiming Li Lixu Wang Shu-Tao Xia Heng Huang Cong Liu Bo Li



研究问题:本文旨在重新审视基于后门的数据集所有权验证(DOV),以保护开源代码集的版权。
动机:目前的DOV方法可能会引入恶意误分类行为,对水印DNNs造成伤害。
方法:通过让水印模型正确分类一些“困难”样本,这些“困难”样本会被良性模型误分类,从而设计新的DOV方法。该方法受到DNNs的泛化特性的启发,找到原始数据集的“几乎未泛化的领域”(作为其“领域水印”),并使用包含修改样本的保护数据集轻松学习。
效果:在三个基准数据集上进行的大量实验验证了该方法的有效性和对潜在适应性方法的抵抗力。

The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods.

Data Selection for Language Models via Importance Resampling
Sang Michael Xie Shibani Santurkar Tengyu Ma Percy Liang



研究问题:如何从大规模的未标注数据集中选择适合预训练的语言模型的子集。
动机:现有的方法主要依赖简单的启发式算法或专家手动筛选,缺乏效率和扩展性。
方法:提出了一种名为“重要性重采样的数据选择”(DSIR)的方法,该方法在降低的特征空间中估计重要性权重,并根据这些权重进行重要性重采样来选择数据。
效果:实验结果表明,DSIR在特定领域的持续预训练中与专家策划的表现相当,在一般领域模型的预训练(目标为维基百科+书籍)中,DSIR比随机选择和启发式过滤基线提高了2-2.5%的性能。

Selecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics or use experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. To determine an appropriate feature space, we show that KL reduction, a data metric that measures the proximity between selected pretraining data and the target in a feature space, has high correlation with average downstream accuracy (r=0.89) when computed with simple n-gram features. This motivates our instantiation of DSIR using n-gram features. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2-2.5% on the GLUE benchmark.

Learning Invariant Representations with a Nonparametric Nadaraya-Watson Head
Alan Q. Wang Minh Nguyen Mert R. Sabuncu



研究问题:训练机器学习模型时,如果部署环境的数据分布与训练分布不同,模型可能会失败。当训练过程中存在多个环境时,如何学习到对不同分布不变的表示?
动机:为了解决在部署环境中数据分布不同的问题,提出了一种基于最近提出的Nadaraya-Watson(NW)头的非参数策略来学习不变的表示。
方法:通过操纵支持集(由标记数据组成),使模型能够编码不同的因果假设。特别是,将支持集限制为单个环境,可以鼓励模型学习不依赖于环境的不变特征。
效果:在三个具有挑战性的计算机视觉领域泛化任务上进行了验证,实验结果证明了该方法的有效性。

Machine learning models will often fail when deployed in an environment with a data distribution that is different than the training distribution. When multiple environments are available during training, many methods exist that learn representations which are invariant across the different distributions, with the hope that these representations will be transportable to unseen domains. In this work, we present a nonparametric strategy for learning invariant representations based on the recently-proposed Nadaraya-Watson (NW) head. The NW head makes a prediction by comparing the learned representations of the query to the elements of a support set that consists of labeled data. We demonstrate that by manipulating the support set, one can encode different causal assumptions. In particular, restricting the support set to a single environment encourages the model to learn invariant features that do not depend on the environment. We present a causally-motivated setup for our modeling and training strategy and validate on three challenging real-world domain generalization tasks in computer vision.

Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL
Yang Yue Rui Lu Bingyi Kang Shiji Song Gao Huang



研究问题:离线强化学习中Q值估计的发散性是一个突出问题,尽管可以通过策略约束或保守的Q估计来缓解这个问题,但对其根本原因的理论理解仍然缺失。
动机:本研究旨在深入理解导致离线RL中Q值估计发散性的根本机制,并提出改进的解决方案。
方法:我们首先确定了“自我激发”作为离线RL中Q值估计发散性的主要原因。然后,我们提出了一种新的基于神经切线核(NTK)的自我激发特征值度量(SEEM)指标,以测量Q网络在训练过程中的演变特性,这为发散性的出现提供了有趣的解释。
效果:实验证明,我们的理论可以可靠地在早期阶段决定训练是否会发散,甚至可以预测当使用SGD优化器时,估计的Q值、模型的范数和崩溃步长的增长速度顺序。此外,我们还发现LayerNorm是一种有效的解决方案,可以在不引入有害偏差的情况下避免发散性,从而获得优越的性能。

The divergence of the Q-value estimation has been a prominent issue offline reinforcement learning (offline RL), where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, \emph{self-excitation}, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1$\%$ transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.

Discover and Align Taxonomic Context Priors for Open-world Semi-Supervised Learning
Yu Wang Zhun Zhong Pengchong Qiao Xuxin Cheng Xiawu Zheng Chang Liu Nicu Sebe Rongrong Ji Jie Chen



研究问题:如何利用部分标注样本对未见类别进行分类,特别是在多粒度标签下。
动机:现有的方法主要关注单一粒度标签的关系,而忽略了类之间的层次关系和更深层次的监督信息。
方法:提出了一种名为Taxonomic context prIors Discovering and Aligning (TIDA)的统一框架,该框架通过构建一组潜在空间中的分层原型来发现潜在的分类上下文先验(即子类、目标类和超类),然后协同利用它们来增强表示学习并提高伪标签的质量。
效果:实验表明,这两个组件对于有效的开放世界半监督学习框架是互惠互利的,并在七个常用数据集上显著提高了性能,达到了新的最先进的水平。

Open-world Semi-Supervised Learning (OSSL) is a realistic and challenging task, aiming to classify unlabeled samples from both seen and novel classes using partially labeled samples from the seen classes. Previous works typically explore the relationship of samples as priors on the pre-defined single-granularity labels to help novel class recognition. In fact, classes follow a taxonomy and samples can be classified at multiple levels of granularity, which contains more underlying relationships for supervision. We thus argue that learning with single-granularity labels results in sub-optimal representation learning and inaccurate pseudo labels, especially with unknown classes. In this paper, we take the initiative to explore and propose a uniformed framework, called Taxonomic context prIors Discovering and Aligning (TIDA), which exploits the relationship of samples under various granularity. It allows us to discover multi-granularity semantic concepts as taxonomic context priors (i.e., sub-class, target-class, and super-class), and then collaboratively leverage them to enhance representation learning and improve the quality of pseudo labels. Specifically, TIDA comprises two components: i) A taxonomic context discovery module that constructs a set of hierarchical prototypes in the latent space to discover the underlying taxonomic context priors; ii) A taxonomic context-based prediction alignment module that enforces consistency across hierarchical predictions to build the reliable relationship between classes among various granularity and provide additions supervision. We demonstrate that these two components are mutually beneficial for an effective OSSL framework, which is theoretically explained from the perspective of the EM algorithm. Extensive experiments on seven commonly used datasets show that TIDA can significantly improve the performance and achieve a new state of the art. The source codes are publicly available at https://github.com/rain305f/TIDA.

Saving 100x Storage: Prototype Replay for Reconstructing Training Sample Distribution in Class-Incremental Semantic Segmentation
Jinpeng Chen Runmin Cong Yuxuan LUO Horace Ip Sam Kwong



研究问题:现有的增量式语义分割(CISS)方法主要解决灾难性遗忘和背景偏移问题,但常常忽视了另一个关键问题。
动机:在CISS中,每一步都关注不同的前景类别,单个步骤的训练集只包含当前前景类别的像素图像,排除了不包含这些类别的图像。这导致这些前景类别在单步训练集中过度表示,导致分类偏向于这些类别。
方法:我们提出了STAR方法,通过存储紧凑的原型和必要的统计数据来保留每个过去类别的主要特征,并通过重播这些原型并适当频率地重复背景像素,使单步训练样本的类别分布与完整数据集对齐。
效果:与以前重播原始图像的工作相比,我们的方法节省了100倍的存储空间,同时实现了更好的性能。此外,STAR引入了一个旧类特征保持(OCFM)损失,在保持旧类特征不变的同时保留了学习新类别的足够塑性。此外,还采用了一种相似感知判别(SAD)损失,专门增强相似旧新类别对之间的特征多样性。在Pascal VOC 2012和ADE20K两个公共数据集上的实验表明,我们的模型超过了所有先前最先进的方法。

Existing class-incremental semantic segmentation (CISS) methods mainly tackle catastrophic forgetting and background shift, but often overlook another crucial issue. In CISS, each step focuses on different foreground classes, and the training set for a single step only includes images containing pixels of the current foreground classes, excluding images without them. This leads to an overrepresentation of these foreground classes in the single-step training set, causing the classification biased towards these classes. To address this issue, we present STAR, which preserves the main characteristics of each past class by storing a compact prototype and necessary statistical data, and aligns the class distribution of single-step training samples with the complete dataset by replaying these prototypes and repeating background pixels with appropriate frequency. Compared to the previous works that replay raw images, our method saves over 100 times the storage while achieving better performance. Moreover, STAR incorporates an old-class features maintaining (OCFM) loss, keeping old-class features unchanged while preserving sufficient plasticity for learning new classes. Furthermore, a similarity-aware discriminative (SAD) loss is employed to specifically enhance the feature diversity between similar old-new class pairs. Experiments on two public datasets, Pascal VOC 2012 and ADE20K, reveal that our model surpasses all previous state-of-the-art methods.

Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective
Huayang Li Tian Lan Zihao Fu Deng Cai Lemao Liu Nigel Collier Taro Watanabe Yixuan Su



研究问题:本文旨在解决神经网络文本退化问题,即生成重复和枯燥的循环。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:通过从数据角度提供一个简单而基本的解释,发现退化问题与训练数据中的重复存在强相关性。后续实验还表明,通过选择性地忽略训练数据中重复的单词的注意力,可以显著减少退化。此外,实证分析表明,从不同角度解决退化问题的先前工作,如高流入词、可能性目标和自我强化现象,都可以用一个简单的解释来解释。
效果:实验结果表明,即使在考虑更大的模型规模和指令调优时,惩罚训练数据中的重复仍然是有效的。

There are a number of diverging hypotheses about the neural text degeneration problem, i.e., generating repetitive and dull loops, which makes this problem both interesting and confusing. In this work, we aim to advance our understanding by presenting a straightforward and fundamental explanation from the data perspective. Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data. Subsequent experiments also demonstrate that by selectively dropping out the attention to repetitive words in training data, degeneration can be significantly minimized. Furthermore, our empirical analysis illustrates that prior works addressing the degeneration issue from various standpoints, such as the high-inflow words, the likelihood objective, and the self-reinforcement phenomenon, can be interpreted by one simple explanation. That is, penalizing the repetitions in training data is a common and fundamental factor for their effectiveness. Moreover, our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.

Overcoming Recency Bias of Normalization Statistics in Continual Learning: Balance and Adaptation
Yilin Lyu Liyuan Wang Xingxing Zhang Zicheng Sun Hang Su Jun Zhu Liping Jing



研究问题:本文旨在解决深度学习中的持续学习问题,特别是针对Batch Normalization(BN)在处理新旧任务平衡时存在的次优性。
动机:当前深度学习模型在处理持续学习问题时,由于无法获取旧的训练样本,往往会出现严重的遗忘旧任务的问题。而BN在进行参数更新时,会受目前观察到的训练样本的梯度和统计量的影响,导致对新旧任务的处理存在偏向性,影响了训练的稳定性和泛化能力。
方法:本文提出了一种名为Adaptive Balance of BN (AdaB$^2$N)的方法。该方法通过引入基于贝叶斯的策略来适应任务的贡献,并使用修改后的动量来平衡BN的统计量,以应对训练和测试阶段的挑战。
效果:实验结果表明,AdaB$^2$N在一系列基准测试中取得了显著的性能提升,特别是在具有挑战性的在线场景中(例如,在Split CIFAR-10、Split CIFAR-100和Split Mini-ImageNet上分别提高了7.68%、6.86%和4.26%)。

Continual learning entails learning a sequence of tasks and balancing their knowledge appropriately. With limited access to old training samples, much of the current work in deep neural networks has focused on overcoming catastrophic forgetting of old tasks in gradient-based optimization. However, the normalization layers provide an exception, as they are updated interdependently by the gradient and statistics of currently observed training samples, which require specialized strategies to mitigate recency bias. In this work, we focus on the most popular Batch Normalization (BN) and provide an in-depth theoretical analysis of its sub-optimality in continual learning. Our analysis demonstrates the dilemma between balance and adaptation of BN statistics for incremental tasks, which potentially affects training stability and generalization. Targeting on these particular challenges, we propose Adaptive Balance of BN (AdaB$^2$N), which incorporates appropriately a Bayesian-based strategy to adapt task-wise contributions and a modified momentum to balance BN statistics, corresponding to the training and testing stages. By implementing BN in a continual learning fashion, our approach achieves significant performance gains across a wide range of benchmarks, particularly for the challenging yet realistic online scenarios (e.g., up to 7.68\%, 6.86\% and 4.26\% on Split CIFAR-10, Split CIFAR-100 and Split Mini-ImageNet, respectively). Our code is available at https://github.com/lvyilin/AdaB2N.

Decompose Novel into Known: Part Concept Learning For 3D Novel Class Discovery
Tingyu Weng Jun Xiao Haiyong Jiang



研究问题:本文旨在解决三维新颖类别发现(NCD)问题,即如何从未标记的数据集中通过利用已知类别的知识来发现新的类别。
动机:三维新颖类别发现的主要挑战在于已知类别识别学习到的特征存在严重偏差,阻碍了对新类别的泛化。由于几何部分在不同类别之间更具泛化性,因此提出将新类别分解为已知部分,称为DNIK,以减轻上述问题。
方法:DNIK学习一个部分概念库,该库编码来自已知类别的丰富部分几何模式,以便新的形状可以表示为部分概念组合,促进跨类别泛化。此外,还制定了三个部分概念约束,以确保部分概念的多样性而不塌陷。同时开发了一个部分关系编码模块(PRE),利用部分空间关系进行更好的识别。
效果:通过构建三个3D NCD任务进行评估,实验结果表明该方法比最先进的基线方法取得了显著优越的结果(在三个任务上平均提高了+11.7%,+14.1%和+16.3%)。代码和数据将发布。

In this work, we address 3D novel class discovery (NCD) that discovers novel classes from an unlabeled dataset by leveraging the knowledge of disjoint known classes. The key challenge of 3D NCD is that learned features by known class recognition are heavily biased and hinder generalization to novel classes. Since geometric parts are more generalizable across different classes, we propose to decompose novel into known parts, coined DNIK, to mitigate the above problems. DNIK learns a part concept bank encoding rich part geometric patterns from known classes so that novel 3D shapes can be represented as part concept compositions to facilitate cross-category generalization. Moreover, we formulate three constraints on part concepts to ensure diverse part concepts without collapsing. A part relation encoding module (PRE) is also developed to leverage part-wise spatial relations for better recognition. We construct three 3D NCD tasks for evaluation and extensive experiments show that our method achieves significantly superior results than SOTA baselines (+11.7%, +14.1%, and +16.3% improvements on average for three tasks, respectively). Code and data will be released.

Certifiably Robust Graph Contrastive Learning
Minhua Lin Teng Xiao Enyan Dai Xiang Zhang Suhang Wang



研究问题:本文旨在解决图对比学习(GCL)在面对图结构和节点属性的对抗攻击时的脆弱性问题。
动机:尽管已经提出了一些经验方法来增强GCL的鲁棒性,但其可证明的鲁棒性仍未得到探索。
方法:我们开发了第一个可证明的鲁棒GCL框架。首先,我们提出了一个统一的标准来评估和证明GCL的鲁棒性。然后,我们引入了一种名为RES(随机边删除平滑)的新方法,以确保任何GCL模型的可证明鲁棒性,并且这种可证明的鲁棒性可以在下游任务中得到保留。此外,我们还提出了一种有效的鲁棒GCL训练方法。
效果:我们在真实数据集上的大量实验表明,我们提出的方法在提供有效的可证明鲁棒性和增强任何GCL模型的鲁棒性方面非常有效。

Graph Contrastive Learning (GCL) has emerged as a popular unsupervised graph representation learning method. However, it has been shown that GCL is vulnerable to adversarial attacks on both the graph structure and node attributes. Although empirical approaches have been proposed to enhance the robustness of GCL, the certifiable robustness of GCL is still remain unexplored. In this paper, we develop the first certifiably robust framework in GCL. Specifically, we first propose a unified criteria to evaluate and certify the robustness of GCL. We then introduce a novel technique, RES (Randomized Edgedrop Smoothing), to ensure certifiable robustness for any GCL model, and this certified robustness can be provably preserved in downstream tasks. Furthermore, an effective training method is proposed for robust GCL. Extensive experiments on real-world datasets demonstrate the effectiveness of our proposed method in providing effective certifiable robustness and enhancing the robustness of any GCL model. The source code of RES is available at https://github.com/ventr1c/RES-GCL.

Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data
Cheng-Hao Tu Hong-You Chen Zheda Mai Jike Zhong Vardaan Pahuja Tanya Berger-Wolf Song Gao Charles Stewart Yu Su Wei-Lun Chao



研究问题:我们提出了一个学习问题,即如何将预训练的源模型适应目标领域,以分类源数据中出现的所有类别,使用仅覆盖部分标签空间的目标数据。
动机:这个问题具有实际意义,因为在进行适应之前,让目标最终用户收集所有类别的数据是不现实的。然而,这个问题在文献中受到的关注有限。
方法:我们构建了基准数据集并进行了广泛的实验,以揭示其中的内在挑战。我们发现了一个两难境地——一方面,适应新的目标任务对提高性能很重要;另一方面,我们发现保留目标适应数据中缺失类别的分类准确性非常具有挑战性,更不用说提高它们了。为了解决这个问题,我们确定了两个关键方向:1)将领域梯度与分类梯度分离;2)保持类别关系。我们提出了几种有效的解决方案,这些方案可以保持缺失类别的准确性并提高整体性能,为使用部分目标数据的预训练模型的整体转移建立了坚实的基线。
效果:通过实验验证,我们的方法能够有效地解决预训练模型在目标领域适应时面临的挑战,特别是在处理只有部分标签的目标数据时。

We propose a learning problem involving adapting a pre-trained source model to the target domain for classifying all classes that appeared in the source data, using target data that covers only a partial label space. This problem is practical, as it is unrealistic for the target end-users to collect data for all classes prior to adaptation. However, it has received limited attention in the literature. To shed light on this issue, we construct benchmark datasets and conduct extensive experiments to uncover the inherent challenges. We found a dilemma --- on the one hand, adapting to the new target domain is important to claim better performance; on the other hand, we observe that preserving the classification accuracy of classes missing in the target adaptation data is highly challenging, let alone improving them. To tackle this, we identify two key directions: 1) disentangling domain gradients from classification gradients, and 2) preserving class relationships. We present several effective solutions that maintain the accuracy of the missing classes and enhance the overall performance, establishing solid baselines for holistic transfer of pre-trained models with partial target data.

Latent Graph Inference with Limited Supervision
Jianglin Lu Yi Xu Huan Wang Yue Bai Yun Fu



研究问题:现有的潜在图推理方法在没有语义监督的情况下学习大量的边权重,导致测试样本的预测结果无法达到语义最优,影响模型的泛化能力。
动机:潜在图推理中图稀疏化操作严重破坏了关键节点和已标记节点之间的重要连接,导致监督不足的问题。
方法:提出通过恢复损坏的亲和力和补充缺失的监督来改善潜在图推理。首先定义枢纽节点为k跳饥饿节点,然后通过重构被破坏的连接消除饥饿节点。
效果:实验表明,减少饥饿节点可以显著提高现有潜在图推理方法的性能,特别是在监督非常有限的情况下(在标注率仅为0.3%的情况下,在Pubmed上提高了6.12%)。

Latent graph inference (LGI) aims to jointly learn the underlying graph structure and node representations from data features. However, existing LGI methods commonly suffer from the issue of supervision starvation, where massive edge weights are learned without semantic supervision and do not contribute to the training loss. Consequently, these supervision-starved weights, which determine the predictions of testing samples, cannot be semantically optimal, resulting in poor generalization. In this paper, we observe that this issue is actually caused by the graph sparsification operation, which severely destroys the important connections established between pivotal nodes and labeled ones. To address this, we propose to restore the corrupted affinities and replenish the missed supervision for better LGI. The key challenge then lies in identifying the critical nodes and recovering the corrupted affinities. We begin by defining the pivotal nodes as k-hop starved nodes, which can be identified based on a given adjacency matrix. Considering the high computational burden, we further present a more efficient alternative inspired by CUR matrix decomposition. Subsequently, we eliminate the starved nodes by reconstructing the destroyed connections. Extensive experiments on representative benchmarks demonstrate that reducing the starved nodes consistently improves the performance of state-of-the-art LGI methods, especially under extremely limited supervision (6.12% improvement on Pubmed with a labeling rate of only 0.3%).

Interactive Multi-fidelity Learning for Cost-effective Adaptation of Language Model with Sparse Human Supervision
Jiaxin Zhang Zhuohang Li Kamalika Das Sricharan Kumar



研究问题:大型语言模型在各种任务上表现出色,但在特定领域的适用性受限,因为部署规模大、易受误导信息影响,更重要的是高数据标注成本。
动机:针对标注预算有限的特定领域任务,提出一种新颖的交互多保真学习(IMFL)框架,以降低开发成本。
方法:将特定领域的微调过程表述为一个多保真学习问题,关注于识别最优获取策略,平衡低保真自动语言模型标注和高保真人工标注,以最大化模型性能。进一步提出探索-利用查询策略,增强标注的多样性和信息量,包括两个创新设计:1) 提示检索,从人工标注样本中选择上下文示例以提高语言模型标注;2) 可变批量大小,控制选择每个保真的顺序以促进知识蒸馏,最终提高标注质量。
效果:在金融和医疗任务上的大量实验表明,IMFL在四个任务上都优于单一保真度标注。在有限的人工标注预算下,IMFL在所有四个任务上都显著优于3倍人类标注基线,并在两个任务上实现了与5倍人类标注相近的性能。这些令人鼓舞的结果表明,通过采用IMFL,可以利用更少的人工标注以及更便宜、更快的语言模型(如GPT-3.5)标注实现相当的性能,从而显著降低特定领域任务中的高人工标注成本。

Large language models (LLMs) have demonstrated remarkable capabilities in various tasks. However, their suitability for domain-specific tasks, is limited due to their immense scale at deployment, susceptibility to misinformation, and more importantly, high data annotation costs. We propose a novel Interactive Multi-Fidelity Learning (IMFL) framework for cost-effective development of small domain-specific LMs under limited annotation budgets. Our approach formulates the domain-specific fine-tuning process as a multi-fidelity learning problem, focusing on identifying the optimal acquisition strategy that balances between low-fidelity automatic LLM annotations and high-fidelity human annotations to maximize model performance. We further propose an exploration-exploitation query strategy that enhances annotation diversity and informativeness, incorporating two innovative designs: 1) prompt retrieval that selects in-context examples from human-annotated samples to improve LLM annotation, and 2) variable batch size that controls the order for choosing each fidelity to facilitate knowledge distillation, ultimately enhancing annotation quality. Extensive experiments on financial and medical tasks demonstrate that IMFL achieves superior performance compared with single fidelity annotations. Given a limited budget of human annotation, IMFL significantly outperforms the $\bf 3\times$ human annotation baselines in all four tasks and achieves very close performance as $\bf 5\times$ human annotation on two of the tasks. These promising results suggest that the high human annotation costs in domain-specific tasks can be significantly reduced by employing IMFL, which utilizes fewer human annotations, supplemented with cheaper and faster LLM (e.g., GPT-3.5) annotations to achieve comparable performance.

Domain Re-Modulation for Few-Shot Generative Domain Adaptation
Yi Wu Ziqiang Li Chaoyue Wang Heliang Zheng Shanshan Zhao Bin Li Dacheng Tao



研究问题:本文旨在解决少样本生成领域适应(GDA)任务,即如何仅使用少量参考图像将预训练的生成器从一个领域转移到新领域。
动机:受到人类大脑在新领域中获取知识的方式的启发,提出了一种创新的生成器结构——域重调制(DoRM)。
方法:DoRM不仅满足了先前在GDA研究中实现的高质量、大合成多样性和跨领域一致性的标准,还引入了记忆和领域关联,类似于人类大脑的工作方式。具体来说,DoRM冻结了源生成器并引入新的映射和仿射模块(M&A模块)以在GDA期间捕获目标领域的特性。这个过程类似于人类大脑中新突触的形成。因此,风格空间中发生了可线性组合的领域转移。通过引入多个新的M&A模块,生成器获得了执行高保真多领域和混合领域生成的能力。此外,为了更有效地保持跨领域的一致性,引入了一种基于相似性的损失结构。这种损失在训练过程中将目标图像的自相关映射与其对应的源图像的自相关映射对齐。
效果:通过大量实验,证明了我们的DoRM和基于相似性的损失结构在少样本GDA中的优越性能,无论是定量还是定性上。

In this study, we delve into the task of few-shot Generative Domain Adaptation (GDA), which involves transferring a pre-trained generator from one domain to a new domain using only a few reference images. Inspired by the way human brains acquire knowledge in new domains, we present an innovative generator structure called $\textbf{Domain Re-Modulation (DoRM)}$. DoRM not only meets the criteria of $\textit{high quality}$, $\textit{large synthesis diversity}$, and $\textit{cross-domain consistency}$, which were achieved by previous research in GDA, but also incorporates $\textit{memory}$ and $\textit{domain association}$, akin to how human brains operate. Specifically, DoRM freezes the source generator and introduces new mapping and affine modules (M\&A modules) to capture the attributes of the target domain during GDA. This process resembles the formation of new synapses in human brains. Consequently, a linearly combinable domain shift occurs in the style space. By incorporating multiple new M\&A modules, the generator gains the capability to perform high-fidelity multi-domain and hybrid-domain generation. Moreover, to maintain cross-domain consistency more effectively, we introduce a similarity-based structure loss. This loss aligns the auto-correlation map of the target image with its corresponding auto-correlation map of the source image during training. Through extensive experiments, we demonstrate the superior performance of our DoRM and similarity-based structure loss in few-shot GDA, both quantitatively and qualitatively. Code will be available at https://github.com/wuyi2020/DoRM.

Scale-teaching: Robust Multi-scale Training for Time Series Classification with Noisy Labels
Zhen Liu Peitian Ma Dongliang Chen Wenbin Pei Qianli Ma



研究问题:如何提高深度神经网络对噪声标签的鲁棒性。
动机:现有的深度学习方法在处理图像数据时,将训练损失小的样本视为正确标签,但这种方法在处理时间序列数据时可能会受到外部噪声的影响,导致一些样本的训练损失不满足小损失标准。
方法:提出一种名为“尺度教学”的深度学习范式,通过利用不同尺度的时间序列同时训练多个深度神经网络,并设计细到粗的跨尺度融合机制来学习判别模式。每个网络都以跨教学的方式训练,使用来自不同尺度的互补信息选择小损失样本作为清洁标签。对于未被选中的大损失样本,通过使用选定的清洁样本进行标签传播来引入多尺度嵌入图学习以纠正其标签。
效果:在多个基准时间序列数据集上的实验表明,所提出的“尺度教学”范式在有效性和鲁棒性方面优于现有的最佳方法。

Deep Neural Networks (DNNs) have been criticized because they easily overfit noisy (incorrect) labels. To improve the robustness of DNNs, existing methods for image data regard samples with small training losses as correctly labeled data (small-loss criterion). Nevertheless, time series' discriminative patterns are easily distorted by external noises (i.e., frequency perturbations) during the recording process. This results in training losses of some time series samples that do not meet the small-loss criterion. Therefore, this paper proposes a deep learning paradigm called Scale-teaching to cope with time series noisy labels. Specifically, we design a fine-to-coarse cross-scale fusion mechanism for learning discriminative patterns by utilizing time series at different scales to train multiple DNNs simultaneously. Meanwhile, each network is trained in a cross-teaching manner by using complementary information from different scales to select small-loss samples as clean labels. For unselected large-loss samples, we introduce multi-scale embedding graph learning via label propagation to correct their labels by using selected clean samples. Experiments on multiple benchmark time series datasets demonstrate the superiority of the proposed Scale-teaching paradigm over state-of-the-art methods in terms of effectiveness and robustness.

Harnessing Hard Mixed Samples with Decoupled Regularizer
Zicheng Liu Siyuan Li Ge Wang Lirong Wu Cheng Tan Stan Z. Li



研究问题:本文旨在解决动态混合方法在优化混合样本时带来的额外计算开销问题。
动机:目前的动态混合方法虽然能提高神经网络的泛化能力,但优化混合样本的过程会带来额外的时间成本。
方法:提出一种名为解耦混合(DM)的高效混合目标函数,通过解耦正则化器,让静态混合方法也能挖掘判别性特征,同时保持混合的平滑性。
效果:实验结果表明,DM能让静态混合方法在无需额外计算的情况下,达到甚至超过动态方法的性能。

Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. Recently, dynamic mixup methods have improved previous \textit{static} policies effectively (e.g., linear interpolation) by maximizing target-related salient regions in mixed samples, but excessive additional time costs are not acceptable. These additional computational overheads mainly come from optimizing the mixed samples according to the mixed labels. However, we found that the extra optimizing step may be redundant because label-mismatched mixed samples are informative hard mixed samples for deep models to localize discriminative features. In this paper, we thus are not trying to propose a more complicated dynamic mixup policy but rather an efficient mixup objective function with decoupled regularizer, named decoupled mixup (DM). The primary effect is that DM can adaptively utilize those hard mixed samples to mine discriminative features without losing the original smoothness of mixup. As a result, DM enables static mixup methods to achieve comparable or even exceed the performance of dynamic methods without any extra computation. This also leads to an interesting objective design problem for mixup training that we need to focus on both smoothing the decision boundaries and identifying discriminative features. Extensive experiments on supervised and semi-supervised learning benchmarks across seven datasets validate the effectiveness of DM.

FedFed: Feature Distillation against Data Heterogeneity in Federated Learning
Zhiqin Yang Yonggang Zhang Yu Zheng Xinmei Tian Hao Peng Tongliang Liu Bo Han



研究问题:联邦学习中,如何在保护隐私和提升模型性能之间找到平衡?
动机:在联邦学习中,数据异质性是一个挑战。共享客户端信息可以缓解数据异质性,但可能侵犯隐私并影响模型性能。
方法:提出一种名为FedFed的新方法,将数据分为对模型性能影响大的性能敏感特征和影响小的性能鲁棒特征。性能敏感特征全局共享以缓解数据异质性,性能鲁棒特征保留在本地。
效果:实验证明,FedFed能在保护隐私的同时提升模型性能。

Federated learning (FL) typically faces data heterogeneity, i.e., distribution shifting among clients. Sharing clients' information has shown great potentiality in mitigating data heterogeneity, yet incurs a dilemma in preserving privacy and promoting model performance. To alleviate the dilemma, we raise a fundamental question: Is it possible to share partial features in the data to tackle data heterogeneity? In this work, we give an affirmative answer to this question by proposing a novel approach called **Fed**erated **Fe**ature **d**istillation (FedFed). Specifically, FedFed partitions data into performance-sensitive features (i.e., greatly contributing to model performance) and performance-robust features (i.e., limitedly contributing to model performance). The performance-sensitive features are globally shared to mitigate data heterogeneity, while the performance-robust features are kept locally. FedFed enables clients to train models over local and shared data. Comprehensive experiments demonstrate the efficacy of FedFed in promoting model performance.

Annotator: A Generic Active Learning Baseline for LiDAR Semantic Segmentation
Binhui Xie Shuang Li qingju guo Chi Harold Liu Xinjing Cheng



研究问题:如何有效地利用主动学习进行LiDAR语义分割,以解决点云数据量大、标注成本高的问题。
动机:传统的标注方法在处理大量的LiDAR点云数据时,需要大量的人工标注,成本高昂。
方法:本文提出了一种名为Annotator的主动学习基线模型,该模型采用了体素为中心的在线选择策略,能够高效地探测和标注每个LiDAR扫描中的显著和典型体素网格,甚至在分布偏移的情况下也能保持高效。
效果:Annotator模型在多种设置中表现出色,尤其在主动学习、主动源自由领域适应和主动领域适应等任务中表现优异。在各种LiDAR语义分割基准测试中,Annotator模型始终能提供出色的性能,无论是在模拟到真实的场景还是真实到真实的场景中。令人惊讶的是,Annotator模型表现出了显著的效率,例如,在SynLiDAR → SemanticKITTI任务中,只需要标注每个扫描的五个体素就能达到87.8%的全监督性能。

Active learning, a label-efficient paradigm, empowers models to interactively query an oracle for labeling new data. In the realm of LiDAR semantic segmentation, the challenges stem from the sheer volume of point clouds, rendering annotation labor-intensive and cost-prohibitive. This paper presents Annotator, a general and efficient active learning baseline, in which a voxel-centric online selection strategy is tailored to efficiently probe and annotate the salient and exemplar voxel girds within each LiDAR scan, even under distribution shift. Concretely, we first execute an in-depth analysis of several common selection strategies such as Random, Entropy, Margin, and then develop voxel confusion degree (VCD) to exploit the local topology relations and structures of point clouds. Annotator excels in diverse settings, with a particular focus on active learning (AL), active source-free domain adaptation (ASFDA), and active domain adaptation (ADA). It consistently delivers exceptional performance across LiDAR semantic segmentation benchmarks, spanning both simulation-to-real and real-to-real scenarios. Surprisingly, Annotator exhibits remarkable efficiency, requiring significantly fewer annotations, e.g., just labeling five voxels per scan in the SynLiDAR → SemanticKITTI task. This results in impressive performance, achieving 87.8% fully-supervised performance under AL, 88.5% under ASFDA, and 94.4% under ADA. We envision that Annotator will offer a simple, general, and efficient solution for label-efficient 3D applications.

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
Ruixiang Tang Jiayi Yuan Yiming Li Zirui Liu Rui Chen Xia Hu



研究问题:预训练语言模型(PLMs)在自然语言处理中广泛应用,但易受后门攻击影响。
动机:为了解决后门攻击问题,提出一种抵抗后门的微调程序,使模型无论在含毒样本的数据集上进行微调都能得到无后门的模型。
方法:在原始PLM中整合一个“蜜罐模块”,该模块专门用于吸收后门信息。通过在微调过程中对蜜罐模块获取的信息施加惩罚,抑制后门创建。
效果:在基准数据集上的全面实验表明,该方法比现有最先进的方法成功降低了10%至40%的攻击成功率,具有很高的有效性和鲁棒性。

In the field of natural language processing, the prevalent approach involves fine-tuning pretrained language models (PLMs) using local samples. Recent research has exposed the susceptibility of PLMs to backdoor attacks, wherein the adversaries can embed malicious prediction behaviors by manipulating a few training samples. In this study, our objective is to develop a backdoor-resistant tuning procedure that yields a backdoor-free model, no matter whether the fine-tuning dataset contains poisoned samples. To this end, we propose and integrate an \emph{honeypot module} into the original PLM, specifically designed to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features while carrying minimal information about the original tasks. Consequently, we can impose penalties on the information acquired by the honeypot module to inhibit backdoor creation during the fine-tuning process of the stem network. Comprehensive experiments conducted on benchmark datasets substantiate the effectiveness and robustness of our defensive strategy. Notably, these results indicate a substantial reduction in the attack success rate ranging from 10\% to 40\% when compared to prior state-of-the-art methods.

Sequential Subset Matching for Dataset Distillation
Jiawei Du Qin Shi Joey Tianyi Zhou



研究问题:现有的知识图谱预训练语言模型缺乏对结构化知识的利用,如何通过结合大规模文本语料库和知识图谱来增强语言表示。
动机:目前的预训练语言模型在处理丰富的结构化知识方面存在不足,而知识图谱中的有信息量的实体可以提供外部知识以增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,提出了一种增强的语言表示模型ERNIE,能够同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Dataset distillation is a newly emerging task that synthesizes a small-size dataset used in training deep neural networks (DNNs) for reducing data storage and model training costs. The synthetic datasets are expected to capture the essence of the knowledge contained in real-world datasets such that the former yields a similar performance as the latter. Recent advancements in distillation methods have produced notable improvements in generating synthetic datasets. However, current state-of-the-art methods treat the entire synthetic dataset as a unified entity and optimize each synthetic instance equally . This static optimization approach may lead to performance degradation in dataset distillation. Specifically, we argue that static optimization can give rise to a coupling issue within the synthetic data, particularly when a larger amount of synthetic data is being optimized. This coupling issue, in turn, leads to the failure of the distilled dataset to extract the high-level features learned by the deep neural network (DNN) in the latter epochs. In this study, we propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch), which tackles this problem by adaptively optimizing the synthetic data to encourage sequential acquisition of knowledge during dataset distillation. Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances, thereby enhancing its performance significantly. Our proposed SeqMatch outperforms state-of-the-art methods in various datasets, including SVNH, CIFAR-10, CIFAR-100, and Tiny ImageNet.

SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation
Mengcheng Lan Xinjiang Wang Yiping Ke Jiaxing Xu Litong Feng Wayne Zhang



研究问题:如何实现无监督语义分割,即在没有手动标注的情况下将图像分割成语义组。
动机:现有的方法主要依赖于语义一致性的先验知识或自监督学习方法中的先验概念,往往忽视了图像段的连贯性属性。
方法:本文提出一种新的方法SmooSeg,利用自监督学习方法来模拟观察值之间的接近关系作为平滑度信号。同时引入一种新颖的平滑度损失函数,以促进段内分段平滑,同时保留不同段之间的不连续性。
效果:由于丰富的平滑度先验监督线索,SmooSeg在三个数据集上的表现显著优于STEGO,具体表现为:COCOStuff(+14.9%)、Cityscapes(+13.0%)和Potsdam-3(+5.7%)。

Unsupervised semantic segmentation is a challenging task that segments images into semantic groups without manual annotation. Prior works have primarily focused on leveraging prior knowledge of semantic consistency or priori concepts from self-supervised learning methods, which often overlook the coherence property of image segments. In this paper, we demonstrate that the smoothness prior, asserting that close features in a metric space share the same semantics, can significantly simplify segmentation by casting unsupervised semantic segmentation as an energy minimization problem. Under this paradigm, we propose a novel approach called SmooSeg that harnesses self-supervised learning methods to model the closeness relationships among observations as smoothness signals. To effectively discover coherent semantic segments, we introduce a novel smoothness loss that promotes piecewise smoothness within segments while preserving discontinuities across different segments. Additionally, to further enhance segmentation quality, we design an asymmetric teacher-student style predictor that generates smoothly updated pseudo labels, facilitating an optimal fit between observations and labeling outputs. Thanks to the rich supervision cues of the smoothness prior, our SmooSeg significantly outperforms STEGO in terms of pixel accuracy on three datasets: COCOStuff (+14.9\%), Cityscapes (+13.0\%), and Potsdam-3 (+5.7\%).

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Feiyang Kang Hoang Anh Just Anit Kumar Sahu Ruoxi Jia



研究问题:在现实数据交换场景中,数据提供者通常只透露有限的样本集,然后做出获取决定。
动机:现有的预测模型性能的缩放函数通常是黑箱操作,计算成本高,容易过拟合,或难以优化数据选择。
方法:本文提出了一个名为“projektor”的框架,该框架基于部分预期数据源的样本预测模型性能并支持数据选择决策。
效果:实验证明,projektor在预测模型性能的准确性和构造性能预测器的计算成本方面显著优于现有的性能缩放方法,并在数据选择效率上大幅超过其他现成的解决方案。

Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling functions that predict model performance at any *size and data source composition* using the limited available samples. However, these scaling functions are usually black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called **, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications (e.g., vision, text, fine-tuning, noisy data sources, etc.) demonstrates that ** significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, ** outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions. We provide ** an open-source toolkit.

FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness for Semi-Supervised Learning
Zhuo Huang Li Shen Jun Yu Bo Han Tongliang Liu



研究问题:现有的半监督学习方法通常基于不同数据转换之间的实例一致性,导致标记数据的标签指导难以传播到未标记的数据,从而影响学习过程和泛化性能。
动机:为了解决半监督学习中的问题,本文提出了一种新的方法——FlatMatch,通过最小化交叉锐度度量来确保两个数据集之间的一致学习性能。
方法:具体来说,我们首先增加标记数据的经验风险以获得一个最坏情况的模型,然后利用未标记数据的丰富性,对最坏情况模型和原始模型之间的预测差异(即交叉锐度)进行惩罚,使学习方向有利于未标记数据上的泛化。
效果:通过全面的验证,我们发现FlatMatch在许多半监督学习设置中都取得了最先进的结果,有效地利用了未标记的数据并提高了半监督学习的性能。

Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data. However, most SSL methods are commonly based on instance-wise consistency between different data transformations. Therefore, the label guidance on labeled data is hard to be propagated to unlabeled data. Consequently, the learning process on labeled data is much faster than on unlabeled data which is likely to fall into a local minima that does not favor unlabeled data, leading to sub-optimal generalization performance. In this paper, we propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets. Specifically, we increase the empirical risk on labeled data to obtain a worst-case model which is a failure case needing to be enhanced. Then, by leveraging the richness of unlabeled data, we penalize the prediction difference (i.e., cross-sharpness) between the worst-case model and the original model so that the learning direction is beneficial to generalization on unlabeled data. Therefore, we can calibrate the learning process without being limited to insufficient label information. As a result, the mismatched learning performance can be mitigated, further enabling the effective exploitation of unlabeled data and improving SSL performance. Through comprehensive validation, we show FlatMatch achieves state-of-the-art results in many SSL settings.

Augmented Memory Replay-based Continual Learning Approaches for Network Intrusion Detection
Suresh kumar Amalapuram Sumohana S. Channappayya Bheemarjuna Tamma



研究问题:本文旨在改进基于连续学习的入侵检测方法,以解决类别不平衡和可扩展性问题。
动机:在通信网络流量中,入侵检测是一种异常活动检测的形式。连续学习(CL)方法可以累积旧知识并适应最新的威胁知识。
方法:首先,我们扩展了基于记忆的CL方法——类别平衡水库采样(CBRS),以解决大型数据集的严重类别不平衡问题。其次,我们提出了一种基于高斯混合模型的新方法——参数近似的干扰辅助(PAPA),以减少发现最大干扰样本所需的虚拟随机梯度下降(SGD)参数计算数量。
效果:实验结果表明,所提出的方法在标准的入侵检测基准上(KDDCUP'99, NSL-KDD, CICIDS-2017/2018, UNSW-NB15, 和 CTU-13)以及具有分布偏移的更长时间段(AnoShift)上都显著优于基线。我们还在标准持续学习基准(SVHN, CIFAR-10/100, 和 CLEAR-10/100)和异常检测基准(SMAP, SMD, 和 MSL)上验证了所提出的方法。此外,提出的PAPA方法显著减少了虚拟SGD更新操作的数量,从而比最大干扰样本检索算法节省了12%到40%的训练时间。

Intrusion detection is a form of anomalous activity detection in communication network traffic. Continual learning (CL) approaches to the intrusion detection task accumulate old knowledge while adapting to the latest threat knowledge. Previous works have shown the effectiveness of memory replay-based CL approaches for this task. In this work, we present two novel contributions to improve the performance of CL-based network intrusion detection in the context of class imbalance and scalability. First, we extend class balancing reservoir sampling (CBRS), a memory-based CL method, to address the problems of severe class imbalance for large datasets. Second, we propose a novel approach titled perturbation assistance for parameter approximation (PAPA) based on the Gaussian mixture model to reduce the number of \textit{virtual stochastic gradient descent (SGD) parameter} computations needed to discover maximally interfering samples for CL. We demonstrate that the proposed approaches perform remarkably better than the baselines on standard intrusion detection benchmarks created over shorter periods (KDDCUP'99, NSL-KDD, CICIDS-2017/2018, UNSW-NB15, and CTU-13) and a longer period with distribution shift (AnoShift). We also validated proposed approaches on standard continual learning benchmarks (SVHN, CIFAR-10/100, and CLEAR-10/100) and anomaly detection benchmarks (SMAP, SMD, and MSL). Further, the proposed PAPA approach significantly lowers the number of virtual SGD update operations, thus resulting in training time savings in the range of 12 to 40\% compared to the maximally interfered samples retrieval algorithm.

AdaptSSR: Pre-training User Model with Augmentation-Adaptive Self-Supervised Ranking
Yang Yu Qi Liu Kai Zhang Yuren Zhang Chao Song Min Hou Yuqing Yuan ZHIhao Ye ZAIXI ZHANG Sanshi Lei Yu



研究问题:用户模型训练依赖于特定任务的标记数据,并受到数据稀疏性问题的影响。
动机:由于用户兴趣多样且行为噪声大,现有的数据增强方法可能会丢失用户的某些特性或引入噪声行为,导致预训练的用户模型产生负迁移。
方法:提出一种新的预训练任务——增强自适应自我监督排名(AdaptSSR),以缓解对增强视图之间语义一致性的要求,同时预训练一个判别性用户模型。具体来说,采用多对排序损失来训练用户模型捕获隐含增强视图、显式增强视图和其他用户视图之间的相似性顺序。
效果:在公共和工业数据集上的六个下游任务的大量实验验证了AdaptSSR的有效性。

User modeling, which aims to capture users' characteristics or interests, heavily relies on task-specific labeled data and suffers from the data sparsity issue. Several recent studies tackled this problem by pre-training the user model on massive user behavior sequences with a contrastive learning task. Generally, these methods assume different views of the same behavior sequence constructed via data augmentation are semantically consistent, i.e., reflecting similar characteristics or interests of the user, and thus maximizing their agreement in the feature space. However, due to the diverse interests and heavy noise in user behaviors, existing augmentation methods tend to lose certain characteristics of the user or introduce noisy behaviors. Thus, forcing the user model to directly maximize the similarity between the augmented views may result in a negative transfer. To this end, we propose to replace the contrastive learning task with a new pretext task: Augmentation-Adaptive SelfSupervised Ranking (AdaptSSR), which alleviates the requirement of semantic consistency between the augmented views while pre-training a discriminative user model. Specifically, we adopt a multiple pairwise ranking loss which trains the user model to capture the similarity orders between the implicitly augmented view, the explicitly augmented view, and views from other users. We further employ an in-batch hard negative sampling strategy to facilitate model training. Moreover, considering the distinct impacts of data augmentation on different behavior sequences, we design an augmentation-adaptive fusion mechanism to automatically adjust the similarity order constraint applied to each sample based on the estimated similarity between the augmented views. Extensive experiments on both public and industrial datasets with six downstream tasks verify the effectiveness of AdaptSSR.

Diffusion-Based Probabilistic Uncertainty Estimation for Active Domain Adaptation
Zhekai Du Jingjing Li



研究问题:如何通过主动标注少量目标样本来辅助领域适应,并解决传统主动学习无法处理的领域偏移问题。
动机:大多数现有的主动领域适应(ADA)方法主要关注测量目标样本的代表性,而忽视了不确定性估计的问题。
方法:提出了一种概率框架,用于捕获数据级和预测级的不确定性,使用变分推断来近似潜在表示和模型预测的联合后验分布。
效果:实验结果表明,该方法在主动领域适应和源自由领域适应设置上均优于以往的ADA方法,能提供更准确的预测,并在三个领域适应数据集上取得了良好的性能。

Active Domain Adaptation (ADA) has emerged as an attractive technique for assisting domain adaptation by actively annotating a small subset of target samples. Most ADA methods focus on measuring the target representativeness beyond traditional active learning criteria to handle the domain shift problem, while leaving the uncertainty estimation to be performed by an uncalibrated deterministic model. In this work, we introduce a probabilistic framework that captures both data-level and prediction-level uncertainties beyond a point estimate. Specifically, we use variational inference to approximate the joint posterior distribution of latent representation and model prediction. The variational objective of labeled data can be formulated by a variational autoencoder and a latent diffusion classifier, and the objective of unlabeled data can be implemented in a knowledge distillation framework. We utilize adversarial learning to ensure an invariant latent space. The resulting diffusion classifier enables efficient sampling of all possible predictions for each individual to recover the predictive distribution. We then leverage a t-test-based criterion upon the sampling and select informative unlabeled target samples based on the p-value, which encodes both prediction variability and cross-category ambiguity. Experiments on both ADA and Source-Free ADA settings show that our method provides more calibrated predictions than previous ADA methods and achieves favorable performance on three domain adaptation datasets.

Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval
Xiaoshuai Hao Wanqian Zhang



研究问题:本文旨在解决无监督领域适应视频-文本检索(UDAVR)的挑战,即训练(源)数据和测试(目标)数据来自不同领域的问题。
动机:现有的方法大多基于分类的领域适应方法,既不适用于检索任务,也不适用于多模态。此外,对于目标领域的配对不匹配问题,即目标视频和文本之间没有配对注释,现有方法假设一个视频对应一个文本,但在实践中,一个文本通常对应多个视频,反之亦然。
方法:我们提出了一种名为不确定性感知对齐网络(UAN)的新方法。具体来说,我们首先引入了多模态互信息模块,以平滑地最小化领域偏移。为了解决目标领域中的多模态不确定配对不匹配问题,我们提出了不确定性感知对齐机制(UAM),以充分利用目标领域中两种模态的语义信息。
效果:在领域适应的视频-文本检索背景下进行的大量实验表明,我们提出的方法始终优于多个基线,显示出对目标数据的优越泛化能力。

Video-text retrieval is an important but challenging research task in the multimedia community. In this paper, we address the challenge task of Unsupervised Domain Adaptation Video-text Retrieval (UDAVR), assuming that training (source) data and testing (target) data are from different domains. Previous approaches are mostly derived from classification based domain adaptation methods, which are neither multi-modal nor suitable for retrieval task. In addition, as to the pairwise misalignment issue in target domain, i.e., no pairwise annotations between target videos and texts, the existing method assumes that a video corresponds to a text. Yet we empirically find that in the real scene, one text usually corresponds to multiple videos and vice versa. To tackle this one-to-many issue, we propose a novel method named Uncertainty-aware Alignment Network (UAN). Specifically, we first introduce the multimodal mutual information module to balance the minimization of domain shift in a smooth manner. To tackle the multimodal uncertainties pairwise misalignment in target domain, we propose the Uncertainty-aware Alignment Mechanism (UAM) to fully exploit the semantic information of both modalities in target domain. Extensive experiments in the context of domain-adaptive video-text retrieval demonstrate that our proposed method consistently outperforms multiple baselines, showing a superior generalization ability for target data.

Collaborative Learning via Prediction Consensus
Dongyang Fan Celestine Mendler-Dünner Martin Jaggi



研究问题:如何通过利用其他模型的专业知识,提高单个模型的性能。
动机:在协作学习环境中,每个模型都希望通过借鉴合作者的知识来提升自己的性能。
方法:提出一种基于蒸馏的方法,利用共享的未标记辅助数据,这些数据由集体进行伪标签标注。该方法的核心是一个信任加权方案,用于自适应地权衡每个合作者对伪标签的影响,直到达成对辅助数据如何标注的共识。
效果:实验证明,这种协作方案能够显著提高目标领域中个体模型的性能,同时可以有效地减轻不良模型对集体的负面影响。此外,该方法能够适应模型架构的异质性,并大大减少与典型协作学习方法相比的通信开销。

We consider a collaborative learning setting where the goal of each agent is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging shared unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a trust weighting scheme that serves to adaptively weigh the influence of each collaborator on the pseudo-labels until a consensus on how to label the auxiliary data is reached. We demonstrate empirically that our collaboration scheme is able to significantly boost individual models’ performance in the target domain from which the auxiliary data is sampled. At the same time, it can provably mitigate the negative impact of bad models on the collective. By design, our method adeptly accommodates heterogeneity in model architectures and substantially reduces communication overhead compared to typical collaborative learning methods.

Revisit the Power of Vanilla Knowledge Distillation: from Small Scale to Large Scale
Zhiwei Hao Jianyuan Guo Kai Han Han Hu Chang Xu Yunhe Wang



研究问题:本文探讨了在只有小数据集的情况下,设计知识蒸馏(KD)方法的合理性。
动机:现有的KD方法存在对大规模数据集如ImageNet-1K的力量估计不足的问题。
方法:通过使用更强的数据增强技术和更大的数据集,减小了普通KD和其他精心设计的KD变体之间的差距。
效果:在没有额外复杂设计的情况下,实现了ResNet-50、ViT-S和ConvNeXtV2-T模型在ImageNet上83.1%、84.3%和85.0%的顶级精度。

The tremendous success of large models trained on extensive datasets demonstrates that scale is a key ingredient in achieving superior results. Therefore, the reflection on the rationality of designing knowledge distillation (KD) approaches for limited-capacity architectures solely based on small-scale datasets is now deemed imperative. In this paper, we identify the small data pitfall that presents in previous KD methods, which results in the underestimation of the power of vanilla KD framework on large-scale datasets such as ImageNet-1K. Specifically, we show that employing stronger data augmentation techniques and using larger datasets can directly decrease the gap between vanilla KD and other meticulously designed KD variants. This highlights the necessity of designing and evaluating KD approaches in the context of practical scenarios, casting off the limitations of small-scale datasets. Our investigation of the vanilla KD and its variants in more complex schemes, including stronger training strategies and different model capacities, demonstrates that vanilla KD is elegantly simple but astonishingly effective in large-scale scenarios. Without bells and whistles, we obtain state-of-the-art ResNet-50, ViT-S, and ConvNeXtV2-T models for ImageNet, which achieve 83.1%, 84.3%, and 85.0% top-1 accuracy, respectively. PyTorch code and checkpoints can be found at https://github.com/Hao840/vanillaKD.

Evaluating Robustness and Uncertainty of Graph Models Under Structural Distributional Shifts
Gleb Bazhenov Denis Kuznedelev Andrey Malinin Artem Babenko Liudmila Prokhorenkova



研究问题:在基于机器学习的可靠决策系统中,模型需要对分布变化具有鲁棒性或提供预测的不确定性。
动机:在图学习中的节点级问题上,由于样本相互依赖,分布变化可能特别复杂。因此,评估图模型的性能需要在多样化且有意义的分布变化上进行测试。
方法:提出一种基于图结构产生多样化分布变化的通用方法,并使用这种方法根据几种结构性节点属性(如流行度、局部性和密度)创建数据分割。
效果:实验结果表明,提出的分布变化对现有的图模型构成了挑战。同时,简单的模型在这些考虑的结构变化上往往优于更复杂的方法。最后,实验证据表明,在学习基础分类任务下的表示质量与使用这些表示将不同节点从不同分布中分离出来的能力之间存在权衡。

In reliable decision-making systems based on machine learning, models have to be robust to distributional shifts or provide the uncertainty of their predictions. In node-level problems of graph learning, distributional shifts can be especially complex since the samples are interdependent. To evaluate the performance of graph models, it is important to test them on diverse and meaningful distributional shifts. However, most graph benchmarks considering distributional shifts for node-level problems focus mainly on node features, while structural properties are also essential for graph problems. In this work, we propose a general approach for inducing diverse distributional shifts based on graph structure. We use this approach to create data splits according to several structural node properties: popularity, locality, and density. In our experiments, we thoroughly evaluate the proposed distributional shifts and show that they can be quite challenging for existing graph models. We also reveal that simple models often outperform more sophisticated methods on the considered structural shifts. Finally, our experiments provide evidence that there is a trade-off between the quality of learned representations for the base classification task under structural distributional shift and the ability to separate the nodes from different distributions using these representations.

R-divergence for Estimating Model-oriented Distribution Discrepancy
Zhilin Zhao Longbing Cao



研究问题:真实生活中的数据由于复杂的分布和交互通常不是独立同分布的,不同的学习模型对样本分布的敏感性可能不同。因此,任何监督或非监督模型的一个关键问题是是否可以认为两个给定数据集的概率分布是相同的。
动机:为了解决这个问题,我们引入了R-divergence,用于评估面向模型的分布差异。其核心思想是,如果两个分布的最佳假设为每个分布产生相同的期望风险,那么这两个分布很可能是相同的。
方法:R-divergence通过在混合数据上学习最小假设,然后衡量它们之间的经验风险差来估计两个数据集之间的分布差异。
效果:我们在各种无监督和有监督任务上评估测试能力,发现R-divergence实现了最先进的性能。为了展示R-divergence的实用性,我们在有噪声标签的样本上使用R-divergence训练了鲁棒的神经网络。

Real-life data are often non-IID due to complex distributions and interactions, and the sensitivity to the distribution of samples can differ among learning models. Accordingly, a key question for any supervised or unsupervised model is whether the probability distributions of two given datasets can be considered identical. To address this question, we introduce R-divergence, designed to assess model-oriented distribution discrepancies. The core insight is that two distributions are likely identical if their optimal hypothesis yields the same expected risk for each distribution. To estimate the distribution discrepancy between two datasets, R-divergence learns a minimum hypothesis on the mixed data and then gauges the empirical risk difference between them. We evaluate the test power across various unsupervised and supervised tasks and find that R-divergence achieves state-of-the-art performance. To demonstrate the practicality of R-divergence, we employ R-divergence to train robust neural networks on samples with noisy labels.

DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable Kendall's Rank Correlation
Kaipeng Zheng Huishuai Zhang Weiran Huang



研究问题:如何更准确地确定新任务中的特征通道重要性,特别是在少量学习的情况下。
动机:传统的少量学习方法主要依赖于几何相似性度量(如余弦相似性和负欧几里得距离)来衡量两个特征之间的语义相关性,但这种方法可能会忽视具有高几何相似性但具有不同语义的特征。
方法:本文提出使用Kendall的等级相关系数作为特征通道的重要性排名指标,代替几何相似性度量。同时,为了解决Kendall的等级相关系数在推理阶段的不可微分问题,提出了一种精心设计的可微分元训练损失函数。
效果:实验结果表明,基于等级相关系数的方法在各种方法和数据集上都取得了显著的改进,可以集成到许多现有的少量学习方法中,并准备与未来依赖几何相似性度量的最新方法集成。

Few-shot learning aims to adapt models trained on the base dataset to novel tasks where the categories were not seen by the model before. This often leads to a relatively uniform distribution of feature values across channels on novel classes, posing challenges in determining channel importance for novel tasks. Standard few-shot learning methods employ geometric similarity metrics such as cosine similarity and negative Euclidean distance to gauge the semantic relatedness between two features. However, features with high geometric similarities may carry distinct semantics, especially in the context of few-shot learning. In this paper, we demonstrate that the importance ranking of feature channels is a more reliable indicator for few-shot learning than geometric similarity metrics. We observe that replacing the geometric similarity metric with Kendall’s rank correlation only during inference is able to improve the performance of few-shot learning across a wide range of methods and datasets with different domains. Furthermore, we propose a carefully designed differentiable loss for meta-training to address the non-differentiability issue of Kendall’s rank correlation. By replacing geometric similarity with differentiable Kendall’s rank correlation, our method can integrate with numerous existing few-shot approaches and is ready for integrating with future state-of-the-art methods that rely on geometric similarity metrics. Extensive experiments validate the efficacy of the rank-correlation-based approach, showcasing a significant improvement in few-shot learning.

DRAUC: An Instance-wise Distributionally Robust AUC Optimization Framework
Siran Dai Qianqian Xu Zhiyong Yang Xiaochun Cao Qingming Huang



研究问题:如何在分布不均的情况下优化AUC指标。
动机:现有的方法主要假设训练和测试样本是从同一分布中独立同分布抽取的,但在实践中这往往无法实现。
方法:提出了一种实例化的分布鲁棒AUC(DRAUC)替代损失函数,并在此基础上构建了优化框架。同时,指出传统的DRAUC可能会引入标签偏差,因此提出了更适合学习鲁棒AUC的分布感知DRAUC。
效果:理论证明如果训练集足够大,训练损失和测试误差之间的差距会减小。在被破坏的基准数据集上的实验证明了该方法的有效性。

The Area Under the ROC Curve (AUC) is a widely employed metric in long-tailed classification scenarios. Nevertheless, most existing methods primarily assume that training and testing examples are drawn i.i.d. from the same distribution, which is often unachievable in practice. Distributionally Robust Optimization (DRO) enhances model performance by optimizing it for the local worst-case scenario, but directly integrating AUC optimization with DRO results in an intractable optimization problem. To tackle this challenge, methodically we propose an instance-wise surrogate loss of Distributionally Robust AUC (DRAUC) and build our optimization framework on top of it. Moreover, we highlight that conventional DRAUC may induce label bias, hence introducing distribution-aware DRAUC as a more suitable metric for robust AUC learning. Theoretically, we affirm that the generalization gap between the training loss and testing error diminishes if the training set is sufficiently large. Empirically, experiments on corrupted benchmark datasets demonstrate the effectiveness of our proposed method. Code is available at: https://github.com/EldercatSAM/DRAUC.

Improving Adversarial Robustness via Information Bottleneck Distillation
Huafeng Kuang Hong Liu YONGJIAN WU Shin'ichi Satoh Rongrong Ji



研究问题:优化信息瓶颈以提高深度神经网络的鲁棒性。
动机:利用来自健壮预训练模型的先验知识来增强信息瓶颈。
方法:提出一种信息瓶颈蒸馏方法,包括两种策略:一是使用健壮软标签蒸馏法增加潜在特征和输出预测之间的互信息;二是引入自适应特征蒸馏法,自动将相关知识从教师模型转移到学生模型,从而降低输入和潜在特征之间的互信息。
效果:通过广泛的实验证明,该方法在对抗最先进的攻击者如PGD-attack和AutoAttack时,可以显著提高对抗鲁棒性。

Previous studies have shown that optimizing the information bottleneck can significantly improve the robustness of deep neural networks. Our study closely examines the information bottleneck principle and proposes an Information Bottleneck Distillation approach. This specially designed, robust distillation technique utilizes prior knowledge obtained from a robust pre-trained model to boost information bottlenecks. Specifically, we propose two distillation strategies that align with the two optimization processes of the information bottleneck. Firstly, we use a robust soft-label distillation method to increase the mutual information between latent features and output prediction. Secondly, we introduce an adaptive feature distillation method that automatically transfers relevant knowledge from the teacher model to the student model, thereby reducing the mutual information between the input and latent features. We conduct extensive experiments to evaluate our approach's robustness against state-of-the-art adversarial attackers such as PGD-attack and AutoAttack. Our experimental results demonstrate the effectiveness of our approach in significantly improving adversarial robustness. Our code is available at https://github.com/SkyKuang/IBD.

Imbalanced Mixed Linear Regression
Pini Zilber Boaz Nadler



研究问题:本文研究了混合线性回归问题,即每个观测样本都属于一个未知的K个线性模型之一。
动机:在实际应用中,K个模型的混合可能不平衡,每个模型的样本数量差异显著。大多数MLR方法在这种设置下表现不佳,因此需要一种新方法来解决这个问题。
方法:本文提出了一种新的、简单且快速的算法Mix-IRLS,用于处理平衡和不平衡的混合情况。与流行的同时恢复K个模型的方法不同,Mix-IRLS使用鲁棒回归工具进行顺序恢复。
效果:实验结果表明,除了不平衡的混合情况外,Mix-IRLS在其他几种情况下也表现出色,包括小样本量、存在异常值和未知数量的模型K。此外,Mix-IRLS在一些真实世界数据集上优于竞争方法,有时优势很大。通过推导Mix-IRLS的恢复保证,进一步强调了其在不平衡混合情况下的优势。

We consider the problem of mixed linear regression (MLR), where each observed sample belongs to one of $K$ unknown linear models. In practical applications, the mixture of the $K$ models may be imbalanced with a significantly different number of samples from each model. Unfortunately, most MLR methods do not perform well in such settings. Motivated by this practical challenge, in this work we propose Mix-IRLS, a novel, simple and fast algorithm for MLR with excellent performance on both balanced and imbalanced mixtures. In contrast to popular approaches that recover the $K$ models simultaneously, Mix-IRLS does it sequentially using tools from robust regression. Empirically, beyond imbalanced mixtures, Mix-IRLS succeeds in a broad range of additional settings where other methods fail, including small sample sizes, presence of outliers, and an unknown number of models $K$. Furthermore, Mix-IRLS outperforms competing methods on several real-world datasets, in some cases by a large margin. We complement our empirical results by deriving a recovery guarantee for Mix-IRLS, which highlights its advantage on imbalanced mixtures.

BIOT: Biosignal Transformer for Cross-data Learning in the Wild
Chaoqi Yang M Brandon Westover Jimeng Sun



研究问题:本文旨在开发一种灵活的生物信号编码器架构,能够在不同的数据集上进行预训练,并在具有不同格式的下游生物信号任务上进行微调。
动机:当前的深度学习模型(基于CNN、RNN和Transformers)通常针对特定的数据集和临床环境进行优化,限制了其广泛的应用性。
方法:提出了Biosignal Transformer (BIOT)模型,通过将不同的生物信号分解为统一的“句子”结构,使模型能够处理通道不匹配、长度可变和缺失值等问题。
效果:在EEG、心电图和人体活动感知信号等多种生物信号学习设置中,BIOT模型均表现出色,证明了其在处理多种数据格式上的有效性。

Biological signals, such as electroencephalograms (EEG), play a crucial role in numerous clinical applications, exhibiting diverse data formats and quality profiles. Current deep learning models for biosignals (based on CNN, RNN, and Transformers) are typically specialized for specific datasets and clinical settings, limiting their broader applicability. This paper explores the development of a flexible biosignal encoder architecture that can enable pre-training on multiple datasets and fine-tuned on downstream biosignal tasks with different formats. To overcome the unique challenges associated with biosignals of various formats, such as mismatched channels, variable sample lengths, and prevalent missing val- ues, we propose Biosignal Transformer (BIOT). The proposed BIOT model can enable cross-data learning with mismatched channels, variable lengths, and missing values by tokenizing different biosignals into unified "sentences" structure. Specifically, we tokenize each channel separately into fixed-length segments containing local signal features and then rearrange the segments to form a long "sentence". Channel embeddings and relative position embeddings are added to each segment (viewed as "token") to preserve spatio-temporal features. The BIOT model is versatile and applicable to various biosignal learning settings across different datasets, including joint pre-training for larger models. Comprehensive evaluations on EEG, electrocardiogram (ECG), and human activity sensory signals demonstrate that BIOT outperforms robust baselines in common settings and facilitates learning across multiple datasets with different formats. Using CHB-MIT seizure detection task as an example, our vanilla BIOT model shows 3% improvement over baselines in balanced accuracy, and the pre-trained BIOT models (optimized from other data sources) can further bring up to 4% improvements. Our repository is public at https://github.com/ycq091044/BIOT.

What Knowledge Gets Distilled in Knowledge Distillation?
Utkarsh Ojha Yuheng Li Anirudh Sundara Rajan Yingyu Liang Yong Jae Lee



研究问题:知识蒸馏过程中被蒸馏的知识是什么?学生如何变得与教师相似?
动机:尽管知识蒸馏技术不断改进,但其基本理解仍存在明显差距。
方法:通过全面研究现有方法,探讨知识蒸馏的运作方式。
效果:发现现有方法确实能间接提炼出超越任务性能的属性,并对此进行深入研究,其发现具有实际意义。

Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions. We show that existing methods can indeed indirectly distill these properties beyond improving task performance. We further study why knowledge distillation might work this way, and show that our findings have practical implications as well.

FouriDown: Factoring Down-Sampling into Shuffling and Superposing
Qi Zhu Man Zhou Jie Huang Naishan Zheng Hongzhi Gao Chongyi Li Yuan Xu Feng Zhao



研究问题:本研究重新审视了空间降采样技术的工作机理,并分析了先前方法中采用的静态加权策略导致的偏见效应。
动机:为了克服这个限制,我们提出了一种新的降采样范例——FouriDown,该范例在傅里叶域中统一了现有的降采样技术。
方法:我们从信号采样定理中获得灵感,将非参数静态加权降采样操作符参数化为一个可学习且上下文自适应的操作符,并在统一的傅里叶函数中进行。
效果:通过在图像去模糊和低光图像增强等任务上进行大量实验,结果一致表明FouriDown可以显著提高性能。我们将公开代码以促进FouriDown的进一步探索和应用。

Spatial down-sampling techniques, such as strided convolution, Gaussian, and Nearest down-sampling, are essential in deep neural networks. In this study, we revisit the working mechanism of the spatial down-sampling family and analyze the biased effects caused by the static weighting strategy employed in previous approaches. To overcome this limitation, we propose a novel down-sampling paradigm in the Fourier domain, abbreviated as FouriDown, which unifies existing down-sampling techniques. Drawing inspiration from the signal sampling theorem, we parameterize the non-parameter static weighting down-sampling operator as a learnable and context-adaptive operator within a unified Fourier function. Specifically, we organize the corresponding frequency positions of the 2D plane in a physically-closed manner within a single channel dimension. We then perform point-wise channel shuffling based on an indicator that determines whether a channel's signal frequency bin is susceptible to aliasing, ensuring the consistency of the weighting parameter learning. FouriDown, as a generic operator, comprises four key components: 2D discrete Fourier transform, context shuffling rules, Fourier weighting-adaptively superposing rules, and 2D inverse Fourier transform. These components can be easily integrated into existing image restoration networks. To demonstrate the efficacy of FouriDown, we conduct extensive experiments on image de-blurring and low-light image enhancement. The results consistently show that FouriDown can provide significant performance improvements. We will make the code publicly available to facilitate further exploration and application of FouriDown.

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)
Weijie Tu Weijian Deng Tom Gedeon



研究问题:本文旨在探索CLIP模型在特定视觉因素变化下的稳健性,以及其在预测不确定性和异常输入检测等安全相关目标上的效果。
动机:虽然CLIP模型在多个挑战性的分布转移上显示出了显著的泛化能力,但在面对具体的视觉因素变化时,其稳健性仍有待进一步探索。此外,可靠的系统除了分类准确性外,还需要考虑其他的安全性措施,如预测不确定性。然而,CLIP模型在这些安全性相关目标上的效果尚未得到充分研究。
方法:本研究对83个CLIP模型和127个ImageNet分类器进行了全面的研究,考虑了10种视觉因素(如形状和模式)、5种分布外数据类型和8种自然且具有挑战性的测试条件,包括纹理、风格和扰动转移等。
效果:研究发现,CLIP模型并不总是比其他ImageNet模型更校准,这与现有发现相矛盾。此外,我们的分析强调了训练源设计的重要性,展示了其对三个关键属性的深远影响。我们相信,这项全面的研究能够为开发更稳健、更可靠的CLIP模型提供启示和指导。

Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety measures beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related objectives is less-explored. Driven by the above, this work comprehensively investigates the safety measures of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs. To this end, we study $83$ CLIP models and $127$ ImageNet classifiers. They are diverse in architecture (pre)training distribution and training strategies. We consider $10$ visual factors (\emph{e.g.}, shape and pattern), $5$ types of out-of-distribution data, and $8$ natural and challenging test conditions with different shift types, such as texture, style, and perturbation shifts. Our study has unveiled several previously unknown insights into CLIP models. For instance, they are not consistently more calibrated than other ImageNet models, which contradicts existing findings. Additionally, our analysis underscores the significance of training source design by showcasing its profound influence on the three key properties. We believe our comprehensive study can shed light on and help guide the development of more robust and reliable CLIP models.

FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning
Dipam Goswami Yuyang Liu Bartłomiej Twardowski Joost van de Weijer



研究问题:本文旨在解决无范例的类别增量学习(CIL)中的挑战,如禁止重放以前任务的数据导致的灾难性遗忘。
动机:由于无法重放之前的任务数据,无范例的类别增量学习面临着许多挑战,包括灾难性遗忘。最近,冻结特征提取器后逐步学习分类器的方法受到了广泛关注。
方法:本文探索了原型网络在CIL中的应用,该方法使用冻结的特征提取器生成新的类别原型,并根据欧几里得距离对特征进行分类。通过分析类别的特征分布,我们发现基于欧几里得度量的分类对于联合训练的特征是成功的。然而,当我们从非平稳数据中学习时,我们发现欧几里得度量不是最优的,而且特征分布是异构的。为了解决这个问题,我们重新审视了各向异性马氏距离在CIL中的应用。此外,我们通过实证发现,建模特征协方差关系比之前的尝试从正态分布中采样特征并训练线性分类器更有效。
效果:与现有方法不同,我们的方法可以推广到多例和少例CIL设置以及领域增量设置。有趣的是,在不更新主干网络的情况下,我们的方法在几个标准的持续学习基准测试上取得了最先进的结果。

Exemplar-free class-incremental learning (CIL) poses several challenges since it prohibits the rehearsal of data from previous tasks and thus suffers from catastrophic forgetting. Recent approaches to incrementally learning the classifier by freezing the feature extractor after the first task have gained much attention. In this paper, we explore prototypical networks for CIL, which generate new class prototypes using the frozen feature extractor and classify the features based on the Euclidean distance to the prototypes. In an analysis of the feature distributions of classes, we show that classification based on Euclidean metrics is successful for jointly trained features. However, when learning from non-stationary data, we observe that the Euclidean metric is suboptimal and that feature distributions are heterogeneous. To address this challenge, we revisit the anisotropic Mahalanobis distance for CIL. In addition, we empirically show that modeling the feature covariance relations is better than previous attempts at sampling features from normal distributions and training a linear classifier. Unlike existing methods, our approach generalizes to both many- and few-shot CIL settings, as well as to domain-incremental settings. Interestingly, without updating the backbone network, our method obtains state-of-the-art results on several standard continual learning benchmarks. Code is available at https://github.com/dipamgoswami/FeCAM.

Mixed Samples as Probes for Unsupervised Model Selection in Domain Adaptation
Dapeng Hu Jian Liang Jun Hao Liew Chuhui Xue Song Bai Xinchao Wang



研究问题:如何准确选择无监督领域适应(UDA)模型以改善未标记目标数据的模型泛化?
动机:由于缺乏标记的目标数据和域分布偏移,准确选择最佳UDA模型具有挑战性。
方法:本文提出了一种名为MixVal的创新模型选择方法,该方法仅在推理期间使用未标记的目标数据。MixVal利用带有伪标签的混合目标样本直接探测每个UDA模型学习到的目标结构。
效果:实验结果表明,MixVal在11种UDA方法和4种适应设置中实现了最先进的性能,并在模型选择中保持了出色的稳定性。

Unsupervised domain adaptation (UDA) has been widely applied in improving model generalization on unlabeled target data. However, accurately selecting the best UDA model for the target domain is challenging due to the absence of labeled target data and domain distribution shifts. Traditional model selection approaches involve training extra models with source data to estimate the target validation risk. Recent studies propose practical methods that are based on measuring various properties of model predictions on target data. Although effective for some UDA models, these methods often lack stability and may lead to poor selections for other UDA models. In this paper, we present MixVal, an innovative model selection method that operates solely with unlabeled target data during inference. MixVal leverages mixed target samples with pseudo labels to directly probe the learned target structure by each UDA model. Specifically, MixVal employs two distinct types of probes: the intra-cluster mixed samples for evaluating neighborhood density and the inter-cluster mixed samples for investigating the classification boundary. With this comprehensive probing strategy, MixVal elegantly combines the strengths of two state-of-the-art model selection methods, Entropy and SND. We extensively evaluate MixVal on 11 UDA methods across 4 adaptation settings, including classification and segmentation tasks. Experimental results consistently demonstrate that MixVal achieves state-of-the-art performance and maintains exceptional stability in model selection. Code is available at \url{https://github.com/LHXXHB/MixVal}.

LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition
Haoxuan Qu Xiaofei Hui Yujun Cai Jun Liu



研究问题:如何准确进行开放集物体识别,减少对误导性特征的依赖。
动机:不同预训练大模型拥有丰富而独特的隐含知识,通过协作这些模型可以解决上述问题。
方法:提出一种名为“大型模型协作”(LMC)的新框架,以训练自由的方式协作不同的现成大模型,并结合几种新设计有效地从大模型中提取隐含知识。
效果:大量实验证明该框架的有效性。

Open-set object recognition aims to identify if an object is from a class that has been encountered during training or not. To perform open-set object recognition accurately, a key challenge is how to reduce the reliance on spurious-discriminative features. In this paper, motivated by that different large models pre-trained through different paradigms can possess very rich while distinct implicit knowledge, we propose a novel framework named Large Model Collaboration (LMC) to tackle the above challenge via collaborating different off-the-shelf large models in a training-free manner. Moreover, we also incorporate the proposed framework with several novel designs to effectively extract implicit knowledge from large models. Extensive experiments demonstrate the efficacy of our proposed framework. Code is available \href{https://github.com/Harryqu123/LMC}{here}.

CL-NeRF: Continual Learning of Neural Radiance Fields for Evolving Scene Representation
Xiuzhe Wu Peng Dai Weipeng DENG Handi Chen Yang Wu Yan-Pei Cao Ying Shan XIAOJUAN QI



研究问题:如何有效地让神经辐射场(NeRFs)适应真实世界的场景变化。
动机:现有的方法需要大量的数据捕获和模型重新训练,既耗时又耗力。
方法:提出一种名为CL-NeRF的新方法,包括两个关键组件:一个用于适应新变化的轻量级专家适配器和一个冲突感知的知识蒸馏学习目标,用于记住未改变的部分。
效果:实验表明,CL-NeRF可以高效地合成已改变和未改变区域的高质量新视图,减少遗忘并适应变化,优于现有方法。

Existing methods for adapting Neural Radiance Fields (NeRFs) to scene changes require extensive data capture and model retraining, which is both time-consuming and labor-intensive. In this paper, we tackle the challenge of efficiently adapting NeRFs to real-world scene changes over time using a few new images while retaining the memory of unaltered areas, focusing on the continual learning aspect of NeRFs. To this end, we propose CL-NeRF, which consists of two key components: a lightweight expert adaptor for adapting to new changes and evolving scene representations and a conflict-aware knowledge distillation learning objective for memorizing unchanged parts. We also present a new benchmark for evaluating Continual Learning of NeRFs with comprehensive metrics. Our extensive experiments demonstrate that CL-NeRF can synthesize high-quality novel views of both changed and unchanged regions with high training efficiency, surpassing existing methods in terms of reducing forgetting and adapting to changes. Code and benchmark will be made available.

Leave No Stone Unturned: Mine Extra Knowledge for Imbalanced Facial Expression Recognition
Yuhang Zhang Yaqi Li lixiong Qin Xuannan Liu Weihong Deng



研究问题:本文旨在解决面部表情识别(FER)中存在的严重不平衡问题,即大部分研究问题:本文旨在解决面部表情识别(FER)中存在的严重不平衡问题,即大部分收集的数据表示快乐或中性的表情,而恐惧或厌恶的表情实例较少。
动机:现有的FER方法主要从少数类样本中学习少数类的知识,但这种方法在处理所有表情类别的平均准确率时表现不佳。因此,作者提出从多数和少数类样本中提取与少数类相关的额外知识。
方法:作者提出了一种新颖的方法,利用重新平衡的注意力图对模型进行正则化,使其能够从所有训练样本中提取关于少数类的转换不变信息。此外,还引入了重新平衡的平滑标签来调整交叉熵损失,通过利用不平衡训练数据标签分布的额外信息,引导模型更多地关注少数类。
效果:通过对不同的数据集和骨干网络进行大量的实验,证明这两个提出的模块共同对模型进行正则化,并在不平衡的FER任务下实现了最先进的性能。

Facial expression data is characterized by a significant imbalance, with most collected data showing happy or neutral expressions and fewer instances of fear or disgust. This imbalance poses challenges to facial expression recognition (FER) models, hindering their ability to fully understand various human emotional states. Existing FER methods typically report overall accuracy on highly imbalanced test sets but exhibit low performance in terms of the mean accuracy across all expression classes. In this paper, our aim is to address the imbalanced FER problem. Existing methods primarily focus on learning knowledge of minor classes solely from minor-class samples. However, we propose a novel approach to extract extra knowledge related to the minor classes from both major and minor class samples. Our motivation stems from the belief that FER resembles a distribution learning task, wherein a sample may contain information about multiple classes. For instance, a sample from the major class surprise might also contain useful features of the minor class fear. Inspired by that, we propose a novel method that leverages re-balanced attention maps to regularize the model, enabling it to extract transformation invariant information about the minor classes from all training samples. Additionally, we introduce re-balanced smooth labels to regulate the cross-entropy loss, guiding the model to pay more attention to the minor classes by utilizing the extra information regarding the label distribution of the imbalanced training data. Extensive experiments on different datasets and backbones show that the two proposed modules work together to regularize the model and achieve state-of-the-art performance under the imbalanced FER task. Code is available at https://github.com/zyh-uaiaaaa.

On the Adversarial Robustness of Out-of-distribution Generalization Models
Xin Zou Weiwei Liu



研究问题:近年来,由于在现实应用中的出色实验结果,分布外(OOD)泛化引起了越来越多的研究关注。
动机:我们发现现有的OOD泛化方法容易受到对抗性攻击,这激发了我们研究OOD对抗性鲁棒性的兴趣。
方法:首先,我们在两种不同的互补设置中对OOD对抗性鲁棒性进行了理论分析。受理论结果的启发,我们设计了两种算法来提高OOD对抗性鲁棒性。最后,我们进行实验以验证我们提出的算法的有效性。
效果:实验结果表明,我们的方法能有效提高OOD对抗性鲁棒性。

Out-of-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. Interestingly, we find that existing OOD generalization methods are vulnerable to adversarial attacks. This motivates us to study OOD adversarial robustness. We first present theoretical analyses of OOD adversarial robustness in two different complementary settings. Motivated by the theoretical results, we design two algorithms to improve the OOD adversarial robustness. Finally, we conduct experiments to validate the effectiveness of our proposed algorithms.

Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
Chenyu You Weicheng Dai Yifei Min Fenglin Liu David A. Clifton S Kevin Zhou Lawrence Hamilton Staib James s Duncan



研究问题:如何提高医学图像分割的质量,特别是在标签有限的情况下。
动机:对比学习是提高视觉表示质量的主要方法,但在实践中可能会遇到模型难以区分少数尾部类别样本的问题,导致模型崩溃和误分类。
方法:提出了一种名为ARCO的半监督对比学习框架,利用分层群理论进行医学图像分割。特别是通过方差减小估计的概念构建ARCO,并证明某些方差减小技术在标签非常有限的像素/体素级分割任务中特别有益。
效果:在八个基准测试上进行了实验验证,包括五个2D/3D医疗和三个语义分割数据集,不同的标签设置,这些方法始终优于最先进的半监督方法。此外,通过将这些采样技术增强到CL框架中,显著提高了以前的方法。

For medical image segmentation, contrastive learning is the dominant practice to improve the quality of visual representations by contrasting semantically similar and dissimilar pairs of samples. This is enabled by the observation that without accessing ground truth labels, negative examples with truly dissimilar anatomical features, if sampled, can significantly improve the performance. In reality, however, these samples may come from similar anatomical features and the models may struggle to distinguish the minority tail-class samples, making the tail classes more prone to misclassification, both of which typically lead to model collapse. In this paper, we propose $\texttt{ARCO}$, a semi-supervised contrastive learning (CL) framework with stratified group theory for medical image segmentation. In particular, we first propose building $\texttt{ARCO}$ through the concept of variance-reduced estimation, and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks with extremely limited labels. Furthermore, we theoretically prove these sampling techniques are universal in variance reduction. Finally, we experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings, and our methods consistently outperform state-of-the-art semi-supervised methods. Additionally, we augment the CL frameworks with these sampling techniques and demonstrate significant gains over previous methods. We believe our work is an important step towards semi-supervised medical image segmentation by quantifying the limitation of current self-supervision objectives for accomplishing such challenging safety-critical tasks.

On the Constrained Time-Series Generation Problem
Andrea Coletta Sriram Gopalakrishnan Daniel Borrajo Svitlana Vyetrenko



研究问题:如何有效地生成受约束的时间序列,同时确保其真实性和满足特定的数值约束。
动机:现有的受约束时间序列生成方法需要重新训练或使用计算成本高的拒绝采样来适应新的约束,且在复杂约束下可能不实用。
方法:本文提出了一种新颖的方法来解决受约束的时间序列生成问题,包括使用约束优化框架进行框架设定,以及提出一系列生成方法,如'GuidedDiffTime',一个有指导的扩散模型。
效果:通过在金融和能源数据等多个数据集上进行实证评估,发现该方法在定性和定量上都优于现有工作,并且'GuidedDiffTime'不需要为新约束重新训练,从而显著降低了碳足迹,最高可达现有深度学习方法的92%。

Synthetic time series are often used in practical applications to augment the historical time series dataset, amplify the occurrence of rare events and also create counterfactual scenarios. Distributional-similarity (which we refer to as realism) as well as the satisfaction of certain numerical constraints are common requirements for counterfactual time series generation. For instance, the US Federal Reserve publishes synthetic market stress scenarios given by the constrained time series for financial institutions to assess their performance in hypothetical recessions. Existing approaches for generating constrained time series usually penalize training loss to enforce constraints, and reject non-conforming samples. However, these approaches would require re-training if we change constraints, and rejection sampling can be computationally expensive, or impractical for complex constraints. In this paper, we propose a novel set of methods to tackle the constrained time series generation problem and provide efficient sampling while ensuring the realism of generated time series. In particular, we frame the problem using a constrained optimization framework and then we propose a set of generative methods including 'GuidedDiffTime', a guided diffusion model. We empirically evaluate our work on several datasets for financial and energy data, where incorporating constraints is critical. We show that our approaches outperform existing work both qualitatively and quantitatively, and that 'GuidedDiffTime' does not require re-training for new constraints, resulting in a significant carbon footprint reduction, up to 92% w.r.t. existing deep learning methods.

Weighted ROC Curve in Cost Space: Extending AUC to Cost-Sensitive Learning
Huiyang Shao Qianqian Xu Zhiyong Yang Peisong Wen Gao Peifeng Qingming Huang



研究问题:本文旨在解决长尾数据集的灵活成本需求,需要构建一个(a)成本敏感和(b)类别分布稳健的学习框架。
动机:现有的误分类成本和ROC曲线下的面积(AUC)是处理(a)和(b)问题的流行指标,但受限于它们的公式,使用AUC训练的模型不能应用于成本敏感的决策问题,而使用固定成本训练的模型对类别分布变化敏感。
方法:我们提出了一种新的设置,将成本视为数据集来处理任意未知的成本分布。此外,我们还提出了一种新颖的加权AUC版本,可以通过决策阈值将成本分布在其计算中进行整合。
效果:实验结果表明,我们的算法在性能上优于现有的成本敏感学习方法和两阶段AUC决策方法。

In this paper, we aim to tackle flexible cost requirements for long-tail datasets, where we need to construct a (a) cost-sensitive and (b) class-distribution robust learning framework. The misclassification cost and the area under the ROC curve (AUC) are popular metrics for (a) and (b), respectively. However, limited by their formulations, models trained with AUC cannot be applied to cost-sensitive decision problems, and models trained with fixed costs are sensitive to the class distribution shift. To address this issue, we present a new setting where costs are treated like a dataset to deal with arbitrarily unknown cost distributions. Moreover, we propose a novel weighted version of AUC where the cost distribution can be integrated into its calculation through decision thresholds. To formulate this setting, we propose a novel bilevel paradigm to bridge weighted AUC (WAUC) and cost. The inner-level problem approximates the optimal threshold from sampling costs, and the outer-level problem minimizes the WAUC loss over the optimal threshold distribution. To optimize this bilevel paradigm, we employ a stochastic optimization algorithm (SACCL) to optimize it. Finally, experiment results show that our algorithm performs better than existing cost-sensitive learning methods and two-stage AUC decisions approach.

Theoretically Guaranteed Bidirectional Data Rectification for Robust Sequential Recommendation
yatong sun Bin Wang Zhu Sun Xiaochun Yang Yan Wang



研究问题:序列推荐系统在训练过程中,用户可能会被诱导点击与其真实偏好不符的项目,导致输入-目标对不可靠。
动机:当前的研究方法无法同时处理不可靠的输入和目标,且大多数方法只能解决其中一个问题。
方法:提出了一个模型无关的双向数据校正(BirDRec)框架,该框架可以与现有的大多数序列推荐系统灵活结合,以对抗不可靠的数据。
效果:通过在四个真实世界数据集上的大量实验,验证了BirDRec的通用性、有效性和效率。

Sequential recommender systems (SRSs) are typically trained to predict the next item as the target given its preceding (and succeeding) items as the input. Such a paradigm assumes that every input-target pair is reliable for training. However, users can be induced to click on items that are inconsistent with their true preferences, resulting in unreliable instances, i.e., mismatched input-target pairs. Current studies on mitigating this issue suffer from two limitations: (i) they discriminate instance reliability according to models trained with unreliable data, yet without theoretical guarantees that such a seemingly contradictory solution can be effective; and (ii) most methods can only tackle either unreliable input or targets but fail to handle both simultaneously. To fill the gap, we theoretically unveil the relationship between SRS predictions and instance reliability, whereby two error-bounded strategies are proposed to rectify unreliable targets and input, respectively. On this basis, we devise a model-agnostic Bidirectional Data Rectification (BirDRec) framework, which can be flexibly implemented with most existing SRSs for robust training against unreliable data. Additionally, a rectification sampling strategy is devised and a self-ensemble mechanism is adopted to reduce the (time and space) complexity of BirDRec. Extensive experiments on four real-world datasets verify the generality, effectiveness, and efficiency of our proposed BirDRec.

What Truly Matters in Trajectory Prediction for Autonomous Driving?
Tran Phong Haoran Wu Cunjun Yu Panpan Cai Sifa Zheng David Hsu



研究问题:轨迹预测在自动驾驶系统中起着关键作用,但其在固定数据集上的预测精度与用于车辆控制的下游预测精度存在显著差异,这被称为动态差距。
动机:由于预测算法会影响自我车辆的行为,而自我车辆的行为又会影响附近其他车辆的行为,这种互动效应会导致特定的预测器动态,直接影响预测结果。但在固定数据集中,由于其他车辆的反应是预先确定的,这种互动效应会被忽略,从而产生显著的动态差距。
方法:本文研究了这种被忽视的动态差距的重要性,并考察了导致预测性能和驾驶性能之间差异的其他几个因素。
效果:研究发现,预测器的计算效率和预测精度之间的权衡决定了现实世界中的驾驶性能。总的来说,一个交互式的、任务驱动的轨迹预测评估协议对于捕捉其在自动驾驶中的效果至关重要。

Trajectory prediction plays a vital role in the performance of autonomous driving systems, and prediction accuracy, such as average displacement error (ADE) or final displacement error (FDE), is widely used as a performance metric. However, a significant disparity exists between the accuracy of predictors on fixed datasets and driving performance when the predictors are used downstream for vehicle control, because of a dynamics gap. In the real world, the prediction algorithm influences the behavior of the ego vehicle, which, in turn, influences the behaviors of other vehicles nearby. This interaction results in predictor-specific dynamics that directly impacts prediction results. In fixed datasets, since other vehicles' responses are predetermined, this interaction effect is lost, leading to a significant dynamics gap. This paper studies the overlooked significance of this dynamics gap. We also examine several other factors contributing to the disparity between prediction performance and driving performance. The findings highlight the trade-off between the predictor's computational efficiency and prediction accuracy in determining real-world driving performance. In summary, an interactive, task-driven evaluation protocol for trajectory prediction is crucial to capture its effectiveness for autonomous driving. Source code along with experimental settings is available online (https://whatmatters23.github.io/).

Semi-Supervised Contrastive Learning for Deep Regression with Ordinal Rankings from Spectral Seriation
Weihang Dai Yao DU Hanru Bai Kwang-Ting Cheng Xiaomeng Li



研究问题:如何将对比学习方法应用于深度回归,特别是在半监督设置中利用无标签数据。
动机:现有的对比学习方法仅限于有标签数据,而分类任务可以利用无标签数据进行对比预训练。
方法:扩展对比回归方法以允许在半监督设置中使用无标签数据,减少对人工标注的依赖。通过恢复有序关系来进行无标签样本的对比学习,从而让更多的数据参与特征表示学习。
效果:实验结果表明,该方法可以超越现有的最先进的半监督深度回归方法,是首次探索使用无标签数据进行对比学习的研究。

Contrastive learning methods can be applied to deep regression by enforcing label distance relationships in feature space. However, these methods are limited to labeled data only unlike for classification, where unlabeled data can be used for contrastive pretraining. In this work, we extend contrastive regression methods to allow unlabeled data to be used in a semi-supervised setting, thereby reducing the reliance on manual annotations. We observe that the feature similarity matrix between unlabeled samples still reflect inter-sample relationships, and that an accurate ordinal relationship can be recovered through spectral seriation algorithms if the level of error is within certain bounds. By using the recovered ordinal relationship for contrastive learning on unlabeled samples, we can allow more data to be used for feature representation learning, thereby achieve more robust results. The ordinal rankings can also be used to supervise predictions on unlabeled samples, which can serve as an additional training signal. We provide theoretical guarantees and empirical support through experiments on different datasets, demonstrating that our method can surpass existing state-of-the-art semi-supervised deep regression methods. To the best of our knowledge, this work is the first to explore using unlabeled data to perform contrastive learning for regression.

Adaptive Uncertainty Estimation via High-Dimensional Testing on Latent Representations
Tsai Hor Chan Kin Wai Lau Jiajun Shen Guosheng Yin Lequan Yu



研究问题:现有的不确定性估计方法依赖于低维分布假设,对高维潜在特征的处理效果不佳,且主要关注离散分类概率的不确定性,缺乏泛化性。
动机:为了克服现有方法在处理高维特征和未知数据上的局限性,提出一种新框架,利用数据自适应的高维假设检验进行不确定性估计。
方法:该方法直接在潜在表示上操作,不需要修改目标函数重新训练特征编码器。测试统计量放宽了特征分布假设到高维,对潜在表示中的不确定性更具判别性。
效果:实验证明,使用贝叶斯神经网络编码的特征可以提高测试性能,实现更准确的不确定性估计。当训练中未见OOD数据时,该方法在OOD检测任务上也表现出满意的性能。

Uncertainty estimation aims to evaluate the confidence of a trained deep neural network. However, existing uncertainty estimation approaches rely on low-dimensional distributional assumptions and thus suffer from the high dimensionality of latent features. Existing approaches tend to focus on uncertainty on discrete classification probabilities, which leads to poor generalizability to uncertainty estimation for other tasks. Moreover, most of the literature requires seeing the out-of-distribution (OOD) data in the training for better estimation of uncertainty, which limits the uncertainty estimation performance in practice because the OOD data are typically unseen. To overcome these limitations, we propose a new framework using data-adaptive high-dimensional hypothesis testing for uncertainty estimation, which leverages the statistical properties of the feature representations. Our method directly operates on latent representations and thus does not require retraining the feature encoder under a modified objective. The test statistic relaxes the feature distribution assumptions to high dimensionality, and it is more discriminative to uncertainties in the latent representations. We demonstrate that encoding features with Bayesian neural networks can enhance testing performance and lead to more accurate uncertainty estimation. We further introduce a family-wise testing procedure to determine the optimal threshold of OOD detection, which minimizes the false discovery rate (FDR). Extensive experiments validate the satisfactory performance of our framework on uncertainty estimation and task-specific prediction over a variety of competitors. The experiments on the OOD detection task also show satisfactory performance of our method when the OOD data are unseen in the training. Codes are available at https://github.com/HKU-MedAI/bnn_uncertainty.

On student-teacher deviations in distillation: does it pay to disobey?
Vaishnavh Nagarajan Aditya Krishna Menon Srinadh Bhojanapalli Hossein Mobahi Sanjiv Kumar



研究问题:知识蒸馏(KD)在训练“学生”网络模仿已训练的“教师”网络的软概率时,尽管学生网络被训练去适应教师的概率,但其可能不仅会显著偏离教师的概率,而且可能会超越教师的性能。本研究旨在调和这种看似矛盾的观察结果。
动机:通过实验和理论研究,揭示学生网络与教师网络的概率偏差的性质,并解释这些偏差如何能同时带来更好的泛化性能。
方法:通过对图像和语言数据进行实验,发现学生网络系统地夸大了教师的信心水平。然后,在简单的设置中理论和实证地建立了另一种形式的夸大:KD夸大了梯度下降在沿数据最高特征向量方向更快收敛时的隐含偏差。最后,将这两个观察结果联系在一起:证明KD的夸大偏差可以同时导致(a)信心的夸大和(b)学生网络性能的改善,从而为这个明显的矛盾提供了解决方案。
效果:这项分析通过考虑KD中梯度下降的作用,并在理论和实证设置中展示了夸大偏差效应,使现有理论和实践更加接近。

Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they _can_ co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically _exaggerating_ the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data. Finally, we tie these two observations together: we demonstrate that the exaggerated bias of KD can simultaneously result in both (a) the exaggeration of confidence and (b) the improved generalization of the student, thus offering a resolution to the apparent paradox. Our analysis brings existing theory and practice closer by considering the role of gradient descent in KD and by demonstrating the exaggerated bias effect in both theoretical and empirical settings.

On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training
Jieyu Zhang Bohan Wang Zhengyu Hu Pang Wei Koh Alexander Ratner



研究问题:本研究旨在探讨有监督预训练数据集的类别内多样性(每个类别的样本数)和类别间多样性(类别数)之间的权衡对下游任务的影响。
动机:预训练数据集的规模对于构建最先进的机器学习模型至关重要,因此需要对其对下游任务的影响进行严格研究。
方法:通过实验发现,在预训练数据集规模固定的情况下,最佳的下游性能来自于类别内多样性和类别间多样性的平衡。同时,理论分析表明,下游性能与这两种多样性呈单调关系。
效果:理论研究揭示出最优的类别与样本比例(#类别 / #每个类别的样本数)与预训练数据集的规模无关,这启发了预测预训练类别数量的应用。当使用ImageNet作为预训练数据集时,该应用在下游任务上的表现提高了约2个百分点。

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.

Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
Alon Albalak Colin Raffel William Yang Wang



研究问题:如何在不过度拟合少量标记数据的情况下学习一个可泛化的模型,特别是在现实应用中具有价值的少次学习。
动机:尽管少次学习在许多实际应用中具有价值,但在没有过度拟合少量标记数据的情况下学习一个可泛化的模型是具有挑战性的。
方法:本文关注使用辅助数据的少次学习(FLAD),这是一种在少次学习期间假设可以使用辅助数据以提高泛化能力的训练范式。我们提出了两种算法——EXP3-FLAD和UCB1-FLAD,并通过广泛的实验发现,探索和利用的结合是关键。
效果:通过大量的实验,我们发现这两种方法比所有现有的FLAD方法提高了4%,并首次实现了优于1750亿参数的GPT-3的30亿参数的语言模型。总的来说,我们的工作表明,发现更好、更有效的FLAD混合策略可能为显著提高少次学习的泛化能力提供了一条可行的路径。

Few-shot learning is valuable in many real-world applications, but learning a generalizable model without overfitting to the few labeled datapoints is challenging. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization. Previous works have proposed automated methods for mixing auxiliary and target data, but these methods typically scale linearly (or worse) with the number of auxiliary datasets, limiting their practicality. In this work we relate FLAD to the explore-exploit dilemma that is central to the multi-armed bandit setting and derive algorithms whose computational complexity is independent of the number of auxiliary datasets, allowing us to scale to 100x more auxiliary datasets than prior methods. We propose two algorithms -- EXP3-FLAD and UCB1-FLAD -- and compare them with prior FLAD methods that either explore or exploit, finding that the combination of exploration and exploitation is crucial. Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3. Overall, our work suggests that the discovery of better, more efficient mixing strategies for FLAD may provide a viable path towards substantially improving generalization in few-shot learning.

Counterfactual Generation with Identifiability Guarantees
Hanqi Yan Lingjing Kong Lin Gui Yuejie Chi Eric Xing Yulan He Kun Zhang



研究问题:本文旨在解决在缺乏配对数据和标签信息的情况下,如何识别并处理计数事实生成任务中内容和风格变量之间变化的依赖关系。
动机:现有的解耦方法依赖于过于简化的假设,如假设内容和风格变量是独立的,这在复杂的数据分布中可能不成立。特别是在跨多个领域的样本中,内容和风格之间的依赖性可能会显著变化。
方法:本文提出了一种名为MATTE的领域自适应计数事实生成模型,通过利用不同潜在变量影响的相对稀疏性,为这种潜在变量模型提供了识别保证。
效果:该理论框架在四个大规模数据集上实现了无监督的风格转换任务的最先进的性能,这些任务既没有使用配对数据,也没有使用风格标签。

Counterfactual generation lies at the core of various machine learning tasks, including image translation and controllable text generation. This generation process usually requires the identification of the disentangled latent representations, such as content and style, that underlie the observed data. However, it becomes more challenging when faced with a scarcity of paired data and labelling information. Existing disentangled methods crucially rely on oversimplified assumptions, such as assuming independent content and style variables, to identify the latent variables, even though such assumptions may not hold for complex data distributions. For instance, food reviews tend to involve words like “tasty”, whereas movie reviews commonly contain words such as “thrilling” for the same positive sentiment. This problem is exacerbated when data are sampled from multiple domains since the dependence between content and style may vary significantly over domains. In this work, we tackle the domain-varying dependence between the content and the style variables inherent in the counterfactual generation task. We provide identification guarantees for such latent-variable models by leveraging the relative sparsity of the influences from different latent variables. Our theoretical insights enable the development of a doMain AdapTive counTerfactual gEneration model, called (MATTE). Our theoretically grounded framework achieves state-of-the-art performance in unsupervised style transfer tasks, where neither paired data nor style labels are utilized, across four large-scale datasets.

CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
Yunyao Mao Jiajun Deng Wengang Zhou Li Li Yao Fang Houqiang Li



研究问题:零样本人类-物体交互(HOI)检测旨在识别已出现和未出现的HOI类别,但目前的顶级方法在处理位置分布差异时表现不佳。
动机:为了解决现有方法在处理未见过的对象类别时的位置分布差异问题,以及避免模型过度拟合已见过的人类-物体对的位置联合分布。
方法:提出了CLIP4HOI框架,该框架首先将人类和物体独立识别,然后通过人类-物体交互器对所有可能的人类-物体对进行处理以生成配对建议。其次,为了避免数据敏感的知识蒸馏,将CLIP模型精心调整为一个细粒度的HOI分类器用于提议鉴别。
效果:实验结果表明,CLIP4HOI在稀有和未见过的种类上都优于先前的方法,并在各种零样本设置下创造了一系列的最新技术成果。

Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories. A strong zero-shot HOI detector is supposed to be not only capable of discriminating novel interactions but also robust to positional distribution discrepancy between seen and unseen categories when locating human-object pairs. However, top-performing zero-shot HOI detectors rely on seen and predefined unseen categories to distill knowledge from CLIP and jointly locate human-object pairs without considering the potential positional distribution discrepancy, leading to impaired transferability. In this paper, we introduce CLIP4HOI, a novel framework for zero-shot HOI detection. CLIP4HOI is developed on the vision-language model CLIP and ameliorates the above issues in the following two aspects. First, to avoid the model from overfitting to the joint positional distribution of seen human-object pairs, we seek to tackle the problem of zero-shot HOI detection in a disentangled two-stage paradigm. To be specific, humans and objects are independently identified and all feasible human-object pairs are processed by Human-Object interactor for pairwise proposal generation. Second, to facilitate better transferability, the CLIP model is elaborately adapted into a fine-grained HOI classifier for proposal discrimination, avoiding data-sensitive knowledge distillation. Finally, experiments on prevalent benchmarks show that our CLIP4HOI outperforms previous approaches on both rare and unseen categories, and sets a series of state-of-the-art records under a variety of zero-shot settings.

Does Graph Distillation See Like Vision Dataset Counterpart?
Beining Yang Kai Wang Qingyun Sun Cheng Ji Xingcheng Fu Hao Tang Yang You Jianxin Li



研究问题:大规模图表示学习在训练大型图时取得了显著成果,但其成本和存储问题引起了越来越多的关注。
动机:现有的图压缩方法主要关注优化压缩图的特征矩阵,而忽视了原始图中的结构信息的影响。
方法:我们提出了一种新的结构广播图数据集蒸馏(SGDD)方案,用于将原始结构信息广播到生成的合成图上,从而避免忽视原始结构信息。
效果:通过在9个数据集上进行验证,我们的SGDD方法在所有数据集上都取得了最先进的结果。例如,在YelpChi数据集上,我们的方法在保持原始图数据集98.6%的测试准确率的同时,将图的规模缩小了1000倍。此外,我们的经验评估表明,跨越9个数据集,LED移位减少了17.6%至31.4%。大量的实验和分析验证了所提出设计的效果和必要性。

Training on large-scale graphs has achieved remarkable results in graph representation learning, but its cost and storage have attracted increasing concerns. Existing graph condensation methods primarily focus on optimizing the feature matrices of condensed graphs while overlooking the impact of the structure information from the original graphs. To investigate the impact of the structure information, we conduct analysis from the spectral domain and empirically identify substantial Laplacian Energy Distribution (LED) shifts in previous works. Such shifts lead to poor performance in cross-architecture generalization and specific tasks, including anomaly detection and link prediction. In this paper, we propose a novel Structure-broadcasting Graph Dataset Distillation (\textbf{SGDD}) scheme for broadcasting the original structure information to the generation of the synthetic one, which explicitly prevents overlooking the original structure information. Theoretically, the synthetic graphs by SGDD are expected to have smaller LED shifts than previous works, leading to superior performance in both cross-architecture settings and specific tasks. We validate the proposed SGDD~across 9 datasets and achieve state-of-the-art results on all of them: for example, on YelpChi dataset, our approach maintains 98.6\% test accuracy of training on the original graph dataset with 1,000 times saving on the scale of the graph. Moreover, we empirically evaluate there exist 17.6\% $\sim$ 31.4\% reductions in LED shift crossing 9 datasets. Extensive experiments and analysis verify the effectiveness and necessity of the proposed designs. The code will be made public.

Environment-Aware Dynamic Graph Learning for Out-of-Distribution Generalization
Haonan Yuan Qingyun Sun Xingcheng Fu Ziwei Zhang Cheng Ji Hao Peng Jianxin Li



研究问题:动态图神经网络在现实场景中的分布偏移问题,即如何泛化到未知环境。
动机:现有工作无法处理动态图中的分布偏移,而动态图的生成受潜在环境影响,因此研究其对分布外(OOD)泛化的影响至关重要。
方法:提出一种新颖的环境感知动态图学习(EAGLE)框架,通过模型复杂耦合环境和利用时空不变模式进行OOD泛化。具体包括设计环境感知的EA-DGNN来通过多通道环境解耦进行环境建模,提出环境实例化机制以实现分布推断下的环境多样化,以及通过不变模式识别机制和节点细粒度的混合实例化环境样本因果干预来进行OOD预测。
效果:实验表明,我们的方法在现实世界和合成动态图数据集上优于最先进的基线方法,特别是在处理分布偏移时。据我们所知,我们是首个从环境学习的角度研究动态图上的OOD泛化的。

Dynamic graph neural networks (DGNNs) are increasingly pervasive in exploiting spatio-temporal patterns on dynamic graphs. However, existing works fail to generalize under distribution shifts, which are common in real-world scenarios. As the generation of dynamic graphs is heavily influenced by latent environments, investigating their impacts on the out-of-distribution (OOD) generalization is critical. However, it remains unexplored with the following two major challenges: **(1)** How to properly model and infer the complex environments on dynamic graphs with distribution shifts? **(2)** How to discover invariant patterns given inferred spatio-temporal environments? To solve these challenges, we propose a novel **E**nvironment-**A**ware dynamic **G**raph **LE**arning (**EAGLE**) framework for OOD generalization by modeling complex coupled environments and exploiting spatio-temporal invariant patterns. Specifically, we first design the environment-aware EA-DGNN to model environments by multi-channel environments disentangling. Then, we propose an environment instantiation mechanism for environment diversification with inferred distributions. Finally, we discriminate spatio-temporal invariant patterns for out-of-distribution prediction by the invariant pattern recognition mechanism and perform fine-grained causal interventions node-wisely with a mixture of instantiated environment samples. Experiments on real-world and synthetic dynamic graph datasets demonstrate the superiority of our method against state-of-the-art baselines under distribution shifts. To the best of our knowledge, we are the first to study OOD generalization on dynamic graphs from the environment learning perspective.

MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data
Tianyu Liu Yuge Wang Zhitao Ying Hongyu Zhao



研究问题:在多元化生物医学背景下,如何发现具有相似功能的基因,由于数据异质性,这对基因表示学习提出了重大挑战。
动机:为了解决这个问题,我们引入了一种名为多模态相似性学习图神经网络的新型模型,该模型结合了多模态机器学习和深度图神经网络,从单细胞测序和空间转录组学数据中学习基因表示。
方法:利用来自10个组织、三种测序技术和三种物种的82个训练数据集,我们为模型训练和基因表示生成创建了丰富的图结构,同时通过加权相似性学习和对比学习进行正则化,以学习跨数据的基因-基因关系。
效果:全面的基准测试分析表明,我们的模型能有效捕获不同模态下的基因功能相似性,在基因表示学习方面比最先进的方法提高了最多100.4%。此外,我们还使用生物信息学工具与基因表示相结合,挖掘通路富集、调控因果网络以及疾病相关基因的功能。因此,我们的模型能有效地生成统一的基因表示,用于分析基因功能、组织功能、疾病和物种进化。

Discovering genes with similar functions across diverse biomedical contexts poses a significant challenge in gene representation learning due to data heterogeneity. In this study, we resolve this problem by introducing a novel model called Multimodal Similarity Learning Graph Neural Network, which combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data. Leveraging 82 training datasets from 10 tissues, three sequencing techniques, and three species, we create informative graph structures for model training and gene representations generation, while incorporating regularization with weighted similarity learning and contrastive learning to learn cross-data gene-gene relationships. This novel design ensures that we can offer gene representations containing functional similarity across different contexts in a joint space. Comprehensive benchmarking analysis shows our model's capacity to effectively capture gene function similarity across multiple modalities, outperforming state-of-the-art methods in gene representation learning by up to $\textbf{100.4}$%. Moreover, we employ bioinformatics tools in conjunction with gene representations to uncover pathway enrichment, regulation causal networks, and functions of disease-associated genes. Therefore, our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution.

Diversifying Spatial-Temporal Perception for Video Domain Generalization
Kun-Yu Lin Jia-Run Du Yipeng Gao Jiaming Zhou Wei-Shi Zheng



研究问题:视频领域泛化旨在通过在源领域进行训练,学习未见过的目标领域的可泛化视频分类模型。
动机:视频领域泛化的一个关键挑战是防止在识别目标视频时过度依赖从源领域提取的特定领域的线索。为此,我们提出感知视频中的多样化空间-时间线索,旨在发现除了特定领域的线索之外的潜在的领域不变线索。
方法:我们贡献了一个名为空间-时间多样化网络(STDN)的新型模型,该模型从视频数据的空域和时域两个维度提高了多样性。首先,我们的STDN通过空间分组在单个帧内发现各种类型的空间线索。然后,我们的STDN通过空间-时间关系建模,在多个空时尺度上显式地对视频内容之间的空间-时间依赖性进行建模。
效果:我们在三种不同类型的基准测试上的大量实验证明了我们的方法的有效性和通用性。

Video domain generalization aims to learn generalizable video classification models for unseen target domains by training in a source domain. A critical challenge of video domain generalization is to defend against the heavy reliance on domain-specific cues extracted from the source domain when recognizing target videos. To this end, we propose to perceive diverse spatial-temporal cues in videos, aiming to discover potential domain-invariant cues in addition to domain-specific cues. We contribute a novel model named Spatial-Temporal Diversification Network (STDN), which improves the diversity from both space and time dimensions of video data. First, our STDN proposes to discover various types of spatial cues within individual frames by spatial grouping. Then, our STDN proposes to explicitly model spatial-temporal dependencies between video contents at multiple space-time scales by spatial-temporal relation modeling. Extensive experiments on three benchmarks of different types demonstrate the effectiveness and versatility of our approach.

Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data
Jang-Hyun Kim Sangdoo Yun Hyun Oh Song



研究问题:如何诊断和清理大规模真实分布数据集中的复杂问题,如标签错误、数据不足表示和异常值。
动机:由于现实世界的大规模数据集存在复杂的问题,识别和解决这些问题是建立健壮的机器学习系统的关键步骤。
方法:提出一种统一的方法来识别有问题的数据,利用特征嵌入空间中数据的关系结构这一被忽视的信息源。为此,我们提出了基于数据关系图结构的可扩展且有效的算法来检测标签错误和异常数据。我们还引入了一个可视化工具,该工具提供了特征嵌入空间中数据点的上下文信息,是一个有效的交互式诊断数据的工具。
效果:在图像、语音和语言领域的大规模任务上评估了我们的方法,包括ImageNet、ESC-50和SST2。我们的方法在所有考虑的任务上都达到了最先进的检测性能,并在各种领域中证明了其对大型真实世界数据集进行调试的有效性。

Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and SST2. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains. We release codes at https://github.com/snu-mllab/Neural-Relation-Graph.

Data-Centric Learning from Unlabeled Graphs with Diffusion Model
Gang Liu Eric Inae Tong Zhao Jiaxin Xu Tengfei Luo Meng Jiang



研究问题:如何利用大量未标注的图数据进行属性预测任务。
动机:虽然每个属性预测任务都提供了少量已标注的例子,但大量未标注的图数据已经从各种来源收集到。传统的方法是在自监督任务上训练模型,然后在预测任务上微调模型,但这种方法中,自监督任务的知识可能与预测任务所需的知识不一致或冲突。
方法:本文提出将大量未标注图数据中的潜在知识提取出来,作为一组有用的数据点来增强每个属性预测模型。使用扩散模型充分利用未标注的图数据,并设计两个新的目标,用每个任务的已标注数据指导模型的去噪过程,生成特定于任务的图示例和它们的标签。
效果:实验表明,与传统的自监督学习方法相比,我们的数据驱动方法在15个任务中的15种现有方法上表现出显著的改进。由未标注数据带来的性能提升是明显的,因为生成的已标注示例不同于自监督学习生成的示例。

Graph property prediction tasks are important and numerous. While each task offers a small size of labeled examples, unlabeled graphs have been collected from various sources and at a large scale. A conventional approach is training a model with the unlabeled graphs on self-supervised tasks and then fine-tuning the model on the prediction tasks. However, the self-supervised task knowledge could not be aligned or sometimes conflicted with what the predictions needed. In this paper, we propose to extract the knowledge underlying the large set of unlabeled graphs as a specific set of useful data points to augment each property prediction model. We use a diffusion model to fully utilize the unlabeled graphs and design two new objectives to guide the model's denoising process with each task's labeled data to generate task-specific graph examples and their labels. Experiments demonstrate that our data-centric approach performs significantly better than fifteen existing various methods on fifteen tasks. The performance improvement brought by unlabeled data is visible as the generated labeled examples unlike the self-supervised learning.

Label-efficient Segmentation via Affinity Propagation
Wentong Li Yuqian Yuan Song Wang Wenyu Liu Dongqi Tang Jian liu Jianke Zhu Lei Zhang



研究问题:如何降低繁琐的像素级标注过程的成本,同时有效地进行弱监督分割。
动机:现有的方法主要使用局部外观内核来建模相邻的成对势能,但这种方法无法捕捉长范围的依赖关系,忽略了对象的拓扑结构。
方法:将亲和建模定义为亲和传播过程,并提出局部和全局的成对亲和项以生成准确的软伪标签。同时开发了一种有效的算法以显著降低计算成本。
效果:在三个典型的弱监督分割任务上进行的实验表明,该方法具有优越的性能。

Weakly-supervised segmentation with label-efficient sparse annotations has attracted increasing research attention to reduce the cost of laborious pixel-wise labeling process, while the pairwise affinity modeling techniques play an essential role in this task. Most of the existing approaches focus on using the local appearance kernel to model the neighboring pairwise potentials. However, such a local operation fails to capture the long-range dependencies and ignores the topology of objects. In this work, we formulate the affinity modeling as an affinity propagation process, and propose a local and a global pairwise affinity terms to generate accurate soft pseudo labels. An efficient algorithm is also developed to reduce significantly the computational cost. The proposed approach can be conveniently plugged into existing segmentation networks. Experiments on three typical label-efficient segmentation tasks, i.e. box-supervised instance segmentation, point/scribble-supervised semantic segmentation and CLIP-guided semantic segmentation, demonstrate the superior performance of the proposed approach.

Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective
Pengfei Wei Lingdong Kong Xinghua Qu Yi Ren zhiqiang xu Jing Jiang Xiang Yin



研究问题:本文旨在解决无监督视频领域适应这一实际而具有挑战性的任务。
动机:首次从解纠缠的角度处理这个问题,通过分离处理空间和时间域的差异。
方法:提出了一种转移序列VAE(TranSVAE)框架来模拟这种生成过程,并设定了多个目标约束潜在因子。
效果:在UCF-HMDB、Jester和Epic-Kitchens数据集上的大量实验验证了TranSVAE的有效性和优越性,优于几种最先进的方法。

Unsupervised video domain adaptation is a practical yet challenging task. In this work, for the first time, we tackle it from a disentanglement view. Our key idea is to handle the spatial and temporal domain divergence separately through disentanglement. Specifically, we consider the generation of cross-domain videos from two sets of latent factors, one encoding the static information and another encoding the dynamic information. A Transfer Sequential VAE (TranSVAE) framework is then developed to model such generation. To better serve for adaptation, we propose several objectives to constrain the latent factors. With these constraints, the spatial divergence can be readily removed by disentangling the static domain-specific information out, and the temporal divergence is further reduced from both frame- and video-levels through adversarial learning. Extensive experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE compared with several state-of-the-art approaches.

Joint Attribute and Model Generalization Learning for Privacy-Preserving Action Recognition
Duo Peng Li Xu Qiuhong Ke Ping Hu Jun Liu



研究问题:如何在保护隐私的同时,从原始视频中识别动作,防止隐私泄露。
动机:在智能视觉应用中,保护隐私的动作识别是一个日益重要的问题。尽管已经有一些努力,但如何处理训练阶段无法获取的新颖隐私属性和新颖隐私攻击模型仍然具有挑战性。
方法:从元学习(学会学习)的角度出发,提出了一种新的元隐私保护动作识别(MPPAR)框架,以统一的方式提高对新颖隐私属性和新颖隐私攻击模型的泛化能力。具体来说,通过构建关于隐私属性或攻击模型的不相交支持/查询集来模拟训练/测试任务转换。然后,基于支持/查询集应用虚拟训练和测试方案,为优化模型的学习提供反馈,使其更好地泛化。
效果:大量的实验表明,与最先进的技术相比,所提出的框架具有有效性和泛化能力。

Privacy-Preserving Action Recognition (PPAR) aims to transform raw videos into anonymous ones to prevent privacy leakage while maintaining action clues, which is an increasingly important problem in intelligent vision applications. Despite recent efforts in this task, it is still challenging to deal with novel privacy attributes and novel privacy attack models that are unavailable during the training phase. In this paper, from the perspective of meta-learning (learning to learn), we propose a novel Meta Privacy-Preserving Action Recognition (MPPAR) framework to improve both generalization abilities above (i.e., generalize to *novel privacy attributes* and *novel privacy attack models*) in a unified manner. Concretely, we simulate train/test task shifts by constructing disjoint support/query sets w.r.t. privacy attributes or attack models. Then, a virtual training and testing scheme is applied based on support/query sets to provide feedback to optimize the model's learning toward better generalization. Extensive experiments demonstrate the effectiveness and generalization of the proposed framework compared to state-of-the-arts.

A Simple Yet Effective Strategy to Robustify the Meta Learning Paradigm
Cheems Wang Yiqin Lv Yanghe Feng Zheng Xie Jincai Huang



研究问题:如何提高元学习在任务分布上的鲁棒性,减少最坏快速适应风险。
动机:现有的元学习方法主要采用经验风险最小化原则进行优化,但在风险敏感的场景中,最坏的快速适应可能会产生灾难性的后果。
方法:本文从分布鲁棒的角度优化元学习流程,并使用尾部任务风险度量来训练模型。通过两阶段策略作为启发式来解决鲁棒元学习问题,以一定的概率水平控制最坏的快速适应情况。
效果:实验结果表明,这种简单的方法可以提高元学习对任务分布的鲁棒性,降低最坏快速适应风险的条件期望。

Meta learning is a promising paradigm to enable skill transfer across tasks. Most previous methods employ the empirical risk minimization principle in optimization. However, the resulting worst fast adaptation to a subset of tasks can be catastrophic in risk-sensitive scenarios. To robustify fast adaptation, this paper optimizes meta learning pipelines from a distributionally robust perspective and meta trains models with the measure of tail task risk. We take the two-stage strategy as heuristics to solve the robust meta learning problem, controlling the worst fast adaptation cases at a certain probabilistic level. Experimental results show that our simple method can improve the robustness of meta learning to task distributions and reduce the conditional expectation of the worst fast adaptation risk.

Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels
Zifu Wang Xuefei Ning Matthew B. Blaschko



研究问题:优化语义分割任务中的损失函数,使其能够支持标签平滑、知识蒸馏和半监督学习等关键技术。
动机:现有的交并比(IoU)损失函数在处理软标签时缺乏灵活性,限制了其在训练技术中的应用。
方法:提出Jaccard Metric Losses(JMLs),它与标准的硬标签软Jaccard损失相同,但完全兼容软标签。将其应用于标签平滑、知识蒸馏和半监督学习的三个主要应用场景。
效果:实验表明,JMLs在4个语义分割数据集(Cityscapes, PASCAL VOC, ADE20K, DeepGlobe Land)和13种架构上均优于交叉熵损失,显著提升了模型的准确性和校准性,超越了最先进的知识蒸馏和半监督学习方法。

Intersection over Union (IoU) losses are surrogates that directly optimize the Jaccard index. Leveraging IoU losses as part of the loss function have demonstrated superior performance in semantic segmentation tasks compared to optimizing pixel-wise losses such as the cross-entropy loss alone. However, we identify a lack of flexibility in these losses to support vital training techniques like label smoothing, knowledge distillation, and semi-supervised learning, mainly due to their inability to process soft labels. To address this, we introduce Jaccard Metric Losses (JMLs), which are identical to the soft Jaccard loss in standard settings with hard labels but are fully compatible with soft labels. We apply JMLs to three prominent use cases of soft labels: label smoothing, knowledge distillation and semi-supervised learning, and demonstrate their potential to enhance model accuracy and calibration. Our experiments show consistent improvements over the cross-entropy loss across 4 semantic segmentation datasets (Cityscapes, PASCAL VOC, ADE20K, DeepGlobe Land) and 13 architectures, including classic CNNs and recent vision transformers. Remarkably, our straightforward approach significantly outperforms state-of-the-art knowledge distillation and semi-supervised learning methods. The code is available at \href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.

TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models
Pum Jun Kim Yoojin Jang Jisu Kim Jaejun Yoo



研究问题:本文旨在提出一种针对生成模型的鲁棒且可靠的评估指标,称为拓扑精确度和召回率(TopP&R),以系统地估计支持度。
动机:现有的评估指标如Inception Score (IS)、Frechet Inception Distance (FID)以及各种精确度和召回率(P&R)变体,严重依赖样本特征的支持度估计,但并未考虑到这些估计的可靠性,这影响了评估的准确性。
方法:本文提出了一种新的评估指标TopP&R,它通过保留具有特定置信度的水平上在拓扑和统计上显著的特征来系统地估计支持度。
效果:实验结果表明,当前的评估方法在支持度估计不可靠时无法准确评估样本质量,并且结果不一致。相比之下,TopP&R能够可靠地评估样本质量,并在其结果中确保统计一致性。即使在存在异常值和非独立同分布(Non-IID)扰动的情况下,TopP&R也能准确捕获样本中的真实变化趋势,这是其他方法导致支持度估计不准确的情况。据我们所知,TopP&R是第一个专门关注支持度稳健估计的评估指标,能在噪声条件下提供统计一致性。

We propose a robust and reliable evaluation metric for generative models called Topological Precision and Recall (TopP&R, pronounced “topper”), which systematically estimates supports by retaining only topologically and statistically significant features with a certain level of confidence. Existing metrics, such as Inception Score (IS), Frechet Inception Distance (FID), and various Precision and Recall (P&R) variants, rely heavily on support estimates derived from sample features. However, the reliability of these estimates has been overlooked, even though the quality of the evaluation hinges entirely on their accuracy. In this paper, we demonstrate that current methods not only fail to accurately assess sample quality when support estimation is unreliable, but also yield inconsistent results. In contrast, TopP&R reliably evaluates the sample quality and ensures statistical consistency in its results. Our theoretical and experimental findings reveal that TopP&R provides a robust evaluation, accurately capturing the true trend of change in samples, even in the presence of outliers and non-independent and identically distributed (Non-IID) perturbations where other methods result in inaccurate support estimations. To our knowledge, TopP&R is the first evaluation metric specifically focused on the robust estimation of supports, offering statistical consistency under noise conditions.

Aligning Language Models with Human Preferences via a Bayesian Approach
Jiashuo WANG Haozhao Wang Shichao Sun Wenjie Li



研究问题:如何确保自然语言生成(NLG)系统与人类偏好对齐,以提升其性能。
动机:当前主流的利用强化学习(RL)和基于人类反馈的奖励模型的方法,由于人类偏好的主观性导致训练奖励模型存在困难,进而影响NLG的性能。
方法:本文提出了一种新方法,使用贝叶斯框架来处理人类偏好之间的分布差异,并训练一个偏好模型(命名为d-PM)。同时,为了提高训练效率,采用对比学习策略,用d-PM模型产生的偏好分数来训练NLG模型。
效果:在两个以人为中心的NLG任务上进行大量实验,即情感支持对话和诚信“经验法则”生成,结果显示该方法在自动评估和人工评估上都超过了先前的最佳模型。

In the quest to advance human-centric natural language generation (NLG) systems, ensuring alignment between NLG models and human preferences is crucial. For this alignment, current popular methods leverage a reinforcement learning (RL) approach with a reward model trained on feedback from humans. However, inherent disagreements due to the subjective nature of human preferences pose a significant challenge for training the reward model, resulting in a deterioration of the NLG performance. To tackle this issue, previous approaches typically rely on majority voting or averaging to consolidate multiple inconsistent preferences into a merged one. Although straightforward to understand and execute, such methods suffer from an inability to capture the nuanced degrees of disaggregation among humans and may only represent a specialized subset of individuals, thereby lacking the ability to quantitatively disclose the universality of human preferences. To address this challenge, this paper proposes a novel approach, which employs a Bayesian framework to account for the distribution of disagreements among human preferences as training a preference model, and names it as $\textbf{d-PM}$. Besides, considering the RL strategy's inefficient and complex training process over the training efficiency, we further propose utilizing the contrastive learning strategy to train the NLG model with the preference scores derived from the d-PM model. Extensive experiments on two human-centric NLG tasks, i.e., emotional support conversation and integrity ``Rule-of-Thumb'' generation, show that our method consistently exceeds previous SOTA models in both automatic and human evaluations.

Data Pruning via Moving-one-Sample-out
Haoru Tan Sitong Wu Fei Du Yukang Chen Zhibin Wang Fan Wang XIAOJUAN QI



研究问题:如何有效地识别并移除训练集中最不具信息量的样本?
动机:通过移除最不具信息量的样本,可以降低计算负担,提高模型训练效率。
方法:提出一种名为“移动一个样本出”(MoSo)的数据剪枝方法,通过评估每个样本对最优经验风险的影响来确定其重要性,然后移除影响最小的样本。
效果:实验结果表明,MoSo在高剪枝比例下能有效缓解性能下降,并在各种设置中优于最先进的方法。

In this paper, we propose a novel data-pruning approach called moving-one-sample-out (MoSo), which aims to identify and remove the least informative samples from the training set. The core insight behind MoSo is to determine the importance of each sample by assessing its impact on the optimal empirical risk. This is achieved by measuring the extent to which the empirical risk changes when a particular sample is excluded from the training set. Instead of using the computationally expensive leaving-one-out-retraining procedure, we propose an efficient first-order approximator that only requires gradient information from different training stages. The key idea behind our approximation is that samples with gradients that are consistently aligned with the average gradient of the training set are more informative and should receive higher scores, which could be intuitively understood as follows: if the gradient from a specific sample is consistent with the average gradient vector, it implies that optimizing the network using the sample will yield a similar effect on all remaining samples. Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios and achieves satisfactory performance across various settings. Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios and outperforms state-of-the-art methods by a large margin across various settings.

Generator Born from Classifier
Runpeng Yu Xinchao Wang



研究问题:本文旨在不依赖任何数据样本,对预训练分类器进行重构图像生成器。
动机:由于涉及识别分类器的逆函数,这个挑战在本质上是信息提取过程,因此从黑箱角度看似乎难以处理。
方法:基于梯度下降的最大边距偏差理论,提出一种新的学习范式,通过满足网络参数的收敛条件来训练生成器。
效果:通过各种图像生成任务的实证验证,证实了我们的策略的有效性。

In this paper, we make a bold attempt toward an ambitious task: given a pre-trained classifier, we aim to reconstruct an image generator, without relying on any data samples. From a black-box perspective, this challenge seems intractable, since it inevitably involves identifying the inverse function for a classifier, which is, by nature, an information extraction process. As such, we resort to leveraging the knowledge encapsulated within the parameters of the neural network. Grounded on the theory of Maximum-Margin Bias of gradient descent, we propose a novel learning paradigm, in which the generator is trained to ensure that the convergence conditions of the network parameters are satisfied over the generated distribution of the samples. Empirical validation from various image generation tasks substantiates the efficacy of our strategy.

DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models
Tao Yang Yuwang Wang Yan Lu Nanning Zheng



研究问题:如何理解观察背后的可解释因素,并在这些因素上进行条件生成过程的建模。
动机:将解耦表示学习与扩散概率模型(DPMs)相结合,利用DPMs强大的建模能力。
方法:提出一个新的任务——DPMs的解耦,即在没有任何因素注释的情况下,自动发现观察背后的固有因素,并将DPM的梯度场解耦为每个发现的因素的表示所决定的子梯度场。
效果:通过设计一种名为DisDiff的无监督方法,首次在DPMs框架中实现了解耦表示学习。在合成和真实世界的数据集上的大量实验证明了DisDiff的有效性。

Targeting to understand the underlying explainable factors behind observations and modeling the conditional generation process on these factors, we connect disentangled representation learning to diffusion probabilistic models (DPMs) to take advantage of the remarkable modeling ability of DPMs. We propose a new task, disentanglement of (DPMs): given a pre-trained DPM, without any annotations of the factors, the task is to automatically discover the inherent factors behind the observations and disentangle the gradient fields of DPM into sub-gradient fields, each conditioned on the representation of each discovered factor. With disentangled DPMs, those inherent factors can be automatically discovered, explicitly represented and clearly injected into the diffusion process via the sub-gradient fields. To tackle this task, we devise an unsupervised approach, named DisDiff, and for the first time achieving disentangled representation learning in the framework of DPMs. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of DisDiff.

Networks are Slacking Off: Understanding Generalization Problem in Image Deraining
Jinjin Gu Xianzheng Ma Xiangtao Kong Yu Qiao Chao Dong



研究问题:深度去雨网络在实验室基准测试中表现良好,但在真实世界应用中经常遇到严重的泛化问题。
动机:尽管深度学习鼓励使用复杂的数据进行训练,以期望更丰富的图像背景内容能有助于克服泛化问题,但通过全面和系统的实验,我们发现这种方法并没有提高这些网络的泛化能力,反而加剧了网络过度拟合特定退化现象的趋势。
方法:我们通过简化训练背景图像的复杂性来改善去雨网络的泛化能力。具体来说,当背景图像比雨迹简单时,网络会优先重建背景,从而抑制对雨模式的过度拟合,提高泛化性能。
效果:我们的研究发现为更好地理解低层次视觉任务中的泛化问题提供了有价值的视角和方法,并显示出在实际应用领域的潜力。

Deep deraining networks consistently encounter substantial generalization issues when deployed in real-world applications, although they are successful in laboratory benchmarks. A prevailing perspective in deep learning encourages using highly complex data for training, with the expectation that richer image background content will facilitate overcoming the generalization problem. However, through comprehensive and systematic experimentation, we discover that this strategy does not enhance the generalization capability of these networks. On the contrary, it exacerbates the tendency of networks to overfit specific degradations. Our experiments reveal that better generalization in a deraining network can be achieved by simplifying the complexity of the training background images. This is because that the networks are ``slacking off'' during training, that is, learning the least complex elements in the image background and degradation to minimize training loss. When the background images are less complex than the rain streaks, the network will prioritize the background reconstruction, thereby suppressing overfitting the rain patterns and leading to improved generalization performance. Our research offers a valuable perspective and methodology for better understanding the generalization problem in low-level vision tasks and displays promising potential for practical application.

When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Segmentation
Xinhong Ma Yiming Wang Hao Liu Tianyu Guo Yunhe Wang



研究问题:如何将预训练的源模型适应到无标签的目标领域,同时避免访问私有源数据。
动机:现有的方法通常微调整个网络,这会导致参数调整费用高昂。为了解决这个问题,我们提出了利用视觉提示微调进行参数高效的适应。
方法:我们提出了一个通用的无监督视觉提示微调(Uni-UVPT)框架,适用于各种基于变压器的骨干网络。具体来说,我们将冻结参数的预训练源骨干网络分为多个阶段,并提出了轻量级提示适配器,用于逐步将有益的知识编码到提示中,增强相邻骨干阶段之间目标特征的泛化能力。
效果:大量的实验表明,Uni-UVPT在GTA5到Cityscapes和SYNTHIA到Cityscapes任务上取得了最先进的性能,可以作为一个通用且参数高效的大模型无监督知识转移框架。

Source-free domain adaptive semantic segmentation aims to adapt a pre-trained source model to the unlabeled target domain without accessing the private source data. Previous methods usually fine-tune the entire network, which suffers from expensive parameter tuning. To avoid this problem, we propose to utilize visual prompt tuning for parameter-efficient adaptation. However, the existing visual prompt tuning methods are unsuitable for source-free domain adaptive semantic segmentation due to the following two reasons: (1) Commonly used visual prompts like input tokens or pixel-level perturbations cannot reliably learn informative knowledge beneficial for semantic segmentation. (2) Visual prompts require sufficient labeled data to fill the gap between the pre-trained model and downstream tasks. To alleviate these problems, we propose a universal unsupervised visual prompt tuning (Uni-UVPT) framework, which is applicable to various transformer-based backbones. Specifically, we first divide the source pre-trained backbone with frozen parameters into multiple stages, and propose a lightweight prompt adapter for progressively encoding informative knowledge into prompts and enhancing the generalization of target features between adjacent backbone stages. Cooperatively, a novel adaptive pseudo-label correction strategy with a multiscale consistency loss is designed to alleviate the negative effect of target samples with noisy pseudo labels and raise the capacity of visual prompts to spatial perturbations. Extensive experiments demonstrate that Uni-UVPT achieves state-of-the-art performance on GTA5 $\to$ Cityscapes and SYNTHIA $\to$ Cityscapes tasks and can serve as a universal and parameter-efficient framework for large-model unsupervised knowledge transfer. Code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/uni-uvpt and https://github.com/huawei-noah/noah-research/tree/master/uni-uvpt.

Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation
Jianing Zhu Geng Yu Jiangchao Yao Tongliang Liu Gang Niu Masashi Sugiyama Bo Han



研究问题:如何有效地进行模型的分布外(OOD)检测,以在真实世界应用中部署可靠的机器学习模型。
动机:现有的OOD检测方法需要足够大且具有代表性的异常值样本集来覆盖内ID和OOD数据之间的边界,这在实践中可能既不切实际又具有挑战性。
方法:提出一种新颖的框架,即多样化异常暴露(DivOE),通过基于给定辅助异常值的信息外推进行有效的OOD检测。具体来说,DivOE引入了一个新的学习目标,通过在训练过程中显式合成更多信息丰富的异常值来进行外推,从而多样化辅助分布。它利用多步优化方法生成原始异常值之外的新异常值,这与许多异常暴露变体兼容。
效果:通过大量的实验和分析,展示了所提出的DivOE的有效性。代码已公开发布。

Out-of-distribution (OOD) detection is important for deploying reliable machine learning models on real-world applications. Recent advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers. However, previous methods assume that the collected outliers can be sufficiently large and representative to cover the boundary between ID and OOD data, which might be impractical and challenging. In this work, we propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers. Specifically, DivOE introduces a new learning objective, which diversifies the auxiliary distribution by explicitly synthesizing more informative outliers for extrapolation during training. It leverages a multi-step optimization method to generate novel outliers beyond the original ones, which is compatible with many variants of outlier exposure. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed DivOE. The code is publicly available at: https://github.com/tmlr-group/DivOE.

Mitigating Source Bias for Fairer Weak Supervision
Changho Shin Sonia Cromp Dyah Adila Frederic Sala



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Weak supervision enables efficient development of training sets by reducing the need for ground truth labels. However, the techniques that make weak supervision attractive---such as integrating any source of signal to estimate unknown labels---also entail the danger that the produced pseudolabels are highly biased. Surprisingly, given everyday use and the potential for increased bias, weak supervision has not been studied from the point of view of fairness. We begin such a study, starting with the observation that even when a fair model can be built from a dataset with access to ground-truth labels, the corresponding dataset labeled via weak supervision can be arbitrarily unfair. To address this, we propose and empirically validate a model for source unfairness in weak supervision, then introduce a simple counterfactual fairness-based technique that can mitigate these biases. Theoretically, we show that it is possible for our approach to simultaneously improve both accuracy and fairness---in contrast to standard fairness approaches that suffer from tradeoffs. Empirically, we show that our technique improves accuracy on weak supervision baselines by as much as 32\% while reducing demographic parity gap by 82.5\%. A simple extension of our method aimed at maximizing performance produces state-of-the-art performance in five out of ten datasets in the WRENCH benchmark.

Learning to Augment Distributions for Out-of-distribution Detection
Qizhou Wang Zhen Fang Yonggang Zhang Feng Liu Yixuan Li Bo Han



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Open-world classification systems should discern out-of-distribution (OOD) data whose labels deviate from those of in-distribution (ID) cases, motivating recent studies in OOD detection. Advanced works, despite their promising progress, may still fail in the open world, owing to the lacking knowledge about unseen OOD data in advance. Although one can access auxiliary OOD data (distinct from unseen ones) for model training, it remains to analyze how such auxiliary data will work in the open world. To this end, we delve into such a problem from a learning theory perspective, finding that the distribution discrepancy between the auxiliary and the unseen real OOD data is the key to affect the open-world detection performance. Accordingly, we propose Distributional-Augmented OOD Learning (DAOL), alleviating the OOD distribution discrepancy by crafting an OOD distribution set that contains all distributions in a Wasserstein ball centered on the auxiliary OOD distribution. We justify that the predictor trained over the worst OOD data in the ball can shrink the OOD distribution discrepancy, thus improving the open-world detection performance given only the auxiliary OOD data. We conduct extensive evaluations across representative OOD detection setups, demonstrating the superiority of our DAOL over its advanced counterparts.

Real-World Image Super-Resolution as Multi-Task Learning
Wenlong Zhang Xiaohui Li Guangyuan SHI Xiangyu Chen Yu Qiao Xiaoyun Zhang Xiao-Ming Wu Chao Dong



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

In this paper, we take a new look at real-world image super-resolution (real-SR) from a multi-task learning perspective. We demonstrate that the conventional formulation of real-SR can be viewed as solving multiple distinct degradation tasks using a single shared model. This poses a challenge known as task competition or task conflict in multi-task learning, where certain tasks dominate the learning process, resulting in poor performance on other tasks. This problem is exacerbated in the case of real-SR, due to the involvement of numerous degradation tasks. To address the issue of task competition in real-SR, we propose a task grouping approach. Our approach efficiently identifies the degradation tasks where a real-SR model falls short and groups these unsatisfactory tasks into multiple task groups for further training. By grouping similar tasks together, our approach mitigates task competition and facilitates effective knowledge transfer. Extensive experiments demonstrate our method achieves significantly enhanced performance across a wide range of degradation scenarios.

Unsupervised Graph Neural Architecture Search with Disentangled Self-Supervision
Zeyang Zhang Xin Wang Ziwei Zhang Guangyao Shen Shiqi Shen Wenwu Zhu



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

The existing graph neural architecture search (GNAS) methods heavily rely on supervised labels during the search process, failing to handle ubiquitous scenarios where supervisions are not available. In this paper, we study the problem of unsupervised graph neural architecture search, which remains unexplored in the literature. The key problem is to discover the latent graph factors that drive the formation of graph data as well as the underlying relations between the factors and the optimal neural architectures. Handling this problem is challenging given that the latent graph factors together with architectures are highly entangled due to the nature of the graph and the complexity of the neural architecture search process. To address the challenge, we propose a novel Disentangled Self-supervised Graph Neural Architecture Search (DSGAS) model, which is able to discover the optimal architectures capturing various latent graph factors in a self-supervised fashion based on unlabeled graph data. Specifically, we first design a disentangled graph super-network capable of incorporating multiple architectures with factor-wise disentanglement, which are optimized simultaneously. Then, we estimate the performance of architectures under different factors by our proposed self-supervised training with joint architecture-graph disentanglement. Finally, we propose a contrastive search with architecture augmentations to discover architectures with factor-specific expertise. Extensive experiments on 11 real-world datasets demonstrate that the proposed model is able to achieve state-of-the-art performance against several baseline methods in an unsupervised manner.

Expanding Small-Scale Datasets with Guided Imagination
Yifan Zhang Daquan Zhou Bryan Hooi Kai Wang Jiashi Feng



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

The power of DNNs relies heavily on the quantity and quality of training data. However, collecting and annotating data on a large scale is often expensive and time-consuming. To address this issue, we explore a new task, termed dataset expansion, aimed at expanding a ready-to-use small dataset by automatically creating new labeled samples. To this end, we present a Guided Imagination Framework (GIF) that leverages cutting-edge generative models like DALL-E2 and Stable Diffusion (SD) to "imagine" and create informative new data from the input seed data. Specifically, GIF conducts data imagination by optimizing the latent features of the seed data in the semantically meaningful space of the prior model, resulting in the creation of photo-realistic images with new content. To guide the imagination towards creating informative samples for model training, we introduce two key criteria, i.e., class-maintained information boosting and sample diversity promotion. These criteria are verified to be essential for effective dataset expansion: GIF-SD obtains 13.5% higher model accuracy on natural image datasets than unguided expansion with SD. With these essential criteria, GIF successfully expands small datasets in various scenarios, boosting model accuracy by 36.9% on average over six natural image datasets and by 13.5% on average over three medical datasets. The source code is available at https://github.com/Vanint/DatasetExpansion.

Mitigating Test-Time Bias for Fair Image Retrieval
Fanjie Kong Shuai Yuan Weituo Hao Ricardo Henao



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

We address the challenge of generating fair and unbiased image retrieval results given neutral textual queries (with no explicit gender or race connotations), while maintaining the utility (performance) of the underlying vision-language (VL) model. Previous methods aim to disentangle learned representations of images and text queries from gender and racial characteristics. However, we show these are inadequate at alleviating bias for the desired equal representation result, as there usually exists test-time bias in the target retrieval set. So motivated, we introduce a straightforward technique, Post-hoc Bias Mitigation (PBM), that post-processes the outputs from the pre-trained vision-language model. We evaluate our algorithm on real-world image search datasets, Occupation 1 and 2, as well as two large-scale image-text datasets, MS-COCO and Flickr30k. Our approach achieves the lowest bias, compared with various existing bias-mitigation methods, in text-based image retrieval result while maintaining satisfactory retrieval performance. The source code is publicly available at \url{https://github.com/timqqt/Fair_Text_based_Image_Retrieval}.

Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning
Yihua Zhang Yimeng Zhang Aochuan Chen Jinghan Jia Jiancheng Liu Gaowen Liu Mingyi Hong Shiyu Chang Sijia Liu



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Massive data is often considered essential for deep learning applications, but it also incurs significant computational and infrastructural costs. Therefore, dataset pruning (DP) has emerged as an effective way to improve data efficiency by identifying and removing redundant training samples without sacrificing performance. In this work, we aim to address the problem of DP for transfer learning, i.e., how to prune a source dataset for improved pretraining efficiency and lossless finetuning accuracy on downstream target tasks. To our best knowledge, the problem of DP for transfer learning remains open, as previous studies have primarily addressed DP and transfer learning as separate problems. By contrast, we establish a unified viewpoint to integrate DP with transfer learning and find that existing DP methods are not suitable for the transfer learning paradigm. We then propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings respectively, by revisiting the DP problem through the lens of source-target domain mapping. Furthermore, we demonstrate the effectiveness of our approach on numerous transfer learning tasks. We show that source data classes can be pruned by up to $40\%\sim 80\%$ without sacrificing the downstream performance, resulting in a significant $2\sim 5\times$ speed-up during the pretraining stage. Besides, our proposal exhibits broad applicability and can improve other computationally intensive transfer learning techniques, such as adversarial pretraining.

Out-of-distribution Detection Learning with Unreliable Out-of-distribution Sources
Haotian Zheng Qizhou Wang Zhen Fang Xiaobo Xia Feng Liu Tongliang Liu Bo Han



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Out-of-distribution (OOD) detection discerns OOD data where the predictor cannot make valid predictions as in-distribution (ID) data, thereby increasing the reliability of open-world classification. However, it is typically hard to collect real out-of-distribution (OOD) data for training a predictor capable of discerning ID and OOD patterns. This obstacle gives rise to *data generation-based learning methods*, synthesizing OOD data via data generators for predictor training without requiring any real OOD data. Related methods typically pre-train a generator on ID data and adopt various selection procedures to find those data likely to be the OOD cases. However, generated data may still coincide with ID semantics, i.e., mistaken OOD generation remains, confusing the predictor between ID and OOD data. To this end, we suggest that generated data (with mistaken OOD generation) can be used to devise an *auxiliary OOD detection task* to facilitate real OOD detection. Specifically, we can ensure that learning from such an auxiliary task is beneficial if the ID and the OOD parts have disjoint supports, with the help of a well-designed training procedure for the predictor. Accordingly, we propose a powerful data generation-based learning method named *Auxiliary Task-based OOD Learning* (ATOL) that can relieve the mistaken OOD generation. We conduct extensive experiments under various OOD detection setups, demonstrating the effectiveness of our method against its advanced counterparts.

Learning Domain-Aware Detection Head with Prompt Tuning
Haochen Li Rui Zhang Hantao Yao Xinkai Song Yifan Hao Yongwei Zhao Ling Li Yunji Chen



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain. However, existing methods focus on reducing the domain bias of the detection backbone by inferring a discriminative visual encoder, while ignoring the domain bias in the detection head. Inspired by the high generalization of vision-language models (VLMs), applying a VLM as the robust detection backbone following a domain-aware detection head is a reasonable way to learn the discriminative detector for each domain, rather than reducing the domain bias in traditional methods. To achieve the above issue, we thus propose a novel DAOD framework named Domain-Aware detection head with Prompt tuning (DA-Pro), which applies the learnable domain-adaptive prompt to generate the dynamic detection head for each domain. Formally, the domain-adaptive prompt consists of the domain-invariant tokens, domain-specific tokens, and the domain-related textual description along with the class label. Furthermore, two constraints between the source and target domains are applied to ensure that the domain-adaptive prompt can capture the domains-shared and domain-specific knowledge. A prompt ensemble strategy is also proposed to reduce the effect of prompt disturbance. Comprehensive experiments over multiple cross-domain adaptation tasks demonstrate that using the domain-adaptive prompt can produce an effectively domain-related detection head for boosting domain-adaptive object detection. Our code is available at https://github.com/Therock90421/DA-Pro.

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias
Zhongwei Wan Che Liu Mi Zhang Jie Fu Benyou Wang Sibo Cheng Lei Ma César Quilodrán-Casas Rossella Arcucci



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-training (VLP). A potential solution lies in the combination of datasets from various language communities. Nevertheless, the main challenge stems from the complexity of integrating diverse syntax and semantics, language-specific medical terminology, and culture-specific implicit knowledge. Therefore, one crucial aspect to consider is the presence of community bias caused by different languages. This paper presents a novel framework named Unifying Cross-Lingual Medical Vision-Language Pre-Training (\textbf{Med-UniC}), designed to integrate multi-modal medical data from the two most prevalent languages, English and Spanish. Specifically, we propose \textbf{C}ross-lingual \textbf{T}ext Alignment \textbf{R}egularization (\textbf{CTR}) to explicitly unify cross-lingual semantic representations of medical reports originating from diverse language communities. \textbf{CTR} is optimized through latent language disentanglement, rendering our optimization objective to not depend on negative samples, thereby significantly mitigating the bias from determining positive-negative sample pairs within analogous medical reports. Furthermore, it ensures that the cross-lingual representation is not biased toward any specific language community. \textbf{Med-UniC} reaches superior performance across 5 medical image tasks and 10 datasets encompassing over 30 diseases, offering a versatile framework for unifying multi-modal medical data within diverse linguistic communities. The experimental outcomes highlight the presence of community bias in cross-lingual VLP. Reducing this bias enhances the performance not only in vision-language tasks but also in uni-modal visual tasks.

What Do Deep Saliency Models Learn about Visual Attention?
Shi Chen Ming Jiang Qi Zhao



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

In recent years, deep saliency models have made significant progress in predicting human visual attention. However, the mechanisms behind their success remain largely unexplained due to the opaque nature of deep neural networks. In this paper, we present a novel analytic framework that sheds light on the implicit features learned by saliency models and provides principled interpretation and quantification of their contributions to saliency prediction. Our approach decomposes these implicit features into interpretable bases that are explicitly aligned with semantic attributes and reformulates saliency prediction as a weighted combination of probability maps connecting the bases and saliency. By applying our framework, we conduct extensive analyses from various perspectives, including the positive and negative weights of semantics, the impact of training data and architectural designs, the progressive influences of fine-tuning, and common error patterns of state-of-the-art deep saliency models. Additionally, we demonstrate the effectiveness of our framework by exploring visual attention characteristics in various application scenarios, such as the atypical attention of people with autism spectrum disorder, attention to emotion-eliciting stimuli, and attention evolution over time. Our code is publicly available at \url{https://github.com/szzexpoi/saliency_analysis}.

The Rise of AI Language Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification
Linhao Qu xiaoyuan Luo Kexue Fu Manning Wang Zhijian Song



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

This paper introduces the novel concept of few-shot weakly supervised learning for pathology Whole Slide Image (WSI) classification, denoted as FSWC. A solution is proposed based on prompt learning and the utilization of a large language model, GPT-4. Since a WSI is too large and needs to be divided into patches for processing, WSI classification is commonly approached as a Multiple Instance Learning (MIL) problem. In this context, each WSI is considered a bag, and the obtained patches are treated as instances. The objective of FSWC is to classify both bags and instances with only a limited number of labeled bags. Unlike conventional few-shot learning problems, FSWC poses additional challenges due to its weak bag labels within the MIL framework. Drawing inspiration from the recent achievements of vision-language models (V-L models) in downstream few-shot classification tasks, we propose a two-level prompt learning MIL framework tailored for pathology, incorporating language prior knowledge. Specifically, we leverage CLIP to extract instance features for each patch, and introduce a prompt-guided pooling strategy to aggregate these instance features into a bag feature. Subsequently, we employ a small number of labeled bags to facilitate few-shot prompt learning based on the bag features. Our approach incorporates the utilization of GPT-4 in a question-and-answer mode to obtain language prior knowledge at both the instance and bag levels, which are then integrated into the instance and bag level language prompts. Additionally, a learnable component of the language prompts is trained using the available few-shot labeled data. We conduct extensive experiments on three real WSI datasets encompassing breast cancer, lung cancer, and cervical cancer, demonstrating the notable performance of the proposed method in bag and instance classification. All codes will be made publicly accessible.

DAW: Exploring the Better Weighting Function for Semi-supervised Semantic Segmentation
Rui Sun Huayu Mai Tianzhu Zhang Feng Wu



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

The critical challenge of semi-supervised semantic segmentation lies in how to fully exploit a large volume of unlabeled data to improve the model’s generalization performance for robust segmentation. Existing methods tend to employ certain criteria (weighting function) to select pixel-level pseudo labels. However, the trade-off exists between inaccurate yet utilized pseudo-labels, and correct yet discarded pseudo-labels in these methods when handling pseudo-labels without thoughtful consideration of the weighting function, hindering the generalization ability of the model. In this paper, we systematically analyze the trade-off in previous methods when dealing with pseudo-labels. We formally define the trade-off between inaccurate yet utilized pseudo-labels, and correct yet discarded pseudo-labels by explicitly modeling the confidence distribution of correct and inaccurate pseudo-labels, equipped with a unified weighting function. To this end, we propose Distribution-Aware Weighting (DAW) to strive to minimize the negative equivalence impact raised by the trade-off. We find an interesting fact that the optimal solution for the weighting function is a hard step function, with the jump point located at the intersection of the two confidence distributions. Besides, we devise distribution alignment to mitigate the issue of the discrepancy between the prediction distributions of labeled and unlabeled data. Extensive experimental results on multiple benchmarks including mitochondria segmentation demonstrate that DAW performs favorably against state-of-the-art methods.

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes
Yulei Qin Xingyu Chen Yunhang Shen Chaoyou Fu Yun Gu Ke Li Xing Sun Rongrong Ji



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Webly supervised learning has attracted increasing attention for its effectiveness in exploring publicly accessible data at scale without manual annotation. However, most existing methods of learning with web datasets are faced with challenges from label noise, and they have limited assumptions on clean samples under various noise. For instance, web images retrieved with queries of ”tiger cat“ (a cat species) and ”drumstick“ (a musical instrument) are almost dominated by images of tigers and chickens, which exacerbates the challenge of fine-grained visual concept learning. In this case, exploiting both web images and their associated texts is a requisite solution to combat real-world noise. In this paper, we propose Cross-modality Aligned Prototypes (CAPro), a unified prototypical contrastive learning framework to learn visual representations with correct semantics. For one thing, we leverage textual prototypes, which stem from the distinct concept definition of classes, to select clean images by text matching and thus disambiguate the formation of visual prototypes. For another, to handle missing and mismatched noisy texts, we resort to the visual feature space to complete and enhance individual texts and thereafter improve text matching. Such semantically aligned visual prototypes are further polished up with high-quality samples, and engaged in both cluster regularization and noise removal. Besides, we propose collective bootstrapping to encourage smoother and wiser label reference from appearance-similar instances in a manner of dictionary look-up. Extensive experiments on WebVision1k and NUS-WIDE (Web) demonstrate that CAPro well handles realistic noise under both single-label and multi-label scenarios. CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition. Codes are available at https://github.com/yuleiqin/capro.

LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning
Atsuyuki Miyai Qing Yu Go Irie Kiyoharu Aizawa



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

We present a novel vision-language prompt learning approach for few-shot out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD images from classes that are unseen during training using only a few labeled in-distribution (ID) images. While prompt learning methods such as CoOp have shown effectiveness and efficiency in few-shot ID classification, they still face limitations in OOD detection due to the potential presence of ID-irrelevant information in text embeddings. To address this issue, we introduce a new approach called $\textbf{Lo}$cal regularized $\textbf{Co}$ntext $\textbf{Op}$timization (LoCoOp), which performs OOD regularization that utilizes the portions of CLIP local features as OOD features during training. CLIP's local features have a lot of ID-irrelevant nuisances (\textit{e.g.}, backgrounds), and by learning to push them away from the ID class text embeddings, we can remove the nuisances in the ID class text embeddings and enhance the separation between ID and OOD. Experiments on the large-scale ImageNet OOD detection benchmarks demonstrate the superiority of our LoCoOp over zero-shot, fully supervised detection methods and prompt learning methods. Notably, even in a one-shot setting -- just one label per class, LoCoOp outperforms existing zero-shot and fully supervised detection methods. The code is available via https://github.com/AtsuMiyai/LoCoOp.

Masked Two-channel Decoupling Framework for Incomplete Multi-view Weak Multi-label Learning
Chengliang Liu Jie Wen Yabo Liu Chao Huang Zhihao Wu Xiaoling Luo Yong Xu



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Multi-view learning has become a popular research topic in recent years, but research on the cross-application of classic multi-label classification and multi-view learning is still in its early stages. In this paper, we focus on the complex yet highly realistic task of incomplete multi-view weak multi-label learning and propose a masked two-channel decoupling framework based on deep neural networks to solve this problem. The core innovation of our method lies in decoupling the single-channel view-level representation, which is common in deep multi-view learning methods, into a shared representation and a view-proprietary representation. We also design a cross-channel contrastive loss to enhance the semantic property of the two channels. Additionally, we exploit supervised information to design a label-guided graph regularization loss, helping the extracted embedding features preserve the geometric structure among samples. Inspired by the success of masking mechanisms in image and text analysis, we develop a random fragment masking strategy for vector features to improve the learning ability of encoders. Finally, it is important to emphasize that our model is fully adaptable to arbitrary view and label absences while also performing well on the ideal full data. We have conducted sufficient and convincing experiments to confirm the effectiveness and advancement of our model.

Learning Trajectories are Generalization Indicators
Jingwen Fu Zhizheng Zhang Dacheng Yin Yan Lu Nanning Zheng



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

This paper explores the connection between learning trajectories of Deep Neural Networks (DNNs) and their generalization capabilities when optimized using (stochastic) gradient descent algorithms. Instead of concentrating solely on the generalization error of the DNN post-training, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error. This perspective enable a more direct comprehension of how the learning trajectory influences generalization error. Building upon this analysis, we propose a new generalization bound that incorporates more extensive trajectory information. Our proposed generalization bound depends on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental observations reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels. These results demonstrate that learning trajectory information is a valuable indicator of a model's generalization capabilities.

Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation
Shengpu Tang Jenna Wiens



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of off-policy evaluation (OPE) is inherently limited since offline data may not reflect the distribution shifts resulting from the application of new policies. On the other hand, online evaluation by collecting rollouts according to the new policy is often infeasible, as deploying new policies in these domains can be unsafe. In this work, we propose a semi-offline evaluation framework as an intermediate step between offline and online evaluation, where human users provide annotations of unobserved counterfactual trajectories. While tempting to simply augment existing data with such annotations, we show that this naive approach can lead to biased results. Instead, we design a new family of OPE estimators based on importance sampling (IS) and a novel weighting scheme that incorporate counterfactual annotations without introducing additional bias. We analyze the theoretical properties of our approach, showing its potential to reduce both bias and variance compared to standard IS estimators. Our analyses reveal important practical considerations for handling biased, noisy, or missing annotations. In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of RL in high-stakes domains.

No Representation Rules Them All in Category Discovery
Sagar Vaze Andrea Vedaldi Andrew Zisserman



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically, given a dataset with labelled and unlabelled images, the task is to cluster all images in the unlabelled subset, whether or not they belong to the labelled categories. Our first contribution is to recognise that most existing GCD benchmarks only contain labels for a single clustering of the data, making it difficult to ascertain whether models are leveraging the available labels to solve the GCD task, or simply solving an unsupervised clustering problem. As such, we present a synthetic dataset, named 'Clevr-4', for category discovery. Clevr-4 contains four equally valid partitions of the data, i.e based on object 'shape', 'texture' or 'color' or 'count'. To solve the task, models are required to extrapolate the taxonomy specified by labelled set, rather than simply latch onto a single natural grouping of the data. We use this dataset to demonstrate the limitations of unsupervised clustering in the GCD setting, showing that even very strong unsupervised models fail on Clevr-4. We further use Clevr-4 to examine the weaknesses of existing GCD algorithms, and propose a new method which addresses these shortcomings, leveraging consistent findings from the representation learning literature to do so. Our simple solution, which is based on `Mean Teachers' and termed $\mu$GCD, substantially outperforms implemented baselines on Clevr-4. Finally, when we transfer these findings to real data on the challenging Semantic Shift Benchmark suite, we find that $\mu$GCD outperforms all prior work, setting a new state-of-the-art.

L2T-DLN: Learning to Teach with Dynamic Loss Network
Zhaoyang Hai Liyuan Pan Xiabi Liu Zhengzheng Liu Mirna Yunita



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

With the concept of teaching being introduced to the machine learning community, a teacher model start using dynamic loss functions to teach the training of a student model. The dynamic intends to set adaptive loss functions to different phases of student model learning. In existing works, the teacher model 1) merely determines the loss function based on the present states of the student model, e.g., disregards the experience of the teacher; 2) only utilizes the states of the student model, e.g., training iteration number and loss/accuracy from training/validation sets, while ignoring the states of the loss function. In this paper, we first formulate the loss adjustment as a temporal task by designing a teacher model with memory units, and, therefore, enables the student learning to be guided by the experience of the teacher model. Then, with a Dynamic Loss Network, we can additionally use the states of the loss to assist the teacher learning in enhancing the interactions between the teacher and the student model. Extensive experiments demonstrate our approach can enhance student learning and improve the performance of various deep models on real-world tasks, including classification, objective detection, and semantic segmentation scenario.

Enhancing Adversarial Contrastive Learning via Adversarial Invariant Regularization
Xilie Xu Jingfeng Zhang Feng Liu Masashi Sugiyama Mohan Kankanhalli



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Adversarial contrastive learning (ACL) is a technique that enhances standard contrastive learning (SCL) by incorporating adversarial data to learn a robust representation that can withstand adversarial attacks and common corruptions without requiring costly annotations. To improve transferability, the existing work introduced the standard invariant regularization (SIR) to impose style-independence property to SCL, which can exempt the impact of nuisance style factors in the standard representation. However, it is unclear how the style-independence property benefits ACL-learned robust representations. In this paper, we leverage the technique of causal reasoning to interpret the ACL and propose adversarial invariant regularization (AIR) to enforce independence from style factors. We regulate the ACL using both SIR and AIR to output the robust representation. Theoretically, we show that AIR implicitly encourages the representational distance between different views of natural data and their adversarial variants to be independent of style factors. Empirically, our experimental results show that invariant regularization significantly improves the performance of state-of-the-art ACL methods in terms of both standard generalization and robustness on downstream tasks. To the best of our knowledge, we are the first to apply causal reasoning to interpret ACL and develop AIR for enhancing ACL-learned robust representations. Our source code is at https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.

Factorized Contrastive Learning: Going Beyond Multi-view Redundancy
Paul Pu Liang Zihao Deng Martin Q. Ma James Zou Louis-Philippe Morency Russ Salakhutdinov



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

In a wide range of multimodal tasks, contrastive learning has become a particularly appealing approach since it can successfully learn representations from abundant unlabeled data with only pairing information (e.g., image-caption or video-audio pairs). Underpinning these approaches is the assumption of multi-view redundancy - that shared information between modalities is necessary and sufficient for downstream tasks. However, in many real-world settings, task-relevant information is also contained in modality-unique regions: information that is only present in one modality but still relevant to the task. How can we learn self-supervised multimodal representations to capture both shared and unique information relevant to downstream tasks? This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy. FactorCL is built from three new contributions: (1) factorizing task-relevant information into shared and unique representations, (2) capturing task-relevant information via maximizing MI lower bounds and removing task-irrelevant information via minimizing MI upper bounds, and (3) multimodal data augmentations to approximate task relevance without labels. On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results on six benchmarks.

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
Muhammad Jehanzeb Mirza Leonid Karlinsky Wei Lin Horst Possegger Mateusz Kozinski Rogerio Feris Horst Bischof



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zero-shot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine-tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to $11.7\%$ ($3.8\%$ on average) in the label-free setting. Moreover, despite our approach being label-free, we observe $1.3\%$ average gains over leading few-shot prompting baselines that do use 5-shot supervision.

RGMIL: Guide Your Multiple-Instance Learning Model with Regressor
Zhaolong Du Shasha Mao Yimeng Zhang Shuiping Gou Licheng Jiao Lin Xiong



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

In video analysis, an important challenge is insufficient annotated data due to the rare occurrence of the critical patterns, and we need to provide discriminative frame-level representation with limited annotation in some applications. Multiple Instance Learning (MIL) is suitable for this scenario. However, many MIL models paid attention to analyzing the relationships between instance representations and aggregating them, but neglecting the critical information from the MIL problem itself, which causes difficultly achieving ideal instance-level performance compared with the supervised model. To address this issue, we propose the $\textbf{\textit{Regressor-Guided MIL network} (RGMIL)}$, which effectively produces discriminative instance-level representations in a general multi-classification scenario. In the proposed method, we make full use of the $\textit{regressor}$ through our newly introduced $\textit{aggregator}$, $\textbf{\textit{Regressor-Guided Pooling} (RGP)}$. RGP focuses on simulating the correct inference process of humans while facing similar problems without introducing new parameters, and the MIL problem can be accurately described through the critical information from the $\textit{regressor}$ in our method. In experiments, RGP shows dominance on more than 20 MIL benchmark datasets, with the average bag-level classification accuracy close to 1. We also perform a series of comprehensive experiments on the MMNIST dataset. Experimental results illustrate that our $\textit{aggregator}$ outperforms existing methods under different challenging circumstances. Instance-level predictions are even possible under the guidance of RGP information table in a long sequence. RGMIL also presents comparable instance-level performance with S-O-T-A supervised models in complicated applications. Statistical results demonstrate the assumption that a MIL model can compete with a supervised model at the instance level, as long as a structure that accurately describes the MIL problem is provided. The codes are available on $\url{https://github.com/LMBDA-design/RGMIL}$.

Importance-aware Co-teaching for Offline Model-based Optimization
Ye Yuan Can Chen Zixuan Liu Willie Neiswanger Xue Liu



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Offline model-based optimization aims to find a design that maximizes a property of interest using only an offline dataset, with applications in robot, protein, and molecule design, among others. A prevalent approach is gradient ascent, where a proxy model is trained on the offline dataset and then used to optimize the design. This method suffers from an out-of-distribution issue, where the proxy is not accurate for unseen designs. To mitigate this issue, we explore using a pseudo-labeler to generate valuable data for fine-tuning the proxy. Specifically, we propose $\textit{\textbf{I}mportance-aware \textbf{C}o-\textbf{T}eaching for Offline Model-based Optimization}~(\textbf{ICT})$. This method maintains three symmetric proxies with their mean ensemble as the final proxy, and comprises two steps. The first step is $\textit{pseudo-label-driven co-teaching}$. In this step, one proxy is iteratively selected as the pseudo-labeler for designs near the current optimization point, generating pseudo-labeled data. Subsequently, a co-teaching process identifies small-loss samples as valuable data and exchanges them between the other two proxies for fine-tuning, promoting knowledge transfer. This procedure is repeated three times, with a different proxy chosen as the pseudo-labeler each time, ultimately enhancing the ensemble performance. To further improve accuracy of pseudo-labels, we perform a secondary step of $\textit{meta-learning-based sample reweighting}$, which assigns importance weights to samples in the pseudo-labeled dataset and updates them via meta-learning. ICT achieves state-of-the-art results across multiple design-bench tasks, achieving the best mean rank $3.1$ and median rank $2$ among $15$ methods. Our source code can be accessed here.

Parallel-mentoring for Offline Model-based Optimization
Can Chen Christopher Beckham Zixuan Liu Xue Liu Christopher Pal



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

We study offline model-based optimization to maximize a black-box objective function with a static dataset of designs and scores. These designs encompass a variety of domains, including materials, robots, DNA sequences, and proteins. A common approach trains a proxy on the static dataset and performs gradient ascent to obtain new designs. However, this often results in poor designs due to the proxy inaccuracies for out-of-distribution designs. Recent studies indicate that (a) gradient ascent with a mean ensemble of proxies generally outperforms simple gradient ascent, and (b) a trained proxy provides weak ranking supervision signals for design selection. Motivated by (a) and (b), we propose $\textit{parallel-mentoring}$ as an effective and novel method that facilitates mentoring among proxies, creating a more robust ensemble to mitigate the out-of-distribution issue. We focus on the three-proxy case in the main paper and our method consists of two modules. The first module, $\textit{voting-based pairwise supervision}$, operates on three parallel proxies and captures their ranking supervision signals as pairwise comparison labels. These labels are combined through majority voting to generate consensus labels, which incorporates ranking supervision signals from all proxies and enables mutual mentoring. Yet, label noise arises due to possible incorrect consensus. To alleviate this, we introduce an $\textit{adaptive soft-labeling}$ module with soft-labels initialized as consensus labels. Based on bi-level optimization, this module fine-tunes proxies in the inner level and learns more accurate labels in the outer level to adaptively mentor proxies, resulting in a more robust ensemble. Experiments validate the effectiveness of our method. Our code is available here.

KD-Zero: Evolving Knowledge Distiller for Any Teacher-Student Pairs
Lujun Li Peijie Dong Anggeng Li Zimian Wei Yang Ya



研究问题:弱监督虽然能减少对真实标签的需求,提高训练集的开发效率,但其产生的伪标签可能高度偏颇,且尚未从公平性的角度进行研究。
动机:即使可以从有真实标签的数据集构建公平模型,通过弱监督标注的相应数据集也可能极度不公平。
方法:我们提出了一种弱监督源不公平模型,并验证了其效果。然后引入了一种基于反事实公平性的技巧,以减轻这些偏见。
效果:理论上,我们的方法可以同时提高准确性和公平性,与标准公平方法存在权衡的情况形成对比。在实验上,我们的方法将弱监督基线的准确性提高了32%,并将人口分布差距降低了82.5%。

Knowledge distillation (KD) has emerged as an effective technique for compressing models that can enhance the lightweight model. Conventional KD methods propose various designs to allow student model to imitate the teacher better. However, these handcrafted KD designs heavily rely on expert knowledge and may be sub-optimal for various teacher-student pairs. In this paper, we present a novel framework, KD-Zero, which utilizes evolutionary search to automatically discover promising distiller from scratch for any teacher-student architectures. Specifically, we first decompose the generalized distiller into knowledge transformations, distance functions, and loss weights. Then, we construct our distiller search space by selecting advanced operations for these three components. With sharpness and represent gap as fitting objectives, we evolve candidate populations and generate better distillers by crossover and mutation. To ensure efficient searching, we employ the loss-rejection protocol, search space shrinkage, and proxy settings during the search process. In this manner, the discovered distiller can address the capacity gap and cross-architecture challenges for any teacher-student pairs in the final distillation stage. Comprehensive experiments reveal that KD-Zero consistently outperforms other state-of-the-art methods across diverse architectures on classification, detection, and segmentation tasks. Noticeably, we provide some practical insights in designing the distiller by analyzing the distiller discovered. Code is at https://github.com/lilujunai/KD-Zero.